# Classification with word embeddings

Word embeddings are dense vector representations (no 0's).
In contrast to sparse representations, like bag of words.

### Quick bag of words intro

Lets say you have a vocabulary of size $V$, then for every word you have sparse vector of size $V$ that is 1 at a unique position and zero otherwise.

### Disadvantages of bag of words

- It ignores the context
- Does not encode similarity between words

However, it is used by word2vec to calculate the vectors.

In [1]:
import os
import nltk
import string
import glob
import re
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Dense
from tensorflow.layers import Flatten
from tensorflow.keras.models import Sequential
from keras.utils import np_utils
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit
import fnmatch
import warnings
warnings.filterwarnings("ignore")
np.set_printoptions(threshold=1800)
%matplotlib inline

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


### The dataset

- Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
- Class Labels: 5 (business, entertainment, politics, sport, tech)

[Get_the_data](http://mlg.ucd.ie/datasets/bbc.html)

### Import and label the raw data

In [3]:
labels = []
news = []
num_per_topic = {}
for label_type in ['business', 'entertainment', 'tech', 'sport', 'politics']:
    dir_name = './bbc/' + label_type
    num_per_topic[label_type] = len(fnmatch.filter(os.listdir(dir_name), '*.txt'))
    print(label_type, num_per_topic[label_type])
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname), encoding='utf-8', errors='ignore')
            news.append(f.read())
            f.close()
        if label_type == 'business':
            labels.append(0)
        elif label_type == 'entertainment':
            labels.append(1)
        elif label_type == 'tech':
            labels.append(2)
        elif label_type == 'sport':
            labels.append(3)
        elif label_type == 'politics':
            labels.append(4)
print('Total number of news:', len(news))

business 510
entertainment 386
tech 401
sport 511
politics 417
Total number of news: 2225


In [5]:
print(news[2])

Yukos unit buyer faces loan claim

The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan.

State-owned Rosneft bought the Yugansk unit for $9.3bn in a sale forced by Russia to part settle a $27.5bn tax claim against Yukos. Yukos' owner Menatep Group says it will ask Rosneft to repay a loan that Yugansk had secured on its assets. Rosneft already faces a similar $540m repayment demand from foreign banks. Legal experts said Rosneft's purchase of Yugansk would include such obligations. "The pledged assets are with Rosneft, so it will have to pay real money to the creditors to avoid seizure of Yugansk assets," said Moscow-based US lawyer Jamie Firestone, who is not connected to the case. Menatep Group's managing director Tim Osborne told the Reuters news agency: "If they default, we will fight them where the rule of law exists under the international arbitration clauses of the credit."

Rosneft officials were unav

In [4]:
# If stopwords were not downloaded before uncommment and run
#nltk.download('stopwords')

### Process data for embedding matrices

### Remove punctuation

In [4]:
translator = str.maketrans('', '', string.punctuation)
news = [new.translate(translator) for new in news]
print(news[0])

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76 to 113bn £600m for the three months to December from 639m yearearlier

The firm which is now one of the biggest investors in Google benefited from sales of highspeed internet connections and higher advert sales TimeWarner said fourth quarter sales rose 2 to 111bn from 109bn Its profits were buoyed by oneoff gains which offset a profit dip at Warner Bros and less users for AOL

Time Warner said on Friday that it now owns 8 of searchengine Google But its own internet business AOL had has mixed fortunes It lost 464000 subscribers in the fourth quarter profits were lower than in the preceding three quarters However the company said AOLs underlying profit before exceptional items rose 8 on the back of stronger internet advertising revenues It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOLs existing customers for highspeed

### Remove numbers

In [5]:
news = [re.sub(r'\d+', '', new) for new in news]
print(news[0])

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped  to bn £m for the three months to December from m yearearlier

The firm which is now one of the biggest investors in Google benefited from sales of highspeed internet connections and higher advert sales TimeWarner said fourth quarter sales rose  to bn from bn Its profits were buoyed by oneoff gains which offset a profit dip at Warner Bros and less users for AOL

Time Warner said on Friday that it now owns  of searchengine Google But its own internet business AOL had has mixed fortunes It lost  subscribers in the fourth quarter profits were lower than in the preceding three quarters However the company said AOLs underlying profit before exceptional items rose  on the back of stronger internet advertising revenues It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOLs existing customers for highspeed broadband TimeWarner also

### Lower words 

In [6]:
news = [new.lower() for new in news]
print(news[0])

ad sales boost time warner profit

quarterly profits at us media giant timewarner jumped  to bn £m for the three months to december from m yearearlier

the firm which is now one of the biggest investors in google benefited from sales of highspeed internet connections and higher advert sales timewarner said fourth quarter sales rose  to bn from bn its profits were buoyed by oneoff gains which offset a profit dip at warner bros and less users for aol

time warner said on friday that it now owns  of searchengine google but its own internet business aol had has mixed fortunes it lost  subscribers in the fourth quarter profits were lower than in the preceding three quarters however the company said aols underlying profit before exceptional items rose  on the back of stronger internet advertising revenues it hopes to increase subscribers by offering the online service free to timewarner internet customers and will try to sign up aols existing customers for highspeed broadband timewarner also

### Remove stop-words (optional)

In [7]:
en_stop_words = set(stopwords.words('english'))
loop_int = 0
for new in news:
    news[loop_int] = ' '.join([word for word in news[loop_int].split() if word not in en_stop_words])
    loop_int += 1
print(news[0])

ad sales boost time warner profit quarterly profits us media giant timewarner jumped bn £m three months december yearearlier firm one biggest investors google benefited sales highspeed internet connections higher advert sales timewarner said fourth quarter sales rose bn bn profits buoyed oneoff gains offset profit dip warner bros less users aol time warner said friday owns searchengine google internet business aol mixed fortunes lost subscribers fourth quarter profits lower preceding three quarters however company said aols underlying profit exceptional items rose back stronger internet advertising revenues hopes increase subscribers offering online service free timewarner internet customers try sign aols existing customers highspeed broadband timewarner also restate results following probe us securities exchange commission sec close concluding time warners fourth quarter profits slightly better analysts expectations film division saw profits slump helped boxoffice flops alexander catw

### Lemmatize words (optional)

In [8]:
lemmatizer = WordNetLemmatizer()
loop_int = 0
for new in news:
    news[loop_int] = ' '.join([lemmatizer.lemmatize(word) for word in news[loop_int].split()])
    loop_int += 1
print(news[0])

ad sale boost time warner profit quarterly profit u medium giant timewarner jumped bn £m three month december yearearlier firm one biggest investor google benefited sale highspeed internet connection higher advert sale timewarner said fourth quarter sale rose bn bn profit buoyed oneoff gain offset profit dip warner bros le user aol time warner said friday owns searchengine google internet business aol mixed fortune lost subscriber fourth quarter profit lower preceding three quarter however company said aols underlying profit exceptional item rose back stronger internet advertising revenue hope increase subscriber offering online service free timewarner internet customer try sign aols existing customer highspeed broadband timewarner also restate result following probe u security exchange commission sec close concluding time warner fourth quarter profit slightly better analyst expectation film division saw profit slump helped boxoffice flop alexander catwoman sharp contrast yearearlier t

### Split data into train and test set with topics approximately evenly distributed between both data sets

In [9]:
stshsp = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in stshsp.split(news, labels):
    x_train = np.array(news)[train_index.astype(int)]
    y_train = np.array(labels)[train_index.astype(int)]
    x_test = np.array(news)[test_index.astype(int)]
    y_test = np.array(labels)[test_index.astype(int)]

In [10]:
x_test[0]



In [11]:
# only the 1000 most frequently occuring words are considered  
max_num_words = 1000

# instantiate Tokenizer
tokenizer = Tokenizer(num_words=max_num_words)

# fit Tokenizer to processed text data assigning a unique integer to the 1000 most freqeuently occuring words
tokenizer.fit_on_texts(x_train)

# transform news and assign the transformation to sequences
train_sequences = tokenizer.texts_to_sequences(x_train)
test_sequences = tokenizer.texts_to_sequences(x_test)

dummy_y_train = np_utils.to_categorical(y_train)
dummy_y_test = np_utils.to_categorical(y_test)

In [12]:
train_sequences[0][0:10]

[571, 493, 630, 571, 630, 802, 323, 148, 309, 285]

In [13]:
max_len = len(max(train_sequences, key=len)) if len(max(train_sequences, key=len)) > len(max(test_sequences, key=len)) \
else len(max(test_sequences, key=len))
train_pad_sequ = pad_sequences(train_sequences, maxlen=max_len, padding='post')
test_pad_sequ = pad_sequences(test_sequences, maxlen=max_len, padding='post')

#labels = np.asarray(labels)

In [14]:
train_pad_sequ[0]

array([571, 493, 630, 571, 630, 802, 323, 148, 309, 285, 603, 630,  13,
        29, 185, 736,  15, 189, 773, 493, 387, 493,  84,   1, 249, 411,
       131,  29, 571,   7, 524, 494, 387,  46, 457, 941,  46, 127, 630,
        17,   7, 239, 362, 309,  19, 666, 863, 493,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   

In [15]:
max_len

1147

### GloVe

6 Billion tokens from a Wikipedia dump from 2015 and Gigaword5.
The 400000 most occuring words are then chosen and their co-occurence matrix is constructed
If a word appears $d$ words apart from another word and $d$ is smaller or equal than the window size then $1/d$ is added to the co-occurence value between those words.

The trainig objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the word's probability of co-occurence.

![title](glove.png)

1. Create co-occurence matrix $X$ of size $(VxV)$, with $V = vocabulary$ with element $X_{ij}$ representing the probability how often word $i$ appears in the context of word $j$.A word is part of another word's context if it appears in its context window. The farer away word $i$ appears in $j$'s context window the lower $X_{ij}$. 

2. Randomly intiliaze two matrices $W_1$ and $W_2$ both of size $(VxN)$ and two vectors $b_1$ and $b_2$ both of size $(Vx1)$.

3. For each word pair $i$ and $j$, pick the corresponding: 

$w_i^Tw_j + b_i + b_j = log(X_{ij})$, 

with $w_i$ and $w_j$ of size $(Nx1)$ being updated through the cost function

  $J = \sum_{i=1}^{V} \sum_{j=i}^{V} f(X_{ij})(w_i^Tw_j + b_i + b_j - log(X_{ij}))^2$

where 

$f(X_{ij}) = (X_{ij}/X_{max})^\alpha$ if $X_{ij} < X_{max}$ else $1$

Finally, add matrices $W_1$ and $W_2$ to receive embedding matrix $W$

[Get_the_paper](https://nlp.stanford.edu/pubs/glove.pdf)

In [16]:
embeddings_index = {}
file = open('GloVe/glove.6B.300d.txt', encoding='utf-8')
for line in file:
    vector = line.split()
    word = vector[0]
    values = np.asarray(vector[1:], dtype='float32')
    embeddings_index[word] = values
file.close()

In [17]:
embeddings_index['bbc']

array([-2.0522e-01, -1.2182e-01,  5.2485e-01,  1.7955e-01, -4.7728e-01,
       -2.8371e-01, -1.0358e-01,  3.1730e-01, -7.2417e-01, -1.0331e+00,
        2.1992e-01, -2.2033e-01,  4.3517e-01,  3.3936e-01,  4.9328e-01,
       -1.2552e-01, -3.3965e-01,  7.4982e-02,  7.0120e-01, -4.3616e-01,
        2.7040e-01,  3.9262e-02, -2.6013e-01,  3.5666e-01,  3.5008e-01,
       -2.8987e-01, -2.7419e-01,  1.9304e-02, -1.7569e-01,  1.0220e-01,
       -9.7804e-01, -5.9894e-01,  3.8760e-02,  3.4257e-01, -5.1110e-01,
       -4.2461e-01,  2.2995e-01,  7.4401e-01, -2.9785e-01,  4.8539e-01,
        2.7491e-01, -4.0656e-01, -9.5166e-02,  9.0538e-01, -4.4044e-02,
        6.5553e-01, -3.9572e-01, -1.5594e-02, -1.2509e-01, -4.6166e-01,
       -2.3151e-01,  4.8925e-01, -1.9804e-01, -1.3514e-01,  5.3156e-02,
        4.5265e-01,  4.6270e-01, -2.4297e-01, -3.8862e-02, -4.5783e-01,
       -3.1181e-01,  1.9102e-01, -2.9440e-01,  3.9846e-01,  9.1296e-02,
       -6.3331e-02,  1.0522e+00,  2.5403e-01,  5.3852e-03,  1.24

Here, the assignment of integers ordered by word frequency among all documents through tokenizer.texts_to_sequences comes in handy. If a word's ranking is smaller than 1000, its embedding-values are not loaded into the embedding matrix.

In [19]:
embedding_dim = 300

embedding_matrix = np.zeros((max_num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i < max_num_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [20]:
np.count_nonzero(np.sum(embedding_matrix, axis=1))

996

In [21]:
model = Sequential()
model.add(Embedding(input_dim=max_num_words, output_dim=embedding_dim, weights=[embedding_matrix], 
                    input_length=max_len, trainable=True))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(5, activation='softmax'))
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1147, 300)         300000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 344100)            0         
_________________________________________________________________
dense (Dense)                (None, 16)                5505616   
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 85        
Total params: 5,805,701
Trainable params: 5,805,701
Non-trainable params: 0
_________________________________________________________________


In [22]:
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

In [23]:
model.fit(train_pad_sequ, dummy_y_train,
            epochs=5,
            batch_size=4)
#model.save_weights('tuned_glove_weights.h5')

Instructions for updating:
Use tf.cast instead.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1d36d3a0cf8>

In [24]:
predictions = model.predict(test_pad_sequ)

In [25]:
predictions.shape

(445, 5)

In [26]:
accuracy_score(dummy_y_test, (predictions > 0.5))

0.950561797752809

### word2vec & gensim

1. Randomly initialize matrix $W_1$ of size $(VxN)$ and matrix $W_2$ of size $(NxV)$.

2. Each word in the voabulary $V$ is represented as a unique one hot encoded vector $X_i$. 

3. A supervised learning problem is created given the one-hot vectors of the context window as input and the one hot vector of the chosen word as target. The output is vector containing the probability of each word in vocabulary given the input words.

4. The weights are adjusted through backpropagation.



![title](word2vec-cbow.png)

[Get_the_other_paper](https://arxiv.org/pdf/1301.3781.pdf)

In [24]:
import gensim

Gensim is a production-ready open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.

In [25]:
model = gensim.models.KeyedVectors.load_word2vec_format('word2vec\GoogleNews-vectors-negative300.bin', binary=True)  

In [30]:
embedd = model.wv['bbc']
print(embedd)
#ind = model.wv.vocab.get('the').index

[-0.06494141  0.08544922 -0.10644531  0.41015625  0.125       0.01977539
  0.06982422 -0.08642578 -0.03686523  0.05908203 -0.16308594 -0.39648438
 -0.28125    -0.07519531  0.04345703  0.22753906  0.1796875   0.33203125
  0.23339844 -0.09423828 -0.29492188  0.11279297  0.203125    0.33398438
  0.00787354  0.4453125  -0.20507812  0.40234375  0.30078125 -0.08496094
  0.00334167  0.21484375  0.10107422 -0.19726562 -0.3515625  -0.1796875
 -0.3671875   0.4765625  -0.00148773 -0.12597656  0.23339844 -0.29882812
  0.08203125  0.35351562  0.02416992  0.26953125 -0.13476562 -0.25585938
 -0.40429688 -0.19726562 -0.23925781 -0.14453125  0.5390625  -0.07324219
 -0.08691406 -0.38671875 -0.4296875   0.15527344 -0.2734375  -0.16015625
  0.22265625  0.06445312 -0.26367188 -0.18066406  0.17285156 -0.04492188
 -0.13378906 -0.2890625  -0.35742188 -0.18457031 -0.21484375  0.23535156
 -0.12207031  0.02636719 -0.24902344 -0.04418945  0.31054688 -0.20605469
  0.14160156 -0.24023438  0.05859375  0.38867188 -0.

In [31]:
embedding_matrix = np.zeros((max_num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i < max_num_words:
        try:
            embedding_vector = model.wv[word]
        except KeyError:
            embedding_vector = None
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [32]:
np.count_nonzero(np.sum(embedding_matrix, axis=1))

989

In [33]:
model = Sequential()
model.add(Embedding(input_dim=max_num_words, output_dim=embedding_dim, weights=[embedding_matrix], 
                    input_length=max_len, trainable=True))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(5, activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1147, 300)         300000    
_________________________________________________________________
flatten_2 (Flatten)          (None, 344100)            0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                5505616   
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 85        
Total params: 5,805,701
Trainable params: 5,805,701
Non-trainable params: 0
_________________________________________________________________


In [34]:
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(train_pad_sequ, dummy_y_train,
            epochs=5,
            batch_size=4)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x23d083d7dd8>

In [35]:
predictions = model.predict(test_pad_sequ)

In [36]:
accuracy_score(dummy_y_test, (predictions > 0.5))

0.9573033707865168

### How to train word2vec on your own text corpus with gensim

In [29]:
model2 = gensim.models.Word2Vec(min_count=4, size=300)

In [26]:
sent = [new.split() for new in x_train]

In [27]:
sent[0]

['collins',
 'appeal',
 'drug',
 'ban',
 'sprinter',
 'michelle',
 'collins',
 'lodged',
 'appeal',
 'eightyear',
 'doping',
 'ban',
 'north',
 'american',
 'court',
 'arbitration',
 'sport',
 'ca',
 'yearold',
 'received',
 'ban',
 'last',
 'month',
 'result',
 'connection',
 'federal',
 'inquiry',
 'balco',
 'doping',
 'scandal',
 'first',
 'athlete',
 'banned',
 'without',
 'positive',
 'drug',
 'test',
 'admission',
 'drug',
 'use',
 'ca',
 'said',
 'ruling',
 'normally',
 'given',
 'within',
 'four',
 'month',
 'appeal',
 'collins',
 'suspended',
 'u',
 'antidoping',
 'agency',
 'based',
 'pattern',
 'observed',
 'blood',
 'urine',
 'test',
 'well',
 'evidence',
 'balco',
 'investigation',
 'well',
 'hit',
 'ban',
 'collins',
 'stripped',
 'world',
 'u',
 'indoor',
 'title',
 'san',
 'franciscobased',
 'balco',
 'laboratory',
 'centre',
 'scandal',
 'rocked',
 'sport',
 'company',
 'accused',
 'distributing',
 'illegal',
 'performanceenhancing',
 'drug',
 'elite',
 'athlete']

In [30]:
model2.build_vocab(sent)

In [31]:
model2.train(sent, total_examples=model2.corpus_count, epochs=30, report_delay=1)

(10350944, 11349960)

In [41]:
model2.wv['google']

array([ 0.99413025, -0.20194669, -0.42620084,  0.4460835 ,  0.30581605,
       -0.4742018 , -0.3601839 ,  0.3225689 , -0.5148598 ,  1.0312783 ,
        1.2999033 ,  0.22674341,  0.75039387,  0.44372705,  0.16040012,
       -0.62010753, -0.10729679, -0.06170062, -0.09001117,  1.448449  ,
        0.02740547, -0.24464263, -0.7453877 , -0.9412458 ,  0.66363484,
       -0.09560748, -0.48959064, -0.3056355 ,  0.1496163 , -0.09705112,
       -0.00971344,  0.4996164 ,  0.16528675, -0.42980462,  0.25316024,
       -0.34875566,  0.5857569 ,  0.01958018, -0.8425177 , -0.26651308,
        0.09368241, -0.3635835 ,  0.5704309 , -0.8728141 , -0.19457762,
        0.24356344,  0.75698096, -0.06678712,  0.5018418 ,  0.4549258 ,
       -1.2295932 , -0.45444015, -0.5253445 ,  0.7744052 ,  0.5649921 ,
        0.5480347 ,  0.5309998 , -1.1545534 , -1.2731216 ,  0.10626012,
       -0.06009358, -0.8233754 , -0.675419  ,  0.23779693, -0.22993396,
       -0.8960414 , -0.9856039 ,  0.24893601, -0.22650301,  0.04

In [42]:
embedding_matrix = np.zeros((max_num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i < max_num_words:
        try:
            embedding_vector = model2.wv[word]
        except KeyError:
            embedding_vector = None
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

In [43]:
np.count_nonzero(np.sum(embedding_matrix, axis=1))

999

In [44]:
model = Sequential()
model.add(Embedding(input_dim=max_num_words, output_dim=embedding_dim, weights=[embedding_matrix], 
                    input_length=max_len, trainable=True))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(5, activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1147, 300)         300000    
_________________________________________________________________
flatten_3 (Flatten)          (None, 344100)            0         
_________________________________________________________________
dense_4 (Dense)              (None, 16)                5505616   
_________________________________________________________________
dense_5 (Dense)              (None, 5)                 85        
Total params: 5,805,701
Trainable params: 5,805,701
Non-trainable params: 0
_________________________________________________________________


In [45]:
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(train_pad_sequ, dummy_y_train,
            epochs=5,
            batch_size=4)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x23d08ede978>

In [46]:
predictions = model.predict(test_pad_sequ)

In [47]:
accuracy_score(dummy_y_test, (predictions > 0.5))

0.8786516853932584

You can find this notebook on https://github.com/arturzeitler/meetup/blob/master/embeddings2.ipynb

### Tensorflow version of the Keras code is going to be followed up soon.