# Lesson 05A - Learned Pre-trained Embedding

### Playing with Keras and getting familiar with CNN layers + using Word2Vec pre-trained embedding


The code here is based on this [ML Mastery post](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/), however I used here another w2v, loaded by gensim.

In [56]:
import pandas as pd
import numpy as np
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from gensim.models import Word2Vec
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import SpatialDropout1D
from keras.layers import Convolution1D
from keras.layers import Conv1D
from keras.layers import GlobalMaxPooling1D
from keras.layers import Dense, Dropout, Activation
from keras.layers import Flatten

**Uploading the w2v**

In [12]:
# Uploading the w2v
pretrained = Word2Vec.load(r"C:\Users\ishay\Documents\Data Science\Naya Course\part_4\02_text_mining_and_nlp\nlp_from_github\nlp_classification\data\w2v.pickle")

In [27]:
# This is wll of the vectoers
pretrained.wv

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x1d138ecc828>

In [28]:
# Examine one of the vectoers:
pretrained.wv["schedule"]

array([  2.77401829e+00,  -2.24431610e+00,  -1.01791372e+01,
        -4.19550276e+00,  -1.84075713e+00,  -5.41560078e+00,
         1.44039355e-02,  -5.99770021e+00,   7.06198025e+00,
        -9.16574860e+00,   5.53503990e+00,  -3.34497046e+00,
        -4.42923605e-01,  -8.38503551e+00,   4.76956034e+00,
        -4.30649328e+00,  -5.75036144e+00,   1.72932088e+00,
         2.89756984e-01,  -8.27550030e+00,  -9.32571697e+00,
        -1.17719114e-01,  -6.59948111e+00,  -6.64540958e+00,
         6.45717907e+00,   5.95884228e+00,   4.18421841e+00,
         5.87587404e+00,  -7.26584864e+00,   8.81920624e+00,
         4.19390440e+00,   2.25955439e+00,  -4.71369219e+00,
         2.67389464e+00,  -7.70914555e+00,  -5.84583163e-01,
         1.05691934e+00,   1.44013420e-01,   1.15844083e+00,
        -5.53940344e+00,   1.71546412e+00,   4.60938644e+00,
        -1.26359761e-01,  -6.45151641e-03,   3.69829804e-01,
         4.95956945e+00,  -2.00106478e+00,   3.57773185e+00,
         2.57341909e+00,

In [29]:
len(pretrained.wv["schedule"])

50

So each vectoer is in length 50. I also have a link to the original words through vocab (dictionary):

In [32]:
pretrained.wv.vocab

{'board of directors': <gensim.models.keyedvectors.Vocab at 0x1d139c716a0>,
 'board': <gensim.models.keyedvectors.Vocab at 0x1d139c71780>,
 'conversion price': <gensim.models.keyedvectors.Vocab at 0x1d139c717b8>,
 'grantor': <gensim.models.keyedvectors.Vocab at 0x1d139c71828>,
 'lenders': <gensim.models.keyedvectors.Vocab at 0x1d139c71898>,
 'accounts': <gensim.models.keyedvectors.Vocab at 0x1d139c718d0>,
 'collateral': <gensim.models.keyedvectors.Vocab at 0x1d139c71908>,
 'documents': <gensim.models.keyedvectors.Vocab at 0x1d139c71940>,
 'person': <gensim.models.keyedvectors.Vocab at 0x1d139c719b0>,
 'proceeds': <gensim.models.keyedvectors.Vocab at 0x1d139c719e8>,
 'security agreement': <gensim.models.keyedvectors.Vocab at 0x1d139c71a20>,
 '': <gensim.models.keyedvectors.Vocab at 0x1d139c71a58>,
 'share': <gensim.models.keyedvectors.Vocab at 0x1d139c71ac8>,
 'shares': <gensim.models.keyedvectors.Vocab at 0x1d139c71b38>,
 'shareholder': <gensim.models.keyedvectors.Vocab at 0x1d139c71b7

**Create the data**

In [10]:
# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = np.array([1,1,1,1,1,0,0,0,0,0])
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)
print()
# pad documents to a max length of 4 words
max_length = 4
padded_docs = sequence.pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14]]

[[ 6  2  0  0]
 [ 3  1  0  0]
 [ 7  4  0  0]
 [ 8  1  0  0]
 [ 9  0  0  0]
 [10  0  0  0]
 [ 5  4  0  0]
 [11  3  0  0]
 [ 5  1  0  0]
 [12 13  2 14]]


This phase of the data is similar to the IMDb data, where I had the data encoded and padded (to 400 length).

**Combining the W2V model to our dataset**

Next, we need to create a matrix of one embedding for each word **in the training dataset**. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded W2V embedding.

In [48]:
"schedule" in pretrained.wv.vocab.keys()

True

In [50]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 50))
for word, i in t.word_index.items():
    if word in pretrained.wv.vocab.keys():
        embedding_vector = pretrained.wv[word]
        if embedding_vector is not None: # Only if exits in the pre-trained
            embedding_matrix[i] = embedding_vector

In [52]:
embedding_matrix.shape

(15, 50)

So, I have 15 words in my vocabulary, with vector of length 50 each.

**Modeling**

In [53]:
# define model
model = Sequential()

Now, the embedding arguments are a bit different from the self trained model:
- The output dimension is given (50 in this case)
- I need to provide the weights from the embedding matrix
- I need to specify that the model is not trainable

In [54]:
model.add(Embedding(vocab_size, 50, weights=[embedding_matrix], input_length=4, trainable=False))

In [57]:
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4, 50)             750       
_________________________________________________________________
flatten_1 (Flatten)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 201       
Total params: 951
Trainable params: 201
Non-trainable params: 750
_________________________________________________________________
None
Accuracy: 89.999998


### Warning

Updating the pre-trained word embeddings during training **can lead to overfitting**. Given that the training set is limited and doesn’t cover the entire vocabulary, we’re only moving some words in the embedding space but keeping others in their place. The words we move (e.g. kangaroo) now have better vectors for the specific task, but they’re further away from their (distributional) neighbours which are not found in the training set. This hurts the model generalization abilities: when it encounters an unobserved word (e.g. wallaby), **its vector is no longer located next to similar words (kangaroo) which have been moved**, so the model doesn’t know much about it. Updating the embeddings during training is a good idea only if your task has very different needs from its embeddings than just plain similarity (and then you may want to start with a random initialization rather than pre-trained embeddings), and only if you have enough training data that covers a broad vocabulary.