### Pretrained Word Embedding

he Keras Embedding layer can also use a word embedding learned elsewhere.

It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.

For example, the researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license. See:

GloVe: Global Vectors for Word Representation
The smallest package of embeddings is 822Mb, called “glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.

You can download this collection of embeddings and we can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in your training dataset.

After downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding.

If you peek inside the file, you will see a token (word) followed by the weights (100 numbers) on each line. For example, below are the first line of the embedding ASCII text file showing the embedding for “the“.

In [2]:
from numpy import array
from numpy import asarray
from numpy import zeros
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

In [3]:
# define documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])

### Using Tokenizer Keras similar to OneHot Representation 

In [4]:
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)

[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14]]


### Using Pad Sequences


In [5]:
# pad documents to a max length of 4 words
max_length = max([len(sen.split(' ')) for sen in docs ])
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[ 6  2  0  0]
 [ 3  1  0  0]
 [ 7  4  0  0]
 [ 8  1  0  0]
 [ 9  0  0  0]
 [10  0  0  0]
 [ 5  4  0  0]
 [11  3  0  0]
 [ 5  1  0  0]
 [12 13  2 14]]


### Load Glove Word Embedding File as Dictionary of Word to embedding array

In [6]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('glove.6B.100d.txt',encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))
print(type(embeddings_index))

Loaded 400000 word vectors.
<class 'dict'>


In [7]:
embeddings_index['well']  ### there will be 100 d of word 'well'

array([-0.53086  ,  0.51404  ,  0.087599 , -0.37314  ,  0.2747   ,
        0.07947  , -0.0085023,  0.028399 , -0.35114  ,  0.094339 ,
        0.087771 , -0.38307  ,  0.43129  ,  0.15261  , -0.1512   ,
       -0.4607   ,  0.080433 ,  0.037627 , -0.43959  ,  0.42451  ,
        0.16058  ,  0.26608  ,  0.35311  ,  0.014055 , -0.052771 ,
       -0.1615   , -0.299    , -0.56214  , -0.18742  ,  0.044237 ,
       -0.28118  ,  0.36594  , -0.26226  ,  0.11013  ,  0.44358  ,
        0.43131  , -0.0053095,  0.34705  , -0.44883  , -0.33727  ,
       -0.13281  , -0.35542  , -0.081663 , -0.12983  ,  0.080606 ,
       -0.161    ,  0.367    , -0.30568  ,  0.057269 , -0.794    ,
       -0.24581  ,  0.027115 ,  0.13203  ,  1.2262   , -0.19183  ,
       -2.5497   ,  0.055273 , -0.1378   ,  1.4552   ,  0.53697  ,
       -0.12337  ,  1.1278   , -0.16365  ,  0.21871  ,  0.82735  ,
       -0.30681  ,  0.65456  ,  0.17636  ,  0.6172   , -0.18425  ,
       -0.029966 , -0.098315 ,  0.32056  , -0.28124  ,  0.2568

In [8]:
vocab_size

15

In [9]:
t.word_index.items() ### Each word has its own Integer value which is required before Embedding layer

dict_items([('work', 1), ('done', 2), ('good', 3), ('effort', 4), ('poor', 5), ('well', 6), ('great', 7), ('nice', 8), ('excellent', 9), ('weak', 10), ('not', 11), ('could', 12), ('have', 13), ('better', 14)])

### Creating Embedded Matrix with GLOVE weigths

Next, we need to create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.

In [10]:
embedding_matrix = zeros((vocab_size, 100))
print(embedding_matrix)
embedding_matrix.shape

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


(15, 100)

In [11]:
embedding_vector = embeddings_index.get('well')
embedding_vector

array([-0.53086  ,  0.51404  ,  0.087599 , -0.37314  ,  0.2747   ,
        0.07947  , -0.0085023,  0.028399 , -0.35114  ,  0.094339 ,
        0.087771 , -0.38307  ,  0.43129  ,  0.15261  , -0.1512   ,
       -0.4607   ,  0.080433 ,  0.037627 , -0.43959  ,  0.42451  ,
        0.16058  ,  0.26608  ,  0.35311  ,  0.014055 , -0.052771 ,
       -0.1615   , -0.299    , -0.56214  , -0.18742  ,  0.044237 ,
       -0.28118  ,  0.36594  , -0.26226  ,  0.11013  ,  0.44358  ,
        0.43131  , -0.0053095,  0.34705  , -0.44883  , -0.33727  ,
       -0.13281  , -0.35542  , -0.081663 , -0.12983  ,  0.080606 ,
       -0.161    ,  0.367    , -0.30568  ,  0.057269 , -0.794    ,
       -0.24581  ,  0.027115 ,  0.13203  ,  1.2262   , -0.19183  ,
       -2.5497   ,  0.055273 , -0.1378   ,  1.4552   ,  0.53697  ,
       -0.12337  ,  1.1278   , -0.16365  ,  0.21871  ,  0.82735  ,
       -0.30681  ,  0.65456  ,  0.17636  ,  0.6172   , -0.18425  ,
       -0.029966 , -0.098315 ,  0.32056  , -0.28124  ,  0.2568

Now above Embedded vector of word 'well' will get replace in Main Embedded Matrix 

In [12]:
### Above process will be done for each and every word . Its value will get stored in Embedded_Matrix

for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

embedding_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.11619   ,  0.45447001, -0.69216001, ..., -0.54737002,
         0.48822001,  0.32246   ],
       [-0.2978    ,  0.31147   , -0.14937   , ..., -0.22709   ,
        -0.029261  ,  0.4585    ],
       ...,
       [ 0.05869   ,  0.40272999,  0.38633999, ..., -0.35973999,
         0.43718001,  0.10121   ],
       [ 0.15711001,  0.65605998,  0.0021149 , ..., -0.60614997,
         0.71004999,  0.41468999],
       [-0.047543  ,  0.51914001,  0.34283999, ..., -0.26859   ,
         0.48664999,  0.55609   ]])

### Embedding Layer 

Now we will Directly provide the Embedded Matrix to the Embedding Layer which has weights from Glove

The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. We chose the 100-dimensional version, therefore the Embedding layer must be defined with output_dim set to 100. Finally, we do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.

Here Learning will not be done , Becuase we have alreadY used pretrained glove embedding

In [13]:
max_length

4

In [14]:
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [15]:
model = Sequential()
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 4, 100)            1500      
_________________________________________________________________
flatten (Flatten)            (None, 400)               0         
_________________________________________________________________
dense (Dense)                (None, 1)                 401       
Total params: 1,901
Trainable params: 401
Non-trainable params: 1,500
_________________________________________________________________
None


In [None]:
model.fit(padded_docs, labels, epochs=10, verbose=0)

In [None]:
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))