# 10 Word2Vec Implemented on Keras
keras is a quite high-level deep learning library. In this notebook, we are going to implement two word2vec models: CBoW and Skip-gram. The utilized corpus is IMDB movie review dataset. http://ai.stanford.edu/~amaas/data/sentiment/


## Agenda

1. How to load pre-trained word vectors
2. Reading in the IMDB Sentiment Dataset and Iterating over files in Python
3. Build Skip-gram Model
4. Build CBoW Model
5. Memory-friendly Data Generation Methods

## Part 1: Load pre-trained word vectors

- You can find the word2vec project here: https://code.google.com/archive/p/word2vec/
- Download the word embeddings from the section **Pre-trained word and phrase vectors**. It is named `GoogleNews-vectors-negative300.bin.gz (3.4G)`
- Use gensim that you can easily load these wordvectors and utilize their functions

In [1]:
from gensim.models import KeyedVectors
# Load pretrained model (since intermediate data is not included, the model cannot be refined with additional data)
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

dog = model['dog']
print(dog.shape)
print(dog[:10])

# Some predefined functions that show content related information for given words
print(model.most_similar(positive=['woman', 'king'], negative=['man']))

print(model.doesnt_match("breakfast cereal dinner lunch".split()))

print(model.similarity('woman', 'man'))

(300,)
[ 0.05126953 -0.02233887 -0.17285156  0.16113281 -0.08447266  0.05737305
  0.05859375 -0.08251953 -0.01538086 -0.06347656]
[('queen', 0.7118192911148071), ('monarch', 0.6189674139022827), ('princess', 0.5902431607246399), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321243286133), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235945582389832), ('queens', 0.5181134343147278), ('sultan', 0.5098593235015869), ('monarchy', 0.5087411999702454)]
cereal
0.76640123


In [2]:
# clear the memory
del model

## Part 2: Read in the IMDB Sentiment Dataset

- You can access the imdb data folder in BT5153_data folder.
- Each movie review is a text file and they are under two different folders: pos and neg.
- We need to iterate over these files and load them one by one.

In [3]:
import numpy as np
import pandas as pd
import os

In [4]:
def load_imdb_dataset(imdb_path):
    # imdb_path is the base path 
    train_texts = []
    train_labels = []
    # contain two sub-folders named pos and neg
    for cat in ['pos', 'neg']:
        dset_path = os.path.join(imdb_path, cat)
        # loop in each folder and get the file name for each txt.
        for fname in sorted(os.listdir(dset_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(dset_path, fname), encoding='utf-8') as f:
                    train_texts.append(f.read()) # load the data into memory
                label = 0 if cat == 'neg' else 1
                train_labels.append(label)
    imdbdf = pd.DataFrame(
             {'text': train_texts,
              'label': train_labels}
             )
    # shuffle the whole dataset
    imdbdf = imdbdf.sample(frac=1).reset_index(drop=True)
    # Return the dataset in dataframe format
    return imdbdf

In [5]:
df_corpus = load_imdb_dataset('../BT5153_data/imdb')
print ('Train samples shape :', df_corpus.shape[0])

Train samples shape : 25000


In [6]:
# 1 denotes positive and 0 is negative
print(df_corpus.head())

                                                text  label
0  This movie has great stars in their earlier ye...      1
1  This is the biggest insult to TMNT ever. Fortu...      0
2  I only saw it once. This happened in 1952, I w...      1
3  Sure, for it's super imagery and awesome sound...      1
4  It's very sly for all of the 60's look to the ...      1


#### Raw Text Cleaning

In [7]:
from bs4 import BeautifulSoup 
import re

def clean_txt(raw_txt):
    # Function to clean raw text
    # 1. Remove HTML
    raw_txt = BeautifulSoup(raw_txt, "html.parser").get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", raw_txt) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                                             
    # 
    #
    # 4. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( words )) 

In [8]:
df_corpus['text'] = df_corpus.text.apply(clean_txt)
corpus = df_corpus.text.tolist()

In [9]:
# check the corpus type, which is a list of string
print(type(corpus))
print(type(corpus[1]))

<class 'list'>
<class 'str'>


#### Text tokenization from Keras

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

In [10]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [11]:
tokenizer = Tokenizer()
# learn the vocab
tokenizer.fit_on_texts(corpus)

In [12]:
print(type(corpus[1]))
print(corpus[1])

<class 'str'>
this is the biggest insult to tmnt ever fortunantely officially venus does not exist in canon tmnt there will never be a female turtle this took away from the tragic tale of male unique mutants who will never have a family of their own once gone no more the biggest mistake was crossing over power rangers to tmnt with a horrible episode the turtle s voices were wrong and they all acted out of character they could have done such a better job better designs and animatronics and no venus don t bother with this people it s cringe worthy material the lip flap was slow and unnatural looking they totally disrespected shredder the main baddie some dragonlord dude was corny the turtles looked corny with things hanging off their bodies what s with the thing around raph s thigh the silly looking sculpted plastrons if they looked normal acted in character and got rid of venus got rid of the stupid kiddie cartoon sounds and better writing it could have been good


This `fit_on_texts` function is trying to build the vocab

In [13]:
# from string to a sequence of intergers
# each word will be convereted to its vocab index
seq_corpus = tokenizer.texts_to_sequences(corpus)
print(seq_corpus[1])

[10, 6, 1, 1103, 2334, 5, 15685, 123, 46352, 8169, 14042, 125, 23, 1752, 8, 9697, 15685, 39, 78, 113, 28, 3, 652, 7366, 10, 551, 242, 37, 1, 1545, 768, 4, 887, 932, 8479, 35, 78, 113, 27, 3, 212, 4, 66, 203, 280, 809, 57, 51, 1, 1103, 1297, 13, 5971, 118, 647, 4771, 5, 15685, 16, 3, 516, 381, 1, 7366, 12, 2295, 69, 349, 2, 32, 30, 897, 44, 4, 103, 32, 96, 27, 220, 139, 3, 127, 288, 127, 4567, 2, 24102, 2, 57, 14042, 88, 20, 1397, 16, 10, 77, 7, 12, 4011, 1493, 802, 1, 5402, 26207, 13, 538, 2, 7466, 262, 32, 477, 26208, 19868, 1, 289, 8307, 48, 28758, 2643, 13, 2004, 1, 15099, 596, 2004, 16, 181, 2317, 122, 66, 2303, 47, 12, 16, 1, 152, 185, 37400, 12, 22375, 1, 694, 262, 32282, 46353, 45, 32, 596, 1226, 897, 8, 103, 2, 187, 3707, 4, 14042, 187, 3707, 4, 1, 371, 8961, 1050, 914, 2, 127, 478, 7, 96, 27, 76, 49]


- In the following, we are going to use a toy corpus instead of the IMDB corpus for a quick demo.

In [14]:
# let us check the texts_to_sequences function
toy_corpus = ['king is a strong man', 
              'queen is a wise woman', 
              'boy is a young man',
              'girl is a young woman',
              'prince is a young king',
              'princess is a young queen',
               'man is strong', 
               'woman is pretty',
               'prince is a boy will be king',
               'princess is a girl will be queen']
tokenizer = Tokenizer()
tokenizer.fit_on_texts(toy_corpus)
toy_seq_corpus = tokenizer.texts_to_sequences(toy_corpus)
print(toy_seq_corpus[0])
print(tokenizer.word_index['king'])
print(tokenizer.word_index['is'])
print(tokenizer.word_index['a'])
print(tokenizer.word_index['strong'])

[4, 1, 2, 8, 5]
4
1
2
8


In [15]:
print(tokenizer.index_word[1])

is


In [16]:
print(tokenizer.index_word[0])

KeyError: 0

- The KeyError means that the Tokenizer reserves 0 as an OOV words.
- In practive, the first ebmedding in word embedding martix is for unkown words or chars.

## Part 3: Build Skip-gram Model

- Here, we only use toy corpus for demo purpose.
- Target: predict the nearby words based on the center word.
<img src="word2vec-skip-gram.png" alt="cbow"
	title="cbow pic" width="250" height="150" />

In [17]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Reshape
from keras.utils import to_categorical
from keras.preprocessing import sequence
import keras.backend as K

- **For skip-gram,  training data generation**:

the input x is the center word index, the output x is one hot vector of the neary word index.
For example, the toy corpus only contain two sentences.
```
I like apple 
I like reading books
```
1. The first step: build a vocab. which can be regarded as a mapping from words to interget index.

Here, OOV-> 0, I -> 1, like -> 2, apple -> 3, reading -> 4, books -> 5.

2. Then, we scan the corpus and creat the pair of center word and nearby word. Here, we set the window size is `one`.
We have the following pair of input x and target y.

<pre>
words pair              numerical input       numerical output

(I, like)                       1               [0,0,1,0,0,0]

(like, I)                       2               [0,1,0,0,0,0]

(like, apple)                   2               [0,0,0,1,0,0]

(apple, like)                   4               [0,0,1,0,0,0]

(I, like)

(like, I)

(like, reading)              

(reading, like)

(reading, books)

(books, reading)                 5              [0,0,0,0,1,0]
</pre>

In [18]:
def generate_data(corpus, window_size, V):
    """
    corpus is the collection of lists of words index
    window_size is the context size that defines 'nearby' words
    V is the vocab Size
    """
    labels = []
    in_words   = [] 
    maxlen = window_size*2
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            s = index - window_size
            e = index + window_size + 1
            for i in range(s, e):
                if 0<= i < L and i != index:
                    in_words.append([word])
                    labels.append(to_categorical(words[i], V))
    return (in_words, labels)   

In [19]:
# plus one is for OOV words
V = len(tokenizer.word_index) + 1
dim = 5
window_size = 4
ith = 0
input_x, target_y =  generate_data(toy_seq_corpus, window_size, V)
input_x           = np.array(input_x,dtype=np.int32)
target_y          = np.array(target_y,dtype=np.int32)

In [20]:
print('check the first pair of input and output')
print(input_x[0])
print(target_y[0])
print('check the third pair of input and output')
print(input_x[2])
print(target_y[2])

check the first pair of input and output
[4]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
check the third pair of input and output
[4]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]


In [21]:
print(toy_seq_corpus[0])

[4, 1, 2, 8, 5]


- **Model Config**

It consists of two layers:

1. The first layer is embeddings layer, which perform the lookup operation. Given the word index as the input, the layer output will return the corresponding vector

2. The second layer is softmax layer.

- **Embeddings Layer**:

Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

This layer can only be used as the first layer in a model.

1. input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
2. output_dim: int >= 0. Dimension of the dense embedding.
3. embeddings_initializer: Initializer for the embeddings matrix (see initializers).
4. input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect  Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).

### Model Building

In [46]:
skipgram = Sequential()
skipgram.add(Embedding(input_dim=V, output_dim=dim, init='glorot_uniform', input_length=1))
skipgram.add(Reshape((dim, )))
skipgram.add(Dense(input_dim=dim, output_dim=V, activation='softmax'))

  from ipykernel import kernelapp as app


In [47]:
print(skipgram.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 1, 5)              85        
_________________________________________________________________
reshape_4 (Reshape)          (None, 5)                 0         
_________________________________________________________________
dense_4 (Dense)              (None, 17)                102       
Total params: 187
Trainable params: 187
Non-trainable params: 0
_________________________________________________________________
None


85 = 17*5

In [48]:
skipgram.compile(loss='categorical_crossentropy', optimizer="adadelta")

In [49]:
skipgram.fit(input_x, target_y, batch_size=8, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x1af3525208>

- **How to save the learned word vectors**

In [50]:
f = open('vectors.txt' ,'w')
f.write('{} {}\n'.format(V-1, dim))

5

In [51]:
vectors = skipgram.get_weights()[0]
for word, i in tokenizer.word_index.items():
    str_vec = ' '.join(map(str, list(vectors[i, :])))
    f.write('{} {}\n'.format(word, str_vec))
f.close()

- **the saved format for word vectors**
<img src="saved_format.jpg" alt="cbow"
	title="saved format" width="550" height="450" />

vocab = 16, window = 5

- **we can use gensim**

Gensim is a production-ready open-source library for NLP problems.

https://radimrehurek.com/gensim/index.html


In [52]:
from gensim.models import KeyedVectors
w2v = KeyedVectors.load_word2vec_format('./vectors.txt', binary=False)

In [53]:
w2v.most_similar(positive=['man'])

[('strong', 0.9319900870323181),
 ('boy', 0.8408138751983643),
 ('is', 0.6083444356918335),
 ('queen', 0.5916460752487183),
 ('will', 0.44901859760284424),
 ('girl', 0.4101567566394806),
 ('wise', 0.3264765441417694),
 ('young', 0.2976924777030945),
 ('pretty', 0.2894490361213684),
 ('a', 0.22375084459781647)]

## Part 4: Build CBoW Model

- CBoW's target is the prediction of center word.
<img src="word2vec-cbow.png" alt="cbow"
	title="cbow pic" width="250" height="150" />

- **For cbow,  training data generation**:

the input x is the list of  context word index, the output x is one hot vector of the center word.
For example, the toy corpus only contain two sentences.
```
I like apple 
I like reading books
```
1. The first step: build a vocab. which can be regarded as a mapping from words to interget index. 
Here OOV->0, I -> 1, like -> 2, apple -> 3, reading -> 4 books -> 5.

2. Then, we scan the corpus and creat the pair of list of nearby word and center word. Here, we set the window size is `one`.
We have the following pair of input x and target y.

<pre>
words pair                     numerical input       numerical output

([like], I)                        [2]                 [0,1,0,0,0,0]

([I, apple], like)                 [1,3]               [0,0,1,0,0,0]

([like], apple)                    [2]                 [0,0,0,1,0,0]

([like], I)                        [2]                 [0,1,0,0,0,0]
 
([I, reading], like)               [1,4]               [0,1,0,0,0,0]

([like, books], reading)           [2,5]               [0,0,0,0,1,0]

([reading], books)                 [4]                 [0,0,0,0,0,1]
</pre>

3. At last, sometimes, we can not get the input context with enough length. For example, the first pair's numerical input only has one word index insetad of two. What we can do here is padding the short input so that all input data have the same length. 

- **Prepare the training and labels**

In [54]:
from keras.preprocessing import sequence
import keras.backend as K
def generate_data(corpus, window_size, V):
    """
    corpus is the list of sequence of words index
    window_size is used to define  
    V is the vocab Size
    """
    context_words   = []
    center_words    = []
    maxlen = window_size*2
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            contexts = []
            labels   = []            
            s = index - window_size
            e = index + window_size + 1
            contexts.append([words[i] for i in range(s, e) if 0 <= i < L and i != index])
            labels.append(word)           
            x = sequence.pad_sequences(contexts, maxlen=maxlen)
            y = to_categorical(labels, V)
            context_words.append(x)
            center_words.append(y)
    return context_words, center_words

In [55]:
ith = 0
input_x, target_y = generate_data(toy_seq_corpus, window_size, V)

In [56]:
input_x = np.array(input_x)
print(input_x.shape)
input_x = np.squeeze(input_x)  # sequeeze the second dimesion as on
print(input_x.shape)
target_y = np.array(target_y)
target_y = np.squeeze(target_y)

(50, 1, 8)
(50, 8)


In [57]:
print(input_x[0])
print(target_y[0])

[0 0 0 0 1 2 8 5]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [58]:
print(toy_seq_corpus[0])

[4, 1, 2, 8, 5]


In [59]:
cbow = Sequential()

- **Lambda Layer**:

Wraps arbitrary expression as a Layer object.
    1. function: The function to be evaluated. Takes input tensor as first argument. usually based on backend
    2. output_shape: Expected output shape from function. 

In [60]:
from keras.layers import Lambda
cbow.add(Embedding(input_dim=V, output_dim=dim, input_length=window_size*2))
# sum all embeddings 
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(dim,)))
## add softmax layer
cbow.add(Dense(V, activation='softmax'))

In [61]:
print(cbow.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 8, 5)              85        
_________________________________________________________________
lambda_1 (Lambda)            (None, 5)                 0         
_________________________________________________________________
dense_5 (Dense)              (None, 17)                102       
Total params: 187
Trainable params: 187
Non-trainable params: 0
_________________________________________________________________
None


In [62]:
cbow.compile(loss='categorical_crossentropy', optimizer='adam')
# Train the model, iterating on the data in batches of 512 samples
cbow.fit(input_x, target_y, batch_size=8, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1af34f5748>

## Part 5: Memory-friendly Data Generation

- Here, we modify the data generation function of skip-gram
- `yield`: it will return generators. And generators do not store all the values in memory. It will return value during each iteration.

sample code
```
generator = (x * x for x in range(3))
for i in generator:
    print(i)
```

https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do


In [63]:
def generate_data_live(corpus, window_size, V):
    """
    corpus is the list of sequence of words index
    window_size is used to define  
    V is the vocab Size
    """
    maxlen = window_size*2
    for words in corpus:
        labels   = []
        in_words = [] 
        L = len(words)
        for index, word in enumerate(words):
            s = index - window_size
            e = index + window_size + 1
            for i in range(s, e):
                if 0<= i < L and i != index:
                    in_words.append([word])
                    labels.append(words[i])
        x = np.array(in_words,dtype=np.int32)
        y = to_categorical(labels, V)
        yield (x, y)# return a generators, doesn't store all the numbers


In [64]:
## here you should define skipgram from scratch

for ite in range(50):
    loss = 0.
    for x, y in generate_data_live(toy_seq_corpus, window_size, V):
        #updated parameters based on data samples provided without regard to any fixed batch size
        loss += skipgram.train_on_batch(x, y)
    print(ite, loss)

0 25.75001883506775
1 25.72290015220642
2 25.696035861968994
3 25.669301748275757
4 25.642654418945312
5 25.61606216430664
6 25.589520692825317
7 25.563032627105713
8 25.53661823272705
9 25.510297298431396
10 25.484103202819824
11 25.458067655563354
12 25.43221688270569
13 25.406575918197632
14 25.38117027282715
15 25.356019258499146
16 25.33113932609558
17 25.30654764175415
18 25.28225612640381
19 25.258276224136353
20 25.23461675643921
21 25.211283445358276
22 25.18828320503235
23 25.16561770439148
24 25.143290042877197
25 25.12130117416382
26 25.09965229034424
27 25.07834005355835
28 25.057363033294678
29 25.036719799041748
30 25.016404151916504
31 24.996421575546265
32 24.976744651794434
33 24.957393884658813
34 24.938352584838867
35 24.919610023498535
36 24.90116810798645
37 24.88301706314087
38 24.865150690078735
39 24.84756088256836
40 24.830241680145264
41 24.813185691833496
42 24.796387672424316
43 24.779839038848877
44 24.763533353805542
45 24.747464418411255
46 24.7316238880