## SkipGram -> feed in the center word and predict the context words


**The X value is Single Centre Word and Y value is the set of Context words**

**Importing libraries**

In [1]:
import pandas as pd
import numpy as np
import pickle
from tqdm import tqdm
import nltk
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer,text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.layers import add, dot, concatenate, Permute
from keras.callbacks import ModelCheckpoint

from numpy import argmax
from keras.utils import to_categorical


Using TensorFlow backend.


In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/swati/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/swati/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
class FirstWordException(Exception):
    '''raised when first word is encountered'''
    pass 

**getting data**

In [4]:
with open('/home/swati/Word2Vec/data/text8', 'r') as f:
    wikipedia=f.read()

stopwords=set(stopwords.words('english')+['as', 'a', 'b', 'of', ''])

In [5]:
wiki = nltk.word_tokenize(wikipedia)

**cleaning data**

In [28]:
# wiki=[x.lower() for x in wiki if x not in stopwords and len(x)>2]
wiki=[x.lower() for x in wiki if x not in stopwords and len(x)>=2]
words=wiki[:1000]

# words=[x for x in words if len(x)>2]

vocab_len=len(set(words))
vocab_len

561

**creating dictionary for vocabulary**

In [17]:
vocabulary=set(words)
word_index=dict((c, i + 1) for i, c in enumerate(vocabulary))

In [24]:
def oneHotEncode(word):
    word_vec=[0 for i in range(len(vocabulary))]
    word_ind = word_index[word]-1
    word_vec[word_ind] = 1
    return word_vec

**creating pairs of neighbouring words**

In [133]:
def create_pairs(words, wikipedia, window_size=5):
    keyword=''
    training_data=[]
    for word in words:
        word_ind =wikipedia.index(word)
        forward_ind=int(window_size/2)
        w_target=[]
        w_context=[]
        for i in range(1,forward_ind+1):
            try:
                w_context.append(oneHotEncode(wikipedia[word_ind+i]))
                w_target.append(oneHotEncode(word))
#                 print(len(oneHotEncode(wikipedia[word_ind+i])))
#                 print(len(oneHotEncode(word)))
#                 training_data.append([oneHotEncode(wikipedia[word_ind+i]), oneHotEncode(word)])
            except:
                pass
            try: 
                if (word_ind-i<0):
                    raise FirstWordException
                w_context.append(oneHotEncode(wikipedia[word_ind-i]))
                w_target.append(oneHotEncode(word))
#                 print(len(oneHotEncode(wikipedia[word_ind-i])))
#                 print(len(oneHotEncode(word)))
#                 training_data.append([oneHotEncode(wikipedia[word_ind-i]), oneHotEncode(word)])
            except:
                pass
            finally:
                w_target=oneHotEncode(word)
#                 training_data.append([w_target, w_context])
        training_data.append([w_target, w_context])
      
    return np.array(training_data)

In [134]:
training_data=create_pairs(words, words, 5)

In [135]:
training_data.shape

(1000, 2)

In [137]:
max_c_words=max([len(training_data[i][1]) for i in range(training_data.shape[0])])

In [138]:
max_c_words

4

**Padding values to maximum val length**

In [139]:
word_vec=[0 for i in range(len(vocabulary))]
for w_t, w_c in training_data:
    if(len(w_c)<max_c_words):
        for i in range(max_c_words-len(w_c)):
            w_c.append(word_vec)

In [140]:
len(training_data[0][1])

4

**Splitting Training and Test Data using CBOW method**

In [141]:
X=[]
y=[]
for w_t, w_c in training_data:
    X.append(w_t)
    y.append(w_c)

In [147]:
X=np.array(X)
y=np.array(y).reshape(1000,-1)
X.shape, y.shape

((1000, 561), (1000, 2244))

In [148]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

**Reshaping 3D array to 2D array**

In [150]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((670, 561), (330, 561), (670, 2244), (330, 2244))

In [166]:
model = Sequential()
model.add(Dense(200, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(130, activation='relu'))
model.add(Dense(15, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(y_train.shape[1], activation='sigmoid'))

In [167]:
model.compile(optimizer='rmsprop', loss='mse', metrics=['accuracy'])

In [173]:
filename = 'model.h5'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')



hist = model.fit(X_train, y_train, epochs = 500, batch_size = 64
                 , validation_data = (X_test, y_test), callbacks = [checkpoint])

Train on 670 samples, validate on 330 samples
Epoch 1/500

Epoch 00001: val_loss improved from inf to 0.00176, saving model to model.h5
Epoch 2/500

Epoch 00002: val_loss improved from 0.00176 to 0.00176, saving model to model.h5
Epoch 3/500

Epoch 00003: val_loss improved from 0.00176 to 0.00176, saving model to model.h5
Epoch 4/500

Epoch 00004: val_loss improved from 0.00176 to 0.00176, saving model to model.h5
Epoch 5/500

Epoch 00005: val_loss improved from 0.00176 to 0.00176, saving model to model.h5
Epoch 6/500

Epoch 00006: val_loss improved from 0.00176 to 0.00176, saving model to model.h5
Epoch 7/500

Epoch 00007: val_loss improved from 0.00176 to 0.00176, saving model to model.h5
Epoch 8/500

Epoch 00008: val_loss improved from 0.00176 to 0.00176, saving model to model.h5
Epoch 9/500

Epoch 00009: val_loss did not improve from 0.00176
Epoch 10/500

Epoch 00010: val_loss improved from 0.00176 to 0.00176, saving model to model.h5
Epoch 11/500

Epoch 00011: val_loss improved fr


Epoch 00040: val_loss did not improve from 0.00175
Epoch 41/500

Epoch 00041: val_loss did not improve from 0.00175
Epoch 42/500

Epoch 00042: val_loss did not improve from 0.00175
Epoch 43/500

Epoch 00043: val_loss improved from 0.00175 to 0.00175, saving model to model.h5
Epoch 44/500

Epoch 00044: val_loss improved from 0.00175 to 0.00174, saving model to model.h5
Epoch 45/500

Epoch 00045: val_loss improved from 0.00174 to 0.00174, saving model to model.h5
Epoch 46/500

Epoch 00046: val_loss improved from 0.00174 to 0.00174, saving model to model.h5
Epoch 47/500

Epoch 00047: val_loss improved from 0.00174 to 0.00174, saving model to model.h5
Epoch 48/500

Epoch 00048: val_loss improved from 0.00174 to 0.00174, saving model to model.h5
Epoch 49/500

Epoch 00049: val_loss improved from 0.00174 to 0.00174, saving model to model.h5
Epoch 50/500

Epoch 00050: val_loss improved from 0.00174 to 0.00174, saving model to model.h5
Epoch 51/500

Epoch 00051: val_loss improved from 0.00174 


Epoch 00078: val_loss improved from 0.00170 to 0.00170, saving model to model.h5
Epoch 79/500

Epoch 00079: val_loss did not improve from 0.00170
Epoch 80/500

Epoch 00080: val_loss did not improve from 0.00170
Epoch 81/500

Epoch 00081: val_loss did not improve from 0.00170
Epoch 82/500

Epoch 00082: val_loss did not improve from 0.00170
Epoch 83/500

Epoch 00083: val_loss did not improve from 0.00170
Epoch 84/500

Epoch 00084: val_loss did not improve from 0.00170
Epoch 85/500

Epoch 00085: val_loss did not improve from 0.00170
Epoch 86/500

Epoch 00086: val_loss did not improve from 0.00170
Epoch 87/500

Epoch 00087: val_loss improved from 0.00170 to 0.00170, saving model to model.h5
Epoch 88/500

Epoch 00088: val_loss improved from 0.00170 to 0.00170, saving model to model.h5
Epoch 89/500

Epoch 00089: val_loss did not improve from 0.00170
Epoch 90/500

Epoch 00090: val_loss did not improve from 0.00170
Epoch 91/500

Epoch 00091: val_loss did not improve from 0.00170
Epoch 92/500



Epoch 00119: val_loss did not improve from 0.00169
Epoch 120/500

Epoch 00120: val_loss did not improve from 0.00169
Epoch 121/500

Epoch 00121: val_loss improved from 0.00169 to 0.00169, saving model to model.h5
Epoch 122/500

Epoch 00122: val_loss improved from 0.00169 to 0.00169, saving model to model.h5
Epoch 123/500

Epoch 00123: val_loss improved from 0.00169 to 0.00169, saving model to model.h5
Epoch 124/500

Epoch 00124: val_loss did not improve from 0.00169
Epoch 125/500

Epoch 00125: val_loss did not improve from 0.00169
Epoch 126/500

Epoch 00126: val_loss improved from 0.00169 to 0.00169, saving model to model.h5
Epoch 127/500

Epoch 00127: val_loss did not improve from 0.00169
Epoch 128/500

Epoch 00128: val_loss improved from 0.00169 to 0.00169, saving model to model.h5
Epoch 129/500

Epoch 00129: val_loss improved from 0.00169 to 0.00169, saving model to model.h5
Epoch 130/500

Epoch 00130: val_loss improved from 0.00169 to 0.00169, saving model to model.h5
Epoch 131/50


Epoch 00159: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 160/500

Epoch 00160: val_loss did not improve from 0.00168
Epoch 161/500

Epoch 00161: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 162/500

Epoch 00162: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 163/500

Epoch 00163: val_loss did not improve from 0.00168
Epoch 164/500

Epoch 00164: val_loss did not improve from 0.00168
Epoch 165/500

Epoch 00165: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 166/500

Epoch 00166: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 167/500

Epoch 00167: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 168/500

Epoch 00168: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 169/500

Epoch 00169: val_loss did not improve from 0.00168
Epoch 170/500

Epoch 00170: val_loss did not improve from 0.00168
Epoch 171/50


Epoch 00198: val_loss did not improve from 0.00168
Epoch 199/500

Epoch 00199: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 200/500

Epoch 00200: val_loss did not improve from 0.00168
Epoch 201/500

Epoch 00201: val_loss did not improve from 0.00168
Epoch 202/500

Epoch 00202: val_loss did not improve from 0.00168
Epoch 203/500

Epoch 00203: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 204/500

Epoch 00204: val_loss did not improve from 0.00168
Epoch 205/500

Epoch 00205: val_loss improved from 0.00168 to 0.00168, saving model to model.h5
Epoch 206/500

Epoch 00206: val_loss did not improve from 0.00168
Epoch 207/500

Epoch 00207: val_loss improved from 0.00168 to 0.00167, saving model to model.h5
Epoch 208/500

Epoch 00208: val_loss improved from 0.00167 to 0.00167, saving model to model.h5
Epoch 209/500

Epoch 00209: val_loss improved from 0.00167 to 0.00167, saving model to model.h5
Epoch 210/500

Epoch 00210: val_loss impro

Epoch 238/500

Epoch 00238: val_loss improved from 0.00167 to 0.00167, saving model to model.h5
Epoch 239/500

Epoch 00239: val_loss did not improve from 0.00167
Epoch 240/500

Epoch 00240: val_loss did not improve from 0.00167
Epoch 241/500

Epoch 00241: val_loss did not improve from 0.00167
Epoch 242/500

Epoch 00242: val_loss did not improve from 0.00167
Epoch 243/500

Epoch 00243: val_loss did not improve from 0.00167
Epoch 244/500

Epoch 00244: val_loss did not improve from 0.00167
Epoch 245/500

Epoch 00245: val_loss did not improve from 0.00167
Epoch 246/500

Epoch 00246: val_loss did not improve from 0.00167
Epoch 247/500

Epoch 00247: val_loss did not improve from 0.00167
Epoch 248/500

Epoch 00248: val_loss did not improve from 0.00167
Epoch 249/500

Epoch 00249: val_loss did not improve from 0.00167
Epoch 250/500

Epoch 00250: val_loss did not improve from 0.00167
Epoch 251/500

Epoch 00251: val_loss did not improve from 0.00167
Epoch 252/500

Epoch 00252: val_loss did not i


Epoch 00279: val_loss did not improve from 0.00167
Epoch 280/500

Epoch 00280: val_loss did not improve from 0.00167
Epoch 281/500

Epoch 00281: val_loss did not improve from 0.00167
Epoch 282/500

Epoch 00282: val_loss did not improve from 0.00167
Epoch 283/500

Epoch 00283: val_loss did not improve from 0.00167
Epoch 284/500

Epoch 00284: val_loss did not improve from 0.00167
Epoch 285/500

Epoch 00285: val_loss did not improve from 0.00167
Epoch 286/500

Epoch 00286: val_loss did not improve from 0.00167
Epoch 287/500

Epoch 00287: val_loss did not improve from 0.00167
Epoch 288/500

Epoch 00288: val_loss did not improve from 0.00167
Epoch 289/500

Epoch 00289: val_loss did not improve from 0.00167
Epoch 290/500

Epoch 00290: val_loss did not improve from 0.00167
Epoch 291/500

Epoch 00291: val_loss did not improve from 0.00167
Epoch 292/500

Epoch 00292: val_loss did not improve from 0.00167
Epoch 293/500

Epoch 00293: val_loss did not improve from 0.00167
Epoch 294/500

Epoch 002


Epoch 00321: val_loss did not improve from 0.00167
Epoch 322/500

Epoch 00322: val_loss did not improve from 0.00167
Epoch 323/500

Epoch 00323: val_loss did not improve from 0.00167
Epoch 324/500

Epoch 00324: val_loss did not improve from 0.00167
Epoch 325/500

Epoch 00325: val_loss did not improve from 0.00167
Epoch 326/500

Epoch 00326: val_loss did not improve from 0.00167
Epoch 327/500

Epoch 00327: val_loss did not improve from 0.00167
Epoch 328/500

Epoch 00328: val_loss did not improve from 0.00167
Epoch 329/500

Epoch 00329: val_loss did not improve from 0.00167
Epoch 330/500

Epoch 00330: val_loss did not improve from 0.00167
Epoch 331/500

Epoch 00331: val_loss did not improve from 0.00167
Epoch 332/500

Epoch 00332: val_loss did not improve from 0.00167
Epoch 333/500

Epoch 00333: val_loss did not improve from 0.00167
Epoch 334/500

Epoch 00334: val_loss did not improve from 0.00167
Epoch 335/500

Epoch 00335: val_loss did not improve from 0.00167
Epoch 336/500

Epoch 003


Epoch 00363: val_loss did not improve from 0.00167
Epoch 364/500

Epoch 00364: val_loss did not improve from 0.00167
Epoch 365/500

Epoch 00365: val_loss did not improve from 0.00167
Epoch 366/500

Epoch 00366: val_loss did not improve from 0.00167
Epoch 367/500

Epoch 00367: val_loss did not improve from 0.00167
Epoch 368/500

Epoch 00368: val_loss did not improve from 0.00167
Epoch 369/500

Epoch 00369: val_loss did not improve from 0.00167
Epoch 370/500

Epoch 00370: val_loss did not improve from 0.00167
Epoch 371/500

Epoch 00371: val_loss did not improve from 0.00167
Epoch 372/500

Epoch 00372: val_loss did not improve from 0.00167
Epoch 373/500

Epoch 00373: val_loss did not improve from 0.00167
Epoch 374/500

Epoch 00374: val_loss did not improve from 0.00167
Epoch 375/500

Epoch 00375: val_loss did not improve from 0.00167
Epoch 376/500

Epoch 00376: val_loss did not improve from 0.00167
Epoch 377/500

Epoch 00377: val_loss did not improve from 0.00167
Epoch 378/500

Epoch 003


Epoch 00405: val_loss did not improve from 0.00167
Epoch 406/500

Epoch 00406: val_loss did not improve from 0.00167
Epoch 407/500

Epoch 00407: val_loss did not improve from 0.00167
Epoch 408/500

Epoch 00408: val_loss did not improve from 0.00167
Epoch 409/500

Epoch 00409: val_loss did not improve from 0.00167
Epoch 410/500

Epoch 00410: val_loss did not improve from 0.00167
Epoch 411/500

Epoch 00411: val_loss did not improve from 0.00167
Epoch 412/500

Epoch 00412: val_loss did not improve from 0.00167
Epoch 413/500

Epoch 00413: val_loss did not improve from 0.00167
Epoch 414/500

Epoch 00414: val_loss did not improve from 0.00167
Epoch 415/500

Epoch 00415: val_loss did not improve from 0.00167
Epoch 416/500

Epoch 00416: val_loss did not improve from 0.00167
Epoch 417/500

Epoch 00417: val_loss did not improve from 0.00167
Epoch 418/500

Epoch 00418: val_loss did not improve from 0.00167
Epoch 419/500

Epoch 00419: val_loss did not improve from 0.00167
Epoch 420/500

Epoch 004


Epoch 00447: val_loss did not improve from 0.00167
Epoch 448/500

Epoch 00448: val_loss did not improve from 0.00167
Epoch 449/500

Epoch 00449: val_loss did not improve from 0.00167
Epoch 450/500

Epoch 00450: val_loss did not improve from 0.00167
Epoch 451/500

Epoch 00451: val_loss did not improve from 0.00167
Epoch 452/500

Epoch 00452: val_loss did not improve from 0.00167
Epoch 453/500

Epoch 00453: val_loss did not improve from 0.00167
Epoch 454/500

Epoch 00454: val_loss did not improve from 0.00167
Epoch 455/500

Epoch 00455: val_loss did not improve from 0.00167
Epoch 456/500

Epoch 00456: val_loss did not improve from 0.00167
Epoch 457/500

Epoch 00457: val_loss did not improve from 0.00167
Epoch 458/500

Epoch 00458: val_loss did not improve from 0.00167
Epoch 459/500

Epoch 00459: val_loss did not improve from 0.00167
Epoch 460/500

Epoch 00460: val_loss did not improve from 0.00167
Epoch 461/500

Epoch 00461: val_loss did not improve from 0.00167
Epoch 462/500

Epoch 004


Epoch 00489: val_loss did not improve from 0.00167
Epoch 490/500

Epoch 00490: val_loss did not improve from 0.00167
Epoch 491/500

Epoch 00491: val_loss did not improve from 0.00167
Epoch 492/500

Epoch 00492: val_loss did not improve from 0.00167
Epoch 493/500

Epoch 00493: val_loss did not improve from 0.00167
Epoch 494/500

Epoch 00494: val_loss did not improve from 0.00167
Epoch 495/500

Epoch 00495: val_loss did not improve from 0.00167
Epoch 496/500

Epoch 00496: val_loss did not improve from 0.00167
Epoch 497/500

Epoch 00497: val_loss did not improve from 0.00167
Epoch 498/500

Epoch 00498: val_loss did not improve from 0.00167
Epoch 499/500

Epoch 00499: val_loss did not improve from 0.00167
Epoch 500/500

Epoch 00500: val_loss did not improve from 0.00167
