<h1> Word2Vec- Wikipedia Sentences </h1>

In [1]:
#importing the required libraries/module

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from tqdm import tqdm
import os
import random
import tensorflow as tf
from tensorflow.keras import *

<h2> Loading the Text </h2>

In [2]:
#the text is in the form of a csv file in the same folder
#it is a 1gb txt file of wikipedia sentences
#each line is a sentence

with open('wikisent2.txt') as f:
    text = f.readlines()

In [3]:
#now text is a list of all sentences as a string
text[50000]

"Abkhazia's break-away government stated that a plane crashed, and rejected the claim that it was shot down.\n"

In [4]:
len(text)

7871825

the list has about 78 lakhs sentences
Due to limited computing power, i have randomly sampled 500 sentences for which we will train the w2v model

In [5]:
text=random.sample(text,500)

In [6]:
len(text)

500

In [7]:
text[100]

'The song has peaked at numbers two and one on the Billboard Hot Country Songs and Country Airplay charts, respectively.\n'

<h2> Data Cleaning </h2>

<b>Text Preprocessing</b>
Now that we have finished loading, our data requires some preprocessing before we go on and train the w2v model

Hence in the Preprocessing phase we do the following in the order below:-

1. Remove any punctuations or limited set of special characters like , or . or # etc.
2. Check if the word is made up of english letters and is not alpha-numeric
3. Convert the word to lowercase

In [8]:
#try to decontract as many words as possible
def decontracted(sentence):
    # specific
    sentence = re.sub(r"won\'t", "will not", sentence)
    sentence = re.sub(r"can\'t", "can not", sentence)

    # general
    sentence = re.sub(r"n\'t", " not", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"\'s", " is", sentence)
    sentence = re.sub(r"\'d", " would", sentence)
    sentence = re.sub(r"\'ll", " will", sentence)
    sentence = re.sub(r"\'t", " not", sentence)
    sentence = re.sub(r"\'ve", " have", sentence)
    sentence = re.sub(r"\'m", " am", sentence)
    return sentence


In [9]:
def remove_alpha_numeric(sentence): #words with numbers as well
    sentence=re.sub(r"\S*\d+\S*","",sentence)
    return sentence
#zero or more non space followed by one or more digits followed by zero or more non space

In [10]:
def remove_special_char(sentence): # @ # $ % ^ & <  etc
    sentence=re.sub(r"[^a-zA-Z\S]+"," ",sentence)
    return sentence
#not a-z or A-Z or white space- 1 or more such occurences-we will replace them with a space

In [11]:
def remove_punctuations(sentence):
    sentence=re.sub(r"[^\w\s]"," ",sentence)
    return sentence
#not (word char or space char)- it has to be a punctuation if you have removed alpha numeric, html tags, url, and special char
#therefore use it in the end- the function has been designed on this assumption that it will be used at last

In [12]:
def remove_underscore(sentence):
    sentence=re.sub(r"_+","",sentence)
    return sentence

<h3> preprocessing/cleaning loop </h3>

In [13]:
preprocessed_data=[] #list to store all the final preprocessed review texts

for sentence in tqdm(text):
    #.strip() removes all the trailing and leading characters (by default-white space) from a string
    #using all the functions made for preprocessing
    sentence=decontracted(sentence).strip()
    sentence=remove_alpha_numeric(sentence).strip()
    sentence=remove_special_char(sentence).strip()
    sentence=remove_punctuations(sentence).strip() #used in the end
    sentence=remove_underscore(sentence).strip()
      
    #removing stop words from the sentence 
    word_list=sentence.split() #splits the string by white spaces and stores it in a list
    word_list=[i.lower() for i in word_list] #changing case of all words to lower case
    word_list=[i for i in word_list if len(i)>1] #removing single characters
    sentence=" ".join(word_list) #sentence has been created from all the final words in word_list
    sentence.strip()
    preprocessed_data.append(sentence)

100%|█████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 15688.20it/s]


now the data preprocessing is complete

<b> comparing results before and after preprocessing 

In [17]:
text[150]

'Since early 2009, they have been on a silent hiatus.\n'

In [18]:
preprocessed_data[150]

'since early they have been on silent hiatus'

<h2> Defining the Vocabulary </h2>

now we need a function to define the vocabulary of all unique words in the dataset

In [19]:
def get_vocab(data): #takes in a list of sentences
    
    unique_words = set() # at first we will initialize an empty set
    for row in tqdm(data): # for each review in the list
        for word in row.split(" "): # for each word in the review. #split method converts a string into list of words
            unique_words.add(word)
    #sorting the words in the list in alphabetical order
    unique_words = sorted(list(unique_words))
    #definning a dictionary giving each word a unique index
    vocab = {j:i for i,j in enumerate(unique_words)}
    
    return vocab #returning the vocab dictionary

In [20]:
#getting the vocabulary for our case
vocab=get_vocab(preprocessed_data)

100%|████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 125255.45it/s]


In [51]:
#now our vocabulary has been defined and each word in preprocessed data has its own unique index

In [21]:
vocab

{'abandoned': 0,
 'abandonment': 1,
 'abbandonata': 2,
 'abbreviated': 3,
 'abbreviation': 4,
 'abco': 5,
 'abdourahman': 6,
 'abolition': 7,
 'aboriginal': 8,
 'about': 9,
 'above': 10,
 'abramowitz': 11,
 'abyssinia': 12,
 'academic': 13,
 'acanthoscelides': 14,
 'accepted': 15,
 'accident': 16,
 'accommodate': 17,
 'accordance': 18,
 'according': 19,
 'accounts': 20,
 'accredited': 21,
 'ace': 22,
 'achievement': 23,
 'acquired': 24,
 'acquisition': 25,
 'acres': 26,
 'across': 27,
 'act': 28,
 'action': 29,
 'activated': 30,
 'active': 31,
 'activist': 32,
 'activities': 33,
 'activity': 34,
 'actor': 35,
 'actress': 36,
 'acts': 37,
 'actual': 38,
 'ada': 39,
 'adaptation': 40,
 'adapted': 41,
 'adc': 42,
 'added': 43,
 'addition': 44,
 'adds': 45,
 'adelaide': 46,
 'adjacent': 47,
 'administration': 48,
 'adolph': 49,
 'advanced': 50,
 'advancing': 51,
 'advocates': 52,
 'adwa': 53,
 'aerial': 54,
 'aeronautica': 55,
 'affected': 56,
 'africa': 57,
 'african': 58,
 'after': 59,
 

In [22]:
#the following words might not be present in the dictionary when the notebook is run again
#as the sampling of the sentences in done randomly
#can check with other words

vocab["comedy"]

607

In [23]:
vocab["great"]

1293

In [24]:
print("length of vocabulary: ", len(vocab))

length of vocabulary:  3401


<b> our vocab has 3401 unique words </b>

<h2> Defining the Dataset for Neural Network </h2>

In [25]:
def one_hot(word): #returns the one hot encoded vector for a given word (given word must be in the vocabulary)
    encoding=np.zeros(len(vocab))
    encoding[vocab[word]]=1 #setting the index corresponding to the word as 1
    return encoding

In [26]:
def get_context_words(word_position,word_list,window=2): #return the context word lsit for a given word position in a sentence
    
    lower_index=max(word_position-window,0)
    upper_index=min(word_position+window,len(word_list)-1)
    
    context_words=[] #list of context words
    for i in range(lower_index,upper_index+1):
        if i==word_position: #index of the word itself
            continue
        else: #context word index
            context_words.append(word_list[i])
                
    return context_words #returns list of context words for a word

In [27]:
#generating x and y for the neural network

In [28]:
x_train=[] #list of words
y_train=[] #list of context words

for sentence in tqdm(preprocessed_data):
    word_list=sentence.split() #getting the list of words in the sentence
    for i in range(len(word_list)): #for all words in a sentence
        x_word=one_hot(word_list[i])
        context_words=get_context_words(i,word_list)
        for word in context_words: #for each context word for a word
            y_word=one_hot(word)
            #adding the x_word and y_word to the main dataset
            x_train.append(x_word)
            y_train.append(y_word)

100%|███████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 819.60it/s]


In [29]:
len(x_train)

33788

In [30]:
len(y_train)

33788

<b> Now we have 33,788 word-context words pair using which we train the neural network </b>

<h2> Defining the Neural Network </h2>

<b> W2V DENSE VECTOR SIZE=30--> NEURONS IN HIDDEN LAYER=30 </b>

<b> VOCABULARY SIZE 3401--> NEURONS IN INPUT AND OUTPUT LAYER=3401 </b>

In [31]:
from tensorflow.keras.layers import Dropout, Dense, Flatten, Input
from tensorflow.keras.initializers import RandomNormal
from tensorflow.keras.activations import softmax

model=Sequential()
model.add(Dense(30, activation='relu',use_bias=True,kernel_initializer="glorot_uniform"
                ,bias_initializer=RandomNormal(mean=0.0, stddev=0.05),input_dim=len(vocab))) #hidden layer
#input size has been defined as the size of the vocabulary
          
model.add(Dense(len(vocab), activation=softmax,use_bias=True,kernel_initializer="glorot_uniform"
                ,bias_initializer=RandomNormal(mean=0.0, stddev=0.05))) #output layer- softmax activation

<h2> Compiling the Model </h2>

In [32]:
from tensorflow.keras.optimizers import Adam
model.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='categorical_crossentropy' #cross entropy as the loss function
)

In [33]:
#summary of the model
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 30)                102060    
_________________________________________________________________
dense_1 (Dense)              (None, 3401)              105431    
Total params: 207,491
Trainable params: 207,491
Non-trainable params: 0
_________________________________________________________________


<h2> Training the Model </h2>

In [34]:
batch_size=64
epochs=50

#train the model
model.fit(x=np.array(x_train),y=np.array(y_train),
                    batch_size=batch_size,
                    epochs=epochs)
                    #callbacks=cp_callback) # Pass callback to training

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x205b909a880>

<b> the training loss has improved from 7.2238 to 4.41 </b>

<h2> Analysing the Weight Matrix </h2>

In [43]:
len(vocab) #vocab size

3401

In [35]:
weights = model.get_weights() # returs a numpy list of weights

In [36]:
len(weights)

4

In [38]:
weights[0].shape #Wword

(3401, 30)

In [40]:
weights[1].shape #hidden layer biases

(30,)

In [42]:
weights[2].shape #Wcontext

(30, 3401)

In [44]:
weights[3].shape #output layer biases

(3401,)

<h2> Getting Word Embeddings for any word </h2>

In [52]:
w_word=weights[0]
bias=weights[1]

In [65]:
word="great" #enter the word here
encoding=one_hot(word) #one hot encoding for the word

In [66]:
embedding=np.matmul(encoding,w_word)+bias

In [67]:
embedding #30-dimensional dense embedding for the word

array([ 0.11897421,  1.04536515,  1.03667539,  1.27910376,  0.95242138,
        0.87302552,  0.36116925, -0.00405413,  0.49722481, -0.09004649,
       -0.0826215 , -0.05988851, -0.01450284, -0.11034647,  0.84957301,
        0.40098637,  1.25051621,  2.08496851, -0.05835783,  0.71212136,
       -0.13719042,  1.18773946,  0.49917839,  1.56662285,  1.11882298,
        1.09841652,  1.22299717,  1.16509277,  0.5704588 ,  0.50117218])

<h1> ---Built the W2V Model--- </h1>