# **Natural Language Processing : Word Embeddings and More!✨**

This is the notebook I used in the Webinar conducted by the Data Science Community of SRM University. It is an implementation of various kinds of embeddings and their results on and after training. I have included the link to the ppt used to explain some of the concepts in the webinar(should you find them too hard to get). 

📌 [PPT Link](https://drive.google.com/file/d/1C5DSz-WezR_onXYyIYUM3iU8UbuhVBcN/view?usp=sharing) : Apart from Embeddings , it also discusses how you can make your Model stronger using RNNS & LSTMS with CNNs , etc , which was covered by my co-host [Harsh Sharma](https://www.linkedin.com/in/harshsharma27/).

Alright then , hope you have some fun ! 😄

## **Importing Libraries and Downloading Necessary Items**


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip3 install contractions


In [None]:
print(os.listdir('../input'))

In [None]:
import re
import nltk
import contractions
#nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
def check_null(data):
    for i in data.columns:
        print(i,":",data[str(i)].isna().sum())

We have imported all the necessary files. Let's Load the Data.

In [None]:
data = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
data.head()

## **Checks**

Before going on with any Machine Learning or Deep Learning Tasks , there are always certain checks which should be made when it comes to the dataset.
These include :

1) Checking the shape of the data

2) Checking the type of data in each column (more than often Date type data is given the object tag as a data type)

3) Checking presence of Null values



In [None]:
data.shape

In [None]:
data.info()

In [None]:
check_null(data)

## **Text Cleaning**

Every Dataset has to go through cleaning. Let's check what kind of cleaning we will be performing in our case. 

In [None]:
data['review'][0]

As you can see , there are couple of HTML tags in the dataset. We will be getting rid of them , white spaces, some special characters, etc. I often find it hard to keep a count of things which need to be removed from the text , so shall we make a list ?

**To Do Cleaning List**

1) Remove HTML TAGS

2) Remove emojis

3) Remove numbers

4) Remove Punctuation

5) Remove Stopwords

6) Removing words whose length is less than 2

7) Fixing Contractions

8) Stemming or Lemmatizing the words (Upto you which one should be performed)

In [None]:
def clean_txt(txt):
        ##html code
        TAG_RE = re.compile(r'<[^>]+>') 
        txt = TAG_RE.sub('', txt.lower())
        ##emojis
        txt=txt.encode("ascii","ignore")
        txt=txt.decode()
        ##numbers removing
        txt=''.join(i for i in txt if not i.isdigit())
        ##punctuation
        txt = re.sub(r'[^\w\s]', ' ', txt) 
        ##stopwords
        txt = ' '.join([i for i in txt.split() if not i in STOPWORDS])
        ##removing certain sized words
        txt=' '.join([i for i in txt.split() if len(i)>2])
        ##contractions
        txt=contractions.fix(txt)
        ##stemmers
        ##txt= stemmer.stem(txt)  should stemming be performed or lemmatization and why?
        ##lemmatizer
        txt=lemmatizer.lemmatize(txt)
        return txt
clean_txt(data['review'][0])
        

In [None]:
data['Clean Text']=data['review'].apply(clean_txt)
data.head()

## **Label Encoding**

The sentiment need to be changed to numbers so that the machine can interpret them correctly.

In [None]:
sentiment = {'positive':0,'negative':1}
data['sentiment'] =  data['sentiment'].map(sentiment)
data.head()

## **Splitting the Data**

In [None]:
X = data['Clean Text']
Y = data['sentiment']

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2, random_state=42)

In [None]:
print(len(max(data['Clean Text'],key=len)))
print(len(min(data['Clean Text'],key=len)))

## **Converting sentences into tokens**

In [None]:
vocab_size = 10000
embedding_dim = 16
max_length = 21
trunc_type='post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(x_train)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(x_train)
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(x_test)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

## **The DL Model**


This is where the embedding layer will come into play. Our padded sequences will be feeded into the network and the network will assign each word in the padded sequence a vector. 

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

## Training Time !!

In [None]:
num_epochs = 5
history = model.fit(padded, y_train , epochs=num_epochs, validation_data=(testing_padded, y_test))

Let's check how our model performed, shall we?

In [None]:
from sklearn.metrics import classification_report 
y_pred = model.predict(testing_padded)
y_pred = (y_pred > 0.6)
print(classification_report(y_test,y_pred))

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['Accuracy','Val Accuracy'])
plt.show()

Hmm, it seems like there is some issue with the training if you look at those graphs . Can you identify what it is and how to fix it ?

If yes, give it a go!

## **Word2Vec Embeddings**

In [None]:
sentences = [ ]
for _,row in data.iterrows():
    sentences+=row['Clean Text'].split()
[sentences[:2]]

In [None]:
num_features = 300  # Word vector dimensionality
min_word_count = 1 # Minimum word count
num_workers = 4     # Number of parallel threads
context = 10        # Context window size
downsampling = 1e-3 # (0.001) Downsample setting for frequent words

# Initializing the train model
from gensim.models import word2vec
print("Training model....")
model = word2vec.Word2Vec([sentences],
                          workers=num_workers,
                          size=num_features,
                          min_count=min_word_count,
                          window=context,
                          sample=downsampling)

print('Completed')
# # To make the model memory efficient
model.init_sims(replace=True)

# # Saving the model for later use. Can be loaded using Word2Vec.load()
# model_name = "300features_40minwords_10context"
# model.save(model_name)

Okay , now our embedding is ready. Let's have a look at our embedding's vocabulary!

In [None]:
list(model.wv.vocab)

Want to see what a embedding vector looks like ?  Run the next cell.

In [None]:
print (model['one'])

Let's see whether this embedding knows it's neighbours. Run the next cell to find words in the vocabulary the embedding finds most relatable to the words we have given.

In [None]:
print(model.wv.most_similar("okay"))

In [None]:
print(model.wv.most_similar("films"))

## **Making the Embedding Matrix / Layer**

The below two codes help make the embedding vector. Logic explained in PPT.

In [None]:
# Function to average all word vectors in a paragraph
def featureVecMethod(words, model, num_features):
    # Pre-initialising empty numpy array for speed
    featureVec = np.zeros(num_features,dtype="float32")
    nwords = 0
    
    #Converting Index2Word which is a list to a set for better speed in the execution.
    index2word_set = set(model.wv.index2word)
    
    for word in  words.split():
        if word in index2word_set:
            #print("Found Word")
            nwords = nwords + 1
            featureVec = np.add(featureVec,model[word])
    
    # Dividing the result by number of words to get average
    featureVec = np.divide(featureVec, nwords)
    return featureVec



In [None]:
# Function for calculating the average feature vector
def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        # Printing a status message every 1000th review
        if counter%1000 == 0:
            print("Review %d of %d"%(counter,len(reviews)))
            
        reviewFeatureVecs[counter] = featureVecMethod(review, model, num_features)
        counter = counter+1
        
    return reviewFeatureVecs


In [None]:
trainVectors = getAvgFeatureVecs(x_train,model,num_features)
testVectors = getAvgFeatureVecs(x_test,model,num_features)

## **Finding the Perfect Fit**

Time to try our prepapred vectors with certain classification ML algorithms.

In [None]:
from sklearn import tree
from sklearn.metrics import classification_report
clf = tree.DecisionTreeClassifier()
clf = clf.fit(trainVectors,y_train)
res = clf.predict(testVectors)
print(classification_report(y_test,res))

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
    
forest = forest.fit(trainVectors, y_train)
res = forest.predict(testVectors)
print(classification_report(y_test,res))

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(trainVectors, y_train).predict(testVectors)
print(classification_report(y_test,y_pred))

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
y_pred = lr.fit(trainVectors, y_train).predict(testVectors)
print(classification_report(y_test,y_pred))

## **Glove Embedding**

You will have to download the GloveEmbedding vectors for this if you have downloaded this file and want to run it on your local computer. 
[You can download it here.](https://nlp.stanford.edu/data/glove.6b.zip)

In this notebook, the Glove Embeddings have already been included.

## Loading the Embedding

In [None]:
GLOVE_DIR='../input/glove6b50dtxt'
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.50d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

## Making the Embedding Matrix

In [None]:
EMBEDDING_DIM = 50
embedding_matrix = np.zeros((len(word_index) + 1, 50))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [None]:
embedding_layer = tf.keras.layers.Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=max_length,
                            trainable=False)

## DL Model 

The only difference between this and the DL model we trained in the beginning , is that here the weights of the embedding layer are provided by us. 
These weights are the ones which came from the Glove embedding.

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(len(word_index) + 1,EMBEDDING_DIM,weights=[embedding_matrix],input_length=max_length,trainable=False),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

## Training Time

In [None]:
num_epochs = 15
history = model.fit(padded, y_train , epochs=num_epochs, validation_data=(testing_padded, y_test))

## Perfomance Check 

In [None]:
from sklearn.metrics import classification_report 
y_pred = model.predict(testing_padded)
y_pred = (y_pred > 0.6)
print(classification_report(y_test,y_pred))

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['Accuracy','Val Accuracy'])
plt.show()

Again some issue right ? Do you think it is because of the embedding or something else ? Try fixing it. 

# **That's all Folks! 🎉 Hope you enjoyed this tutorial!**

If you learnt something new , don't forget to upvote this notebook ! If you found something in the notebook and would like to tell me , leave a comment in the notebook and I'll get back to you ASAP. 🖖🏼

Take Care ! 😊