# Assignment 3 CS 5316 Natural Language Processing
For this assignment we will use the following packages
<ul>
    <li><a href="https://radimrehurek.com/gensim/">Gensim</a>.</li>
    <li><a href="https://keras.io/">Keras</a>.</li>
    <li><a href="https://www.tensorflow.org/">Tensorflow</a>.</li>
</ul>
You can install these packages via anaconda navigator or use the conda install / pip install commands e.g<br>
<blockquote>pip install gensim<br>
pip install tensorflow<br>
pip install keras</blockquote>

In [None]:
import numpy as np
from IPython.display import Image
# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from sklearn.decomposition import TruncatedSVD
from sklearn.utils.extmath import randomized_svd
from sklearn.model_selection import train_test_split
from nltk import ngrams
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk import ngrams
from nltk.tokenize import word_tokenize
import keras
import seaborn as sn


# Word Embeddings

Word Vectors nowadays are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, language translation, sentiment analysis, etc. The goal of word embedding methods is to derive a low-dimensional continuous vector representation for words so that words that are syntactically or semantically related are close together in that vector space and thus, share a similar representation.

In this assingment we are going to explore different word embedddings inorder to build some intuitions about their strengths and weaknesses. Although there are many types of word embeddings they can be broadly classified into two categories:
<ul>
    <li>Frequency based Embedding</li>
    <li>Prediction based Embedding</li>
</ul>
For frequenct based embedding we will explore embeddings based on <b>word co-occurance</b> counts with <a href="https://en.wikipedia.org/wiki/Pointwise_mutual_information">Point Wise Mutial Information(PPMI)</a> and <a href="https://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition(SVD)</a>.
<a href="https://www.youtube.com/watch?v=P5mlg91as1c">SVD video explaination</a><br>
For prediction based embeddings we will explore <a href="https://en.wikipedia.org/wiki/Pointwise_mutual_information">Word2Vec</a> based embeddings.

For evaluating these embeddings we will work with the following two datasets: 
<ul>
    <li>Twitter dataset created by Sanders Analytics which we explored in the previous assignment<b>(file provided)</b></li>
    <li>Movie reviews dataset from the popular website <a href="https://www.imdb.com/">IMDB</a>.
        Head over the to <a href="https://www.kaggle.com/c/word2vec-nlp-tutorial/data">kaggle</a> and download the dataset from there. The dataset consists of three files:<br><b>labelledTrainData,unlabelledTrainData,testData</b></li>   
</ul>
Read the "Data" section on kaggle for details on the dataset.

Let's get started.......<br>
remove this link later [Assignment solution](https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=40s) 

## Frequency base Embeddings
For this part we will use the Sanders Analytics dataset to create embeddings. Since the other dataset is large we might run into memory problems.<br><br><br>
Although we can directly use word representation based on word co-occurance matrix directly it is generally not a good idea to do so for the following reasons:
<ul>
    <li>The word co-occurance matrix scales with vocabulary size, considering memory constraints this would be problematic for large datasets, as in the IMDB data set that has vocabulary size after remove stop words of 225109, which requres rougly around 189 GiB of storage capacity(roughly 203 GB)</li><img src="memoryerror.png">
    <li> The word co-occurance matrix will be quite sparse, meaning many entries in the matrix will be zeros. This is problematic due the fact that for many nlp tasks the multyplication operation is used quite frequently, e.g. for word similarity task, cosine similarity is used:<img src="cosine-equation.png"> Here we can see the dot product is computed between two word vectors, multyplication with zeros wastes precious computation power and your time.</li>
    <li> High co-occurance counts for stop words and conjunctions offset true representation of words, meaning thier could become a dominant factor when these embeddings are used in computations. These also dont provide a lot of information as thier counts with other words would also be high.</li>
</ul>
In summary, you want to avoid sparse represenation's just like the corona virus.<br>
To mitigate the above problems we will use PPMI and SVD. PPMI is use to control high co-occurance counts and SVD is used to reduce dimensionality.

    

### Preprocessing
Since we have already discussed preprocessing trade off's in previous assingments. We expect you to analyse the data and preform the preprocessing that is required.<br> 

In [None]:
def load_data(filename):
    
    labels=list(pd.read_csv(filename,usecols=[0]).values)
    tweets=list(pd.read_csv(filename,usecols=[1]).values)
    tweets=[tweet for i in tweets for tweet in i]
    labels=[label for i in labels for label in i]
    
    return  tweets,labels
tweets, labels=load_data("twitter-sanders-apple3.csv")


In [None]:
def preprocessing(data):
    """
    Return preprocessed data

    Args:
        data : reviews
    
    Returns: preprocessed_data
    preprocessed_data : preprocessed dataset 
    """
    #remove http
    for i in range(0,len(data)):
        data[i]=re.sub(r'http:\/\/t\.co\/.*?[A-Z]$','',data[i])
        data[i]=re.sub(r'http[A-Za-z0-9]+','',data[i])
    #converting to lowercase
    for i in range(0,len(data)):
        data[i]=data[i].lower()
    #remove punc
    for i in range(0,len(data)):
        data[i]=re.sub(r'[^0-9a-z\@\s]',' ',data[i])
        data[i]=data[i].replace('  ',' ')
    #add token
    for i in range(0,len(data)):
        data[i]=re.sub(r'\@',"",data[i])

    return data


data=preprocessing(tweets)

    

### Test train split
Use test train split from sklearn.


In [None]:
def testTrainSplit(tweets,labels):
    """
    Return test train data

    Args:
        data_X : reviews
        data_Y: labels
    Returns: test train split data 
    """
    return train_test_split(tweets,labels,train_size=0.8,test_size=0.2,shuffle=True)

train_tweets,test_tweets,train_labels,test_labels=testTrainSplit(data,labels)


Extract the vocabulary, to find te dimensions of co-occurance matrix

In [None]:
def getVocabulary(data):
    """
    Return dataset vocabulart

    Args:
        train_X : reviews in train dataset
    
    Returns: vocabulary
    vocabulary: list of unique words in dataset
    """
    #sen tokenize
    sen_tweets=[word_tokenize(i) for i in data]
    
    #flatten
    all_tokens=[j for i in sen_tweets for j in i]
    #stop words 
    lst=list(set(stopwords.words('english')))
    all_tokens=[i for i in all_tokens if i not in lst]
    #vocab
    vocab=list(set(all_tokens))
    unk=[]
    for i in list(vocab):
        if all_tokens.count(i)==1:
            unk.append(i)
    for i in unk:
        if i in vocab:
            vocab.remove(i)
    
    for i,sen in enumerate(sen_tweets):
        for j,word in enumerate(sen):
            if word in lst:
                sen.remove(word)
        sen_tweets[i]=sen
        
    for i,sen in enumerate(sen_tweets):
        for j,word in enumerate(sen):
            if word not in vocab:
                sen[j]='UNK'
        sen_tweets[i]=sen
    vocab.append('UNK')
    dic={}
    for i,word in enumerate(vocab):
        dic[word]=i
        
    print(len(vocab))
    return vocab,dic,sen_tweets

vocab,dic,train=getVocabulary(train_tweets)



In [None]:
#tokenizing sentetnces
sen_tokens=[word_tokenize(i) for i in test_tweets]
lst=list(set(stopwords.words('english')))
test=[]
tokens=[]
for i in sen_tokens:
    for j in i:
        if j not in lst:
            tokens.append(j)
    test.append(tokens)
    tokens=[]

test_vocab=[]
for i in sen_tokens:
    for j in i:
        test_vocab.append(j)

test_vocab=list(set(test_vocab))       
len(test_vocab)

### Point Wise Mutial Information
Pointwise mutual information, or PMI, is the (unweighted) term that occurs inside of the summation of mutual information and measures the correlation between two specific events. Specifically, PMI is defined as<br>
$$PMI(a, b) = \log \frac{p(a,b)}{p(a)p(b)}$$

and measures the (log) ratio of the joint probability of the two events as compared to the joint probability of the two events assuming they were independent. Thus, PMI is high when the two events a and b co-occur with higher probability than would be expected if they were independent.

If we suppose that a and b are words, we can measure how likely we see a and b together compared to what we would expect of they were unrelated by computing their PMI under some model for the joint probability $$p(a,b)$$

Let D represent a collection of observed word-context pairs (with contexts being other words). We can construct D by considering the full context of a specific word occurrence as the collection of all word occurrences that appear within a fixed-size window of length L before and after it.

For a specific word $w_i$ in position i in a large, ordered collection of words $w_1, w_2$, we would have the context as ,$w_{i-1},w_{i+1},\ldots$, and could thus collect counts (a total of 2L) of each of those words as appearing in the context of word $w_i$. We will refer to $w_i$ as the “target word” and the words appearing in the L-sized window around $w_i$ as “context words”.

Consider a sample corpus containing only one sentence:<br>
    <center><blockquote>Encumbered forever by desire and ambition</blockquote></center>

We can construct D by considering each word position i and extracting the pairs $(w_i, w_{i+k})$ for $−L≤k≤L;k≠0$. In such a pair, we would call $w_i$ the “target word” and $w_{i+k}$ the “context word”.

For example, we would extract the following pairs for $i=4i$ if we let our window size $L=2$<br>
    <center><blockquote>(desire,forever),(desire,by),(desire,and),(desire,ambition)</blockquote></center>

Similarly, for $i=5i$, we would extract the following pairs:
    <center><blockquote>(and,by),(and,desire),(and,ambition)</blockquote></center>
Let’s let $n_{w,c}$ represent the number of times we observe word type c in the context of word type w. We can then define ,$n_w = \sum_{c'} n_{w,c'}$ as the number of times we see a “target” word w in the collection of pairs D and $n_c = \sum_{w'} n_{w',c}$ as the number of times we see the context word c in the collection of pairs D.

We can then define the joint probability of a word and a context word as
    $$p(w, c) = \frac{n_{w,c}}{|D|}$$

where $∣D∣$ is simply the total number of word-context occurrences we see. Similarly, we can define
    $$p(w) = \frac{n_w}{|D|}$$

and $$p(c) = \frac{n_c}{|D|}$$

and thus the PMI between a word w and context word c is
$$PMI(w, c) = \log \frac{p(w,c)}{p(w)p(c)} = \log \frac{n_{w,c} \cdot |D|}{n_w \cdot n_c}.$$

If we compute the PMI between all pairs of words in our vocabulary V, we will arrive at a large, real-valued matrix. However, some of the values of this matrix will be $\log 0$, if the word-context pair $(w,c)$ is unobserved, this will result in inf bieng computed. To remedy this, we could simply define a modified PMI that is equal to 0 when $n_{w,c} = 0$, which is the positive pointwise mutual information (PPMI) which:
    P$$PPMI(w,c) = \max(0, PMI(w,c))$$
<br>
This wonderfull explaination is made by <a href="http://czhai.cs.illinois.edu/">Dr.ChengXiang ("Cheng") Zhai</a><br><br>

<center><b>HINT: Consult your slides and see the example, how the formulas are used. You can calculate $|D|$ by the formula given in the slides(its the same thing).</b></center>

If youre having troubles implementing this here is some [motivation](https://www.youtube.com/watch?v=TsyM5jP7RQk)

### Create a co-occurance matrix with +,- k window size
Hint: Use the ngrams package from [nltk](https://www.nltk.org/) to make life easier. Matrix size is vocab X vocab.
Please keep track of the order of words in the matrix this will be usefull later.

In [None]:
import nltk
import pandas as pd

def coOccuranceMatrix(sen_tokens,vocab,dic,k=2):
    """
    Return co-occurance matrix with ppmi counts

    Args:
        data : dataset
        vocab : vocabulary
    Returns: co_matrix
    co_matrix: co-occurance matrix
    """
    matrix = np.zeros((len(vocab),len(vocab)))

  
    # <- <- -kwindow
    for sen in sen_tokens:
        ngs=list(ngrams(sen,k+1))
        for tup in ngs:
            target=tup[0]
            for i in range(1,len(tup)):
                context=tup[i]
                matrix[dic[target]][dic[context]]+=1
    # -> -> +kwindow
    for sen in sen_tokens:
        ngs=list(ngrams(sen,k+1))
        for tup in ngs:
            target=tup[-1]
            for i in range(0,len(tup)-1):
                context=tup[i]
                matrix[dic[target]][dic[context]]+=1
                
    
    
    
    return matrix

matrix=coOccuranceMatrix(train,vocab,dic)



In [None]:

def ppmiMatrix(co_matrix):
    """
    Return co-occurance matrix with ppmi counts

    Args:
        co_matrix : co-occurance matrix
    Returns: ppmi_matrix
    ppmi_co_matrix: co-occurance matrix with ppmi counts
    """
    col_totals = co_matrix.sum(axis=0)
    total = col_totals.sum()
    row_totals = co_matrix.sum(axis=1)
    expected = np.outer(row_totals, col_totals) / total
    co_matrix = co_matrix / expected
    
    #taken from stackoverflow
    with np.errstate(divide='ignore'):
        co_matrix = np.log(co_matrix)
    co_matrix[np.isinf(co_matrix)] = 0
    co_matrix[co_matrix < 0] = 0
    np.nan_to_num(co_matrix, copy=False)
    
    return co_matrix
    

ppmatrix=ppmiMatrix(matrix)
ppmatrix

Code for SVD has been provided for you, all you have to do is specify the number of top eigenvalues or how many top <b>n</b> dimensions you want to keep. Check the dimensions of the returned matrix by using <blockquote>.shape</blockquote> command to figure out if the embedding for each word is in row or column. By our calculation the vocab count should be less than five thousand, reduce the dimensionality to less than one thousand.

In [None]:
e_size=len(vocab)

In [None]:
#code provided 
def denseMatrixViaSVD(ppmi_co_matrix,n):
    """
    Return reduced dimensionality co-occurance matrix by applying svd

    Args:
        ppmi_matrix : co-occurance matrix with ppmi counts
        
    Returns: svd_co_matrix
    svd_co_matrix: reduced dimensionality co-occurance matrix
    """
#     top_n_eigenvalues=
    U, Sigma, VT = randomized_svd(ppmi_co_matrix, 
                              n_components=n,
                              n_iter=5,
                              random_state=None)
    svd_co_matrix=U
    return svd_co_matrix
dense=denseMatrixViaSVD(ppmatrix,e_size)
dense 



### Modelling
Now that we have our embeddings, lets use these to train a Feed Forward Neural network for our semantic classification task. Since a feed forward network's input layer is of a fixed size we will need to create a fixed size representation for each review. For this purpose we will use the following:
<ul>
    <li>Average pooling.</li>
    <li>Averaging pooling algorithm by FastText(provided)</li>
    <li>Max pooling. </li>
</ul>
For those of you who are familiar with Convolution Neural Networks this pooling will be a 1d pooling operation. See illustrated example below:<img src="pooling.png">

Since we cant have a tutorial due to corona virus for keras, a simple feed forward network has beed provided for you. You need to create train_X, test_X , train_Y and test_Y these should numpy arrays inorder for keras to use them.
<ul>
    <li>train_X= contains embedding representains of all the reviews in the train set</li>
    <li>train_Y= contains embedding representains of all the reviews in the test set</li>
    <li>train_Y= contains <b>one hot</b> representations of train labels</li>
    <li>test_Y= contains <b>one hot</b> representations of test labels</li>   
</ul>
To construct one hot representation you can use the sklearn's preprocessing package or the preprocessing package from keras. Read online.

In [None]:
# CONSTRUCT ONE HOT REPRESENTATION
#[neg0,neu1,pos2]
def make_labels(train_labels):
    from sklearn.preprocessing import LabelEncoder
    from keras.utils import to_categorical
    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(train_labels)
    return to_categorical(integer_encoded)
labels=make_labels(train_labels)
print(labels[1],train_labels[1])
labels1=make_labels(test_labels)

In [None]:
#Fast text averaging, pass a list of word embeddings and embedding size to fasttextAveraging function
def l2_norm(x):
   return np.sqrt(np.sum(x**2))

def div_norm(x):
   norm_value = l2_norm(x)
   if norm_value > 0:
       return x * ( 1.0 / norm_value)
   else:
       return x
def fasttextAveraging(embedding_list,embedding_size):
    norm=np.zeros(embedding_size)
    for emb in embedding_list:
        norm=norm+div_norm(emb) 
    return norm/len(embedding_list)

tweets=[]
tweets1=[]
tw=[]
for j in train:
    for i in j:
        tw.append(dense[dic[i]])
    ans=fasttextAveraging(tw,e_size)
    tweets.append(ans)
    tw=[]
tweets=np.array(tweets)

matrix1=np.array((len(test_vocab),len(test_vocab)))

for i in test:
    for j in i:
        if j not in vocab:
            tw.append(dense[dic['UNK']])
        else:
            tw.append(dense[dic[j]])
    ans=fasttextAveraging(tw,e_size)
    tweets1.append(ans)
    tw=[]
tweets1=np.array(tweets1)
tweets.shape



In [None]:
tweets.shape

In [None]:
def averagePooling(embedding_list,embedding_size):
    """
    Return average embedding vector from list of embedding
    Args:
        embedding_list : embedding list
        embedding_size: size of embedding vector
    Returns: average_embedding
    average_embedding: average embedding vector
    """
        
    return np.mean(embedding_list,axis=0)

avg_tweets=[]
avg_tweets1=[]
tw=[]
for j in train:
    for i in j:
        tw.append(dense[dic[i]])
    ans=averagePooling(tw,e_size)
    avg_tweets.append(ans)
    tw=[]
avg_tweets=np.array(avg_tweets)


for i in test:
    for j in i:
        if j not in vocab:
            tw.append(dense[dic['UNK']])
        else:
            tw.append(dense[dic[j]])
    ans=averagePooling(tw,e_size)
    avg_tweets1.append(ans)
    tw=[]
avg_tweets1=np.array(avg_tweets1)






In [None]:
def maxPooling(embedding_list,embedding_size):
    """
    Return maxpooling embedding vector from list of embedding
    Args:
        embedding_list : embedding list
        embedding_size: size of embedding vector
    Returns: max_embedding
    max_embedding: maxpooled embedding vector
    """
    arr= np.zeros(embedding_size)
    for i in embedding_list:
        arr+=np.max(i,axis=0)
        
    return arr

max_tweets=[]
max_tweets1=[]
tw=[]
for j in train:
    for i in j:
        tw.append(dense[dic[i]])
    ans=maxPooling(tw,e_size)
    max_tweets.append(ans)
    tw=[]
max_tweets=np.array(max_tweets)
print(max_tweets[0])

for i in test:
    for j in i:
        if j not in vocab:
            tw.append(dense[dic['UNK']])
        else:
            tw.append(dense[dic[j]])
    ans=maxPooling(tw,e_size)
    max_tweets1.append(ans)
    tw=[]
max_tweets1=np.array(max_tweets1)


Try using all three representaions to train the model and check which one works best. You can play around with embedding size by controlling <b>n</b> in SVD function and for the model you can add or remove layers or change the number of neurons in the hidden layers. Keep in mind that the layers should be decreasing in size as we go deeper into the network, theoritically this means that we are constructing complex features in a lower dimensional space from less complex features and larger dimensional space.<br><br>
Issues related to overfiting will be proper addressed in the next assignment for now you are free to choose the number of epoch, try to find one that trains the model sufficiently enough but does not overfit it.

In [None]:
from keras.callbacks import ModelCheckpoint, EarlyStopping, CSVLogger, ReduceLROnPlateau
filepath = "setting_" + "model1" + ".hdf5"
logfilepath = "setting_"+"model1" + ".csv"
reduce_lr_rate=0.2
logCallback = CSVLogger(logfilepath, separator=',', append=False)
earlyStopping = EarlyStopping(monitor='acc', min_delta=0, patience=10, verbose=0, mode='max')
checkpoint = ModelCheckpoint(filepath, monitor='acc', save_weights_only=True, verbose=1,
                             save_best_only=True, mode='max')
reduce_lr = ReduceLROnPlateau(monitor='acc', factor=reduce_lr_rate, patience=10,
                              cooldown=0, min_lr=0.0000000001, verbose=0)

callbacks_list = [logCallback, earlyStopping, reduce_lr, checkpoint]

import tensorflow as tf
from tensorflow import keras
embedding_size=e_size
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(embedding_size,)),#donot change(input layer)
    keras.layers.Dense(500, activation='relu'),#(hidden layer)
    keras.layers.Dense(100, activation='relu'),#(hidden layer)
    keras.layers.Dense(3)#donot change
])
adam=keras.optimizers.Adam(lr=0.00001)
model.compile(optimizer=adam,
              loss=["categorical_crossentropy"],
              metrics=['accuracy'])

model.summary()

In [None]:
model.fit(avg_tweets,labels, epochs=15, batch_size=32,
               verbose=1,shuffle=True,callbacks=callbacks_list)

In [None]:
labelList=['neg','neu','pos']

Use the <b>model.predict</b> method to get predictions. There predictions will be a probability distribution over the lables, to get the desired class take the max value in a prediction vector as the predicted class.<br> To run the code below you need to construct a list of unique labels, the list should be ordered on the basis of the id assigned to each class when you were constructing the one hot representation.

In [None]:
predictions = model.predict(max_tweets1)
pred=[]
for i in predictions:
    pred.append(np.argmax(i))
predictions=pred

from sklearn.metrics import confusion_matrix

test_Y_max=np.argmax(labels1, axis=-1)
cm=confusion_matrix(test_Y_max,predictions)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm = pd.DataFrame(cm, labelList,labelList )# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.4) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 11}, fmt=".2f") # font size
plt.show()



In [None]:
import sklearn
print("Classification Report\n",sklearn.metrics.classification_report(test_Y_max, predictions, labels=[0,1,2], target_names = labelList))

## Prediction base Embeddings
For prediction based embeddings we will use the IMDB dataset. We will create create our embeddings by using the unlabeledTrainData.tsv file.
We will use the Word2Vec model that we have already covered in class. 

### Preprocessing
Since we have already discussed preprocessing trade off's in previous assingments. We expect you to analyse the data and preform the preprocessing that is required.<br> 
<b> Hint: Each review is in string format so they have used slahes to escape characters and br tags to identify line breaks</b>

In [None]:
#load the data
import pandas as pd 
unlabeled=pd.read_csv('unlabeled.txt',sep="\t",error_bad_lines=False,encoding='utf-8')
unlabeled=list(unlabeled['review'])
labeled=pd.read_csv('labeledTrainData.tsv',sep='\t',encoding='utf-8')
data=list(labeled['review'])
labels=list(labeled['sentiment'])


In [None]:
import re
def preprocessing(data):
    """
    Return preprocessed data

    Args:
        data : reviews
    
    Returns: preprocessed_data
    preprocessed_data : preprocessed dataset 
    """
    for i,text in enumerate(data):
        data[i]=text.lower()
    
    
    for i,text in enumerate(data):
        data[i]=re.findall(r'[A-Za-z]+',text)
    
    lst=list(set(stopwords.words('english')))
    tokens=[]
    all_tokens=[]
    for num,i in enumerate(data):
        for j in i:
            if len(j)<=2:
                continue
            if j in lst:
                continue
            else:
                tokens.append(j)
        all_tokens.append(tokens)
        tokens=[]
        
    data=all_tokens
        
    
    
    return data
data= preprocessing(data)
unlabeled=preprocessing(unlabeled)



In [None]:
train_reviews,test_reviews,train_labels,test_labels=testTrainSplit(data,labels)
len(train_reviews)


In [None]:
def get_voc(data):
    vocab=[]
    for i in data:
        for j in i:
            vocab.append(j)
    vocab=list(set(vocab))
    return vocab
unlabeled_voc=get_voc(unlabeled)

vocab_dic={}
for i,v in enumerate(unlabeled_voc):
    vocab_dic[v]=i


Use the [gensim](https://radimrehurek.com/gensim/models/word2vec.html) to train a Word2Vec model. Keep the dimensionality at 300 and window size at 2. After trianing use the model and previously coded methods create vectorial represenations for movie reviews.<i>(create train_X, test_X, train_Y and test_Y)</i>

In [None]:
from gensim.models import Word2Vec
def trainWord2Vec(data):
    """
    Return preprocessed data

    Args:
        data : movie reviews
    
    Returns: model
    model : Word2Vec model 
    """
    path = get_tmpfile("word2vec.model")
    model = Word2Vec(data, size=300, window=2, min_count=1, workers=4)
    model.save("word2vec.model")
    return model
model=trainWord2Vec(unlabeled)

In [None]:
#load the train and test files and create the vectorial representations

In [None]:
tw=[]
train=[]
for i in train_reviews:
    for j in i:
        try:
            word=vocab_dic[j]
        except:
            continue
        tw.append(model.wv[j])
    ans=fasttextAveraging(tw,300)
    train.append(ans)
    tw=[]

train=np.array(train)


In [None]:
tw=[]
test=[]
for i in test_reviews:
    for j in i:
        try:
            word=vocab_dic[j]
        except:
            continue
        tw.append(model.wv[j])
    ans=maxPooling(tw,300)
    test.append(ans)
    tw=[]

test=np.array(test)
    

In [None]:
labels=make_labels(train_labels)
labels1=make_labels(test_labels)
labels.shape
test.shape

Since this is dense representaion we wont be faced with the challenges posed by sparse representations. We can move onto modelling.

### Modelling


In [None]:
from keras.callbacks import ModelCheckpoint, EarlyStopping, CSVLogger, ReduceLROnPlateau
filepath = "setting_" + "model1" + ".hdf5"
logfilepath = "setting_"+"model1" + ".csv"
reduce_lr_rate=0.2
logCallback = CSVLogger(logfilepath, separator=',', append=False)
earlyStopping = EarlyStopping(monitor='loss', min_delta=0, patience=3, verbose=0, mode='min')
checkpoint = ModelCheckpoint(filepath, monitor='loss', save_weights_only=True, verbose=1,
                             save_best_only=True, mode='min')
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=reduce_lr_rate, patience=10,
                              cooldown=0, min_lr=0.0000000001, verbose=0)

callbacks_list = [logCallback, earlyStopping, reduce_lr, checkpoint]

import tensorflow as tf
from tensorflow import keras
embedding_size=300
model_word2vec = keras.Sequential([
    keras.layers.Flatten(input_shape=(embedding_size,)),#donot change(input layer)
    keras.layers.Dense(300, activation='relu'),#(hidden layer)
    keras.layers.Dense(50, activation='relu'),#(hidden layer)
    keras.layers.Dense(2)#donot change
])
adam=keras.optimizers.Adam(lr=0.00001)
model_word2vec.compile(optimizer=adam,
              loss=["categorical_crossentropy"],
              metrics=['accuracy'])

model_word2vec.summary()

In [None]:
model_word2vec.fit(train,labels, epochs=15, batch_size=32,
               verbose=1,shuffle=True)

In [None]:
predictions = model_word2vec.predict(test)
pred=[]
for i in predictions:
    pred.append(np.argmax(i))
predictions=np.array(pred)


from sklearn.metrics import confusion_matrix
labelList=['<5','>7']
test_Y_max=np.argmax(labels1, axis=-1)
cm=confusion_matrix(test_Y_max,predictions)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm = pd.DataFrame(cm, labelList,labelList )# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.4) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 11}, fmt=".2f") # font size
plt.show()


In [None]:
print("Classification Report\n",sklearn.metrics.classification_report(test_Y_max, predictions, labels=[0,1], target_names = labelList))

# Theory
The two are two major reaserch papers [Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) for prediction based embeddings and [GloVe](https://nlp.stanford.edu/pubs/glove.pdf) for frequency based embeddings. Research online and write a short note on the trade offs associated with the two types of embeddings. 

###_______________Anwer________________###

The word2vec performs better than GloVe model.Word2vec offers best vector representation in lower representation.GloVe model performance can be made similar to the word2vec by increasing the semantic space. GloVe performs better when there is wide variety of grammatical forms and complex morphology is involved in the dataset.











#### Ending Note:
Feed forward networks are not suitable for natural language task because of thier fixed input sizes, the size of natural language text in each example for a dataset can vary considerably, also feed forward networks ignore the temporal nature of natural language text, which result's in them not bieng able to caputre context's or interdepencies between words for semantic information. To fix this researcher's have invented recurrent neural networks that help to aleviate these limitations.
The next assignment will be related to recurrent neural networks.

# We hope all of you are working on your projects!

:(