# Assignment 3 CS 5316 Natural Language Processing
For this assignment we will use the following packages
<ul>
    <li><a href="https://radimrehurek.com/gensim/">Gensim</a>.</li>
    <li><a href="https://keras.io/">Keras</a>.</li>
    <li><a href="https://www.tensorflow.org/">Tensorflow</a>.</li>
</ul>
You can install these packages via anaconda navigator or use the conda install / pip install commands e.g<br>
<blockquote>pip install gensim<br>
pip install tensorflow<br>
pip install keras</blockquote>

In [8]:
import numpy as np
from IPython.display import Image
# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from sklearn.decomposition import TruncatedSVD
from sklearn.utils.extmath import randomized_svd
from nltk import ngrams
import pandas as pd
import re, string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.util import ngrams
import nltk
from sklearn.model_selection import train_test_split
from nltk.util import ngrams

# Word Embeddings

Word Vectors nowadays are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, language translation, sentiment analysis, etc. The goal of word embedding methods is to derive a low-dimensional continuous vector representation for words so that words that are syntactically or semantically related are close together in that vector space and thus, share a similar representation.

In this assingment we are going to explore different word embedddings inorder to build some intuitions about their strengths and weaknesses. Although there are many types of word embeddings they can be broadly classified into two categories:
<ul>
    <li>Frequency based Embedding</li>
    <li>Prediction based Embedding</li>
</ul>
For frequenct based embedding we will explore embeddings based on <b>word co-occurance</b> counts with <a href="https://en.wikipedia.org/wiki/Pointwise_mutual_information">Point Wise Mutial Information(PPMI)</a> and <a href="https://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition(SVD)</a>.
<a href="https://www.youtube.com/watch?v=P5mlg91as1c">SVD video explaination</a><br>
For prediction based embeddings we will explore <a href="https://en.wikipedia.org/wiki/Pointwise_mutual_information">Word2Vec</a> based embeddings.

For evaluating these embeddings we will work with the following two datasets: 
<ul>
    <li>Twitter dataset created by Sanders Analytics which we explored in the previous assignment<b>(file provided)</b></li>
    <li>Movie reviews dataset from the popular website <a href="https://www.imdb.com/">IMDB</a>.
        Head over the to <a href="https://www.kaggle.com/c/word2vec-nlp-tutorial/data">kaggle</a> and download the dataset from there. The dataset consists of three files:<br><b>labelledTrainData,unlabelledTrainData,testData</b></li>   
</ul>
Read the "Data" section on kaggle for details on the dataset.

Let's get started.......<br>
remove this link later [Assignment solution](https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=40s) 

## Frequency base Embeddings
For this part we will use the Sanders Analytics dataset to create embeddings. Since the other dataset is large we might run into memory problems.<br><br><br>
Although we can directly use word representation based on word co-occurance matrix directly it is generally not a good idea to do so for the following reasons:
<ul>
    <li>The word co-occurance matrix scales with vocabulary size, considering memory constraints this would be problematic for large datasets, as in the IMDB data set that has vocabulary size after remove stop words of 225109, which requres rougly around 189 GiB of storage capacity(roughly 203 GB)</li><img src="memoryerror.png">
    <li> The word co-occurance matrix will be quite sparse, meaning many entries in the matrix will be zeros. This is problematic due the fact that for many nlp tasks the multyplication operation is used quite frequently, e.g. for word similarity task, cosine similarity is used:<img src="cosine-equation.png"> Here we can see the dot product is computed between two word vectors, multyplication with zeros wastes precious computation power and your time.</li>
    <li> High co-occurance counts for stop words and conjunctions offset true representation of words, meaning thier could become a dominant factor when these embeddings are used in computations. These also dont provide a lot of information as thier counts with other words would also be high.</li>
</ul>
In summary, you want to avoid sparse represenation's just like the corona virus.<br>
To mitigate the above problems we will use PPMI and SVD. PPMI is use to control high co-occurance counts and SVD is used to reduce dimensionality.

    

### Preprocessing
Since we have already discussed preprocessing trade off's in previous assingments. We expect you to analyse the data and preform the preprocessing that is required.<br> 

In [9]:
#load the dataset
def load_data(filename):
    """
    Load data from file

    Args:
        filename : Name of the file from which the data is to be loaded
    
    Returns: tweet_X, sentiment_Y
    tweet_X: list of tweets
    sentiment_Y: list of sentiment lables correponding to each tweet
    """
    data = pd.read_csv(filename)
    tweets = data['text']
    labels = data['class']
#     print(labels)
    return list(tweets), list(labels)

In [10]:
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

def replaceAtTag(data):
    return re.sub(r'@\w+', r'', data)

def removeHashTag(data):
    return re.sub(r'#', r'', data)
    
def removePunc(data):
    return data.translate(str.maketrans('', '', string.punctuation))

def preprocessing(data_X):
    """
    Perform preprocessing of the tweets

    Args:
        data : list of tweets
    
    Returns: data: preprocessed list of tweets
    """
    processedList = []
    
    for tweet in data_X:
#         print(tweet)
        p1 = striphtml(tweet) #HTML Removed
        p2 = replaceAtTag(p1) #@ replaced with AT_TOKEN
        p3 = removeHashTag(p2) # #tag Removed
        p4 = p3.lower() #lowercase
        p5 = re.sub(r'attoken','AT_TOKEN',removePunc(p4)) #restore AT_TOKEN
        p6 = word_tokenize(p5) #tokenization
        processedList.append(p6)
#         print(p6, "\n")
    return processedList

# tweet_X_task2=preprocessing(tweet_X_task2)

# def preprocessing(data_X):
#     """
#     Return preprocessed data

#     Args:
#         data : reviews
    
#     Returns: preprocessed_data
#     preprocessed_data : preprocessed dataset 
#     """
# #     preprocessed_data=
#     print(data_X.shape)

tweet_X, sentiment_label_Y=load_data("twitter-sanders-apple3.csv")
tweet_X_Backup = tweet_X[:]

# tweet_X = ['hello whatsup bro how are you bro', 'haha I am good bro whatsup', 'are you happy bro']
# sentiment_label_Y = ['Pos', 'Neg', 'Pos']

tweet_X=preprocessing(tweet_X)
print(tweet_X[:10])

[['now', 'all', 'has', 'to', 'do', 'is', 'get', 'swype', 'on', 'the', 'iphone', 'and', 'it', 'will', 'be', 'crack', 'iphone', 'that', 'is'], ['will', 'be', 'adding', 'more', 'carrier', 'support', 'to', 'the', 'iphone', '4s', 'just', 'announced'], ['hilarious', 'video', 'guy', 'does', 'a', 'duet', 'with', 's', 'siri', 'pretty', 'much', 'sums', 'up', 'the', 'love', 'affair', 'httptco8exbnqjy'], ['you', 'made', 'it', 'too', 'easy', 'for', 'me', 'to', 'switch', 'to', 'iphone', 'see', 'ya'], ['i', 'just', 'realized', 'that', 'the', 'reason', 'i', 'got', 'into', 'twitter', 'was', 'ios5', 'thanks'], ['im', 'a', 'current', 'user', 'little', 'bit', 'disappointed', 'with', 'it', 'should', 'i', 'move', 'to', 'or'], ['the', '16', 'strangest', 'things', 'siri', 'has', 'said', 'so', 'far', 'i', 'am', 'sooo', 'glad', 'that', 'gave', 'siri', 'a', 'sense', 'of', 'humor', 'httptcotwaeudbp', 'via'], ['great', 'up', 'close', 'personal', 'event', 'tonight', 'in', 'regent', 'st', 'store'], ['from', 'which',

### Test train split
Use test train split from sklearn.


In [11]:
def testTrainSplit(data_X,data_Y):
    """
    Returns test train split data

    Args:
        featurevectors : list of feature vectors
    
    Returns: feature_vector_train, feature_vector_test, label_train, label_test
    """
#     print(sentiment_label_Y_task1[0])
#     print(featurevectors[0])
    
    train_features, test_features, train_labels, test_labels = train_test_split(data_X, data_Y, test_size=0.2, random_state = 0)
    return train_features, test_features, train_labels, test_labels

# def testTrainSplit(data_X,data_Y):
#     """
#     Return test train data

#     Args:
#         data_X : reviews
#         data_Y: labels
#     Returns: test train split data 
#     """
#     pass

train_X, test_X, train_Y, test_Y=testTrainSplit(tweet_X, sentiment_label_Y)
# train_X = tweet_X
len(train_X)

790

Extract the vocabulary, to find te dimensions of co-occurance matrix

In [12]:
def getVocabulary(train_X):
    """
    Return dataset vocabulart

    Args:
        train_X : reviews in train dataset
    
    Returns: vocabulary
    vocabulary: list of unique words in dataset
    """
    flattened_data = [y for x in train_X for y in x]
    listOfUnigrams = []
    
#     for t in flattened_data:
#         print(list(ngrams(t,1)))
    v_list = list(set(ngrams(flattened_data,1)))
    simple_Flat = []
    for ele in v_list:
        simple_Flat.append(ele[0])
#     print(len(simple_Flat))
    return simple_Flat

vocabulary = getVocabulary(train_X)
len(vocabulary)

2771

### Point Wise Mutial Information
Pointwise mutual information, or PMI, is the (unweighted) term that occurs inside of the summation of mutual information and measures the correlation between two specific events. Specifically, PMI is defined as<br>
$$PMI(a, b) = \log \frac{p(a,b)}{p(a)p(b)}$$

and measures the (log) ratio of the joint probability of the two events as compared to the joint probability of the two events assuming they were independent. Thus, PMI is high when the two events a and b co-occur with higher probability than would be expected if they were independent.

If we suppose that a and b are words, we can measure how likely we see a and b together compared to what we would expect of they were unrelated by computing their PMI under some model for the joint probability $$p(a,b)$$

Let D represent a collection of observed word-context pairs (with contexts being other words). We can construct D by considering the full context of a specific word occurrence as the collection of all word occurrences that appear within a fixed-size window of length L before and after it.

For a specific word $w_i$ in position i in a large, ordered collection of words $w_1, w_2$, we would have the context as ,$w_{i-1},w_{i+1},\ldots$, and could thus collect counts (a total of 2L) of each of those words as appearing in the context of word $w_i$. We will refer to $w_i$ as the “target word” and the words appearing in the L-sized window around $w_i$ as “context words”.

Consider a sample corpus containing only one sentence:<br>
    <center><blockquote>Encumbered forever by desire and ambition</blockquote></center>

We can construct D by considering each word position i and extracting the pairs $(w_i, w_{i+k})$ for $−L≤k≤L;k≠0$. In such a pair, we would call $w_i$ the “target word” and $w_{i+k}$ the “context word”.

For example, we would extract the following pairs for $i=4i$ if we let our window size $L=2$<br>
    <center><blockquote>(desire,forever),(desire,by),(desire,and),(desire,ambition)</blockquote></center>

Similarly, for $i=5i$, we would extract the following pairs:
    <center><blockquote>(and,by),(and,desire),(and,ambition)</blockquote></center>
Let’s let $n_{w,c}$ represent the number of times we observe word type c in the context of word type w. We can then define ,$n_w = \sum_{c'} n_{w,c'}$ as the number of times we see a “target” word w in the collection of pairs D and $n_c = \sum_{w'} n_{w',c}$ as the number of times we see the context word c in the collection of pairs D.

We can then define the joint probability of a word and a context word as
    $$p(w, c) = \frac{n_{w,c}}{|D|}$$

where $∣D∣$ is simply the total number of word-context occurrences we see. Similarly, we can define
    $$p(w) = \frac{n_w}{|D|}$$

and $$p(c) = \frac{n_c}{|D|}$$

and thus the PMI between a word w and context word c is
$$PMI(w, c) = \log \frac{p(w,c)}{p(w)p(c)} = \log \frac{n_{w,c} \cdot |D|}{n_w \cdot n_c}.$$

If we compute the PMI between all pairs of words in our vocabulary V, we will arrive at a large, real-valued matrix. However, some of the values of this matrix will be $\log 0$, if the word-context pair $(w,c)$ is unobserved, this will result in inf bieng computed. To remedy this, we could simply define a modified PMI that is equal to 0 when $n_{w,c} = 0$, which is the positive pointwise mutual information (PPMI) which:
    P$$PPMI(w,c) = \max(0, PMI(w,c))$$
<br>
This wonderfull explaination is made by <a href="http://czhai.cs.illinois.edu/">Dr.ChengXiang ("Cheng") Zhai</a><br><br>

<center><b>HINT: Consult your slides and see the example, how the formulas are used. You can calculate $|D|$ by the formula given in the slides(its the same thing).</b></center>

If youre having troubles implementing this here is some [motivation](https://www.youtube.com/watch?v=TsyM5jP7RQk)

### Create a co-occurance matrix with +,- k window size
Hint: Use the ngrams package from [nltk](https://www.nltk.org/) to make life easier. Matrix size is vocab X vocab.
Please keep track of the order of words in the matrix this will be usefull later.

In [13]:
def getCount(target, context, gramed_tweets):
    count = 0
    for tweet in gramed_tweets:
        for window in tweet:
#             print(window , "T= ", target, " C=", context)
            if(target == context):
                if window.count(target) == 2:
                    count += 1
            else:
                if (target in window) and (context in window):
                    count += 1
    return count

def coOccuranceMatrix(train_X,vocab,k=2):
    """
    Return co-occurance matrix with ppmi counts

    Args:
        data : dataset
        vocab : vocabulary
    Returns: co_matrix
    co_matrix: co-occurance matrix
    """
    gramed_tweets = [list(ngrams(tweet,k+1)) for tweet in train_X]
    paired_gramed_tweets = []
    
    for tweet in gramed_tweets:
        tweet_pairs = []
        for ngram in tweet:
            for word in ngram:
                for word2 in ngram:
                    if word != word2:
                        if (word2, word) not in tweet_pairs:
                            tweet_pairs.append((word, word2))
        tweet_pairs = list(set(tweet_pairs))
        paired_gramed_tweets.append(tweet_pairs)
#     print(paired_gramed_tweets)
    lenV = len(vocabulary)
#     print(vocabulary)
    matrix = np.zeros(shape=(lenV,lenV))
    
    for (i, ColEle) in enumerate(vocabulary):
        for (j, RowEle) in enumerate(vocabulary):
            matrix[i,j] = getCount(ColEle, RowEle, paired_gramed_tweets)
    return matrix
#     print(matrix)

co_matrix = coOccuranceMatrix(train_X, vocabulary, 2)
co_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [186]:
# import dill

# # dill.dump_session('notebook_env.db')
# dill.load_session('notebook_env2.db')

(2771, 2771)

import dill
# dill.dump_session('notebook_env.db')
dill.load_session('notebook_env.db')

In [None]:

def ppmiMatrix(co_matrix):
    """
    Return co-occurance matrix with ppmi counts

    Args:
        co_matrix : co-occurance matrix
    Returns: ppmi_matrix
    ppmi_co_matrix: co-occurance matrix with ppmi counts
    """
    vocabSize = co_matrix.shape[0]
    temp_matrix = np.zeros(shape=(vocabSize,vocabSize))
    vocabCounts = np.zeros(shape=(vocabSize))
    ppmiM = np.zeros(shape=(vocabSize,vocabSize))
    vocabCounts = np.sum(co_matrix, axis = 0) #counts of every word p(w) and p(c)
    print(vocabCounts.shape)
    for i, ele in enumerate(vocabCounts):
#         if ele == 0:
#             print("Error")
        vocabCounts[i] = float(ele)/vocabSize

    #making p(w, c)
    for ((i, j), x) in np.ndenumerate(co_matrix):
        temp_matrix[i,j] = float(co_matrix[i,j])/vocabSize
    print("Doing ppmi now")
    #making ppmi matrix
    for ((i, j), x) in np.ndenumerate(temp_matrix):
        if (vocabCounts[i] * vocabCounts[j]) == 0:
            ppmiM[i,j] = 0.
        else:
            ppmiM[i,j] = max(0, np.log2( float(temp_matrix[i,j])/( vocabCounts[i] * vocabCounts[j] ) ))
    return ppmiM

ppmi_co_matrix = ppmiMatrix(co_matrix)

(2771,)
Doing ppmi now




In [7]:
import dill

# dill.dump_session('notebook_env.db')
dill.load_session('notebook_env2.db')

TypeError: code() takes at most 15 arguments (16 given)

Code for SVD has been provided for you, all you have to do is specify the number of top eigenvalues or how many top <b>n</b> dimensions you want to keep. Check the dimensions of the returned matrix by using <blockquote>.shape</blockquote> command to figure out if the embedding for each word is in row or column. By our calculation the vocab count should be less than five thousand, reduce the dimensionality to less than one thousand.

In [212]:
#code provided 
def denseMatrixViaSVD(ppmi_co_matrix,n):
    """
    Return reduced dimensionality co-occurance matrix by applying svd

    Args:
        ppmi_matrix : co-occurance matrix with ppmi counts
        
    Returns: svd_co_matrix
    svd_co_matrix: reduced dimensionality co-occurance matrix
    """
#     top_n_eigenvalues=
    U, Sigma, VT = randomized_svd(ppmi_co_matrix, 
                              n_components=n,
                              n_iter=5,
                              random_state=None)
    svd_co_matrix=U
    return svd_co_matrix
svd_co_matrix = denseMatrixViaSVD(ppmi_co_matrix, 500)

In [279]:
svd_co_matrix.shape

(2771, 500)

dill.dump_session('notebook_env2.db')

### Modelling
Now that we have our embeddings, lets use these to train a Feed Forward Neural network for our semantic classification task. Since a feed forward network's input layer is of a fixed size we will need to create a fixed size representation for each review. For this purpose we will use the following:
<ul>
    <li>Average pooling.</li>
    <li>Averaging pooling algorithm by FastText(provided)</li>
    <li>Max pooling. </li>
</ul>
For those of you who are familiar with Convolution Neural Networks this pooling will be a 1d pooling operation. See illustrated example below:<img src="pooling.png">

Since we cant have a tutorial due to corona virus for keras, a simple feed forward network has beed provided for you. You need to create train_X, test_X , train_Y and test_Y these should numpy arrays inorder for keras to use them.
<ul>
    <li>train_X= contains embedding representains of all the reviews in the train set</li>
    <li>test_X= contains embedding representains of all the reviews in the test set</li>
    <li>train_Y= contains <b>one hot</b> representations of train labels</li>
    <li>test_Y= contains <b>one hot</b> representations of test labels</li>   
</ul>
To construct one hot representation you can use the sklearn's preprocessing package or the preprocessing package from keras. Read online.

In [228]:
arr = np.zeros(shape=(2,3))
arr[1,:]

array([0., 0., 0.])

In [269]:
# print(len(train_X[1]))
train_Y_hot = []
for label in train_Y:
    if label == 'Pos':
        train_Y_hot.append([1,0,0])
    elif label == 'Neutral':
        train_Y_hot.append([0,1,0])
    elif label == 'Neg':
        train_Y_hot.append([0,0,1])
# train_Y[3]
test_Y_hot = []
for label in test_Y:
    if label == 'Pos':
        test_Y_hot.append([1,0,0])
    elif label == 'Neutral':
        test_Y_hot.append([0,1,0])
    elif label == 'Neg':
        test_Y_hot.append([0,0,1])
# test_Y_hot[0]

[0, 0, 1]

In [247]:
#making embeddings for each tweet
embedding_list = []
def getvec(word):
    for i, w in enumerate(vocabulary):
        if w == word:
            return svd_co_matrix[i,:]

for tweet in train_X:
    tweet_embedding = []
    for word in tweet:
        tweet_embedding.append(getvec(word))
    embedding_list.append(tweet_embedding)
embedding_list = np.array(embedding_list)
# len(embedding_list[1])
# embedding_size = 500

16

In [None]:
# CONSTRUCT ONE HOT REPRESENTATION
#done above

In [255]:
#Fast text averaging, pass a list of word embeddings and embedding size to fasttextAveraging function
def l2_norm(x):
   return np.sqrt(np.sum(x**2))

def div_norm(x):
   norm_value = l2_norm(x)
   if norm_value > 0:
       return x * ( 1.0 / norm_value)
   else:
       return x
def fasttextAveraging(embedding_list,embedding_size):
    norm=np.zeros(embedding_size)
    for emb in embedding_list:
        norm=norm+div_norm(emb) 
    return norm/len(embedding_list)
fasttext_embeddings = []
for embedding in embedding_list:
    fasttext_embeddings.append( fasttextAveraging(embedding, 500))
# len(fasttext_embeddings[0])

500

In [None]:
def averagePooling(embedding_list,embedding_size):
    """
    Return average embedding vector from list of embedding
    Args:
        embedding_list : embedding list
        embedding_size: size of embedding vector
    Returns: average_embedding
    average_embedding: average embedding vector
    """
    pass

In [None]:
def maxPooling(embedding_list,embedding_size):
    """
    Return maxpooling embedding vector from list of embedding
    Args:
        embedding_list : embedding list
        embedding_size: size of embedding vector
    Returns: max_embedding
    max_embedding: maxpooled embedding vector
    """
    pass

Try using all three representaions to train the model and check which one works best. You can play around with embedding size by controlling <b>n</b> in SVD function and for the model you can add or remove layers or change the number of neurons in the hidden layers. Keep in mind that the layers should be decreasing in size as we go deeper into the network, theoritically this means that we are constructing complex features in a lower dimensional space from less complex features and larger dimensional space.<br><br>
Issues related to overfiting will be proper addressed in the next assignment for now you are free to choose the number of epoch, try to find one that trains the model sufficiently enough but does not overfit it.

# import dill
# dill.dump_session('notebook_env3.db')
# dill.load_session('notebook_env.db')

In [273]:
import tensorflow as tf
from tensorflow import keras
embedding_size=500
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(embedding_size,)),#donot change(input layer)
    keras.layers.Dense(300, activation='relu'),#(hidden layer)
    keras.layers.Dense(50, activation='relu'),#(hidden layer)
    keras.layers.Dense(3)#donot change
])
model.compile(optimizer='adam',
              loss=["categorical_crossentropy"],
              metrics=['accuracy'])

model.summary()

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
model.fit(train_X,train_Y, epochs=15, batch_size=32,
               verbose=1,shuffle=True)

Use the <b>model.predict</b> method to get predictions. There predictions will be a probability distribution over the lables, to get the desired class take the max value in a prediction vector as the predicted class.<br> To run the code below you need to construct a list of unique labels, the list should be ordered on the basis of the id assigned to each class when you were constructing the one hot representation.

In [None]:
#predictions = code here

from sklearn.metrics import confusion_matrix

test_Y_max=np.argmax(test_Y, axis=-1)
cm=confusion_matrix(test_Y_max,predictions)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm = pd.DataFrame(cm, labelList,labelList )# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.4) # for label size
sn.heatmap(cm1, annot=True, annot_kws={"size": 11}, fmt=".2f") # font size
plt.show()


In [None]:
print("Classification Report\n",classification_report(test_Y_max, predictions, labels=[0,1,2], target_names = labelList))

## Prediction base Embeddings
For prediction based embeddings we will use the IMDB dataset. We will create create our embeddings by using the unlabeledTrainData.tsv file.
We will use the Word2Vec model that we have already covered in class. 

### Preprocessing
Since we have already discussed preprocessing trade off's in previous assingments. We expect you to analyse the data and preform the preprocessing that is required.<br> 
<b> Hint: Each review is in string format so they have used slahes to escape characters and br tags to identify line breaks</b>

In [None]:
#load the data

In [None]:
def preprocessing(data):
    """
    Return preprocessed data

    Args:
        data : reviews
    
    Returns: preprocessed_data
    preprocessed_data : preprocessed dataset 
    """
#     preprocessed_data=
    pass

Use the [gensim](https://radimrehurek.com/gensim/models/word2vec.html) to train a Word2Vec model. Keep the dimensionality at 300 and window size at 2. After trianing use the model and previously coded methods create vectorial represenations for movie reviews.<i>(create train_X, test_X, train_Y and test_Y)</i>

In [None]:
def trainWord2Vec(data):
    """
    Return preprocessed data

    Args:
        data : movie reviews
    
    Returns: model
    model : Word2Vec model 
    """
#     preprocessed_data=
    pass

In [None]:
#load the train and test files and create the vectorial representations

Since this is dense representaion we wont be faced with the challenges posed by sparse representations. We can move onto modelling.

### Modelling


In [None]:
import tensorflow as tf
from tensorflow import keras
embedding_size=100
model_word2vec = keras.Sequential([
    keras.layers.Flatten(input_shape=(embedding_size,)),#donot change(input layer)
    keras.layers.Dense(150, activation='relu'),#(hidden layer)
    keras.layers.Dense(50, activation='relu'),#(hidden layer)
    keras.layers.Dense(2)#donot change
])
model_word2vec.compile(optimizer='adam',
              loss=["categorical_crossentropy"],
              metrics=['accuracy'])

model_word2vec.summary()

In [None]:
model_word2vec.fit(train_X,train_Y, epochs=15, batch_size=32,
               verbose=1,shuffle=True)

In [None]:
#predictions = 

from sklearn.metrics import confusion_matrix

test_Y_max=np.argmax(test_Y, axis=-1)
cm=confusion_matrix(test_Y_max,predictions)
cm1 = cm1.astype('float') / cm1.sum(axis=1)[:, np.newaxis]
cm1 = pd.DataFrame(cm, labelList,labelList )# matrix,names row,names col,
# plt.figure(figsize=(10,7))
sn.set(font_scale=1.4) # for label size
sn.heatmap(cm1, annot=True, annot_kws={"size": 11}, fmt=".2f") # font size
plt.show()


In [None]:
print("Classification Report\n",classification_report(test_Y_max, predictions, labels=[0,1], target_names = labelList))

# Theory
The two are two major reaserch papers [Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) for prediction based embeddings and [GloVe](https://nlp.stanford.edu/pubs/glove.pdf) for frequency based embeddings. Research online and write a short note on the trade offs associated with the two types of embeddings. 

###_______________Anwer________________###


#### Ending Note:
Feed forward networks are not suitable for natural language task because of thier fixed input sizes, the size of natural language text in each example for a dataset can vary considerably, also feed forward networks ignore the temporal nature of natural language text, which result's in them not bieng able to caputre context's or interdepencies between words for semantic information. To fix this researcher's have invented recurrent neural networks that help to aleviate these limitations.
The next assignment will be related to recurrent neural networks.

# We hope all of you are working on your projects!