# Assignment 3 CS 5316 Natural Language Processing
For this assignment we will use the following packages
<ul>
    <li><a href="https://radimrehurek.com/gensim/">Gensim</a>.</li>
    <li><a href="https://keras.io/">Keras</a>.</li>
    <li><a href="https://www.tensorflow.org/">Tensorflow</a>.</li>
</ul>
You can install these packages via anaconda navigator or use the conda install / pip install commands e.g<br>
<blockquote>pip install gensim<br>
pip install tensorflow<br>
pip install keras</blockquote>

In [1]:
import numpy as np
from IPython.display import Image
# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from sklearn.decomposition import TruncatedSVD
from sklearn.utils.extmath import randomized_svd
from nltk import ngrams
import pandas as pd
from nltk import word_tokenize
from keras.utils import to_categorical
import seaborn as sn
from sklearn.metrics import classification_report
from gensim.models import Word2Vec

Using TensorFlow backend.


# Word Embeddings

Word Vectors nowadays are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, language translation, sentiment analysis, etc. The goal of word embedding methods is to derive a low-dimensional continuous vector representation for words so that words that are syntactically or semantically related are close together in that vector space and thus, share a similar representation.

In this assingment we are going to explore different word embedddings inorder to build some intuitions about their strengths and weaknesses. Although there are many types of word embeddings they can be broadly classified into two categories:
<ul>
    <li>Frequency based Embedding</li>
    <li>Prediction based Embedding</li>
</ul>
For frequenct based embedding we will explore embeddings based on <b>word co-occurance</b> counts with <a href="https://en.wikipedia.org/wiki/Pointwise_mutual_information">Point Wise Mutial Information(PPMI)</a> and <a href="https://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition(SVD)</a>.
<a href="https://www.youtube.com/watch?v=P5mlg91as1c">SVD video explaination</a><br>
For prediction based embeddings we will explore <a href="https://en.wikipedia.org/wiki/Pointwise_mutual_information">Word2Vec</a> based embeddings.

For evaluating these embeddings we will work with the following two datasets: 
<ul>
    <li>Twitter dataset created by Sanders Analytics which we explored in the previous assignment<b>(file provided)</b></li>
    <li>Movie reviews dataset from the popular website <a href="https://www.imdb.com/">IMDB</a>.
        Head over the to <a href="https://www.kaggle.com/c/word2vec-nlp-tutorial/data">kaggle</a> and download the dataset from there. The dataset consists of three files:<br><b>labelledTrainData,unlabelledTrainData,testData</b></li>   
</ul>
Read the "Data" section on kaggle for details on the dataset.

Let's get started.......<br>
remove this link later [Assignment solution](https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=40s) 

## Frequency base Embeddings
For this part we will use the Sanders Analytics dataset to create embeddings. Since the other dataset is large we might run into memory problems.<br><br><br>
Although we can directly use word representation based on word co-occurance matrix directly it is generally not a good idea to do so for the following reasons:
<ul>
    <li>The word co-occurance matrix scales with vocabulary size, considering memory constraints this would be problematic for large datasets, as in the IMDB data set that has vocabulary size after remove stop words of 225109, which requres rougly around 189 GiB of storage capacity(roughly 203 GB)</li><img src="memoryerror.png">
    <li> The word co-occurance matrix will be quite sparse, meaning many entries in the matrix will be zeros. This is problematic due the fact that for many nlp tasks the multyplication operation is used quite frequently, e.g. for word similarity task, cosine similarity is used:<img src="cosine-equation.png"> Here we can see the dot product is computed between two word vectors, multyplication with zeros wastes precious computation power and your time.</li>
    <li> High co-occurance counts for stop words and conjunctions offset true representation of words, meaning thier could become a dominant factor when these embeddings are used in computations. These also dont provide a lot of information as thier counts with other words would also be high.</li>
</ul>
In summary, you want to avoid sparse represenation's just like the corona virus.<br>
To mitigate the above problems we will use PPMI and SVD. PPMI is use to control high co-occurance counts and SVD is used to reduce dimensionality.

    

### Preprocessing
Since we have already discussed preprocessing trade off's in previous assingments. We expect you to analyse the data and preform the preprocessing that is required.<br> 

In [2]:
def load_data(filename):
    """
    Load data from file

    Args:
        filename : Name of the file from which the data is to be loaded
    
    Returns: tweet_X, sentiment_Y
    tweet_X: list of tweets
    sentiment_Y: list of sentiment lables correponding to each tweet
    """
    df = pd.read_csv(filename)
    df.head()
    y = df[['class']]
    x = df[['text']]
    return x,y#update these
tweet_X_task1, sentiment_label_Y_task1=load_data("twitter-sanders-apple3.csv")

In [3]:
tweet_X_task1

Unnamed: 0,text
0,Now all @Apple has to do is get swype on the i...
1,@Apple will be adding more carrier support to ...
2,Hilarious @youtube video - guy does a duet wit...
3,@RIM you made it too easy for me to switch to ...
4,I just realized that the reason I got into twi...
...,...
983,@vlingo is a POOR substitute for Siri!! Yo @AP...
984,@Apple Scrapple. (:
985,@tvnewschick @apple Oh no! Why not?! I want it...
986,One of the great #entrepreneurs has died. #Ste...


In [4]:
import string
import re

In [5]:
#Copied from assignment 2
def remove_punctuation(data):
    for punctuation in string.punctuation:
        if punctuation != '@':
            data = data.replace(punctuation, ' ')
    return data
def remove_trailing(data):
    data = data.replace('@',' ')
    return data

In [6]:
def preprocessing(data):
    """
    Return preprocessed data

    Args:
        data : reviews
    
    Returns: preprocessed_data
    preprocessed_data : preprocessed dataset 
    """
#     preprocessed_data=
    #copied from assignment 2 
    data['text'] = data['text'].replace(to_replace=r'http://t\.co/[A-Za-z0-9]{8}',value=" ",regex=True)
    data['text'] = data['text'].str.lower()
    data['text'] = data['text'].apply(remove_punctuation)
    data['text'] = data['text'].replace(to_replace = r'@[A-Za-z0-9]*',value = 'AT_TOKEN',regex=True)
    data['text'] = data['text'].apply(remove_trailing)
    #xCount = data['text']
    data['text'] = data['text'].apply(word_tokenize)
    return data

In [7]:
x=preprocessing(tweet_X_task1)
y = sentiment_label_Y_task1

In [8]:
#checking if x is correct
x

Unnamed: 0,text
0,"[now, all, AT_TOKEN, has, to, do, is, get, swy..."
1,"[AT_TOKEN, will, be, adding, more, carrier, su..."
2,"[hilarious, AT_TOKEN, video, guy, does, a, due..."
3,"[AT_TOKEN, you, made, it, too, easy, for, me, ..."
4,"[i, just, realized, that, the, reason, i, got,..."
...,...
983,"[AT_TOKEN, is, a, poor, substitute, for, siri,..."
984,"[AT_TOKEN, scrapple]"
985,"[AT_TOKEN, AT_TOKEN, oh, no, why, not, i, want..."
986,"[one, of, the, great, entrepreneurs, has, died..."


In [9]:
#checking if y is correct
y

Unnamed: 0,class
0,Pos
1,Pos
2,Pos
3,Pos
4,Pos
...,...
983,Neutral
984,Neutral
985,Neutral
986,Neutral


### Test train split
Use test train split from sklearn.


In [10]:
from sklearn.model_selection import train_test_split

In [11]:
def testTrainSplit(data_X,data_Y):
    """
    Return test train data

    Args:
        data_X : reviews
        data_Y: labels
    Returns: test train split data 
    """
    X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2)
    return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test = testTrainSplit(x,y)

In [12]:
#Incorporating unks
pseudoDict = {}
for values in X_train['text']:
    for word in values:
        if word not in pseudoDict.keys():
            pseudoDict[word] = 1
        else:
            pseudoDict[word] = pseudoDict[word]+1

In [13]:
stopwords = []
threshold = 1
for val in pseudoDict:
    if pseudoDict[val]<=threshold:
        stopwords.append(val)

In [14]:
X_train['text'] = X_train['text'].str.join(" ")
for word in stopwords:
    X_train['text'] = X_train['text'].replace(r'\b{}\b'.format(word),'UNK',regex=True)
X_train['text'] = X_train['text'].str.split()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Extract the vocabulary, to find te dimensions of co-occurance matrix

In [15]:
def getVocabulary(train_X):
    """
    Return dataset vocabulart

    Args:
        train_X : reviews in train dataset
    
    Returns: vocabulary
    vocabulary: list of unique words in dataset
    """
#     vocabulary=
    tempSet = set()
    for val in train_X['text']:
            for subval in val:
                tempSet.add(subval)
    return list(tempSet)
vocab = getVocabulary(X_train)

In [16]:
print(len(vocab))

1078


In [17]:
print(vocab)

['bring', 'out', 'reminders', 'its', 'dumb', 'also', 'setup', 'call', 'stay', 'ipad2', 'ours', 'ugh', 'keep', 'lead', 'be', 'some', 'wall', 'finance', 'y', 'played', 'x', 'tweet', 'watch', 'with', 'champ', 'hand', 'her', 'last', 'gt', 'comp', 'still', 'bar', 'problems', 'gen', 'when', 'person', 'tv', 'space', 'online', 'who', 'hire', 'can', 'rest', 'trip', 'wonder', 'older', 'backside', 'half', 'glad', 'eric', 'perfect', 'sd', 'did', 'top', 'nostalgia', 'found', 'santa', 'resources', 'tweets', 'announced', 'recognition', 'give', 'we', 'says', 'ipad', 'after', 'book', 'page', 'global', 'itunes', 'changes', 'kind', 'tech', 'there', 'turn', 'bud', 'needed', 'bad', 'far', 'aka', 'coming', 'annoying', 'talking', 'omg', 'loyalty', 'move', 'memorial', 'any', 'spend', 'finally', 'thanks', 'thank', 'against', 'came', 'management', '20', 'less', 'cloud', '2', 'buy', 'adding', 'free', 'rolling', 'week', 'failed', 'eye', 'like', 'again', 'forget', 'imessage', 'months', 'na', 'reader', 'note', 'cc'

### Point Wise Mutial Information
Pointwise mutual information, or PMI, is the (unweighted) term that occurs inside of the summation of mutual information and measures the correlation between two specific events. Specifically, PMI is defined as<br>
$$PMI(a, b) = \log \frac{p(a,b)}{p(a)p(b)}$$

and measures the (log) ratio of the joint probability of the two events as compared to the joint probability of the two events assuming they were independent. Thus, PMI is high when the two events a and b co-occur with higher probability than would be expected if they were independent.

If we suppose that a and b are words, we can measure how likely we see a and b together compared to what we would expect of they were unrelated by computing their PMI under some model for the joint probability $$p(a,b)$$

Let D represent a collection of observed word-context pairs (with contexts being other words). We can construct D by considering the full context of a specific word occurrence as the collection of all word occurrences that appear within a fixed-size window of length L before and after it.

For a specific word $w_i$ in position i in a large, ordered collection of words $w_1, w_2$, we would have the context as ,$w_{i-1},w_{i+1},\ldots$, and could thus collect counts (a total of 2L) of each of those words as appearing in the context of word $w_i$. We will refer to $w_i$ as the “target word” and the words appearing in the L-sized window around $w_i$ as “context words”.

Consider a sample corpus containing only one sentence:<br>
    <center><blockquote>Encumbered forever by desire and ambition</blockquote></center>

We can construct D by considering each word position i and extracting the pairs $(w_i, w_{i+k})$ for $−L≤k≤L;k≠0$. In such a pair, we would call $w_i$ the “target word” and $w_{i+k}$ the “context word”.

For example, we would extract the following pairs for $i=4i$ if we let our window size $L=2$<br>
    <center><blockquote>(desire,forever),(desire,by),(desire,and),(desire,ambition)</blockquote></center>

Similarly, for $i=5i$, we would extract the following pairs:
    <center><blockquote>(and,by),(and,desire),(and,ambition)</blockquote></center>
Let’s let $n_{w,c}$ represent the number of times we observe word type c in the context of word type w. We can then define ,$n_w = \sum_{c'} n_{w,c'}$ as the number of times we see a “target” word w in the collection of pairs D and $n_c = \sum_{w'} n_{w',c}$ as the number of times we see the context word c in the collection of pairs D.

We can then define the joint probability of a word and a context word as
    $$p(w, c) = \frac{n_{w,c}}{|D|}$$

where $∣D∣$ is simply the total number of word-context occurrences we see. Similarly, we can define
    $$p(w) = \frac{n_w}{|D|}$$

and $$p(c) = \frac{n_c}{|D|}$$

and thus the PMI between a word w and context word c is
$$PMI(w, c) = \log \frac{p(w,c)}{p(w)p(c)} = \log \frac{n_{w,c} \cdot |D|}{n_w \cdot n_c}.$$

If we compute the PMI between all pairs of words in our vocabulary V, we will arrive at a large, real-valued matrix. However, some of the values of this matrix will be $\log 0$, if the word-context pair $(w,c)$ is unobserved, this will result in inf bieng computed. To remedy this, we could simply define a modified PMI that is equal to 0 when $n_{w,c} = 0$, which is the positive pointwise mutual information (PPMI) which:
    P$$PPMI(w,c) = \max(0, PMI(w,c))$$
<br>
This wonderfull explaination is made by <a href="http://czhai.cs.illinois.edu/">Dr.ChengXiang ("Cheng") Zhai</a><br><br>

<center><b>HINT: Consult your slides and see the example, how the formulas are used. You can calculate $|D|$ by the formula given in the slides(its the same thing).</b></center>

If youre having troubles implementing this here is some [motivation](https://www.youtube.com/watch?v=TsyM5jP7RQk)

### Create a co-occurance matrix with +,- k window size
Hint: Use the ngrams package from [nltk](https://www.nltk.org/) to make life easier. Matrix size is vocab X vocab.
Please keep track of the order of words in the matrix this will be usefull later.

In [18]:
def coOccuranceMatrix(train_X,vocab,k=2):
    """
    Return co-occurance matrix with ppmi counts

    Args:
        data : dataset
        vocab : vocabulary
    Returns: co_matrix
    co_matrix: co-occurance matrix
    """
    vocab_index ={}
    vocab_word = {}
    for i,word in zip((range(len(vocab))),vocab):
        vocab_index[word] = i
        vocab_word[i] = word
    #print(vocab_index)
   
    co_matrix = np.zeros((len(vocab),len(vocab)))
    count = 0
    for word in train_X['text']:
        #print(word)
        for w,i in zip(word,range(len(word))):
            #print(i,k+i)
            #print('Orig Word:',word[i])
            index_i = vocab_index[word[i]]
            for j in range(i+1,k+i+1):
                try:
                    #print(word[j])
                    index_j = vocab_index[word[j]]
                    co_matrix[index_i][index_j] = co_matrix[index_i][index_j]+1
                    count = count+1
                except:
                    pass
           
            for l in range(i,i-k,-1):
                if l-1>=0:
                    index_l = vocab_index[word[l-1]]
                    co_matrix[index_i][index_l] = co_matrix[index_i][index_l]+1
                    count = count+1
                    #print(word[l-1])
            #print(" ")
    print(count)
    return np.matrix(co_matrix),vocab_index,vocab_word

    
    #print(vocab_index)

comat,vocab_index,vocab_word = coOccuranceMatrix(X_train,vocab)

48168


In [19]:
vocab_index


{'bring': 0,
 'out': 1,
 'reminders': 2,
 'its': 3,
 'dumb': 4,
 'also': 5,
 'setup': 6,
 'call': 7,
 'stay': 8,
 'ipad2': 9,
 'ours': 10,
 'ugh': 11,
 'keep': 12,
 'lead': 13,
 'be': 14,
 'some': 15,
 'wall': 16,
 'finance': 17,
 'y': 18,
 'played': 19,
 'x': 20,
 'tweet': 21,
 'watch': 22,
 'with': 23,
 'champ': 24,
 'hand': 25,
 'her': 26,
 'last': 27,
 'gt': 28,
 'comp': 29,
 'still': 30,
 'bar': 31,
 'problems': 32,
 'gen': 33,
 'when': 34,
 'person': 35,
 'tv': 36,
 'space': 37,
 'online': 38,
 'who': 39,
 'hire': 40,
 'can': 41,
 'rest': 42,
 'trip': 43,
 'wonder': 44,
 'older': 45,
 'backside': 46,
 'half': 47,
 'glad': 48,
 'eric': 49,
 'perfect': 50,
 'sd': 51,
 'did': 52,
 'top': 53,
 'nostalgia': 54,
 'found': 55,
 'santa': 56,
 'resources': 57,
 'tweets': 58,
 'announced': 59,
 'recognition': 60,
 'give': 61,
 'we': 62,
 'says': 63,
 'ipad': 64,
 'after': 65,
 'book': 66,
 'page': 67,
 'global': 68,
 'itunes': 69,
 'changes': 70,
 'kind': 71,
 'tech': 72,
 'there': 73,
 't

In [20]:
vocab_word

{0: 'bring',
 1: 'out',
 2: 'reminders',
 3: 'its',
 4: 'dumb',
 5: 'also',
 6: 'setup',
 7: 'call',
 8: 'stay',
 9: 'ipad2',
 10: 'ours',
 11: 'ugh',
 12: 'keep',
 13: 'lead',
 14: 'be',
 15: 'some',
 16: 'wall',
 17: 'finance',
 18: 'y',
 19: 'played',
 20: 'x',
 21: 'tweet',
 22: 'watch',
 23: 'with',
 24: 'champ',
 25: 'hand',
 26: 'her',
 27: 'last',
 28: 'gt',
 29: 'comp',
 30: 'still',
 31: 'bar',
 32: 'problems',
 33: 'gen',
 34: 'when',
 35: 'person',
 36: 'tv',
 37: 'space',
 38: 'online',
 39: 'who',
 40: 'hire',
 41: 'can',
 42: 'rest',
 43: 'trip',
 44: 'wonder',
 45: 'older',
 46: 'backside',
 47: 'half',
 48: 'glad',
 49: 'eric',
 50: 'perfect',
 51: 'sd',
 52: 'did',
 53: 'top',
 54: 'nostalgia',
 55: 'found',
 56: 'santa',
 57: 'resources',
 58: 'tweets',
 59: 'announced',
 60: 'recognition',
 61: 'give',
 62: 'we',
 63: 'says',
 64: 'ipad',
 65: 'after',
 66: 'book',
 67: 'page',
 68: 'global',
 69: 'itunes',
 70: 'changes',
 71: 'kind',
 72: 'tech',
 73: 'there',
 74

In [21]:
vocab_index['UNK']

131

In [22]:
def ppmiMatrix(co_matrix):
    """
    Return co-occurance matrix with ppmi counts

    Args:
        co_matrix : co-occurance matrix
    Returns: ppmi_matrix
    ppmi_co_matrix: co-occurance matrix with ppmi counts
    """
    #formulae = P(w,c) = n(w,c)*|D|/(n(w)*n(c))
    ppmi_co_matrix = np.zeros((co_matrix.shape[0],co_matrix.shape[1]))
    D = co_matrix.sum()
    for x in range(0,ppmi_co_matrix.shape[0]):
        for y in range(0,ppmi_co_matrix.shape[1]):
            #print(co_matrix[x,y])
            result = co_matrix[x,y]*D/(co_matrix[x].sum()*co_matrix[y].sum())
            #print(result)
            nresult = max(0,result)
            ppmi_co_matrix[x,y] = nresult
    return ppmi_co_matrix
pMatrix = ppmiMatrix(comat)

Code for SVD has been provided for you, all you have to do is specify the number of top eigenvalues or how many top <b>n</b> dimensions you want to keep. Check the dimensions of the returned matrix by using <blockquote>.shape</blockquote> command to figure out if the embedding for each word is in row or column. By our calculation the vocab count should be less than five thousand, reduce the dimensionality to less than one thousand.

In [23]:
#code provided 
def denseMatrixViaSVD(ppmi_co_matrix,n):
    """
    Return reduced dimensionality co-occurance matrix by applying svd

    Args:
        ppmi_matrix : co-occurance matrix with ppmi counts
        
    Returns: svd_co_matrix
    svd_co_matrix: reduced dimensionality co-occurance matrix
    """
#     top_n_eigenvalues=
    U, Sigma, VT = randomized_svd(ppmi_co_matrix, 
                              n_components=n,
                              n_iter=5,
                              random_state=None)
    svd_co_matrix=U
    #print(U.shape)
    return svd_co_matrix


In [24]:
pMatrix.shape

(1078, 1078)

In [25]:
svdp_mat = denseMatrixViaSVD(pMatrix,n=100)

In [26]:
def convertData(X):
    Xlist = []
    for word in X['text']:
        a = []
        for i in word:
            try:
                a.append(svdp_mat[vocab_index[i]])
            except:
                a.append(svdp_mat[vocab_index['UNK']])
        a = np.stack(a,axis=0)
        Xlist.append(a)
    return Xlist

In [27]:
X_train_copy = convertData(X_train)

In [28]:
X_test_copy = convertData(X_test)

### Modelling
Now that we have our embeddings, lets use these to train a Feed Forward Neural network for our semantic classification task. Since a feed forward network's input layer is of a fixed size we will need to create a fixed size representation for each review. For this purpose we will use the following:
<ul>
    <li>Average pooling.</li>
    <li>Averaging pooling algorithm by FastText(provided)</li>
    <li>Max pooling. </li>
</ul>
For those of you who are familiar with Convolution Neural Networks this pooling will be a 1d pooling operation. See illustrated example below:<img src="pooling.png">

Since we cant have a tutorial due to corona virus for keras, a simple feed forward network has beed provided for you. You need to create train_X, test_X , train_Y and test_Y these should numpy arrays inorder for keras to use them.
<ul>
    <li>train_X= contains embedding representains of all the reviews in the train set</li>
    <li>train_Y= contains embedding representains of all the reviews in the test set</li>
    <li>train_Y= contains <b>one hot</b> representations of train labels</li>
    <li>test_Y= contains <b>one hot</b> representations of test labels</li>   
</ul>
To construct one hot representation you can use the sklearn's preprocessing package or the preprocessing package from keras. Read online.

In [29]:
from keras.utils import to_categorical

In [30]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
y_train['cat'] = labelencoder.fit_transform(y_train['class'])
y_test['cat'] = labelencoder.transform(y_test['class'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [31]:
# CONSTRUCT ONE HOT REPRESENTATION
y_train_hot = to_categorical(y_train['cat'])
y_test_hot = to_categorical(y_test['cat'])

In [32]:
# Need to construct Training and Test data and stack them with new representations


In [33]:
y_train_hot

array([[0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       ...,
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]], dtype=float32)

In [34]:
#Fast text averaging, pass a list of word embeddings and embedding size to fasttextAveraging function
def l2_norm(x):
   return np.sqrt(np.sum(x**2))

def div_norm(x):
   norm_value = l2_norm(x)
   if norm_value > 0:
       return x * ( 1.0 / norm_value)
   else:
       return x
def fasttextAveraging(embedding_list,embedding_size):
    norm=np.zeros(embedding_size)
    for emb in embedding_list:
        norm=norm+div_norm(emb) 
    return norm/len(embedding_list)

In [35]:
def averagePooling(embedding_list,embedding_size):
    """
    Return average embedding vector from list of embedding
    Args:
        embedding_list : embedding list
        embedding_size: size of embedding vector
    Returns: average_embedding
    average_embedding: average embedding vector
    """
    
    xlist = []
    for example in embedding_list:
        xlist.append(np.mean(example,axis=0))
    return np.array(xlist)
avg_x_train = averagePooling(X_train_copy,embedding_size=100) #You need to change n to change size of embedding.This has no effect
avg_x_test = averagePooling(X_test_copy,embedding_size=100)

In [36]:
avg_x_train

array([[ 2.98959959e-05,  3.92301363e-04,  1.64971768e-04, ...,
        -1.23231723e-03,  2.55676959e-04,  4.23713645e-03],
       [ 6.61491136e-05,  7.77958133e-05,  1.06389450e-04, ...,
         1.36234435e-03, -1.34981032e-03,  4.40264073e-03],
       [ 7.59412091e-05,  8.43491253e-06,  1.19832002e-04, ...,
        -1.32146224e-03, -8.40745898e-04,  5.47390349e-03],
       ...,
       [ 1.19018439e-04,  3.35072272e-05,  1.68173172e-04, ...,
        -1.93112604e-03, -8.80703944e-04,  5.66757473e-03],
       [ 4.08347748e-05,  2.51773880e-05,  5.42102716e-05, ...,
        -5.66728792e-04, -8.90262851e-04,  9.40047345e-04],
       [ 2.83093851e-05,  1.05484502e-04,  4.89044844e-05, ...,
        -6.80280472e-04, -4.97600194e-04,  2.60737838e-03]])

In [37]:
avg_x_train.shape

(790, 100)

In [38]:
def maxPooling(embedding_list,embedding_size):
    """
    Return maxpooling embedding vector from list of embedding
    Args:
        embedding_list : embedding list
        embedding_size: size of embedding vector
    Returns: max_embedding
    max_embedding: maxpooled embedding vector
    """
    xlist = []
    for example in embedding_list:
        xlist.append(np.max(example,axis=0))
    return np.array(xlist)
max_x_train = maxPooling(X_train_copy,embedding_size=100)
max_x_test = maxPooling(X_test_copy,embedding_size=100)

In [39]:
max_x_train.shape

(790, 100)

Try using all three representaions to train the model and check which one works best. You can play around with embedding size by controlling <b>n</b> in SVD function and for the model you can add or remove layers or change the number of neurons in the hidden layers. Keep in mind that the layers should be decreasing in size as we go deeper into the network, theoritically this means that we are constructing complex features in a lower dimensional space from less complex features and larger dimensional space.<br><br>
Issues related to overfiting will be proper addressed in the next assignment for now you are free to choose the number of epoch, try to find one that trains the model sufficiently enough but does not overfit it.

In [40]:
import tensorflow as tf
from tensorflow import keras
embedding_size=100
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(embedding_size,)),#donot change(input layer)
    keras.layers.Dense(300, activation='relu'),#(hidden layer)
    keras.layers.Dense(50, activation='relu'),#(hidden layer)
    keras.layers.Dense(3)#donot change
])
model.compile(optimizer='adam',
              loss=["categorical_crossentropy"],
              metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 300)               30300     
_________________________________________________________________
dense_1 (Dense)              (None, 50)                15050     
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 153       
Total params: 45,503
Trainable params: 45,503
Non-trainable params: 0
_________________________________________________________________


In [41]:
train_Y = y_train_hot
train_X = max_x_train

In [42]:
model.fit(train_X,train_Y, epochs=15, batch_size=32,
               verbose=1,shuffle=True)

Train on 790 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x1d85d781248>

Use the <b>model.predict</b> method to get predictions. There predictions will be a probability distribution over the lables, to get the desired class take the max value in a prediction vector as the predicted class.<br> To run the code below you need to construct a list of unique labels, the list should be ordered on the basis of the id assigned to each class when you were constructing the one hot representation.

In [43]:
predictions1 = model.predict(max_x_test)
test_Y = y_test_hot
predictions = np.argmax(predictions1,axis=-1)
predictions

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

In [44]:
mapping = dict(zip(labelencoder.classes_, range(len(labelencoder.classes_))))

In [45]:
mapping

{'Neg': 0, 'Neutral': 1, 'Pos': 2}

In [46]:
#predictions = code here

from sklearn.metrics import confusion_matrix
labelList = ['Neg','Neutral','Pos']
test_Y_max=np.argmax(test_Y, axis=-1)
cm=confusion_matrix(test_Y_max,predictions)
cm1 = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm1 = pd.DataFrame(cm, labelList,labelList )# matrix,names row,names col,
plt.figure(figsize=(3,3))
sn.set(font_scale=1.4) # for label size
sn.heatmap(cm1, annot=True, annot_kws={"size": 11}, fmt=".2f") # font size
plt.show()


<IPython.core.display.Javascript object>

In [47]:
print("Classification Report\n",classification_report(test_Y_max, predictions, labels=[0,1,2], target_names = labelList))

Classification Report
               precision    recall  f1-score   support

         Neg       0.28      0.91      0.43        54
     Neutral       0.71      0.16      0.26       107
         Pos       0.00      0.00      0.00        37

    accuracy                           0.33       198
   macro avg       0.33      0.36      0.23       198
weighted avg       0.46      0.33      0.26       198



  'precision', 'predicted', average, warn_for)


## Prediction base Embeddings
For prediction based embeddings we will use the IMDB dataset. We will create create our embeddings by using the unlabeledTrainData.tsv file.
We will use the Word2Vec model that we have already covered in class. 

### Preprocessing
Since we have already discussed preprocessing trade off's in previous assingments. We expect you to analyse the data and preform the preprocessing that is required.<br> 
<b> Hint: Each review is in string format so they have used slahes to escape characters and br tags to identify line breaks</b>

In [48]:
#load the data
df_train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
#df_train = df_train.head(1000)
df_train.head()
len(df_train)

25000

In [49]:
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
def preprcoessing1( raw_review ):
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review, 'lxml').get_text() 
    
    # 2. Remove non-letters with regex
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                           
    
    # 4. Create set of stopwords
    stops = set(stopwords.words("english"))                  
    
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return meaningful_words  

df_train['review']=df_train['review'].apply(preprcoessing1)

In [50]:
df_train['review']

0        [stuff, going, moment, mj, started, listening,...
1        [classic, war, worlds, timothy, hines, enterta...
2        [film, starts, manager, nicholas, bell, giving...
3        [must, assumed, praised, film, greatest, filme...
4        [superbly, trashy, wondrously, unpretentious, ...
                               ...                        
24995    [seems, like, consideration, gone, imdb, revie...
24996    [believe, made, film, completely, unnecessary,...
24997    [guy, loser, get, girls, needs, build, picked,...
24998    [minute, documentary, bu, uel, made, early, on...
24999    [saw, movie, child, broke, heart, story, unfin...
Name: review, Length: 25000, dtype: object

In [51]:
#since only the training data has the labels so only using the training data 
X_train,X_test,y_train,y_test = train_test_split(df_train,df_train['sentiment'])

In [52]:
#UNK tokens 
# pseudoDict = {}
# for values in X_train['review']:
#     for word in values:
#         if word not in pseudoDict.keys():
#             pseudoDict[word] = 1
#         else:
#             pseudoDict[word] = pseudoDict[word]+1

In [53]:
# stopwords = []
# threshold = 1
# for val in pseudoDict:
#     if pseudoDict[val]<=threshold:
#         stopwords.append(val)

In [54]:
# len(stopwords)

In [55]:
# def createUnk(data):
#     count = 0
#     for word in stopwords:
#         data = data.replace(r'\b{}\b'.format(word),'UNK',regex=True)
#         count = count+1
#         if count%10==0:
#             print(count)
#     return data

In [56]:
# X_train['review'] = X_train['review'].str.join(" ")

In [57]:
# count = 0
# # for word in stopwords:
# #     X_train['review'] = X_train['review'].replace(r'\b{}\b'.format(word),'UNK',regex=True)
# #     count = count+1
# #     if count%1000==0:
# #         print(count)

In [58]:
# X_train['review'] = X_train['review'].str.split()

In [59]:
# X_train['review']

In [60]:
#this shiz going to happen lol
xList = []
for review in X_train['review']:
    xList.append(review)

Use the [gensim](https://radimrehurek.com/gensim/models/word2vec.html) to train a Word2Vec model. Keep the dimensionality at 300 and window size at 2. After trianing use the model and previously coded methods create vectorial represenations for movie reviews.<i>(create train_X, test_X, train_Y and test_Y)</i>

In [61]:
def trainWord2Vec(dataX):
    """
    Return preprocessed data

    Args:
        data : movie reviews
    
    Returns: model
    model : Word2Vec model 
    """
#     preprocessed_data=
    model = Word2Vec(dataX, size=300, window=2,min_count=1)
    return model
    


In [62]:
wv = trainWord2Vec(xList)

In [63]:
word_vector = wv.wv

In [64]:
#Removing words not in vocab
v = word_vector.vocab.keys()

In [65]:
vocab = []
for i in v:
    vocab.append(i)

In [66]:
def convertVector(X):
    Xlist = []
    for word in X:
        #print(word)
        a = []
        for i in word:
            #print(word)
            try:
                a.append(word_vector.get_vector(i))
            except:
                pass
        try:
            a = np.stack(a,axis=0)
        except:
            print(a)
        Xlist.append(a)
    return Xlist

In [67]:
X_train_vec = X_train['review']
X_test_vec = X_test['review']

In [68]:
X_train_vec = convertVector(X_train_vec)

In [69]:
X_test_vec = convertVector(X_test_vec)

In [70]:
y_test_hot = to_categorical(y_test,num_classes=2)
y_train_hot = to_categorical(y_train,num_classes=2)

In [71]:
len(X_test_vec)

6250

In [84]:
#load the train and test files and create the vectorial representations
max_x_train = averagePooling(X_train_vec,embedding_size=300)
max_x_test = averagePooling(X_test_vec,embedding_size=300)

In [85]:
max_x_train.shape

(18750, 300)

Since this is dense representaion we wont be faced with the challenges posed by sparse representations. We can move onto modelling.

### Modelling


In [86]:
import tensorflow as tf
from tensorflow import keras
embedding_size=300
model_word2vec = keras.Sequential([
    keras.layers.Flatten(input_shape=(embedding_size,)),#donot change(input layer)
    keras.layers.Dense(150, activation='relu'),#(hidden layer)
    keras.layers.Dense(50, activation='relu'),#(hidden layer)
    keras.layers.Dense(2)#donot change
])
model_word2vec.compile(optimizer='adam',
              loss=["categorical_crossentropy"],
              metrics=['accuracy'])

model_word2vec.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 300)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 150)               45150     
_________________________________________________________________
dense_7 (Dense)              (None, 50)                7550      
_________________________________________________________________
dense_8 (Dense)              (None, 2)                 102       
Total params: 52,802
Trainable params: 52,802
Non-trainable params: 0
_________________________________________________________________


In [87]:
train_X = max_x_train
train_Y = y_train_hot

In [88]:
model_word2vec.fit(train_X,train_Y, epochs=15, batch_size=32,
               verbose=1,shuffle=True)

Train on 18750 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x1d90d170b88>

In [89]:
predictions1 = model_word2vec.predict(max_x_test)
test_Y = y_test_hot
predictions = np.argmax(predictions1,axis=-1)
predictions

array([1, 0, 1, ..., 0, 1, 1], dtype=int64)

In [90]:
#predictions = 
labelList = ['Pos','Neg']
from sklearn.metrics import confusion_matrix

test_Y_max=np.argmax(test_Y, axis=-1)
cm=confusion_matrix(test_Y_max,predictions)
cm1 = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm1 = pd.DataFrame(cm, labelList,labelList )# matrix,names row,names col,
plt.figure(figsize=(3,3))
sn.set(font_scale=1.4) # for label size
sn.heatmap(cm1, annot=True, annot_kws={"size": 11}, fmt=".2f") # font size
plt.show()


<IPython.core.display.Javascript object>

In [79]:
d = {'predictions':predictions,'test_Y':np.array(y_test)}


In [91]:
d

{'predictions': array([0, 1, 1, ..., 1, 0, 1], dtype=int64),
 'test_Y': array([1, 0, 1, ..., 0, 1, 0], dtype=int64)}

In [92]:
writeData = pd.DataFrame(data=d)
writeData.to_csv("Final.csv")

In [93]:
print("Classification Report\n",classification_report(test_Y_max, predictions, labels=[0,1], target_names = labelList))

Classification Report
               precision    recall  f1-score   support

         Pos       0.82      0.82      0.82      3131
         Neg       0.82      0.82      0.82      3119

    accuracy                           0.82      6250
   macro avg       0.82      0.82      0.82      6250
weighted avg       0.82      0.82      0.82      6250



# Theory
The two are two major reaserch papers [Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) for prediction based embeddings and [GloVe](https://nlp.stanford.edu/pubs/glove.pdf) for frequency based embeddings. Research online and write a short note on the trade offs associated with the two types of embeddings. 

`Copied from Quora after reading both papers`

`Presence of Neural Networks: GloVe does not use neural networks while word2vec does. In GloVe, the loss function is the difference between the product of word embeddings and the log of the probability of co-occurrence. We try to reduce that and use SGD but solve it as we would solve a linear regression. While in the case of word2vec, we either train the word on its context (skip-gram) or train the context on the word (continuous bag of words) using a 1-hidden layer neural network.`

`Global information: word2vec does not have any explicit global information embedded in it by default. GloVe creates a global co-occurrence matrix by estimating the probability a given word will co-occur with other words. This presence of global information makes GloVe ideally work better. Although in a practical sense, they work almost similar and people have found similar performance with both.`

#### Ending Note:
Feed forward networks are not suitable for natural language task because of thier fixed input sizes, the size of natural language text in each example for a dataset can vary considerably, also feed forward networks ignore the temporal nature of natural language text, which result's in them not bieng able to caputre context's or interdepencies between words for semantic information. To fix this researcher's have invented recurrent neural networks that help to aleviate these limitations.
The next assignment will be related to recurrent neural networks.

# We hope all of you are working on your projects!