Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Mohlatlego Nakeng"
StudentNumber = "13413351"

---

## COS 802 2019 Homework 2 - Part 3 [35 points, 42 with extra credit]
## Exploring Word Embeddings.


**You will learn how to:**
- Train your own word embedding.
- Use pretrained word embeddings.
- Use word embeddings in simple classifiers

**What this homework does not cover!!!!**
- Sequential Deep Learning Models with embeddings - A challenge you have with words is that the context is defined by the sequential context. [Resource 1](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) [Resource 2](https://www.coursera.org/lecture/nlp-sequence-models/why-sequence-models-0h7gT)
- Other Deep learning models with embeddings [Resource 3](https://realpython.com/python-keras-text-classification/#convolutional-neural-networks-cnn)

**Note:** you can experiment by adding aditional cells, but they must be removed from final solution. Only the cells originally in the notebook plus the ones you have filled with your solution are required.

## 1 Packages ##

Scikit-Learn for text Analysis
- [sklearn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html): Scikit-Learn Working With Text Data
- [matplotlib](http://matplotlib.org) is a library for plotting graphs in Python.
- [gensim](https://pypi.org/project/gensim/) NLP library with word-embedding functions.

In [19]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [20]:
!pip install wordcloud
!pip install nltk
!pip install gensim



In [2]:
import gensim
import nltk

import pandas as pd
import numpy as np

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.model_selection import train_test_split

In [22]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Fetch the AGNews dataset from the FastAI repository

Fast AI repo https://course.fast.ai/datasets

> 496,835 categorized news articles from >2000 news sources from the 4 largest classes from AG’s corpus of news articles, using only the title and description fields. The number of training samples for each class is 30,000 and testing 1900.

In [3]:
import os

if not os.path.isfile('ag_news_csv.tgz'):
    !wget https://s3.amazonaws.com/fast-ai-nlp/ag_news_csv.tgz

#U ncompress the archive
if not os.path.isfile('ag_news_csv/train.csv'):
    !tar -xzf ag_news_csv.tgz
    # Lets see what is in the 
    !ls ag_news_csv/
# View categories
!cat ag_news_csv/classes.txt

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
df_ag_news_train  = pd.read_csv('ag_news_csv/train.csv', header = None)
df_ag_news_train.columns = ['category','title','text']
df_ag_news_train.head()

Unnamed: 0,category,title,text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


## Task1: Building own Word2Vec [4 points]
### What is word2vec

*Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.* [URL](https://en.wikipedia.org/wiki/Word2vec)

**Task:**

You will use gensim to be able to train both skip-gram and cbow word2vec models using the AGNews data. 

In [5]:
documents = df_ag_news_train.text.values

# Go through all the documents and tokenize them by words
data_gensim = []
for doc in documents:
    
    # We are training on sentences
    for sentence in sent_tokenize(doc):
        tokens = []
        # We want only lower case
        for word in word_tokenize(sentence):
            tokens.append(word.lower())
        data_gensim .append(tokens)

# The result
print("Original doc: ", documents[0])
print("Tokenized doc: ", data_gensim[0])
print("Num Sentences: ", len(data_gensim))

Original doc:  Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
Tokenized doc:  ['reuters', '-', 'short-sellers', ',', 'wall', 'street', "'s", 'dwindling\\band', 'of', 'ultra-cynics', ',', 'are', 'seeing', 'green', 'again', '.']
Num Sentences:  154351


Hmm, that does not seem right. Can we remove all of the special characters? Let's use the RegexTokenizer that can use a regular expression that removes special characters

In [6]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+') # Can you figure out what the regular expression is doing here?

Let's do this again.

In [7]:
documents = df_ag_news_train.text.values

# Go through all the documents and tokenize them by words
data_gensim = []
for doc in documents:
    # We are training on sentences
    for sentence in sent_tokenize(doc):
        tokens = []
        # We want only lower case
        for word in tokenizer.tokenize(sentence): # We are using the new tokeniser here
            tokens.append(word.lower())
        data_gensim .append(tokens)

# The result
print("Original doc: ", documents[0])
print("Tokenized doc: ", data_gensim[0])
print("Num Sentences: ", len(data_gensim))

Original doc:  Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
Tokenized doc:  ['reuters', 'short', 'sellers', 'wall', 'street', 's', 'dwindling', 'band', 'of', 'ultra', 'cynics', 'are', 'seeing', 'green', 'again']
Num Sentences:  154351


In [8]:
# Word2vec properties
embedding_size = 300
window_size = 5
min_count_words = 1 # minimum times we must see a word before we make a vector for it

### Q1.1 

Take a look at **[gensim.models.Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html)** and train a **CBOW** model. The model should be saved in *model_cbow*



In [9]:
# Train CBOW model
# YOUR CODE HERE
from gensim.models import Word2Vec
model_cbow = Word2Vec( data_gensim, min_count=min_count_words,size= embedding_size, window= window_size)

# Note: This will take some time. Be patient.

Lets see a vector for the word *money*

In [10]:
model_cbow.wv['money']

array([-0.43023917,  0.49391034, -0.29455504, -0.16713655, -0.7198631 ,
        1.1437222 ,  0.33048776, -0.79975283,  0.16143873,  0.22538131,
       -0.69949865,  0.42010462,  0.02072031,  0.27805907, -0.02141028,
        0.01897099, -0.3570757 , -0.1552344 , -0.11092183, -0.10134619,
       -0.89406645, -0.25430495, -0.7951853 ,  0.7433451 ,  0.37337   ,
        0.13273127, -0.252723  ,  0.29027316,  0.49601826, -0.547723  ,
        0.17721382, -0.06368242, -0.22159591, -0.55978185,  0.02974451,
        0.78752035,  0.7518917 , -1.2577966 , -0.45998067,  0.03627203,
       -0.8557342 , -0.34061185,  0.3619586 ,  0.19925684,  0.14900613,
       -0.29531407, -0.09845301,  0.14405546, -0.13352919,  0.11729415,
        0.3884275 ,  0.08272228,  0.40851918, -0.1461801 ,  0.0803031 ,
        0.84643286,  0.504147  , -0.71561295, -0.8155103 ,  0.4897443 ,
       -0.02334483,  0.36437902, -0.47386777, -0.54606193, -0.29890427,
        0.34048876, -0.12005597,  0.5151365 ,  0.07837338, -0.52

One of the things we can do with word embeddings we could not do with bag-of-words models is that we can now do [cosine similarity](https://www.machinelearningplus.com/nlp/cosine-similarity/) of words. 1.0 is perfect match.

In [11]:
print("CBOW similarity: ", model_cbow.wv.similarity('money', 'cash'))

CBOW similarity:  0.56732136


In [12]:
print("CBOW Most Similar to Money")
model_cbow.wv.most_similar('money')

CBOW Most Similar to Money


[('benefits', 0.6512506008148193),
 ('customers', 0.6380195021629333),
 ('paying', 0.6330016851425171),
 ('consumers', 0.6310704350471497),
 ('funds', 0.6260761022567749),
 ('savings', 0.6136110424995422),
 ('cowbird', 0.610594630241394),
 ('options', 0.6009640693664551),
 ('ways', 0.599640965461731),
 ('certain', 0.5945741534233093)]

### Q1.2

Now, Take a look at **[gensim.models.Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html)** and train a **Skip-Gram** model. The model should be saved in *model_skipgram*

In [13]:
# Train CBOW model
# YOUR CODE HERE

model_skipgram = Word2Vec( data_gensim, min_count=min_count_words,size= embedding_size, window= window_size, sg=1) 
#sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
# Note: This will take some time. Be patient.

In [14]:
print("Skip-Gram similarity: ", model_skipgram.wv.similarity('money', 'cash'))

Skip-Gram similarity:  0.43862706


In [15]:
print("SkipGram Most Similar to Money")
model_skipgram.wv.most_similar('money')

SkipGram Most Similar to Money


[('changers', 0.6007676720619202),
 ('recoup', 0.5959882140159607),
 ('payoffs', 0.5949201583862305),
 ('laundering', 0.579896092414856),
 ('gifts', 0.5775237083435059),
 ('inclined', 0.5665773153305054),
 ('deposit', 0.56183260679245),
 ('portfolios', 0.5608170032501221),
 ('parija', 0.5567904710769653),
 ('taxpayer', 0.556017279624939)]

## Task 2: Pretrained Embeddings. [4 Points]

A great feature of embeddings is someone else can pretrain them for us. In this case we can get the original [Google News word2vec vectors](https://code.google.com/archive/p/word2vec/). 

In [17]:
# If you have not downloaded this, uncomment the next 2 lines
!wget -c https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
!gunzip GoogleNews-vectors-negative300.bin.gz

'wget' is not recognized as an internal or external command,
operable program or batch file.
'gunzip' is not recognized as an internal or external command,
operable program or batch file.


In [19]:
from gensim import models

## Uncomment below

# model_w2v_google = models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

## Note: This will take some time. Be patient.

In [21]:
#the cell above is not editable!
#I was unable to access the files, instead i did manual download
model_w2v_google = models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [24]:
## Uncomment below
#  print("Google News Pretrained W2V similarity: ", model_w2v_google.wv.similarity('money', 'cash'))

In [25]:
## Uncomment below

print("Google News Pretrained W2V similarity: ", model_w2v_google.wv.most_similar('money'))

In [27]:
## Uncomment below

print("Google Pretrained W2V Most Similar to Money")
model_w2v_google.wv.most_similar('money')

### Questions

**Q2.1:** Why do the 3 different word2vec models have differing distances between money and cash?

**Q2.2:** Test other words and check if the distance makes sense to you. What do you observe?

### Answers

A2.1 and A2.2 Here
YOUR ANSWER HERE

## Task 3: Visualizing Word Vectors with TSNE [4 points]
In this section you will first reduce the large vector space (dimenstion 300) to a 2D space that you can then visualise the words and their semantic relations on a plot. We use tSNE for this.

*t-distributed Stochastic Neighbor Embedding. t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.* [URL](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)

In [28]:
from sklearn.manifold import TSNE
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=1500)

Lets get data from the CBOW model so that we can then train the tSNE model.

In [29]:
labels_viz = []
tokens_viz = []
count = 0
for word in model_cbow.wv.vocab:
    if word in model_w2v_google.wv.vocab:
        tokens_viz.append(model_cbow[word])
        labels_viz.append(word)
        if count > 10000:
            break
        count+=1

  """
  


Now lets train the TSNE Model for CBOW

In [30]:
values_cbow_tsne = tsne_model.fit_transform(tokens_viz)
# Note: This will take some time. Be patient.

In [32]:
import matplotlib.pyplot as plt
x = []
y = []
for value in values_cbow_tsne:
    x.append(value[0])
    y.append(value[1])
    
plt.figure(figsize=(20, 20)) 
for i in range(200):
    plt.scatter(x[i],y[i])
    plt.annotate(labels_viz[i],
                    xy=(x[i], y[i]),
                    xytext=(5, 2),
                    textcoords='offset points',
                    ha='right',
                    va='bottom')

In [32]:
# Get the same words in the gogogle model
tokens_viz_w2v = []
for word in labels_viz:
        tokens_viz_w2v.append( model_w2v_google[word])

In [33]:
values_w2v_tsne = tsne_model.fit_transform(tokens_viz_w2v)
# Note: This will take some time. Be patient.

In [None]:
import matplotlib.pyplot as plt
x = []
y = []
for value in values_w2v_tsne:
    x.append(value[0])
    y.append(value[1])
    
plt.figure(figsize=(20, 20)) 
for i in range(200):
    plt.scatter(x[i],y[i])
    plt.annotate(labels_viz[i],
                    xy=(x[i], y[i]),
                    xytext=(5, 2),
                    textcoords='offset points',
                    ha='right',
                    va='bottom')

### Question 3:
What do you observe in the two visualisations? Any similarities, what are the differences. What can you say about one embedding over the other?

### A3
YOUR ANSWER HERE

## Task 4: Classifiers and embeddings [40 points]

In this task, which takes up most of this homework, you will experiment with different ways to use word embeddings to transform documents. For this section you will transform documents [multiple word embeddings] into a single vector and looking at the effect on classifier performance.

* **Q1:** experiment with using a *sum* to transform the word embedding sequence to a document vector [5]
* **Q2:** experiment with using a *max* to transform the word embedding sequence to a document vector [5]
* **Q3:** experiment with using a 1 and 2 and also playing with the different CBOW, and SkipGram models to transform the word embedding sequence to a document vector [7]
* **Q4: Discusion:** Discuss your observations on the performance results [10]
* **Q5 [Extra Credit - extra 5 points]:** experiment with using approaches from [this paper](https://arxiv.org/abs/1803.01400) to transform the word embedding sequence to a document vector - Paper: *Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations*

**Evalution:** Evaluate the performance via accuracy on predicting the document class. Example is shown below for a mean transform with logistic regression model.

**Note** We will not cover using paragraph2vec or doc2vec but you can look at that yourself. It is available in the gensim package.

In [None]:
# Helper function to transform a document into a mean of the word vectors. 

# Note!! You should add your code in this function to implement Q1-5
def embedding_transform(documents_tokenized, w2v_model, embedding_size, transform_type='mean'):
    transformed_docs = []
    for words in documents_tokenized:
        transformed_doc = []
        for w in words:
            if w in w2v_model:
                transformed_doc.append(w2v_model[w])
            else:
                transformed_doc.append(np.zeros(embedding_size)) ## Add a default 0 vector for unknown words
        
        # How do you transform to a document? Put your answer in here. 
        if  transform_type == 'mean':
            transformed_doc = np.mean(transformed_doc, axis=0)
        elif transform_type == 'sum':
            # YOUR CODE HERE
            transformed_doc = np.sum(transformed_doc, axis=0)
        elif transform_type == 'max':
            # YOUR CODE HERE
            transformed_doc = np.max(transformed_doc, axis=0)
        elif transform_type == 'extra':
            # YOUR CODE HERE
            transformed_doc = np.power(np.mean(np.power(transformed_doc,2),axis =0),1/2)
        else:
            raise NotImplementedError()
        transformed_docs.append(transformed_doc)
    output = np.array(transformed_docs)
    return output

Go through all the documents and tokenize them by words

In [None]:
documents_tokenized = []
for doc in documents:
    tokens = []
    # We want only lower case
    for word in tokenizer.tokenize(doc): # We are using the new tokeniser here
        tokens.append(word.lower())
    documents_tokenized.append(tokens)

In [None]:
y = df_ag_news_train.category # Get all the document labels

# Now lets split into training and testing.
doc_train, doc_test, y_train, y_test = train_test_split(documents_tokenized,y , test_size = 0.3)

Here is where we transform the data with the mean transform.

In [None]:
X_train_mean = embedding_transform(doc_train, model_cbow.wv, embedding_size, transform_type='mean')

In [None]:
X_train_mean.shape

Note the size of each document is now the embedding_size. As you can see, every document is now only of size 300, instead of the larger sives with Bag-of-words

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [None]:
# Train and then check score.
clf = LogisticRegression(n_jobs=-1) # Use multicore
scores = cross_val_score(clf, X_train_mean, y_train, cv=5)

In [None]:
scores

# Answers to all Task 4 questions below

Do your experimentation below to answer the questions. You can add code lines, visualisations and tests as needed and also include the a **discussion**. 

### A 4.1

In [None]:
# YOUR CODE HERE
X_train_sum = embedding_transform(doc_train, model_cbow.wv, embedding_size, transform_type='sum')
X_train_sum.shape

### A 4.2

In [None]:
# YOUR CODE HERE
X_train_max = embedding_transform(doc_train, model_cbow.wv, embedding_size, transform_type='max')
X_train_max.shape

### A 4.3

In [None]:
# YOUR CODE HERE
X_train_sum_sg = embedding_transform(doc_train, model_skipgram.wv, embedding_size, transform_type='sum')
X_train_sum_sg.shape

### A 4.5 (optional)

In [None]:
# YOUR CODE HERE
X_train_extra = embedding_transform(doc_train, model_cbow.wv, embedding_size, transform_type='extra')
X_train_extra.shape

### Discussion 4

In [None]:
### Answer
# YOUR CODE HERE
The Skip-Gram algorithm performs better than the continous bag of words (CBOW) algorithm for the model.
The Skip-Gram model had a better model accuracy (validation accuracy scores) across all the embedding transformations and the models ran quicker than the
CBOW models.
The different embedding transfomations are discussed below:
Sum:
Model Average Accuracy: CBOW: 87.6% and Skip-Gram: 88.9%
Max:
Model Average Accuracy: CBOW: 81.3% and Skip-Gram: 85.3%
Mean:
Model Average Accuracy: CBOW: 86.9% and Skip-Gram: 88.8%
Power-Mean Algorthim:
Model Average Accuracy: CBOW: 86.1% and Skip-Gram: 88.1%
