# Modern Data Science 
**(Module 05: Deep Learning)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session A - Neural Embedding

**The purpose of this session is to introduce vector space models (VSMs), called Word2Vec, that represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points. In this practical session, we present the following topics:****

1. Learning vector representation of words using Gensim, a Python implementation of word2vec. 
2. Conducting an extrinsic evaluation of the word vectors through using them to build features in a sentiment classification of movie reviews.

** References and additional reading and resources**
- [Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec)
- [Word2Vec word embedding tutorial in Python and TensorFlow](http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/)
- [Word2Vec Resources](http://mccormickml.com/2016/04/27/word2vec-resources/)


---





## 0. Preliminaries

This section will instruct you to install the necessary software packages used in this notebook and initial introduction to datasets which are used throughout this practical session.

### 0.1. Installing Gensim

Gensim is a free Python library designed to automatically extract semantic topics from documents. It is one of main libraries for handling textual data which is introduced in this and some consequence sessions. However, in this session, we only use Word2vec implementation from this library. 

To install gensim, from the command line, you can run:

``pip install gensim``

However, you are also able to install packages in notetbooks using `!` notation before the commands as follows:


In [None]:
# you only need to run this command once to install gensim
!pip install gensim

If Gesim is already installed, you will be noticed. 

<img src="https://raw.githubusercontent.com/tuliplab/mds/master/Jupyter/image/note.gif" width="40", align="left"> Installing a new package might require administrator's previledge. If the above command fails, open an Annoconda command window with Administrator's right and type: 

                conda install gensim.


### 0.2. Datasets

We are mainly working on two datasets throughout this practical session: Wikipedia and IMDb movie reviews.

1. **Wikipedia**: is the first 100,000,000 bytes (~100MB) of the English Wikipedia dump on Mar. 3, 2006, provided by Matt Mahoney at http://mattmahoney.net/dc/text8.zip
The data was already downloaded and stored at <span style="color:blue">data/dl/wiki/wiki.txt</span>. If the file is not available, you can download and store in the corresponding folder.

2. **IMDb movie reviews**: 
This is a sentiment polarity dataset consisting of 25,000 movie reviews for training (*test-pos.txt and test-neg.txt* files) and 25,000 movie reviews for testing (*train-pos.txt and train-neg.txt* files). A half of these is labelled 'positive' and another half is labelled 'negative'. There are also 50,000 unlabelled data points (train-unsup.txt). It is downloaded from
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
http://ai.stanford.edu/~amaas/data/sentiment/
The data was already downloaded and stored at <span style="color:blue">data/dl/sent</span>. If the files are not available, you can download and store in the corresponding folder.


## 1. Training a Word2Vec model

Word2Vec learns the semantic relationship between words. If you want to think of  Word2Vec as a magical blackbox, you can think that it takes words as input and returns a vector of numbers that represent the input word and its meaning. You can refer to the lecture 7 or [this tutorial](https://www.tensorflow.org/tutorials/word2vec) for details in how to implement a Word2Vec model in Tensorflow.
We can use Gensim library to train Wikipedia dataset to produces word embedded vectors and save the learned model into folder <span style="color:blue"> model/wiki/</span>.

First, importing necessary libraries:

In [None]:
#import Python libaries needed for training embbeded vectors 
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec
import time # for checking how long the training process takes

Indicating the input data location:

In [None]:
# where the data is located
input_data = 'data/wiki/wiki.txt'

Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the **Continuous Bag-of-Words model (CBOW)** and the **Skip-Gram model**. Algorithmically, these models are similar, except that CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'), while the skip-gram does the inverse and predicts source context-words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that <span style="color:blue">CBOW</span> smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for <span style="color:blue">smaller datasets</span> . However, <span style="color:yellowgreen">skip-gram</span> treats each context-target pair as a new observation, and this tends to do better when we have <span style="color:yellowgreen">larger datasets</span>.

<img src="https://raw.githubusercontent.com/tuliplab/mds/master/Jupyter/image/note.gif" width="40", align="left">  The following code may take time. You can read the following setion while waiting for the code be finished. You **must** create ``model/wiki`` folder to store the trained model. If there is no such folder, an error can be raised. In case you forgot to create the folder, you can run the code from ``model.save(model_file)`` to the end to save time.

When training the model with given data, you can choose the corresponding algorithm as follows:

In [None]:
# parameters for training
sg_ = 1 # the training algorithm. If sg=0, CBOW is used. Otherwise (sg=1), skip-gram is employed.
alg = 'CBOW' if sg_ == 0 else 'sg'
size_ = 200 #  the dimensionality of the feature vectors
window_ = 5 # the context size or the maximum distance between the current and predicted word

# where to save the model learned
model_file = 'model/wiki/' + alg + '_' + str(size_) + '_' + str(window_)

# keep the time starting the training
start_time = time.time()
print("Running ...")

# training embedded vectors for the dataset with the parameters specified above
model = Word2Vec(LineSentence(input_data), sg = sg_, size = size_, window = window_)
# save the model learned into model file
model.save(model_file)
                 
# show how long does it take to train the word vectors
print("="*40)
runtime = time.time() - start_time
print("--- Running time: %s seconds ---" % (runtime))

We can check the number of vocabularies in the corpus using the trained model:

In [None]:
words = list(model.wv.vocab.keys())
print("The number of words: {}".format(len(words)))
print("The 10th word in the vocabularies: {}".format(words[9]))


**<span style="color:red">  Exercise 1: </span>**
**<span style="color:#0b486b"> 
Training the embedded vectors for the IMDb movie reviews dataset, using only training data, stored at <span style="color:blue"> data/sent/train-pos-neg-unsup.txt </span>
Save the model learned in <span style="color:blue"> model/sent/</span> 
Using the same parameters as for training Wikipedia dataset above.
</span>**

In [None]:
# Enter your own code here


## 2. Vector calculus


With the vectors at hand what will we do? Will they tell us how close *'dog'* and *'cat'* is? What word is closest to *'sister'*? Or Can we infer 'queen' from the vectors of *'king'*, *'woman'*, and *'man'*?

Indeed, the proximity of two words can be computed as the cosine similarity of their vectors. Likewise, finding the closest word to a given word is in fact searching for the vector whose the largest cosine similarity.

For the last case, of all the vectors, model['queen'], vector for 'queen', should have the largest cosine similarity with v = model['woman'] + model['king'] - model['man']. 

The following code show the list of words closed to some given word using vectorized representation of words. You can list 10 words similar to the word 'sister':


In [None]:
#import Python libaries needed 
from gensim.models import Word2Vec
import numpy as np
from numpy import dot
from numpy.linalg import norm

# compute cosine similarity for two vector u & v
def cosine_similarity(u, v):
    return dot(u, v)/(norm(u)*norm(v))

# compute cosine similarity for vector v & all vectors in matrix W
def cosine_similarity_matrix(W, v):
    return dot(W, v)/(norm(W, axis=1)*norm(v))


# word vectors from the model in the order of words
W = np.asarray([model[word] for word in words])

# what is the words closest to a given word?
word = 'sister'

# vector of the word
v = model[word]
# its similarity to all of other vectors
sim = cosine_similarity_matrix(W, v)
# set the similar to its own zero as we do not want to see the word in the list
sim[words.index(word)] = 0
# indices of words whose the similarity from smallest to largest
indices = sim.argsort()
# reverse the order, then indices of words whose the similarity from largest to smallest
indices = indices[::-1]
# how many words you want
TOP = 10 
indices = indices[:TOP]
# turn indices into words and their respective similarity to the word
top_words = [(words[i], sim[i]) for i in indices]
print("The top ten words similar to \'{}\'".format(word))
top_words

**<span style="color:red">  Exercise 2: </span>
What are the ten most similar words to 'australia'? And the similarity scores? 
</span>**

In [None]:
# Enter your own code here


**<span style="color:red">  Exercise 3: </span>
What are the ten most similar words whose the vectors closest to (model['queen'] + model['man'] - model['king'])? And the similarity scores? Note that you have to set sim[words.index(word)] = 0 for all words in ['queen', 'man', 'king']
</span>**

In [None]:
# Enter your own code here


## 3. Extrinsic evaluation of the word vectors

In this section we will use the word vectors learned from the previous sections to create a feature vector in a binary classification of movie reviews. For example, the review of 'I love the movie' will be represented as

``( model['i'] + model['love'] + model['movie'] ) / 3``

given that all but 'the' are existed in the learned word vectors.

Three popular classifiers *DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier* will be used as the classfiers in the task. 

**<span style="color:#0b486b"> 
The following code evaluates how well the embedded vectors learned for the Wikipedia dataset contribute to the classification of the IDMb movie reviews </span>
</span>**

In [None]:
#import Python libaries needed 
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from gensim.models import Word2Vec

# This function returns the word vector for an input review. It is computed as the average of all vectors, 
# if existed, for all words in the reviews
def get_avg_vector(review, model):
    tokens = review.split()
    vecs = [model[word] for word in tokens if word in model]
    if len(vecs) > 0:
        vecs = np.asarray(vecs).sum(0)/len(vecs)
    return vecs

classifiers = [
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier()
]

# re-load the saved model learned from Wikipedia data
model_file = 'model/wiki/sg_200_5'
model = Word2Vec.load(model_file)


# Getting movie reviews data
X_train = []
y_train = []
for line in open('data/sent/train-pos.txt'):
    vec = get_avg_vector(line, model)
    if len(vec)>0:
        X_train += [vec]
        y_train += [1]
for line in open('data/sent/train-neg.txt'):
    vec = get_avg_vector(line, model)
    if len(vec)>0:
        X_train += [vec]
        y_train += [0]

X_test = []
y_test = []
for line in open('data/sent/test-pos.txt'):
    vec = get_avg_vector(line, model)
    if len(vec)>0:
        X_test += [vec]
        y_test += [1]
for line in open('data/sent/test-neg.txt'):
    vec = get_avg_vector(line, model)
    if len(vec)>0:
        X_test += [vec]
        y_test += [0]

X_train, X_test, y_train, y_test = np.asarray(X_train), np.asarray(X_test), np.asarray(y_train), np.asarray(y_test)

for clf in classifiers:
    clf.fit(X_train, y_train)
    name = clf.__class__.__name__
    train_predictions = clf.predict(X_test)
    acc = accuracy_score(y_test, train_predictions)
    print("Classifier: {}, Accuracy: {:.4%}".format(name, acc))

**<span style="color:red">  Exercise 4: </span>**
**<span style="color:#0b486b"> 
In the above code, we use vector reprentation learned from Wikipedia data set for classification IDMb movie reviews. It seems not good. You now can load the model learned with IDMb movie reviews in **Exercise 1** and evaluates how well the embedded vectors learned for the IMDb movie reviews dataset contribute to the classification of the IDMb movie reviews.

In [None]:
# Enter your own code here


**<span style="color:red">  Exercise 5: </span>**
**<span style="color:#0b486b"> 
Compare with the performance in the classification of the word vectors learned for the Wikipedia dataset and explain why, if possible.
</span>**

**Enter your explanation here**


## 4. Intrinsic evaluation of the word vectors

In this section we will evaluate how well the learned word vectors perform in predicting closest words in both semantic and syntactic context.

Report the accuracy in this analogy prediction, against the file of data/questions-words.txt. There are 14 categories of analogies in the file, consisting of five semantic and nine syntactic analogies:

### 4.1 Semantic analogy
Consists of the following sections:
- capital-common-countries
- capital-world
- currency
- city-in-state
- family

### 4.2 Syntactic analogy
Cmonsists of the following sections:
- gram1-adjective-to-adverb
- gram2-opposite
- gram3-comparative
- gram4-superlative
- gram5-present-participle
- gram6-nationality-adjective
- gram7-past-tense
- gram8-plural
- gram9-plural-verbs

**<span style="color:#0b486b"> 
The following code evaluates how well the embedded vectors learned for the Wikipedia dataset match with analogy quadruples stored at <span style="color:blue"> data/questions-words.txt </span>
</span>**

In [None]:
#import Python libaries needed 
from gensim.models import Word2Vec
import numpy as np

# load model learned above
model_file = 'model/wiki/sg_200_5'
model = Word2Vec.load(model_file)

# evaluate how well the model matchs with the anology defined in data/questions-words.txt
sections = model.accuracy('data/questions-words.txt', restrict_vocab=None)
# This function returns the number of correct and incorrect matching when in predicting the hidden word in the quadruples of
# [a, b, c, ?] words. These values are grouped by the 14 analogy categories. We will further group them into semantic and 
# syntactic  categories.

total = np.zeros(2)
semantic = np.zeros(2)
syntactic = np.zeros(2)
for section in sections:
    name = section['section'] 
    correct, incorrect = len(section['correct']), len(section['incorrect']) # len returns the number of matching
    if not 'total' in name:
        total += [correct, incorrect]
        if 'gram' in name: # all syntactic section starts with 'gram'
            syntactic += [correct, incorrect]
        else: # otherwise, it's a semantic section
            semantic += [correct, incorrect]
            
total = float( total[0]*100 ) / (sum(total) + 10.-6)
semantic = float( semantic[0]*100 ) / (sum(semantic) + 10.-6)
syntactic = float( syntactic[0]*100 ) / (sum(syntactic) + 10.-6)

print("="*80)
print('total accuracy: %0.2f%%, semantic accuracy: %0.2f%%, syntactic accuracy: %0.2f%%' % (total, semantic, syntactic))

**<span style="color:red">  Exercise 6: </span>**
**<span style="color:#0b486b"> 
Evaluates how well the embedded vectors learned for the IMDb movie reviews dataset match with analogy quadruples stored at <span style="color:blue"> data/questions-words.txt </span>
</span>**

In [None]:
# Enter your own code here


**<span style="color:red">  Exercise 7: </span>**
**<span style="color:#0b486b"> 
Compare with the performance of the word vectors learned for the Wikipedia dataset and explain why, if possible.
</span>**

**Enter your explanation here**