# Data Mining & Analytics
## Lab 6 (B)

Available software:
 - Python's Gensim module: https://radimrehurek.com/gensim/ (install using pip)
 - Sklearn’s  TSNE module in case you use TSNE to reduce dimension (optional)
 - Python’s Matplotlib (optional)

_Note: The most important hyper parameters of skip-gram/CBOW are vector size and windows size_

This assignment  will be broadly  split into **2 parts**.

### Lab06 (A)

#### Part I
##### Preparation:
Download and extract the Google’s pretrained Word2Vec model (Google has  trained a large corpus of text containing billions of words,). To kick things off we will use this pre trained model to explore Word2Vec. 
(Download Link: https://docs.google.com/a/berkeley.edu/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download)
Now load this pretrained model in Gensim and you should be good to get started with this assignment. 

[ ... ] (Omitted)

#### Part II:

In part 1 we used the Word2Vec model on a pre trained corpus. In this part (in the next lab) you are going to train a Word2Vec model on your own dataset/corpus(text). Choose a text corpus (A good place to start will be the nltk corpus, the gutenburg project or the brown movie reviews) and tokenize the text (We will go through this in detail in the next Lab.) 

You can also choose the the dataset provided here.

Q7. Based on your knowledge or understand of the text corpus you have chosen, form 3 hypotheses of analogies or relationships you expect will hold and give a reason why.


### Lab06 (B)

 
1. Generate embeddings from the corpus you had chosen in the previous lab.
2. Verify and test your hypotheses from the previous lab.
3. Use T-SNE or PCA to reduce the dimensionality of the vectors to two dimensions for: 
    1. The GoogleNews corpus. Feel free to down sample it to 10 - 20k words based on frequency.
    2. The embeddings you just generated.
4. Using [this](https://github.com/CAHLR/d3-scatterplot) library, visualize both reduced datasets from the step above and explore the visualization.
5. Submit the jupyter notebook (including lab6a) with the code and an embedded (screengrabbed) image of the viz(s) that you created and explored. Ensure that you have appropriate comments for both the code and the images.
6. For this library, you need to generate a tab separated text file in the following format: `Dim1 | Dim2 | Label (word)`

#### How to use the JS library

```
1) copy over the d3-scatterplot .html and .js files onto your machine
2) setup a local web server in the directory of the d3-scatterplot files (on terminal - "python -m http.server" for Python 3.x)
3) place a tab separated file in that same folder 
4) the tab separated file must have at least an x and y column and a third column of any value (in our case, the word itself)
5) go to http://localhost:8000/plot.html?dataset=name_of_dataset.txt (Links to an external site.)Links to an external site. (Links to an external site.)Links to an external site.
6) Hover over plot points to see description in the upper left. You can color the points (using cluster labels, perhaps) using additional columns in your text file.
7) lasso select a group of points with the left mouse button and look at summaries of the group to the right and all the selected point descriptions below the plot
 
```



In [3]:
import numpy as np
import gensim
from gensim.models import Word2Vec

from sklearn.metrics.pairwise import cosine_similarity 
from scipy import sparse

In [5]:
from urllib.request import urlopen
import string
import nltk, re, pprint
from nltk import word_tokenize
from nltk import tokenize

In [14]:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

## Functions and Setup

In [15]:
def cos_similarity_for_2_words(model, left, right):
    M = np.array([model[left], model[right]])
    M_sim = cosine_similarity(M)
    return M_sim[0][1]

def euc_similarity_for_2_words(model, left, right):
    return np.linalg.norm(model[left]-model[right])

def print_cos_similarity_for_word_pair(model, pair):
    print("{}: {:.4f} ".format(pair, cos_similarity_for_2_words(model, pair[0], pair[1])))

def print_euc_similarity_for_word_pair(model, pair):
    print("{}: {:.4f} ".format(pair, euc_similarity_for_2_words(model, pair[0], pair[1])))

---
# Part I:


# [ ... ]

(**Note:** *Omitted code from Lab06a Part I. Have a look at the seperately submitted jupyter notebook for Lab06A*)

---
# Part II:
In the next lab you are going to train a Word2Vec model on your own dataset/corpus(text). To prepare do the following...


### Choose a text corpus (A good place to start will be the nltk corpus, the gutenburg project or the brown movie reviews)

**ANSWER:** I decided to select [Grimms' Fairy Tales](https://www.gutenberg.org/ebooks/2591) from Project Gutenberg for the next lab and analyse it using the NLTK corpus and selected the following three hypotheses.


##### H1: The words 'wolf' and 'evil' will have a high similarity.
**Reason:** The character of the wolf is commonly associated as the evil persona in fary tales. Examples are Red Riding Hood, Three Little Pigs and probably there are a whole bunch more.

##### H2: 'King' + 'Daughter' = 'Princess'
**Reason:** My assumption here is that the above relationship is present in enough fary tales that it can be 'uncovered' by building a corpus out of this book.

##### H3: 'Hansel' + 'sister' = 'Gretel'
**Reason:** The last hypotheses is specific to one of the many fary tales — Hansel and Gretel. The are siblings, so I assume that the above relationship can be detected.

---
# Lab06 B
---

### Tokenize the text (We will go through this in detail in the next Lab.)

In [16]:
# Note: Code adapted from gensim_tutorial.ipynb from the current lab.

sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

### Download and load  "The Importance of Being Earnest A Trivial Comedy for Serious People" by Oscar 
## Wilde from Project Gutenberg : https://www.gutenberg.org


## URL of Grimms's Fairy Tales
url = "https://www.gutenberg.org/files/2591/2591-0.txt" ## Your raw text file location 
resp = urlopen(url)
raw = resp.read().decode('utf8')
firstlook = tokenize.sent_tokenize(raw)

pattern = r'''(?x)  # set flag to allow verbose regexps
(?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
|\w+(?:[-']\w+)*    # words with optional internal hyphens
|\$?\d+(?:\.\d+)?   # currency, e.g. $12.80 
|\.\.\.             # elipses
|[.,;"'?()-_`]      # these are separate tokens
'''
#print(nltk.regexp_tokenize(raw,pattern))
tokenized_raw = ' '.join( nltk.regexp_tokenize(raw,pattern))
tokenized_raw= tokenize.sent_tokenize(tokenized_raw)

nopunct=[]
for sent in tokenized_raw:
    a=[w for w in sent.split() if w not in string.punctuation]
    nopunct.append(' '.join(a))
#create a set of stopwords
tok_corp= [nltk.word_tokenize(sent) for sent in nopunct]

### creating a list of unique words 

combined_list=[" ".join(w) for w in tok_corp]
unique_list=[]
for sent in combined_list:
    unique_list.append([w for w in sent.split()])
unique_list=list(set([item for sublist in unique_list for item in sublist]))

unique_words=unique_list

### Its just one single command
model = gensim.models.Word2Vec(tok_corp, min_count=1, size = 16, window=7)

## Extracting the respective vectors corresponding to the words

vector_list=[] ## n by d matrix containing words and their respective vectors
for word in unique_words:
    vector_list.append(model[word])



In [17]:
print(vector_list[:5])

[array([ 0.05235949, -0.04953231,  0.03576385, -0.01601919, -0.0693511 ,
       -0.01456504,  0.01359499, -0.01525817,  0.085245  ,  0.00977326,
       -0.0210762 ,  0.00151788,  0.04061151, -0.01398589,  0.01363964,
        0.04431199], dtype=float32), array([ 0.02624226, -0.01981116,  0.03476242, -0.00232454, -0.04765066,
       -0.0327542 , -0.00440537,  0.0152738 ,  0.07384662,  0.00578531,
       -0.00191476,  0.00331511,  0.03962994, -0.00388025, -0.01304295,
        0.01633914], dtype=float32), array([ 0.09900264, -0.0471682 ,  0.07363572, -0.04363722, -0.07313373,
       -0.04096917, -0.07151716, -0.00578928,  0.17756607,  0.00371851,
        0.00570008, -0.01576044,  0.13996762, -0.02839047, -0.01232996,
        0.04880045], dtype=float32), array([ 0.18350405, -0.09647742,  0.14782602, -0.0462269 , -0.13688937,
       -0.08243618, -0.03592265,  0.02405367,  0.27839208, -0.04941341,
       -0.01147962, -0.03909972,  0.21690543, -0.01661538,  0.02661334,
        0.08772004], dty

## 2. Verify and test your hypotheses from the previous lab.

##### H1: The words 'wolf' and 'evil' will have a high similarity.

In [19]:
print_cos_similarity_for_word_pair(model, ('wolf', 'evil'))

('wolf', 'evil'): 0.9923 


  


**Interpretation:** A cosine similiarity of 0.9923 speak for a high correlation between the words. Hypothesis confirmed.

##### H2: 'King' + 'Daughter' = 'Princess'

In [36]:
top_1000 = model.most_similar(positive=['king','daughter'], topn=1000)
print(top_1000[:10])
print()

for i, (word, sim) in enumerate(top_1000):
    if word == 'princess':
        print('[Rank ' + str(i) + ']','::', (word, sim))

print()
print_cos_similarity_for_word_pair(model, ('king', 'princess'))
print_cos_similarity_for_word_pair(model, ('daughter', 'princess'))

[('chamber', 0.9985895156860352), ('fox', 0.9984092712402344), ('finger', 0.998346745967865), ('put', 0.9980343580245972), ('heard', 0.9980288743972778), ('Then', 0.9979771971702576), ('morning', 0.9979766607284546), ('in', 0.9979627728462219), ('room', 0.997961699962616), ('ring', 0.9979234933853149)]

[Rank 195] :: ('princess', 0.9941306710243225)

('king', 'princess'): 0.9867 
('daughter', 'princess'): 0.9980 


  """Entry point for launching an IPython kernel.
  


**Interpretation:** Both ('king', 'princess') and ('daughter', 'princess') have a high similiarity. 
    However the combination of 'King' + 'Daughter' = 'Princess' is not confirmed, as there are more similiar results.
    Inspecting the results further shows that 'King' + 'Daughter' = 'Princess' is only ranked 195.

##### H3: 'Hansel' + 'sister' = 'Gretel'

In [38]:
top_1000 = model.most_similar(positive=['Hansel','sister'], topn=1000)
print(top_1000[:10])
print()

for i, (word, sim) in enumerate(top_1000):
    if word == 'Gretel':
        print('[Rank ' + str(i) + ']','::', (word, sim))

print()
print_cos_similarity_for_word_pair(model, ('Hansel', 'Gretel'))
print_cos_similarity_for_word_pair(model, ('sister', 'Gretel'))

[('an', 0.99968421459198), ('fine', 0.9995608329772949), ('however', 0.9995306730270386), ('people', 0.9995037317276001), ('to', 0.9994958639144897), ('without', 0.9994885921478271), ('four', 0.999474823474884), ('heart', 0.9994156360626221), ('large', 0.9994149804115295), ('mother', 0.9993584156036377)]

[Rank 226] :: ('Gretel', 0.9975356459617615)

('Hansel', 'Gretel'): 0.9979 
('sister', 'Gretel'): 0.9967 


  """Entry point for launching an IPython kernel.
  


**Interpretation:** As with the previous hypothesis, also here the assumption does not hold. 'Hansel' + 'sister' = 'Gretel' is only ranked 226. Even though their is a high similiaity, the formula 'Hansel' + 'sister' does not seem to be descriptive enough.


## 3. Use T-SNE or PCA to reduce the dimensionality of the vectors to two dimensions for:

In [63]:
from sklearn.manifold import TSNE
import pandas as pd


#### The GoogleNews corpus. Feel free to down sample it to 10 - 20k words based on frequency.

In [41]:
google_news_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [55]:
# sample words by frequency
n = 20000

words = list(google_news_model.vocab)
word_count_list = []

for word in words:
    word_count_list.append((word, google_news_model.vocab[word].count))
    
sorted_wort_count_list = sorted(word_count_list, reverse=True, key=lambda word_count: word_count[1]) 

sampled_words = [word_count[0] for word_count in sorted_wort_count_list[:n]]

In [57]:
google_news_vector_list=[] ## n by d matrix containing words and their respective vectors
for word in sampled_words:
    google_news_vector_list.append(google_news_model[word])

In [60]:
# Lets dim reduce the 16 dimension vectors to 2dimensions to vizualise the dataset 
data_embed=TSNE(n_components=2, perplexity=50, verbose=2, method='barnes_hut').fit_transform(google_news_vector_list)

[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 20000 samples in 0.281s...
[t-SNE] Computed neighbors for 20000 samples in 367.975s...
[t-SNE] Computed conditional probabilities for sample 1000 / 20000
[t-SNE] Computed conditional probabilities for sample 2000 / 20000
[t-SNE] Computed conditional probabilities for sample 3000 / 20000
[t-SNE] Computed conditional probabilities for sample 4000 / 20000
[t-SNE] Computed conditional probabilities for sample 5000 / 20000
[t-SNE] Computed conditional probabilities for sample 6000 / 20000
[t-SNE] Computed conditional probabilities for sample 7000 / 20000
[t-SNE] Computed conditional probabilities for sample 8000 / 20000
[t-SNE] Computed conditional probabilities for sample 9000 / 20000
[t-SNE] Computed conditional probabilities for sample 10000 / 20000
[t-SNE] Computed conditional probabilities for sample 11000 / 20000
[t-SNE] Computed conditional probabilities for sample 12000 / 20000
[t-SNE] Computed conditional probabilities for 

In [69]:
# save as tsv in format `Dim1 | Dim2 | Label (word)`
df = pd.DataFrame(data_embed)
df['word'] = sampled_words
df.columns = ['x', 'y', 'label']
df.head()

df.to_csv('d3-scatterplot/google_news.tsv', sep='\t', index=False)

#### The embeddings you just generated.

In [70]:
vector_list=[] ## n by d matrix containing words and their respective vectors
for word in unique_words:
    vector_list.append(model[word])

  This is separate from the ipykernel package so we can avoid doing imports until


In [71]:
# Lets dim reduce the 16 dimension vectors to 2dimensions to vizualise the dataset 
data_embed=TSNE(n_components=2, perplexity=50, verbose=2, method='barnes_hut').fit_transform(vector_list)

[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 5931 samples in 0.007s...
[t-SNE] Computed neighbors for 5931 samples in 0.919s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5931
[t-SNE] Computed conditional probabilities for sample 2000 / 5931
[t-SNE] Computed conditional probabilities for sample 3000 / 5931
[t-SNE] Computed conditional probabilities for sample 4000 / 5931
[t-SNE] Computed conditional probabilities for sample 5000 / 5931
[t-SNE] Computed conditional probabilities for sample 5931 / 5931
[t-SNE] Mean sigma: 0.031960
[t-SNE] Computed conditional probabilities in 0.440s
[t-SNE] Iteration 50: error = 81.7736130, gradient norm = 0.0409066 (50 iterations in 12.875s)
[t-SNE] Iteration 100: error = 74.6888275, gradient norm = 0.0062669 (50 iterations in 8.772s)
[t-SNE] Iteration 150: error = 74.1573410, gradient norm = 0.0022043 (50 iterations in 9.863s)
[t-SNE] Iteration 200: error = 74.0356598, gradient norm = 0.0016372 (50 iterations in 15.362s)

In [73]:
# save as tsv in format `Dim1 | Dim2 | Label (word)`
df = pd.DataFrame(data_embed)
df['word'] = unique_words
df.columns = ['x', 'y', 'label']
df.head()

df.to_csv('d3-scatterplot/hansel_gretel.tsv', sep='\t', index=False)

## 4. Visualize both reduced datasets from the step above and explore the visualization.

#### The GoogleNews corpus. Feel free to down sample it to 10 - 20k words based on frequency.

This is the full visualization of the 20k Google News Corpus Dataset. 

![Google News Corpus Full](./img/ss_viz_google_news_corpus_01.png)

Zooming in on the cluster on the very top, we can (for example) find that the respective cluster contains capitals around the world.

![Google News Corpus Full](./img/ss_viz_google_news_corpus_02.png)

#### The embeddings you just generated.

This is the full visualization of the Fary Tale Corpus. (It has quite a funny shape) 

![Hansel Gretel Corpus Full](./img/ss_viz_hansel_gretel_corpus_01.PNG)