# Data Mining & Analytics
## March 15 , Lab 6 (A): Skip Gram models

Available software:
 - Python's Gensim module: https://radimrehurek.com/gensim/ (install using pip)
 - Sklearn’s  TSNE module in case you use TSNE to reduce dimension (optional)
 - Python’s Matplotlib (optional)

_Note: The most important hyper parameters of skip-gram/CBOW are vector size and windows size_

This assignment  will be broadly  split into **2 parts**.

#### Part I
##### Preparation:
Download and extract the Google’s pretrained Word2Vec model (Google has  trained a large corpus of text containing billions of words,). To kick things off we will use this pre trained model to explore Word2Vec. 
(Download Link: https://docs.google.com/a/berkeley.edu/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download)
Now load this pretrained model in Gensim and you should be good to get started with this assignment. 



Q1: Find the **cosine similarities** between the following word pairs/tuples:
- (France, England)
- (smaller, bigger)
- (England, London)
- (France, Rocket)
- (big, bigger)

Q2 : Write an expression to extract the vector representations   of the words  (France,  England, smaller, bigger, rocket, big). 

Q3: Repeat the exercise from Q1 by finding the **euclidean distances** between the word pairs.

Q4: What is the relationship between the magnitude of individual vectors, the vectors themselves and the cosine distance for any pair of words. Use any tuple in Q1 as an example to support your answer. 

Q5: Time to dabble with the power of Word2Vec. Find the 2 closest words  for the following conditions:  
- (King - Queen)
- (bigger - big + small)
- (man + programmer - woman)
- (school + shooting - guns)
- (Texas + Milwaukee – Wisconsin)

Q6: Using the vectors for the following words, explore the semantic representation of these words through K-means clustering and explain your findings.

Q7: What loss function does the skipgram model use and briefly describe what this function is minimizing .


#### Part II:

In part 1 we used the Word2Vec model on a pre trained corpus. In this part (in the next lab) you are going to train a Word2Vec model on your own dataset/corpus(text). Choose a text corpus (A good place to start will be the nltk corpus, the gutenburg project or the brown movie reviews) and tokenize the text (We will go through this in detail in the next Lab.) 

You can also choose the the dataset provided here.

Q7. Based on your knowledge or understand of the text corpus you have chosen, form 3 hypotheses of analogies or relationships you expect will hold and give a reason why.




 
 



In [1]:
import numpy as np
import gensim
from gensim.models import Word2Vec

from sklearn.metrics.pairwise import cosine_similarity 
from scipy import sparse

In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

## Functions and Setup

In [3]:
def cos_similarity_for_2_words(model, left, right):
    M = np.array([model[left], model[right]])
    M_sim = cosine_similarity(M)
    return M_sim[0][1]

def euc_similarity_for_2_words(model, left, right):
    return np.linalg.norm(model[left]-model[right])

def print_cos_similarity_for_word_pair(model, pair):
    print("{}: {:.4f} ".format(pair, cos_similarity_for_2_words(model, pair[0], pair[1])))

def print_euc_similarity_for_word_pair(model, pair):
    print("{}: {:.4f} ".format(pair, euc_similarity_for_2_words(model, pair[0], pair[1])))


## Q1: Find the cosine similarities between the following word pairs/tuples:

#### (France, England)

In [4]:
print_cos_similarity_for_word_pair(model, ('France', 'England'))

('France', 'England'): 0.3980 


#### (smaller, bigger)

In [5]:
print_cos_similarity_for_word_pair(model, ('smaller', 'bigger'))

('smaller', 'bigger'): 0.7302 


#### (England, London)

In [6]:
print_cos_similarity_for_word_pair(model, ('England', 'London'))

('England', 'London'): 0.4399 


#### (France, Rocket)

In [7]:
print_cos_similarity_for_word_pair(model, ('France', 'Rocket'))

('France', 'Rocket'): 0.0711 


#### (big, bigger)

In [8]:
print_cos_similarity_for_word_pair(model, ('big', 'bigger'))

('big', 'bigger'): 0.6842 


## Q2 : Write an expression to extract the vector representations of the following words

#### France 

In [9]:
model['France']

array([ 4.85839844e-02,  7.86132812e-02,  3.24218750e-01,  3.49121094e-02,
        7.71484375e-02,  3.54003906e-02, -1.25976562e-01, -3.86718750e-01,
       -1.31835938e-01,  2.91748047e-02, -1.44531250e-01, -1.42578125e-01,
        1.79687500e-01, -2.75390625e-01, -1.65039062e-01,  9.32617188e-02,
        1.17187500e-01,  1.82617188e-01,  6.10351562e-02,  1.14257812e-01,
        1.82617188e-01, -1.16699219e-01, -3.24707031e-02, -7.56835938e-02,
        9.64355469e-03,  8.59375000e-02, -2.85156250e-01, -2.55859375e-01,
        3.01513672e-02,  2.16796875e-01, -1.00097656e-01,  2.85644531e-02,
       -2.81250000e-01, -8.39843750e-02, -2.02636719e-02, -1.96289062e-01,
       -4.78515625e-02,  7.12890625e-02, -1.42578125e-01, -1.13525391e-02,
        1.16210938e-01,  7.22656250e-02,  1.47460938e-01,  1.50390625e-01,
        1.40625000e-01,  2.47070312e-01, -1.69921875e-01,  7.76367188e-02,
       -5.44433594e-02,  1.66992188e-01, -1.45507812e-01,  2.12402344e-02,
       -7.51953125e-02,  

#### England

In [10]:
model['England']

array([-1.98242188e-01,  1.15234375e-01,  6.25000000e-02, -5.83496094e-02,
        2.26562500e-01,  4.58984375e-02, -6.22558594e-02, -2.02148438e-01,
        8.05664062e-02,  2.16064453e-02, -2.79541016e-02, -1.21093750e-01,
        1.24511719e-01,  5.39550781e-02,  3.29589844e-02,  6.88476562e-02,
       -5.12695312e-02,  1.83593750e-01,  1.32812500e-01, -7.22656250e-02,
        1.06933594e-01,  2.50244141e-03, -1.37695312e-01,  1.75781250e-02,
        1.06933594e-01,  1.08398438e-01, -2.34375000e-01,  8.05664062e-02,
        3.73535156e-02,  2.61718750e-01,  7.42187500e-02,  9.21630859e-03,
       -2.77343750e-01, -1.75781250e-01, -7.61718750e-02, -2.44140625e-02,
       -1.26953125e-01, -9.37500000e-02,  5.41992188e-02,  3.08593750e-01,
        2.16064453e-02,  1.19628906e-02,  1.66992188e-01, -4.39453125e-02,
       -2.96630859e-02,  2.81982422e-02,  1.26953125e-01,  1.12304688e-01,
        3.85742188e-02,  1.69921875e-01, -1.08886719e-01,  3.16406250e-01,
        2.51953125e-01, -

#### smaller

In [11]:
model['smaller']

array([-0.05004883,  0.03417969, -0.0703125 ,  0.17578125,  0.00689697,
       -0.13183594,  0.03686523, -0.04638672, -0.01092529,  0.10058594,
        0.03173828,  0.12011719,  0.06005859,  0.0859375 , -0.18652344,
       -0.10888672, -0.20507812,  0.10107422, -0.22070312,  0.06103516,
       -0.05200195,  0.0189209 , -0.05688477, -0.00646973, -0.20410156,
        0.01623535, -0.24316406,  0.04077148,  0.16113281, -0.13769531,
       -0.125     ,  0.11230469,  0.09326172, -0.09082031,  0.10009766,
       -0.1796875 , -0.03930664,  0.2109375 ,  0.15625   ,  0.33007812,
        0.21679688,  0.08984375,  0.11035156, -0.01141357, -0.06689453,
        0.140625  , -0.0859375 ,  0.07470703,  0.02148438, -0.27539062,
        0.10107422, -0.265625  , -0.27539062, -0.30078125, -0.05444336,
        0.07128906,  0.01708984, -0.01446533,  0.14941406, -0.11914062,
       -0.05004883,  0.07080078, -0.203125  , -0.11962891,  0.05078125,
       -0.10205078, -0.07128906,  0.2734375 , -0.12109375,  0.28

#### bigger

In [12]:
model['bigger']

array([-6.54296875e-02, -9.52148438e-02, -6.22558594e-02,  1.62109375e-01,
        1.98974609e-02, -1.74804688e-01,  1.16210938e-01, -7.42187500e-02,
       -3.41796875e-02,  2.33398438e-01,  1.03027344e-01, -3.17382812e-02,
        1.44531250e-01,  3.95507812e-02, -1.33789062e-01,  1.08337402e-03,
        6.25000000e-02,  6.83593750e-02,  7.12890625e-02,  1.49414062e-01,
       -1.55273438e-01,  7.71484375e-02,  1.06933594e-01,  1.15234375e-01,
       -2.16796875e-01, -8.59375000e-02, -2.59765625e-01, -2.90527344e-02,
        1.61132812e-01,  1.13281250e-01, -1.04003906e-01,  3.61328125e-01,
       -7.89642334e-04, -1.38671875e-01,  1.02539062e-01, -1.95312500e-01,
        5.12695312e-02,  2.50000000e-01,  3.94531250e-01,  4.16015625e-01,
        2.08984375e-01, -5.10253906e-02,  1.37695312e-01, -4.46777344e-02,
       -7.81250000e-02,  8.20312500e-02,  2.06298828e-02,  1.72851562e-01,
        7.08007812e-02, -2.45117188e-01,  3.71093750e-02,  6.68334961e-03,
       -2.63671875e-01, -

#### rocket

In [13]:
model['rocket']

array([-3.19824219e-02,  2.71484375e-01, -2.89062500e-01, -1.54296875e-01,
        1.68945312e-01, -3.65234375e-01,  3.33984375e-01, -4.39453125e-01,
        1.75781250e-01,  2.48046875e-01,  2.79541016e-02, -1.39648438e-01,
       -6.88476562e-02, -1.62109375e-01, -1.57226562e-01,  1.58203125e-01,
       -3.59375000e-01,  1.26953125e-02,  1.85546875e-01, -3.08593750e-01,
       -1.52343750e-01,  1.87500000e-01, -2.17773438e-01,  6.59179688e-03,
       -1.59263611e-04, -7.37304688e-02,  4.95605469e-02, -5.32226562e-02,
        2.81250000e-01, -2.21679688e-01, -3.59375000e-01, -4.43359375e-01,
       -1.02050781e-01, -1.63085938e-01,  2.96020508e-03, -3.30078125e-01,
        3.41796875e-02,  2.38281250e-01, -1.19140625e-01,  6.34765625e-02,
       -1.53320312e-01, -9.47265625e-02,  3.37890625e-01,  2.28515625e-01,
        1.25732422e-02, -1.08398438e-01,  1.21582031e-01, -5.24902344e-03,
       -3.36914062e-02, -3.28125000e-01, -2.77343750e-01,  3.16406250e-01,
        1.60156250e-01,  

#### big

In [14]:
model['big']

array([ 0.11132812,  0.10595703, -0.07373047,  0.18847656,  0.07666016,
       -0.3828125 , -0.0625    , -0.07470703,  0.05957031,  0.22167969,
        0.20507812, -0.09228516,  0.05395508,  0.01379395, -0.16992188,
        0.05493164,  0.09619141,  0.06103516, -0.14160156,  0.03173828,
       -0.08642578,  0.12011719,  0.06445312,  0.22070312,  0.06835938,
        0.04956055, -0.22460938, -0.06298828,  0.09179688, -0.00531006,
       -0.11425781,  0.20605469,  0.31054688, -0.0625    , -0.02026367,
       -0.13476562, -0.02697754,  0.2734375 ,  0.27929688,  0.21386719,
        0.25195312, -0.13964844,  0.19824219, -0.07421875,  0.09228516,
        0.125     ,  0.0612793 , -0.02990723,  0.0072937 , -0.05615234,
       -0.08447266,  0.1796875 , -0.17578125, -0.11328125, -0.17578125,
       -0.1171875 ,  0.09082031, -0.07177734,  0.30273438, -0.2734375 ,
       -0.07128906,  0.33007812, -0.13574219, -0.0390625 ,  0.01397705,
       -0.02526855,  0.05981445,  0.14550781, -0.11035156,  0.12

## Q3: Find the euclidian similarities between the following word pairs/tuples:

#### (France, England)

In [15]:
print_euc_similarity_for_word_pair(model, ('France', 'England'))

('France', 'England'): 3.0151 


#### (smaller, bigger)

In [16]:
print_euc_similarity_for_word_pair(model, ('smaller', 'bigger'))

('smaller', 'bigger'): 1.8619 


#### (England, London)

In [17]:
print_euc_similarity_for_word_pair(model, ('England', 'London'))

('England', 'London'): 2.8753 


#### (France, Rocket)

In [18]:
print_euc_similarity_for_word_pair(model, ('France', 'Rocket'))

('France', 'Rocket'): 3.8921 


#### (big, bigger)

In [19]:
print_euc_similarity_for_word_pair(model, ('big', 'bigger'))

('big', 'bigger'): 1.9586 


## Q4: What is the relationship between the magnitude of individual vectors, the vectors themselves and the cosine distance for any pair of words. Use any tuple in Q1 as an example to support your answer.

#### Similarity Measures
- Cosine similarity is expressed $\in$ $[0,1]$ where $1$ means that two elements are 100% similiar.
- Euclidian similarity is expressed $\in$ $[0, \infty[$ where $0$ means that two elements are 100% similiar.

#### Cosine Similarity

The cosine similarity of two vectors is defined as follows: 
$cos(A,B) = \frac{\sum_{i=1}^{n}{A_i B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}} \sqrt{\sum_{i=1}^{n}{B_i^2}} }$

In [20]:
print('Length of the Vector for one word ::', len(model['England']))
print('(Min, Max, Mean) of the vector\'s values for the word "England" ::', '({},{},{})'.format(np.min(model['England']), np.max(model['England']), np.mean(model['England'])))

Length of the Vector for one word :: 300
(Min, Max, Mean) of the vector's values for the word "England" :: (-0.5390625,0.5,-0.0010115846525877714)


## Q5: Time to dabble with the power of Word2Vec. Find the 2 closest words for the following conditions:

#### (King - Queen)

In [21]:
print(model.most_similar(positive=['King'], negative=['Queen'], topn=2))

[('Bon_Jovi_Canillas', 0.36876529455184937), ('Lomax', 0.36203983426094055)]


#### (bigger - big + small)

In [22]:
print(model.most_similar(positive=['bigger', 'small'], negative=['big'], topn=2))

[('larger', 0.7402472496032715), ('smaller', 0.732999324798584)]


#### (man + programmer - woman)

In [23]:
print(model.most_similar(positive=['man', 'programmer'], negative=['woman'], topn=2))

[('programer', 0.5371963977813721), ('programmers', 0.5310998558998108)]


#### (school + shooting - guns)

In [24]:
print(model.most_similar(positive=['school','shooting'], negative=['guns'], topn=2))

[('elementary', 0.5435296297073364), ('eighth_grade', 0.47330963611602783)]


#### (Texas + Milwaukee – Wisconsin)

In [25]:
print(model.most_similar(positive=['Texas', 'Milwaukee'], negative=['Wisconsin'], topn=2))

[('Houston', 0.7767744660377502), ('Fort_Worth', 0.7270511388778687)]


## Q6: Using the vectors for the following words, explore the semantic representation of these words through K-means clustering and explain your findings.


In [26]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Note: I do not know based on the description which words are "the following words". 
# Also, there is no response from the discussion on bcourses, as of now. 
#(https://bcourses.berkeley.edu/courses/1468844/discussion_topics/5334270)

# I therefore chose to cluster the first 25.000 words of the vocabulary of the model.


# all words in the vocabulary of the model
words = list(model.vocab) 
words_25k = words[:25000]

In [27]:
ws = words_25k

[model[w] for w in ws]

df = pd.DataFrame( [model[w] for w in ws])
df['word'] = ws
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,291,292,293,294,295,296,297,298,299,word
0,0.001129,-0.000896,0.000319,0.001534,0.001106,-0.001404,-3.1e-05,-0.00042,-0.000576,0.001076,...,0.001373,-6.1e-05,-0.000824,0.001328,0.00116,0.000568,-0.001564,-0.000123,-8.6e-05,</s>
1,0.070312,0.086914,0.087891,0.0625,0.069336,-0.108887,-0.081543,-0.154297,0.020752,0.131836,...,-0.088867,-0.080566,0.064941,0.061279,-0.047363,-0.058838,-0.047607,0.014465,-0.0625,in
2,-0.01178,-0.047363,0.044678,0.063477,-0.018188,-0.063965,-0.001312,-0.072266,0.064453,0.086426,...,0.003723,-0.08252,0.081543,0.007935,0.000477,0.018433,0.071289,-0.034912,0.02417,for
3,-0.015747,-0.02832,0.083496,0.050293,-0.110352,0.031738,-0.014221,-0.089844,0.117676,0.118164,...,-0.015625,-0.033447,-0.02063,-0.019409,0.063965,0.020142,0.006866,0.061035,-0.148438,that
4,0.00705,-0.073242,0.171875,0.022583,-0.132812,0.198242,0.112793,-0.10791,0.071777,0.020874,...,-0.036377,-0.09375,0.182617,0.0271,0.12793,-0.02478,0.01123,0.164062,0.106934,is


In [28]:
X = df.set_index('word')
results = []
for i in [2,3,4,5,6,7,8,20,40,80]:
    kmeans = KMeans(n_clusters=i, random_state=1).fit(X)
    score = silhouette_score(X, kmeans.labels_)

    print('[' + str(i) + '] :: ', score)
    results.append((i, score))

[2] ::  0.5811315260411222
[3] ::  0.04608681286669453
[4] ::  0.002498790204143555
[5] ::  -0.006186132170751177
[6] ::  -0.030888140054599608
[7] ::  -0.02849355112494855
[8] ::  -0.02240108134000604
[20] ::  -0.04449654988168943
[40] ::  -0.029422972407840912
[80] ::  -0.022621608065209214


**Answer:** The rapid decline of the silhoutte coefficient suggests that the best clustering is found for **k=2** for the selected words.
However, given that I selected the first 25.000 words, without respect to their semantic meaning, the optimal cluster sizes on a sub- or superset will most probably differ.

## Q7: What loss function does the skipgram model use and briefly describe what this function is minimizing.

**Answer:**


The TensorFlow documentation provides a detailed writeup of how Word2Vec and the Skipgram model work.

- **Step 1:** Given a text, build the tuples of $(context, target)$ where $target$ refers to a word in the text and $context$ to the neighboring words in the text, e.g. $([the, brown], quick))$
- **Step 2:** Expand the $(context, target)$ tuples, as to each pairwise combinations of $(input, output)$ tuples. E.g. $([the, brown], quick))$ becomes $(quick, the), (quick, brown), (brown, quick), (brown, fox), ...$.
- **Step 3:** The model is trained using the $(input, output)$ tuples using **stochastic gradient descent** one tuple at a time. (also explains long training times)
    - For each $(input, output)$ a number of **constrastive** (noisy) examples is drawn and the loss is is calculated using the following formula:

$J^{(t)}_\text{NEG} = \log Q_\theta(D=1 | \text{the, quick}) +
  \log(Q_\theta(D=0 | \text{sheep, quick}))$
  
where $(the, quick)$ represents the training sample and $(sheep, quick)$ the contrastive sample with $sheep$ being the word drawn randomly.

The goal is to then update the *embedding parameter* $\theta$ to improve the objective function, by deriving the gradient of the loss with respect to $\theta$: $\frac{\partial}{\partial \theta} J_\text{NEG}$ and then taking a small step into the direction of the gradient.
  

**Src:** https://www.tensorflow.org/tutorials/word2vec#the_skip-gram_model

---
# Part II:
In the next lab you are going to train a Word2Vec model on your own dataset/corpus(text). To prepare do the following...



### Choose a text corpus (A good place to start will be the nltk corpus, the gutenburg project or the brown movie reviews)

**ANSWER:** I decided to select [Grimms' Fairy Tales](https://www.gutenberg.org/ebooks/2591) from Project Gutenberg for the next lab and analyse it using the NLTK corpus.

### Tokenize the text (We will go through this in detail in the next Lab.)

In [8]:
from urllib.request import urlopen
import string
import nltk, re, pprint
from nltk import word_tokenize
from nltk import tokenize

In [1]:
# Note: Code adapted from gensim_tutorial.ipynb from the current lab.

sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

### Download and load  "The Importance of Being Earnest A Trivial Comedy for Serious People" by Oscar 
## Wilde from Project Gutenberg : https://www.gutenberg.org


## URL of Grimms's Fairy Tales
url = "https://www.gutenberg.org/files/2591/2591-0.txt" ## Your raw text file location 
resp = urlopen(url)
raw = resp.read().decode('utf8')
firstlook = tokenize.sent_tokenize(raw)

pattern = r'''(?x)  # set flag to allow verbose regexps
(?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
|\w+(?:[-']\w+)*    # words with optional internal hyphens
|\$?\d+(?:\.\d+)?   # currency, e.g. $12.80 
|\.\.\.             # elipses
|[.,;"'?()-_`]      # these are separate tokens
'''
#print(nltk.regexp_tokenize(raw,pattern))
tokenized_raw = ' '.join( nltk.regexp_tokenize(raw,pattern))
tokenized_raw= tokenize.sent_tokenize(tokenized_raw)

nopunct=[]
for sent in tokenized_raw:
    a=[w for w in sent.split() if w not in string.punctuation]
    nopunct.append(' '.join(a))
#create a set of stopwords
tok_corp= [nltk.word_tokenize(sent) for sent in nopunct]

### creating a list of unique words 

combined_list=[" ".join(w) for w in tok_corp]
unique_list=[]
for sent in combined_list:
    unique_list.append([w for w in sent.split()])
unique_list=list(set([item for sublist in unique_list for item in sublist]))

unique_words=unique_list

### Its just one single command
model = gensim.models.Word2Vec(tok_corp, min_count=1, size = 16, window=7)

## Extracting the respective vectors corresponding to the words

vector_list=[] ## n by d matrix containing words and their respective vectors
for word in unique_words:
    vector_list.append(model[word])

NameError: name 'nltk' is not defined

## Q7. Based on your knowledge or understand of the text corpus you have chosen, form 3 hypotheses of analogies or relationships you expect will hold and give a reason why.

**Note:** *I created the hyptotheses below based on my knowledge of the stories within the Brother Grimm's Fary Tales. I did not inspect the text.*

#### H1: The words 'wolf' and 'evil' will have a high similarity.

**Reason:** The character of the wolf is commonly associated as the evil persona in fary tales. Examples are *Red Riding Hood*, *Three Little Pigs* and probably there are a whole bunch more.

#### H2: 'King' + 'Daughter' = 'Princess'

**Reason:** My assumption here is that the above relationship is present in enough fary tales that it can be 'uncovered' by building a corpus out of this book.

#### H3: 'Hansel' + 'sister' = 'Gretel'

**Reason:** The last hypotheses is specific to one of the many fary tales — *Hansel and Gretel*. The are siblings, so I assume that the above relationship can be detected.