<a href="https://colab.research.google.com/github/GiuliaLanzillotta/NLU/blob/master/VectorSpaceModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import pandas as pd
import os
import scipy.spatial.distance as distance

# Getting the data



```
This is formatted as code
```



In [0]:
!ls 'drive/My Drive'

In [2]:
!wget http://web.stanford.edu/class/cs224u/data/data.tgz

--2020-03-06 16:14:55--  http://web.stanford.edu/class/cs224u/data/data.tgz
Resolving web.stanford.edu (web.stanford.edu)... 171.67.215.200
Connecting to web.stanford.edu (web.stanford.edu)|171.67.215.200|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1531647000 (1.4G) [application/x-gzip]
Saving to: ‘data.tgz’


2020-03-06 16:18:09 (7.57 MB/s) - ‘data.tgz’ saved [1531647000/1531647000]



In [0]:
!tar xvzf data.tgz

In [0]:
!cp './data/' 'drive/My Drive' -r

In [0]:
DATA_HOME = 'drive/My Drive/data/vsmdata'

# Build the vocabulary 
using a co-occurrence matrix

# Analysing pre-computed matrices

Syntactic information
- Source : IMDB movie reviews
- Window size: 5 
- Weighting: 1/distance

In [0]:
imdb5 = pd.read_csv(
    os.path.join(DATA_HOME, 'imdb_window5-scaled.csv.gz'), index_col=0)

Semantica information
- Source : IMDB movie reviews
- Window size: 20
- Weighting: 1

In [0]:
imdb20 = pd.read_csv(
    os.path.join(DATA_HOME, 'imdb_window20-flat.csv.gz'), index_col=0)

## But what is our goal? 
> *Ideally* we would like to represent **semantically related words close together** in the vector space, <br>
and semantically unrelated words should end up far apart




## Distance metrics

The definition of distance in the VSM is therefore essential. <br>
I will now explore different distance definitions on the pre-computed matrices.
<br>
<br>

Note: <br>most of the distance metrics presented here do not qualifyeffectively as distance metrics, mainly because they don't satisfy the triangle inequality.

In [0]:
# Let's select 3 words 
# imdb 5
zombie5 = imdb5.loc["zombie"].array
death5 = imdb5.loc["death"].array
happy5 = imdb5.loc["happy"].array
# imbd 20
zombie20 = imdb20.loc["zombie"].array
death20 = imdb20.loc["death"].array
happy20 = imdb20.loc["happy"].array

In [43]:
words5 = np.array([zombie5,death5,happy5])
words5.shape

(3, 5000)

In [45]:
words20 = np.array([zombie20,death20,happy20])
words20.shape

(3, 5000)

### Euclidean distance

In [0]:
d1 = distance.euclidean(zombie5,death5)
d2 = distance.euclidean(zombie5,happy5)
d3 = distance.euclidean(death5,happy5)

In [40]:
print("Zombie-death: ",d1)
print("Zombie-happy: ",d2)
print("Happy-death: ",d3)

Zombie-death:  140678.50846327867
Zombie-happy:  79913.87661592668
Happy-death:  153085.63009060753


This apparently weird result has a simple explanation: using the Euclidean distance we are taking into account the absolute number of occurrences of the words. Hence, words that are both frequent will appear more similar to each other than words that have different frequencies. We can assume that the word *Zombie* is not frequent in movies' reviews as the words *happy* and *death*, which is why *happy* and *death* are closer together then *zombie* and *death*.
<br>
<br>
Let's now look at the same analysis on the flatter dataset:

In [0]:
d1 = distance.euclidean(zombie20,death20)
d2 = distance.euclidean(zombie20,happy20)
d3 = distance.euclidean(death20,happy20)

In [47]:
print("Zombie-death: ",d1)
print("Zombie-happy: ",d2)
print("Happy-death: ",d3)

Zombie-death:  222839.7455190613
Zombie-happy:  112295.34699176098
Happy-death:  190959.80210766871


As we can see, taking into account a larger co-occurrence window and using a flat weighting scheme decreases the effect of frequency on similarity, thus enhancing the semantic meaning of a word.


### Normalised Euclidean Distance or Cosine-distance
Normalising the length results in not taking into account the absolute value of frequencies of the words when computing the similarity. This should bring even more far apart the words that have different meanings. 
<br>
<br>
Euclidean with L2-normed vectors is equivalent to cosine
w.r.t. ranking.


In [49]:
d1 = distance.cosine(zombie5,death5)
d2 = distance.cosine(zombie5,happy5)
d3 = distance.cosine(death5,happy5)
print("Zombie-death: ",d1)
print("Zombie-happy: ",d2)
print("Happy-death: ",d3)

Zombie-death:  0.9821138850310313
Zombie-happy:  0.9819876080334038
Happy-death:  0.9797060511902813


In [50]:
d1 = distance.cosine(zombie20,death20)
d2 = distance.cosine(zombie20,happy20)
d3 = distance.cosine(death20,happy20)
print("Zombie-death: ",d1)
print("Zombie-happy: ",d2)
print("Happy-death: ",d3)

Zombie-death:  0.33280097194242764
Zombie-happy:  0.3116146909460501
Happy-death:  0.28845341896736065


One major difference that we can observe between this last result and the result obtained by using the Euclidean distance is that the similarity between *Happy* and *Death* is now ranked last (in both datasets), even though the two words appear with the same order of frequency.

### Matching based distances: Jaccard distance
Here we explore matching as a method to compute distances between words. Since the matching coefficient depends heavily on the raw frequencies of the words it could be necessary to apply a *standardization* on the coefficient, like it's done in the *Jaccard distance* case

In [52]:
d1 = distance.jaccard(zombie5,death5)
d2 = distance.jaccard(zombie5,happy5)
d3 = distance.jaccard(death5,happy5)
print("Zombie-death: ",d1)
print("Zombie-happy: ",d2)
print("Happy-death: ",d3)

Zombie-death:  0.9961356805495921
Zombie-happy:  0.9897213339424394
Happy-death:  0.9953241232731137


In [53]:
d1 = distance.jaccard(zombie20,death20)
d2 = distance.jaccard(zombie20,happy20)
d3 = distance.jaccard(death20,happy20)
print("Zombie-death: ",d1)
print("Zombie-happy: ",d2)
print("Happy-death: ",d3)

Zombie-death:  0.9937055837563452
Zombie-happy:  0.9768881551795295
Happy-death:  0.9846371538306045


### Probability-based norms

Probability-based norms compute a probability distribution for each word (hence associating a probability mass to each other word in the vocabulary based on the co-occurrence factor) and then compute the distance between two words in the probability metrics space.
<br>
<br>
An example of symmetric probability-based norm is the *Jensen-Shannon* divergence, which is based on the *KL-divergence*.

In [0]:
# First we normalise the arrays
zombie5n = zombie5/np.sum(zombie5)
death5n = death5/np.sum(death5)
happy5n = happy5/np.sum(happy5)

In [74]:
d1 = distance.jensenshannon(zombie5n,death5n)
d2 = distance.jensenshannon(zombie5n,happy5n)
d3 = distance.jensenshannon(death5n,happy5n)
print("Zombie-death: ",d1)
print("Zombie-happy: ",d2)
print("Happy-death: ",d3)

Zombie-death:  0.6471835294189927
Zombie-happy:  0.6472151522431021
Happy-death:  0.6472732505531924


*happy-death* > *zombie-happy* > *zombie-death*

In [0]:
# First we normalise the arrays
zombie20n = zombie20/np.sum(zombie20)
death20n = death20/np.sum(death20)
happy20n = happy20/np.sum(happy20)

In [68]:
d1 = distance.jensenshannon(zombie20n,death20n)
d2 = distance.jensenshannon(zombie20n,happy20n)
d3 = distance.jensenshannon(death20n,happy20n)
print("Zombie-death: ",d1)
print("Zombie-happy: ",d2)
print("Happy-death: ",d3)

Zombie-death:  0.3148975947848389
Zombie-happy:  0.31499963283886195
Happy-death:  0.27413988112993093


*zombie-happy* > *zombie-death* > *happy-death*

It may be interesting to notice from this result that the words happy and death appear more similar in distribution when using a larger window to scan the text. 
<br>
Also, differently from the Jaccard distance result, the definition of the window and the weighting scheme has an effect on ranking.

Note: 
> Both L2-norms and probability distributions can obscure
differences in the amount/strength of evidence, which
can in turn have an effect on the reliability of cosine,
normed-euclidean, and KL divergence. These
shortcomings might be addressed through weighting
schemes.


## Re-weighting schemes