# Initial Explorations

## Contents
1. [Foreword](#Foreword)
1. [Intro to VSM's](#Intro-to-Vector-Space-Models)
1. [Linguistic Motivation](#Linguistic-Motivation)
1. [Counts & Context Windows](#Counts-Context-Windows)


In [1]:
import pandas as pd
import numpy as np
import os 
import utils
import vsm
import random
from scipy.stats import spearmanr
utils.fix_random_seeds()
DATA_HOME = os.path.join('data/data', 'vsmdata')
giga5 = pd.read_csv(
    os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)
giga5.to_csv('giga5.csv')
giga = giga5.iloc[11:,11:]
W = giga5.iloc[12:,12:]

Let's use Gigaword as a corpus, a collection of 4 million articles from the Associated Press,Los Angeles Times, Washington Post, Bloomberg, and several other news agencies. We'll use a window which scans five words deep on either side of a center word. Instead of keeping the context window flat, we'll weigh it so words closer to the target count for more. For each word in the corpus we update the total list of possible neighbors (the vocabulary). After scanning through the entire corpus we obtain a wxw co-occurance matrix (datafrmae) W.
For more info on gigaword (https://catalog.ldc.upenn.edu/LDC2011T07)

A look at W:

In [2]:
W.head(20)

Unnamed: 0,abc,ability,able,abortion,about,above,abraham,absolute,absolutely,absorbing,...,younger,your,yourself,youth,zebra,zero,zinc,zombie,zone,zoo
abc,143.8,1.3,12.75,0.733333,276.483333,3.8,0.0,0.416667,1.4,0.0,...,6.45,18.783333,0.866667,2.433333,0.0,0.6,0.0,0.0,2.316667,0.0
ability,1.3,86.266667,49.933333,4.8,2195.383333,39.2,0.766667,8.15,12.25,0.616667,...,13.25,834.333333,16.616667,7.7,0.0,10.966667,0.0,0.2,8.683333,0.0
able,12.75,49.933333,60.133333,12.416667,1262.466667,61.516667,0.983333,6.85,17.466667,1.333333,...,31.516667,595.05,75.966667,10.416667,0.2,10.6,0.166667,0.0,21.2,8.366667
abortion,0.733333,4.8,12.416667,313.466667,1010.783333,2.833333,0.866667,8.35,2.616667,0.0,...,6.516667,17.05,0.166667,1.666667,0.0,1.45,0.0,0.0,9.983333,0.0
about,276.483333,2195.383333,1262.466667,1010.783333,17040.166667,872.433333,66.116667,91.966667,428.133333,49.25,...,353.766667,8092.616667,1107.883333,511.533333,9.15,199.683333,16.083333,13.45,330.95,79.3
above,3.8,39.2,61.516667,2.833333,872.433333,142.133333,2.65,70.55,7.466667,0.533333,...,10.05,321.2,17.3,4.466667,0.0,183.75,1.75,0.0,203.516667,5.433333
abraham,0.0,0.766667,0.983333,0.866667,66.116667,2.65,36.966667,0.666667,0.0,0.0,...,1.783333,6.316667,0.2,2.2,0.0,0.666667,0.166667,0.0,0.7,0.25
absolute,0.416667,8.15,6.85,8.35,91.966667,70.55,0.666667,31.066667,18.133333,0.0,...,2.066667,36.133333,1.283333,3.266667,0.0,179.866667,0.0,0.0,1.566667,1.0
absolutely,1.4,12.25,17.466667,2.616667,428.133333,7.466667,0.0,18.133333,115.7,0.0,...,1.583333,70.183333,5.816667,2.083333,0.2,95.75,0.0,0.0,3.1,1.116667
absorbing,0.0,0.616667,1.333333,0.0,49.25,0.533333,0.0,0.0,0.0,1.666667,...,0.0,1.533333,0.916667,0.5,0.0,1.0,0.166667,0.0,0.583333,0.0


The output shows a shortened view of all the columns in our VSM, W, from "abc" to "zoo". It also shows the full view of the first dozen rows (components), which happened to be arranged alphabetically. Notice W's diagonal, it represents a words' self co-occurance. It's a useful landmark since the values catch the eye, and will by definition be large. However, the meaningful view is in the components of each column. Let's look at a truncated word  embedding for 'old'. 

In [93]:
W.to_csv('W.csv')

In [11]:
print(W['age'].head(10))
print(W['age'].tail(10))
print(f'w-dimensional column, w = ',end='')
print(W['age'].shape[0])

abc              5.800000
ability         82.000000
able            67.700000
abortion        15.216667
about         1950.483333
above          115.416667
abraham          3.650000
absolute        12.433333
absolutely      10.666667
absorbing        1.366667
Name: age, dtype: float64
younger     582.450000
your        951.233333
yourself      8.866667
youth       171.533333
zebra         0.000000
zero         10.650000
zinc          0.000000
zombie        0.166667
zone          4.783333
zoo           8.250000
Name: age, dtype: float64
w-dimensional column, w = 5988


And its components with the largest component values

In [23]:
W['age'].sort_values(ascending=False).head(40)

the           48993.700000
of            44506.316667
at            32964.916667
and           17284.166667
in            17146.216667
to            12623.733333
his            7344.650000
an             6249.683333
is             5946.750000
for            5842.133333
that           5768.833333
from           4940.383333
with           4408.350000
by             4369.433333
he             4156.883333
was            3893.666667
under          3767.250000
as             3705.200000
new            3608.683333
when           3290.550000
this           3231.100000
her            3163.133333
old            3012.550000
who            2846.000000
or             2807.750000
are            2801.433333
their          2708.800000
but            2535.233333
died           2503.600000
on             2460.566667
retirement     2452.100000
has            2363.866667
group          2344.916667
my             2209.983333
years          2196.416667
children       2110.616667
over           2068.250000
a

This is encouraging. We see some words that seem meaningful when paired with 'age'. There's 'old','retirement','children', and 'died'. There are also words that have a sort of phrase partnership with 'age'. Golden age, under age, and average age for example. Then there are bunch of stop words. Words that highly occur with everything because they are linguistic building blocks. We can be sure the components in any of these word embeddings are high across the board. This is intrinsic to natural language. Looking at the usage chart of English words below, we can see it follows a striking Zipfian distribution. This noise serves as our primary challenge.



## Visualizing word embeddings
Gallery: https://public.tableau.com/app/profile/jelan.samatar/viz/VisualizingWordEmbeddings/Dashboard1 

So far we've only seen a shortened printout of word embeddings.
It might be useful to get a better idea of what we're transforming. So quickly before we start, let's take another look at some embeddings. We know each word vector/embedding is a column in W, characterized by w components running along its rows. Each component is a unique word in the vocabulary of the corpus populating this VSM, Gigaword. 
##### Age 
In in the 'Age' sheet we see the embeddings for age represented as abstract columns. The co-occurance between each component and the word embedding is a band. The large elements in W['age'], for example have the widest bands in the column. I've colored tagged the components to make the connection more evident when you hover over them.
For reference, the top five co-occurances in 'age': [the, of, at, and, in]

##### Trio
Since our VSM is comprised from a set of word vectors, the column space of W as a matrix is the equivalent to the embedding space of W as a VSM. So comparing word embeddings means making we want to make meaningful comparisons between columns. For example how does 'age' compare against 'old' and 'clear'. In "Trio" sheet, we can see these three words. On first glance, 'age' and 'old' have a similar structure than 'clear'. However, all three tend to share the largest bands. Click the marker icon in the top right of the legend title 'components' to activate the highlight function. It lets you compare by component. Although this is a tiny subspace of W, exploring it gives a nice feel for how word embeddings relate to each other.Moreover it serves as a helpful, if incomplete, visual reference for where our VSM is. It paints a picture of latent knowledge obscured by noise.


## Quantitative Evaluations
Another way to judge our VSM is by seeing how it performs on relevant tasks. How well it does it mimic human judgement on how related a given pair of words is? A dataset formed by having folks sit down and score the relatedness between word pairs is called a relatedness dataset. I've used the dataset from Stanford's Linguistic/NLP department as our evaluative dataset (https://web.stanford.edu/class/cs224u/data/). 

In [99]:
DATA_HOME = os.path.join('data/data', 'wordrelatedness')
eval_df = pd.read_csv(
    os.path.join(DATA_HOME, "cs224u-wordrelatedness-dev.csv"))

In [95]:
eval_df

Unnamed: 0,word1,word2,score
0,abandon,button,0.180000
1,abandon,consigning,0.400000
2,abandon,crane,0.160000
3,abandon,ditch,0.630000
4,abandon,left,0.570000
...,...,...,...
4751,wife,woman,0.728438
4752,withdraw,withdraw,1.000000
4753,workings,workings,1.000000
4754,workplace,workshop,0.767677


The basic idea is to use the word embeddings to predict a relatedness score. Then we'll compare the human rankings with our VSM's rankings through Spearman's $\rho$ value. For now, as a sanity check, let's see how a random guesser does. As we make transformations on our VSM, we'll see how it performs on evaluative tasks and explore what's in it with more visualizations.

In [96]:
def random_scorer(x1, x2):
    return random.random()
def distance2pred(pred_df):
    lis = [-1*i for i in pred_df['prediction']]
    random_pred_df['prediction'] = pd.Series(lis)
    return pred_df
random_pred_df, random_score = vsm.word_relatedness_evaluation(eval_df, giga5, distfunc=random_scorer)
random_pred_df = distance2pred(random_pred_df)

In [97]:
print(f'Score: {random_score}')
random_pred_df

Score: -0.0053296008429690785


Unnamed: 0,word1,word2,score,prediction
0,abandon,button,0.180000,0.708161
1,abandon,consigning,0.400000,0.169370
2,abandon,crane,0.160000,0.957292
3,abandon,ditch,0.630000,0.863825
4,abandon,left,0.570000,0.509684
...,...,...,...,...
4751,wife,woman,0.728438,0.999176
4752,withdraw,withdraw,1.000000,0.461827
4753,workings,workings,1.000000,0.948790
4754,workplace,workshop,0.767677,0.634606
