# Statistical Methods (in progress)

### Contents
1. [Motivations](#Motivations)
1. [Defining a clearer space](#Defining-a-clearer-space)
    1. [Defining Length](#Defining-length)
        1. [L-1](#L-1)
        1. [L-2](#L-2)
    1. [What is Distance?](#Distance)
        1. [Cosine](#Cosine)
1. [Evaluation Update](#Quick-Evaluation-Update)
1. [Nearest Neighbors](#Nearest-Neighbors)

In [15]:
import utils
import vsm
import pandas as pd
import numpy as np
import os
import random
W = pd.read_csv('giga5.csv',index_col=0) #read in raw giga word dataset
DATA_HOME = os.path.join('data/data', 'wordrelatedness')
eval_df = pd.read_csv( # read in evaluation dataset
    os.path.join(DATA_HOME, "wordrelatedness-dev.csv"))
giga5 = pd.read_csv('giga5.csv',index_col=0)
def distance2pred(pred_df): # haven't added these to vsm.py yet
    lis = [-1*i for i in pred_df['prediction']]
    pred_df['prediction'] = pd.Series(lis)
    return pred_df
def random_scorer(x1, x2):
    return random.random()

### Motivations
We've seen how W, a set of w-dimensional word vectors derived from raw co-occurance counts seem to carry an interesting embedding space. This is our initial VSM. Let's apply some statistical thinking to this space and reweigh the matrix/dataset. 

# Defining a clearer space

### Defining length
#### L-1 norm, Magnitude 
If we want to treat the colorful embeddings in Tableau as true vectors, we'll have to define them so they're more appropriate in vector space. As of now, the identity of each embedding comes from the co-occurance values present in its components (the band widths). In the Trio viz, notice the summed co-occurance values are unique for each embedding.The Y-Axis for 'age' tops out at 500k while 'old' is almost twice as large. These values represent the embedding's *magnitude*, the the sum of absolute values. 
It might make sense to use magnitude, Length-1(L1), as our notion of length.
- L-1

$$\|u\|_{1} = {\sum_{i=1}^{w} |{u_{i}|}}$$

#### L-2 norm, Length
Imagine, however, we're looking at two words that are roughly synonyms. How about 'cookie' and 'biscuit'? We can be idealistic and assume there's no noise for now. Since they share so many of the same usages, as we run along their rows and compare the two, we'd find the co-occurance values (band widths) to be pretty much the same at each component. However, they're **not** identical. If I wanted to dip a treat in tea, and asked a Brit to bring some biscuits, I might wind up with soggy Oreos. We want to capture and amplify these slight differences in usage. Instead of summing the co-occurances at face values, we'll sum the *squares* of each component and take the square root at the end. This notion of length, L-2, is also more compatible with statistical Least Squares methods and happens to be the standard Linear Algebra definition of vector length.
- L-2  $$\|u\|_{2} = \sqrt{\sum_{i=1}^{w} u_{i}^{2}}$$

### Defining Distance
There are even more ways to think about distance. We could treat the space like our physical reality and use euclidean distance. There's the grid-like Manhattan distance, and probability-oriented KL divergence. The slew of options reflects what makes ML interesting but also why new problems can feel overwhelming. The decisions, from matrix design to hyperparameter choices, are not formulaic. Instead, they're informed by the domain and your downstream tasks. Since ours is to make meaningful vector comparisons, we want to start addressing the noise from common words. The cliff between high and medium usage dwarfes a lot of nuance. We can normalize the word vectors so the components are expressed as a percentage of its length (like comparing by per capita). Then we could just use euclidean distance to express space between normalized vectors.
- Normalization of vector *u* with **w** components
$$\textbf{normalize}(u_w) =
\left[ 
  \frac{u_{1}}{\|u\|_{2}}, 
  \frac{u_{2}}{\|u\|_{2}}, 
  \ldots 
  \frac{u_{w}}{\|u\|_{2}} 
 \right]$$
- Euclidean Distance between vectors *u* and *v*
 $$\textbf{euclidean}(u, v) = 
\sqrt{\sum_{i=1}^{w}|u_{i} - v_{i}|^{2}}$$
 
The end result is a space in which the components of our word vectors have more meaning, since they share a common reference. Moreover, instead of just length as an identity, the *orientation* of these vectors in euclidean space have meaning. We can sort of think of them as arrows in 3-D space pointing in different directions.

### Cosine
If you are more familiar with Linear Algebra, you might notice we can achieve the same effect in one transformation. We take the dot product of the two vectors (giving us euclidean-like orientation), control for length by dividing this value by the product of their norms (normalizing them). That's the cosine distance between the two. Instead of talking about how far apart things are, it makes more sense to talk about how similar they are. So we simply flip the script and subtract the result from 1.
- CosineSim
$$\textbf{CosineSim}(u, v) = 
1 - \frac{\sum_{i=1}^{w} u_{i} \cdot v_{i}}{\|u\|_{2} \cdot \|v\|_{2}}$$

### Summary
I'll use the cosine similarity since it's more practical, but it's the same exact thing as normalizing and taking euclidean distance. 

## Quick Evaluation Update

In [60]:
pred_df, score = vsm.word_relatedness_evaluation(eval_df, giga5, distfunc=vsm.cosine)
random_pred_df, random_score = vsm.word_relatedness_evaluation(eval_df, giga5, distfunc=random_scorer)
pred_df = distance2pred(pred_df)
pred_df['prediction'] = abs(1 - pred_df['prediction'])
pred_df['percent error'] = abs(pred_df['score']-pred_df['prediction'])*100
pred_df.rename(columns={'score': 'relatedness score'},inplace=True)
pred_df.rename(columns={'prediction': 'vsm prediction'},inplace=True)
print(f'VSM Score Percent: {score*100}')
print(f'Random Score Percent: {random_score*-100}')
print('Glimpse of how our VSM answered')
pred_df[59:90]

VSM Score Percent: 27.76320615138188
Random Score Percent: 2.4253955339491053
Glimpse of how our VSM answered


Unnamed: 0,word1,word2,relatedness score,vsm prediction,percent error
59,action,involvement,0.686,0.744897,5.889677
60,action,operation,0.66,0.928016,26.801612
61,action,physician,0.227454,0.838599,61.114498
62,action,subway,0.32,0.854321,53.432081
63,action,truck,0.44,0.874324,43.43243
64,activity,activity,1.0,1.0,0.0
65,activity,attempt,0.4,0.561553,16.155269
66,activity,event,0.77862,0.848844,7.022406
67,activity,facility,0.502664,0.849441,34.677691
68,activity,music,0.424702,0.901657,47.69556


The significant jump in score from blind guessing reflects the basic meaning these word embedding take on in a well-defined space. Scrolling through shows it has begun capturing some basic connections.

###  Nearest Neighbors
Since our vectors have meaningful orientation now, we can start doing what Semantle (https://semantle.novalis.org/) does. We can search for a word in our VSM and inspect the embeddings that are closest to it! Our hope is that words sharing a local space have something in common. It's a quick way to get a glimpse of what the landscape looks like. Nearest neighbors for various Age and Beauty are shown below.

In [55]:
pd.DataFrame(vsm.neighbors('age',(W))).head(10)

Unnamed: 0,0
age,0.0
university,0.04385
expense,0.04683
risk,0.046907
length,0.052734
height,0.061168
gathering,0.065073
saint,0.069794
peak,0.077287
level,0.077732


In [59]:
pd.DataFrame(vsm.neighbors('beauty',(W))).head(10)

Unnamed: 0,0
beauty,0.0
integrity,0.023809
architecture,0.024793
makeup,0.025417
freshness,0.026953
brightness,0.027701
flavor,0.032795
culture,0.033562
coldness,0.033635
glass,0.034652


The closest words for each concept are the not exactly inspiring. We address this in the next notebook, *Leveraging Probability*.

Here's a sneak peak at at where this work will take us. 

In [57]:
pd.DataFrame(vsm.neighbors('age',vsm.pmi(W))).head(10)

Unnamed: 0,0
age,0.0
ages,0.60008
gender,0.702292
older,0.705287
aged,0.71683
women,0.717285
young,0.721127
generation,0.721446
younger,0.72412
adults,0.73521


In [58]:
pd.DataFrame(vsm.neighbors('beauty',vsm.pmi(W))).head(10)

Unnamed: 0,0
beauty,0.0
beautiful,0.523039
gorgeous,0.555254
style,0.573829
fashion,0.584708
romance,0.587333
romantic,0.591599
lovely,0.596409
fantasy,0.606127
colors,0.606353
