# Demo: Embedding Distance Measures 

In this notebook, we demonstrate how to use different embedding distance measures to select the embedding dimension to minimize downstream instability. 

## Task

We compare two pairs of embeddings, 25-dim and 100-dim embeddings trained on the Wiki'2017 and Wiki'2018 datasets. The goal is to choose the pair of embeddings with the smaller embedding distance, in order to select the dimension that is expected to have lower downstream instability. Given the results of our study in our MLSys 2020 paper "Understanding the Downstream Instability of Word Embeddings", where we found higher dimension generally improves stability, we want the embedding distance measure to select the 100-dim (higher dimensional) embeddings. 

## Embedding Distance Computation 

First, we use five embedding distance measures (Eigenspace instability measure, k-NN measure, semantic displacement, PIP loss, and eigenspace overlap scprre) to compute the embedding distance between the two pairs of embeddings (i.e., `dist(emb_2017_dim_25, emb_2018_dim_25)` and `dist(emb_2017_dim_100, emb_2018_dim_100)`). 

Note: we subtract the k-NN and eigenspace overlap values from 1 since a larger value for these measures indicates greater stability (and we want a smaller value to uniformly indicate greater stability across the distance measures). 

In [1]:
from anchor.embedding import Embedding
import numpy as np 
import pandas

In [2]:
# load embeddings 

# same anchor embeddings for both dimension settings (use largest dimension)
emb1_anchor = Embedding('../demo/glove_wiki_2017_dim_100.txt')
emb2_anchor = Embedding('../demo/glove_wiki_2018_dim_100.txt')

emb1_dim_25 = Embedding('../demo/glove_wiki_2017_dim_25.txt')
emb2_dim_25 = Embedding('../demo/glove_wiki_2018_dim_25.txt')

emb1_dim_100 = Embedding('../demo/glove_wiki_2017_dim_100.txt')
emb2_dim_100 = Embedding('../demo/glove_wiki_2018_dim_100.txt')

# use the top-10000 most frequent words 
n = 10000

### Eigenspace Instability Measure (EIS)

In [3]:
eis_dim_25 = emb2_dim_25.eis(emb1_dim_25, curr_anchor=emb2_anchor, other_anchor=emb1_anchor, n=n, exp=3)
eis_dim_100 = emb2_dim_100.eis(emb1_dim_100, curr_anchor=emb2_anchor, other_anchor=emb1_anchor, n=n, exp=3)

### k-NN Measure 

In [4]:
knn_dim_25 = 1-emb2_dim_25.knn(emb1_dim_25, n=n)
knn_dim_100 = 1-emb2_dim_100.knn(emb1_dim_100, n=n)

### Semantic Displacement (SD)

In [5]:
sem_disp_dim_25 = emb2_dim_25.sem_disp(emb1_dim_25, n=n)
sem_disp_dim_100 = emb2_dim_100.sem_disp(emb1_dim_100, n=n)

### PIP Loss

In [6]:
pip_dim_25 = emb2_dim_25.pip_loss(emb1_dim_25, n=n)
pip_dim_100 = emb2_dim_100.pip_loss(emb1_dim_100, n=n)

### Eigenspace Overlap Score (EO)

In [7]:
eo_dim_25 = 1-emb2_dim_25.eigen_overlap(emb1_dim_25, n=10000)
eo_dim_100 = 1-emb2_dim_100.eigen_overlap(emb1_dim_100, n=10000)

## Predictions

Now we use the above results to make a prediction for the more stable dimension for each embedding distance measure.  

In [9]:
def get_vote(dim_25, dim_100): 
    return ["25-dim", "100-dim"][np.argmin([dim_25, dim_100])]

# Create a table with predictions 
cols = ["25-dim distance", "100-dim distance", "Vote"]
rows = ["EIS", "1 - k-NN", "SD", "PIP", "1 - EO"]
data = np.array([[eis_dim_25, eis_dim_100, get_vote(eis_dim_25, eis_dim_100)],
                 [knn_dim_25, knn_dim_100, get_vote(knn_dim_25, knn_dim_100)],
                 [sem_disp_dim_25, sem_disp_dim_100, get_vote(sem_disp_dim_25, sem_disp_dim_100)], 
                 [pip_dim_25, pip_dim_100, get_vote(pip_dim_25, pip_dim_100)], 
                 [eo_dim_25, eo_dim_100, get_vote(eo_dim_25, eo_dim_100)]])
df = pandas.DataFrame(data, rows, cols)
df[['25-dim distance', '100-dim distance']] = df[['25-dim distance', '100-dim distance']].astype(float)
df.round(3)

Unnamed: 0,25-dim distance,100-dim distance,Vote
EIS,0.001,0.001,100-dim
1 - k-NN,0.231,0.156,100-dim
SD,0.021,0.036,25-dim
PIP,11697.035,11563.472,100-dim
1 - EO,0.127,0.199,25-dim


As we can see in the table above, on these pairs of embeddings, the EIS measure, k-NN measure, and PIP loss correctly choose the 100-dim embedding pair as more stable. Over different precision and dimension configurations, we find that our theoretically grounded EIS measure, and the k-NN measure, for which we have no theoretical guarantees, are the top-performing measures. 