# Evaluating Document Embeddings

This notebook provides code for performing several qualitative and quantitative analyses of document embeddings.

## Getting Started

First, import the required packages

In [None]:
import torch
import pandas as pd
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
import random
import pathlib
import json
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from tqdm import tqdm, trange
from transformers import AutoModel
from transformers import AutoTokenizer

RANDOM_STATE = 42

### Upload the Required Files

### Process the document embeddings

In the following cell, we read in the document embeddings and convert them to torch tensors. We also read in the corresponding document class labels.

In [None]:
# Read in the document emebddings
train_embeddings, test_embeddings = [], []
with open('/Users/johngiorgi/Downloads/train_predictions.jsonl', 'r') as f:
    for line in tqdm(f):
        train_embeddings.append(json.loads(line)['doc_embeddings'])

with open('/Users/johngiorgi/Downloads/valid_predictions.jsonl', 'r') as f:
    for line in tqdm(f):
        test_embeddings.append(json.loads(line)['doc_embeddings'])

# Convert them to torch tensors
train_embeddings = torch.as_tensor(train_embeddings)
test_embeddings = torch.as_tensor(test_embeddings)

# Read in the corresponding document labels
train_labels = pathlib.Path('/Users/johngiorgi/Documents/dev/t2t/datasets/biorxiv/train_labels.tsv').read_text().split('\n')
test_labels = pathlib.Path('/Users/johngiorgi/Documents/dev/t2t/datasets/biorxiv/valid_labels.tsv').read_text().split('\n')

# We should have one label per document embedding
assert len(train_labels) == train_embeddings.size(0)
assert len(test_labels) == test_embeddings.size(0)

# TEMP (John): A hack to filter labels down to those that appear in the KATE paper
filtered_labels = ['Animal Behavior and Cognition', 'Ecology', 'Bioinformatics', 'Neuroscience', 'Genetics', 'Microbiology']
train_embeddings = torch.as_tensor([embedding.tolist() for i, embedding in enumerate(train_embeddings) if train_labels[i] in filtered_labels])
train_labels = [label for label in train_labels if label in filtered_labels]

## Qualitative Analysis

### PCA

Perform dimensionality reduction with PCA and plot the resulting principal components.

In [None]:
pca = PCA(n_components=2, random_state=RANDOM_STATE)
principal_components = pca.fit_transform(train_embeddings.cpu().numpy())
pca_df = pd.DataFrame({'pc_1': principal_components[:, 0], 'pc_2': principal_components[:, 1], 'labels': train_labels})

In [None]:
ax = sns.scatterplot(x="pc_1", y="pc_2", hue='labels', data=pca_df, palette='deep')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)  # move legend outside the plot
fig = ax.get_figure()
fig.savefig('pca.png', dpi=300)

### TNSE

Perform dimensionality reduction with TSNE and plot the resulting vectors (note, this may take a few minutes).

In [None]:
tsne = TSNE(n_components=2, random_state=RANDOM_STATE)
reduced_dims = tsne.fit_transform(train_embeddings.cpu().numpy())
tsne_df = pd.DataFrame({'dim_1': reduced_dims[:, 0], 'dim_2': reduced_dims[:, 1], 'labels': train_labels})

In [None]:
ax = sns.scatterplot(x="dim_1", y="dim_2", hue="labels", data=tsne_df, palette='deep')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)  # move legend outside the plot
fig = ax.get_figure()
fig.savefig('tsne.png', dpi=300) 

## Quantitative Analysis

## Document Retrival

Here, we set up a quantitative evaluation of our document embeddings by using them to perform document retrival. The setup:

- Using each document embedding in the test set as a query, perform a search over all document embeddings in the train set (using some vector-based similarity metric).
- Compute the average precision as the fraction of retrived documents that belong to the same class as the query document.
- Perform the evaluation for multiple fraction sizes and plot the precision curve.

First, define the similarity metric. Here, we use cosine similarity.



In [None]:
sim_metric = torch.nn.CosineSimilarity(-1)

then perform the search (see [this paper]() for more information)

In [None]:
retrived_docs = []  # Tuples of query labels and indices of retrived indices
precision = []      # Per fraction precision scores

# Fractions to evaluate the average precision at
fractions = [0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0]

# Using each item in the test set as a query, perform a search over the train set
for query, label in tqdm(
        zip(test_embeddings, test_labels), 
        total=test_embeddings.size(0), 
        desc='Performing document retrival'
    ):
    similarity_scores = sim_metric(query, train_embeddings)
    retrived_indices = torch.sort(similarity_scores, descending=True)[-1].tolist()
    retrived_docs.append((label, retrived_indices))

# Evaluate the average precision according to document class for each fraction of retrived documents
for fraction in tqdm(fractions, desc='Evaluating retrived documents'):
    precision.append(0)
    num_docs_to_retrive = int(fraction * train_embeddings.size(0))
    for label, indices in retrived_docs:
        top_retrived = indices[:num_docs_to_retrive]
        precision[-1] += (sum([1 if train_labels[idx] == label else 0 for idx in top_retrived]) / 
                          len(top_retrived))
    
    precision[-1] /= test_embeddings.size(0)

and finally, plot the precision curve

In [None]:
doc_retrival_results = {'fraction': fractions, 'precision': precision}
doc_retrival_df = pd.DataFrame(doc_retrival_results)

In [None]:
_ = sns.pointplot(x='fraction', y='precision', data=doc_retrival_df)