# Assignment 2 –  Word Embeddings

In this assignment, we are going to train a dynamic word embedding from scratch on newspaper data. The data is available [here](), and covers news articles between 2005 and 2012.

We are going to use the ```probabilistic_word_embeddings``` package. It can be installed from pypi. We also need ```networkx``` and  ```pandas``` , and a library for plotting, eg. ```seaborn```.

The documentation for the PWE module is available [here](https://ninpnin.github.io/probabilistic-word-embeddings/probabilistic_word_embeddings.html). Moreover, an example of training a dynamic embedding (which might come in handy), is available [here](https://github.com/ninpnin/probabilistic-word-embeddings/blob/main/examples/dynamic.py).

In [None]:
!pip install probabilistic_word_embeddings
!pip install networkx pandas seaborn

First, let's import the modules

In [None]:
import networkx as nx
import probabilistic_word_embeddings as pwe
from probabilistic_word_embeddings.preprocessing import preprocess_standard, preprocess_partitioned
from probabilistic_word_embeddings.embeddings import LaplacianEmbedding
from probabilistic_word_embeddings.estimation import map_estimate
from probabilistic_word_embeddings.evaluation import evaluate_on_holdout_set
import pandas as pd
import numpy as np
import seaborn as sns

The next thing you want to do is read in the data (CSV), remove null rows, filter out languages other than english and take a subset of ~ 1000 rows for development. Come back and train with full data later.

In [None]:
df = pd.read_csv # ...

Now it's time to preprocess the data. First, save the contents of the Text column as a list, and split each article by whitespace to get a list of lists. Then, use the dynamic.py example as a reference on how to use the preprocess_partitioned function. Provide the years as labels. Use the ```downsample=False``` flag in the preprocessing function. After this, you should be left with the preprocessed articles, as well as the resulting vocabulary for the embedding.

In [None]:
texts = df["Text"]
texts = [t.split() for t in texts]
labels = df["Year"]
texts, vocabulary = preprocess_partitioned # ... 

Print the first and the last article to make sure they have been processed properly. You should see lists of words, eg. ```["the_2024", "dog_2024"]```

In [None]:
print(texts[0], texts[-1])

Since we are creating a dynamic model, we need a prior graph. Each word vector is connected to the same word vector for the previous and next year. For instance, $\text{dog}_{2023}$ would be connected to $\text{dog}_{2024}$. The dynamic.py example file is helpful when creating the graph.

In [None]:
g = nx.Graph()
for # ...:
    # ...
    g.add_edge()

Create an embedding Since 

In [None]:
e = LaplacianEmbedding(vocabulary, graph=g, dimensionality=100)

Train the embedding using ```map_estimate```. Feel free to set model to sgns or cbow, window size ws to anything between 2 and 10. Other reasonable hyperparameters are epochs=10, batch_size=5000 or 10000. Finally, save the embedding using e.save.

In [None]:
e = map_estimate # ...
e.save("embedding.pkl") # if you train multiple models, save them under different names

# load by:
# e = LaplacianEmbedding(saved_model_path="embedding.pkl")

## Analysis

At this stage, you want to analyze the word embeddings you have trained. Oftentimes, it is useful to know which words are similar to each other. In word embeddings, this can be done with cosine similarity.

Implement the cosine similarity metric. It is defined as

$$
cossim(a, b) = \frac{a \cdot b}{\lVert a\rVert \lVert b\rVert}
$$

i.e. the dot product between the vectors $a$ and $b$, divided by the norm of $a$ and the norm of $b$. The dot product is available in as function in numpy; norm is available in numpy's linalg submodule.

In [None]:
def cosine_similarity(vec1, vec2):
    return # ...


Pick two words from the same year (using the syntax ```e["dog_2024]```) and calculate their similarity.

In [None]:
w1, w2 = # ...
similarity = cosine_similarity(e[w1], e[w2])
print(similarity)

Select another pair of words. Your task is to plot the similarity of the pair of words over time on a line plot.

In [None]:
# Calculate the similarity for each year, and
# use eg. Matplotlib : plt.plot()
# or Seaborn : sns.lineplot()

Sometimes, we want to know what the semantically closest words are . There is a function for this, ```nearest_neighbors```. Use it to extract the 10 closest words to "bread", both in 2005 and 2012.

In [None]:
from probabilistic_word_embeddings.utils import nearest_neighbors

w3 = # ...
results = nearest_neighbors # ..
print(results)