# Static Embeddings: Exercises

## 1) Model Training

Train a custom Word2Vec Model. This time, use the full set of example sentences (rather than only the first 10). 

Think about whether or not you should set the tokens to lower case, and whether or not you should change the parameters of the Word2Vec algorithm (such as embedding dimensions and context window).

In [1]:
import pandas as pd
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

full_docs = pd.read_csv('example_sentences.csv')['sentence'].tolist()

# ...


## 2) Word Similarity

a) Compute the cosine similarity between "nlp" and five words of your choosing. Note: the words must be present in the model!

b) Get the 5 most similar words for "nlp" and another three words of your choosing. 

Hint: in order to access the similarity methods for your model, you need to navigate to the "wv" keyedvectors first, e.g. `word2vec_model.wv.similarity()`

In [3]:
# ...

## 3) Visualization

Load the `word2vec-google-news-300` model we used before, and visualize the embeddings. Since we the new model will have more tokens, it will improve visual clarity to only plot a subset of words. You can pick whichever words you choose, but you will find a suggestion below.

Before visualizing, you will need to reduce the dimensions of the embeddings. Try both the PCA algorithm (as before) and the T-SNE algorithm. What do you notice in the visualization?

Hint: Depending on the number of words you choose, you may need to adjust the "perplexity" parameter of the T- SNE algorithm to something smaller than your number of words.
    
More info on the T-SNE algorithm here: https://scikit-learn.org/stable/modules/manifold.html#t-sne

In [6]:
words = ["NLP", "sentiment", "analysis", "physics", "scientist", 
         "man", "woman", "doctor", "nurse", "research", "medicine",
         "engineering", "technology", "AI", "machine", "learning",
         "data", "big", "small", "fast", "slow", "good", "bad",
         "up", "down", "left", "right", "positive", "negative",
         "productivity", "efficiency", "accuracy", "precision",]

perplexity = len(words)-1

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import numpy as np
import matplotlib.pyplot as plt
import gensim.downloader as api

word2vec = api.load('...')

# ...

## 4) Model Bias

Think about words that may carry a certain bias in specific domains. Find term pairs (like "men" and "women" for gendered biases) and get the cosine similarity for each word of a specific domain (e.g. "nurse", "doctor", etc.). How could you analyse these differences in your own research?

In [None]:
# ...

## Bonus

### Other Models

Load another model from the ones available in *Gensim*. Are the same biases present?

In [None]:
# List of all available models
for model_name in list(api.info()['models'].keys()):
  print(model_name)

# ...

### Analogies

Find more analogy tasks to run on the google-news-300 model. Is the model good at finding these? What does that say about the model? How does the task compare on other models?

Hint: The `most_similar` method is capable of solving analogy tasks with the `negative` and `positive` arguments. Remeber that the syntax is `X2 - X1 + Y1 = Y2`. It may help to write out your analogy and formalize it before filling in the function arguments!

In [None]:
# ...