# Week 2 Exercise

In this first exercise, we will learn about word embeddings. Here, you will:

- Try out static embeddings
- Try out contextual embeddings
- Visualize embeddings in a 2D-Space

As it is the first exercise, there is not that much programming work here, it is more about experimentation :)

To start, install the requirements. You should really do this inside of a virtual environment, so create one first.

If you are comfortable using conda, you can read the environment.yml file to create an environment.

```
conda env create -f environment.yml
```

If you want to use pip:

```
python -m venv venv

pip install -r requirements.txt
```

If you use VS Code, you can of course use that, too.

In [None]:
# Quick check if your dependencies are installed.

import numpy as np

print("Hello world!")

# Part 1: Static Embeddings

First, let's look at static embeddings like Word2Vec and GloVe. 

Let's download some GloVe embeddings. GloVe is second-order, so if words ('word instances') appear in similar contexts (i.e. surrounded by the same words), they will be close together in the embedding space. 

We will work with relatively small 50-dimensional Glove-Embeddings for now.

Run the code below to load the embeddings.

In [None]:
import numpy as np

from gensim import downloader
glove = downloader.load("glove-wiki-gigaword-50")

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

print("Glove loaded!")

We can use cosine similarity to get a feel for how close words are in the embedding space.

Try running this code. Also try replacing these word pairs with some of your own and calculating their similarity. Take note of cases where the similarity is unintuitively high or low.

In [None]:
word_pairs = [
    ("mouse", "cat"),
    ("mouse", "dog"),
    ("mouse", "hamster"),
    ("mouse", "hole"),
    ("mouse", "cheese"),
    ("mouse", "sewer"),
    ("mouse", "computer"),
    ("mouse", "airport"),
    ("mouse", "politics")
]

for word1, word2 in word_pairs:
    sim = cosine_similarity(glove[word1], glove[word2])
    print(f"Cosine similarity between '{word1}' and '{word2}': {sim:.4f}")

## Task 1:

Examine the cosine similarities. Are they as you would expect? Are there any with unintuitively high or low similarities, and what do those tell you about how second-order/static embeddings work and potential issues? Also try out other word pairs to make your points.

### Your Answer:

# Part 2: Contextual Embeddings

Popular embedding models like BERT are contextualized - words have different embeddings depending on the context.

Let's load up BERT:

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import matplotlib.pyplot as plt

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

Let's define a simple function for getting the embedding of a target word from a sentence.

In [None]:
def get_token_embedding(sentence, target_word):
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        last_hidden_state = outputs.last_hidden_state.squeeze(0) 

    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())

    for idx, token in enumerate(tokens):
        if token == target_word:
            return last_hidden_state[idx]

    raise ValueError(f"Token '{target_word}' not found in: {tokens}")

get_token_embedding("penguins are cool", "penguins")

Just like with GloVe, we can also compute cosine similarities. Since they are context dependent, even the same words can have different embeddings (and thus, a cosine similarity other than 1). 

In [None]:
word1 = get_token_embedding("I feel great", "feel")
word2 = get_token_embedding("I feel good", "feel")

print(cosine_similarity(word1, word2))

## Task 2

Play around with BERT and cosine similarities: 
1) Does the cosine similarity correlate with the similarity of context semantics, syntax, and word choice? When do vectors of the same word have very high or very low similarities?

### Your Answer:

# Part 3: Visualizing Embeddings

Let's try to visualize BERT embeddings. PCA is a common algorithm to use for dimensionality reduction (sklearn.decomposition.PCA), but we will use TSNE here (sklearn.manifold.TSNE), which is optimized for visualization. 

## Task 3

Make yourself familiar with the TSNE algorithm. Implement a function below that takes some array/list of embedding vectors as input and outputs reduced representations. Check out the scikit-learn documentation https://scikit-learn.org/stable/api/index.html for more information.

Hint: Using scikit-learn, you can easily do this in 1 or 2 lines of code.

In [None]:
from sklearn.manifold import TSNE

def reduce_dimensionality(vectors: list):
    # TODO implement here
    return []    

Let's test if we can plot BERT embeddings with matplotlib.

In [None]:
embeddings = [
    # [context, target word]
    ["I like books", "books"],
    ["I like learning", "learning"],
    ["I like music", "music"],
    ["I like food", "food"],
    ["I like cake", "cake"],
    ["I like cookies", "cookies"],
    ["I like rice", "rice"],
    ["I like pancakes", "pancakes"]
]

def plot_vectors(embeddings):
    labels = [x[0] for x in embeddings]

    vectors = []
    for sentence, word in embeddings:
        vectors.append(get_token_embedding(sentence, word))

    vectors = reduce_dimensionality(vectors)

    plt.figure(figsize=(8, 6))
    for i, label in enumerate(labels):
        x, y = vectors[i]
        plt.scatter(x, y)
        plt.text(x + 0.01, y + 0.01, label, fontsize=9)

    plt.show()

plot_vectors(embeddings)


# Task 4

Let's get to the issue of homonymy/polysemy. When a word has multiple senses, the context of the word decides its meaning. So perhaps it is possible to see which contexts imply the same word sense using these visualizations.

For example: Regarding the vector of the word "bat", perhaps the vectors with the contexts "I have a baseball bat" and "I hit the ball with the bat" are close together, while "The bat flies out of the cave" will be further away.

Your Task: Choose one word with two or more senses (e.g.: key, run, set) and think of contexts that imply a word sense. Create a graphic using data points of the same word in different contexts.

In [None]:
# Your code here



## Task 5

Interpret the graphic you just created. Can you see any tendencies regarding word senses? Hint: If you cannot see any, try adding more data points until you see clusters forming.

### Your Answer:

# Additional Task

If you have no experience with regular expressions, I recommend reading up on the basics. Regular expressions are very useful for extracting information from complex structures.

You can read up on regular expressions here: https://docs.python.org/3/library/re.html

With that in mind, try solving the following exercise. Write regular expressions to get the outputs to match the comments.

In [None]:
import re

# Task 1
document = """
a b c d e <f> g h i j k a b c d <e> f g h i j k
"""
match = re.findall(r"your pattern here", document)
# Return the letters inside the <> brackets. 
# Expected Output:
#  f, e
print(match)

# Task 2
document = """
qualification well-known finger introduction high-spirited dataset long-term story container six-pack music notebook
"""
match = re.findall(r"your pattern here", document)
# Return the words that include a hyphen (-)
# Expected Output:
# well-known, high-spirited, long-term, six-pack
print(match)
