<div class="alert alert-block alert-info">
    <h1>Natural Language Processing</h1>
    04
    <h3>General Information:</h3>
    <p>Please do not add or delete any cells. Answers belong into the corresponding cells (below the question). If a function is given (either as a signature or a full function), you should not change the name, arguments or return value of the function.<br><br> If you encounter empty cells underneath the answer that can not be edited, please ignore them, they are for testing purposes.<br><br>When editing an assignment there can be the case that there are variables in the kernel. To make sure your assignment works, please restart the kernel and run all cells before submitting (e.g. via <i>Kernel -> Restart & Run All</i>).</p>
    <p>Code cells where you are supposed to give your answer often include the line  ```raise NotImplementedError```. This makes it easier to automatically grade answers. If you edit the cell please outcomment or delete this line.</p>
    <h3>Submission:</h3>
    <p>Please submit your notebook via the web interface (in the main view -> Assignments -> Submit). The assignments are due on <b>Monday at 15:00</b>.</p>
    <h3>Group Work:</h3>
    <p>You are allowed to work in groups of up to three people. Please enter the UID (your username here) of each member of the group into the next cell. We apply plagiarism checking, so do not submit solutions from other people except your team members. If an assignment has a copied solution, the task will be graded with 0 points for all people with the same solution.</p>
    <h3>Questions about the Assignment:</h3>
    <p>If you have questions about the assignment please post them in the LEA forum before the deadline. Don't wait until the last day to post questions.</p>
    
</div>

In [1]:
'''
Group Work:
Enter the username of each team member into the variables. 
If you work alone please leave the other variables empty.
'''
member1 = 'tghane2s'
member2 = 'rmore2s'
member3 = 'psheth2s'


# Word2Vec and FastText Embeddings

In this assignment we will work on Word2Vec embeddings and FastText embeddings.

I prepared three dictionaries for you:

- ```word2vec_yelp_vectors.pkl```: A dictionary with 300 dimensional word2vec embeddings trained on the Google News Corpus, contains only words that are present in our Yelp reviews (key is the word, value is the embedding)
- ```fasttext_yelp_vectors.pkl```: A dictionary with 300 dimensional FastText embeddings trained on the English version of Wikipedia, contains only words that are present in our Yelp reviews (key is the word, value is the embedding)
- ```tfidf_yelp_vectors.pkl```: A dictionary with 400 dimensional TfIdf embeddings trained on the Yelp training dataset from last assignment (key is the word, value is the embedding)

In the next cell we load those into the dictionaries ```w2v_vectors```, ```ft_vectors``` and ```tfidf_vectors```.

© Tim Metzler, Hochschule Bonn-Rhein-Sieg

In [2]:
import pickle

with open('/srv/shares/NLP/embeddings/word2vec_yelp_vectors.pkl', 'rb') as f:
    w2v_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/embeddings/fasttext_yelp_vectors.pkl', 'rb') as f:
    ft_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/embeddings/tfidf_yelp_vectors.pkl', 'rb') as f:
    tfidf_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/datasets/yelp/reviews_train.pkl', 'rb') as f:
    train = pickle.load(f)
    
with open('/srv/shares/NLP/datasets/yelp/reviews_test.pkl', 'rb') as f:
    test = pickle.load(f)
    
reviews = train + test

## Creating a vector model with helper functions [30 points]

In the next cell we have the class ```VectorModel``` with the methods:

- ```vector_size```: Returns the vector size of the model
- ```embed```: Returns the embedding for a word. Returns None if there is no embedding present for the word
- ```cosine_similarity```: Calculates the cosine similarity between two vectors
- ```most_similar```: Given a word returns the ```top_n``` most similar words from the model, together with the similarity value, **sorted by similarity (descending)**. We do not want to return the word itself as the most similar one. So we only return the most similar words except for the first one.
- ```most_similar_vec```: Given a vector returns the ```top_n``` most similar words from the model, together with the similarity value, **sorted by similarity (descending)**. Here we want to keep the most similar one.

Your task is to complete these methods.

Example output:
```
model = VectorModel(w2v_vectors)

vector_good = model.embed('good')
vector_tomato = model.embed('tomato')

print(model.cosine_similarity(vector_good, vector_tomato)) # Prints: 0.05318105

print(model.most_similar('tomato')) 
'''
[('tomatoes', 0.8442263), 
 ('lettuce', 0.70699364),
 ('strawberry', 0.6888598), 
 ('strawberries', 0.68325955), 
 ('potato', 0.67841727)]
'''

print(model.most_similar_vec(vector_good)) 
'''
[('good', 1.0), 
 ('great', 0.72915095), 
 ('bad', 0.7190051), 
 ('decent', 0.6837349), 
 ('nice', 0.68360925)]
'''

```

In [3]:
from typing import List, Tuple, Dict
import numpy as np

   
class VectorModel:
    
    def __init__(self, vector_dict: Dict[str, np.ndarray]):
        # YOUR CODE HERE
        self.vector = vector_dict
        
    def embed(self, word: str) -> np.ndarray:
        # YOUR CODE HERE
        return self.vector.get(word, None)
    
    def vector_size(self) -> int:
        # YOUR CODE HERE
        return len(next(iter(self.vector.values())))
    
    def cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:
        dot = np.dot(vec1, vec2)
        norm = np.linalg.norm(vec1) * np.linalg.norm(vec2)
        return dot / norm if norm != 0 else 0.0

    def most_similar(self, word: str, top_n: int=5) -> List[Tuple[str, float]]:
        target_vec = self.embed(word)
        if target_vec is None:
            return []
        
        similarities = []
        for other_word, vec in self.vector.items():
            if other_word == word:
                continue
            similarity = self.cosine_similarity(target_vec, vec)
            similarities.append((other_word, similarity))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_n]
        
    def most_similar_vec(self, vec: np.ndarray, top_n: int=5) -> List[Tuple[str, float]]:
        similarities = []
        for word, word_vec in self.vector.items():
            similarity = self.cosine_similarity(vec, word_vec)
            similarities.append((word, similarity))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_n]
model = VectorModel(w2v_vectors)
vector_good = model.embed('good')
vector_tomato = model.embed('tomato')

print(model.cosine_similarity(vector_good, vector_tomato))
print(model.most_similar('tomato'))
print(model.most_similar_vec(vector_good))

0.05318105
[('tomatoes', 0.8442263), ('lettuce', 0.70699376), ('strawberry', 0.6888598), ('strawberries', 0.6832595), ('potato', 0.67841715)]
[('good', 1.0), ('great', 0.72915095), ('bad', 0.7190051), ('decent', 0.6837348), ('nice', 0.68360925)]


## Investigating similarity A) [10 points]

We now want to find the most similar words for a given input word for each model (Word2Vec, FastText and TfIdf).

Your input words are: ```['good', 'tomato', 'restaurant', 'beer', 'wonderful']```.

For each model and input word print the top three most similar words.

In [4]:
input_words = ['good', 'tomato', 'restaurant', 'beer', 'wonderful', 'dinner']

w2v_vm = VectorModel(w2v_vectors)          # Word2Vec
fasttext_vm = VectorModel(ft_vectors) # FastText
tfidf_vm = VectorModel(tfidf_vectors)       # TF-IDF vectors 
models = {
    'Word2Vec': w2v_vm,
    'FastText': fasttext_vm,
    'TfIdf': tfidf_vm
}

for model_name, model in models.items():
    print(f"\nModel: {model_name}")
    for word in input_words:
        if model.embed(word) is not None:
            top_similar = model.most_similar(word, top_n=3)
            print(f"{word}: {top_similar}")
        else:
            print(f"{word}: Not found in vocabulary")



Model: Word2Vec
good: [('great', 0.72915095), ('bad', 0.7190051), ('decent', 0.6837348)]
tomato: [('tomatoes', 0.8442263), ('lettuce', 0.70699376), ('strawberry', 0.6888598)]
restaurant: [('restaurants', 0.77228934), ('diner', 0.72802156), ('steakhouse', 0.72698534)]
beer: [('beers', 0.8409688), ('drinks', 0.66893125), ('ale', 0.63828725)]
wonderful: [('fantastic', 0.8047919), ('great', 0.76478696), ('fabulous', 0.7614761)]
dinner: [('dinners', 0.7902064), ('brunch', 0.79005134), ('breakfast', 0.7007028)]

Model: FastText
good: [('excellent', 0.7223856825801254), ('decent', 0.7202461451724537), ('bad', 0.6704173041669614)]
tomato: [('eggplant', 0.7518509618329048), ('spinach', 0.7422800959168396), ('onions', 0.7328857483500281)]
restaurant: [('restaurants', 0.8384667264823358), ('bistro', 0.7845601578005464), ('bakery', 0.7155727705943096)]
beer: [('beers', 0.7944971406865431), ('brewed', 0.7929903321082489), ('brewery', 0.7520785637582763)]
wonderful: [('lovely', 0.6808215868395576),

## Investigating similarity B) [10 points]

Comment on the output from the previous task. Let us look at the output for the word ```wonderful```. How do the models differ for this word? Can you reason why the TfIdf model shows so different results?

# - Word2Vec

### wonderful: [('fantastic', 0.8047919), ('great', 0.76478696), ('fabulous', 0.7614761)]
These are clear synonyms or near-synonyms of "wonderful". This shows that Word2Vec captures semantic relationships very well, since it learns from context; if words appear in similar contexts, they get similar vectors.

# - FastText

### wonderful: [('lovely', 0.6808215868395576), ('fascinating', 0.6745727685452472), ('amazing', 0.6457084279396067)]
FastText works similarly to Word2Vec but also includes subword information (n-grams), which helps with rare or misspelled words.

# - TfIdf

### wonderful: [('truffle', 0.6264995084522798), ('accident', 0.5432509277196604), ('equally', 0.5432509277196604)]
These words are not semantically similar to "wonderful". It doesn't understand meaning, only surface-level correlations.

TfIdf gives dissimilar results for "wonderful" because it lacks understanding of semantic similarity. Word2Vec and FastText are trained on massive corpora using context windows, which enables them to reflect true meaning-based similarity, making them more suitable for tasks requiring semantic understanding.

## Investigating similarity C) [10 points]

Instead of just finding the most similar word to a single word, we can also find the most similar word given a list of positive and negative words.

For this we just sum up the positive and negative words into a single vector by calculating a weighted mean. For this we multiply each positive word with a factor of $+1$ and each negative word with a factor of $-1$. Then we get the most similar words to that vector.

You are given the following examples:

```
inputs = [
    {
        'positive': ['good', 'wonderful'],
        'negative': ['bad']
    },
    {
        'positive': ['tomato', 'lettuce'],
        'negative': ['strawberry', 'salad']
    }    
]
```

In [5]:
# Answer

def find_analogy_vector(model_vectors: Dict[str, np.ndarray], word_groups: Dict[str, List[str]]) -> List[Tuple[str, float]]:
    model = VectorModel(model_vectors)

    # Lists to store valid vectors
    pos_vectors = []
    neg_vectors = []

    # Collect positive vectors
    for pos_word in word_groups.get("positive", []):
        vec = model.embed(pos_word)
        if vec is not None:
            pos_vectors.append(vec)
        else:
            print(f" Warning: Positive word '{pos_word}' not in vocabulary.")

    # Collect negative vectors
    for neg_word in word_groups.get("negative", []):
        vec = model.embed(neg_word)
        if vec is not None:
            neg_vectors.append(vec)
        else:
            print(f" Warning: Negative word '{neg_word}' not in vocabulary.")

    # Compute the average positive and negative vectors
    pos_mean = np.mean(pos_vectors, axis=0) if pos_vectors else np.zeros(model.vector_size())
    neg_mean = np.mean(neg_vectors, axis=0) if neg_vectors else np.zeros(model.vector_size())

    # Final analogy vector
    analogy_vec = pos_mean - neg_mean

    # Return top 1 most similar word to the resulting vector
    return model.most_similar_vec(analogy_vec, top_n=1)

inputs = [
    {
        'positive': ['good', 'wonderful'],
        'negative': ['bad']
    },
    {
        'positive': ['tomato', 'lettuce'],
        'negative': ['strawberry', 'fruit']
    },
    {
        'positive': ['ceasar', 'chicken'],
        'negative': []
    }    
]
for idx, entry in enumerate(inputs):
    result = find_analogy_vector(w2v_vectors, entry)
    print(f"\nInput {idx + 1}: {entry}")
    print(f"Top match: {result}")



Input 1: {'positive': ['good', 'wonderful'], 'negative': ['bad']}
Top match: [('wonderful', 0.5333514)]

Input 2: {'positive': ['tomato', 'lettuce'], 'negative': ['strawberry', 'fruit']}
Top match: [('lettuce', 0.54421157)]

Input 3: {'positive': ['ceasar', 'chicken'], 'negative': []}
Top match: [('chicken', 0.9999999977182473)]


## Investigating similarity D) [15 points]

We can use our model to find out which word does not match given a list of words.

For this we build the mean vector of all embeddings in the list.  
Then we calculate the cosine similarity between the mean and all those vectors.

The word that does not match is then the word with the lowest cosine similarity to the mean.

Example:

```
model = VectorModel(w2v_vectors)
doesnt_match(model, ['potato', 'tomato', 'beer']) # -> 'beer'
```

In [6]:
def doesnt_match(model, word_list):
    # List to store cosine similarity scores
    sim_scores = []

    # Vector to hold the sum of embeddings
    combined_vec = np.zeros(model.vector_size())

    # Sum up all the word embeddings
    for token in word_list:
        vec = model.embed(token)
        combined_vec += vec

    # Calculate the average (mean) embedding
    avg_vec = combined_vec / len(word_list)

    # Compute cosine similarity of each word vector with the mean vector
    for token in word_list:
        vec = model.embed(token)
        sim = model.cosine_similarity(vec, avg_vec)
        sim_scores.append(sim)

    # Convert similarity list to numpy array and find index of lowest score
    sim_scores = np.array(sim_scores)
    least_similar_idx = np.argmin(sim_scores)

    # Return the word least similar to the group
    return word_list[least_similar_idx]


doesnt_match(VectorModel(w2v_vectors), ['vegetable', 'strawberry', 'tomato', 'lettuce'])
words_1 = ['vegetable', 'strawberry', 'tomato', 'lettuce']
odd_word = doesnt_match(VectorModel(w2v_vectors), words_1)
print(f"\nWORDS: {words_1}, \nDoesn't match: {odd_word}")

words_2 = ['potato', 'tomato', 'beer']
odd_word = doesnt_match(VectorModel(w2v_vectors), words_2)
print(f"\nWORDS: {words_2}, \nDoesn't match: {odd_word}")


WORDS: ['vegetable', 'strawberry', 'tomato', 'lettuce'], 
Doesn't match: vegetable

WORDS: ['potato', 'tomato', 'beer'], 
Doesn't match: beer


## Document Embeddings A) [15 points]

Now we want to create document embeddings similar to the last assignment. For this you are given the function ```bagOfWords```. In the context of Word2Vec and FastText embeddings this is also called ```SOWE``` for sum of word embeddings.

Take the yelp reviews (```reviews```) and create a dictionary containing the document id as a key and the document embedding as a value.

Create the document embeddings from the Word2Vec, FastText and TfIdf embeddings.

Store these in the variables ```ft_doc_embeddings```, ```w2v_doc_embeddings``` and ```tfidf_doc_embeddings```

In [7]:
def bagOfWords(model, doc: List[str]) -> np.ndarray:
    '''
    Create a document embedding using the bag of words approach
    
    Args:
        model     -- The embedding model to use
        doc       -- A document as a list of tokens
        
    Returns:
        embedding -- The embedding for the document as a single vector 
    '''
    embeddings = [np.zeros(model.vector_size())]
    n_tokens = 0
    for token in doc:
        embedding = model.embed(token)
        if embedding is not None:
            n_tokens += 1
            embeddings.append(embedding)
    if n_tokens > 0:
        return sum(embeddings)/n_tokens
    return sum(embeddings)

ft_doc_embeddings = dict()
w2v_doc_embeddings = dict()
tfidf_doc_embeddings = dict()

# document-level embeddings
for review in reviews:
    doc_id = review['id']
    tokens = review['tokens']
    
    w2v_doc_embeddings[doc_id] = bagOfWords(w2v_vm, tokens)
    ft_doc_embeddings[doc_id] = bagOfWords(fasttext_vm, tokens)
    tfidf_doc_embeddings[doc_id] = bagOfWords(tfidf_vm, tokens)


## Document Embeddings B) [10 points]

Create a vector model from each of the document embedding dictionaries. Call these ```model_w2v_doc```, ```model_ft_doc``` and ```model_tfidf_doc```.

Now find the most similar document (```top_n=1```) for document $438$ with each of these models. Use the method `most_similar`. For example `model.most_similar(438)`.

Print the text for each of the most similar reviews.

In [11]:
# First find the text for review 438
def find_doc(doc_id, reviews):
    for review in reviews:
        if review['id'] == doc_id:
            return review['text']
    
doc_id = 438

# Print it
print('Source document:')
print(find_doc(doc_id, reviews))

# Create the models
model_w2v_doc = VectorModel(w2v_doc_embeddings)
model_ft_doc = VectorModel(ft_doc_embeddings)
model_tfidf_doc = VectorModel(tfidf_doc_embeddings)

# similar documents
similar_doc_w2v = model_w2v_doc.most_similar(doc_id, top_n=2)[1][0]
similar_doc_ft = model_ft_doc.most_similar(doc_id, top_n=2)[1][0]
similar_doc_tfidf = model_tfidf_doc.most_similar(doc_id, top_n=2)[1][0]

# print
print("\nMost similar document (Word2Vec):")
print(find_doc(similar_doc_w2v, reviews))

print("\nMost similar document (FastText):")
print(find_doc(similar_doc_ft, reviews))

print("\nMost similar document (TF-IDF):")
print(find_doc(similar_doc_tfidf, reviews))


# YOUR CODE HERE
#raise NotImplementedError()

Source document:
Absolutely ridiculously amazing! Chicken Tikka masala was perfect. Best I've ever had!

Most similar document (Word2Vec):
I LOVE THIS PLACE. ever since i frist tried it in atl.

ok. first thing first. YOU MUST ORDER COCONUT SHRIMP (unless your allergic!)

we ordered:

onion rings ( A MUST its frikkin huge and the sauce is awesome!)

Crab, Shrimp, Mango and Avocado Stack Crab (its frikkin good as hell.. mix it in the sauce that is circled around the plate they go perfect together!)

Cuban Sandwich (eggs were missing from it for me but the bf said it was good)

1/2 coconut shrimp & Caesar salad. (like I said the coconut shrimp is to die for ::and if your allergic i mean literally::  the caesar salad was... meh.. the cruton were cute but the salad itself was whatevers.

SERVICE AS USUAL WAS AWESOME! waiters were nice, patient and quick. waters were always filled as well.

Most similar document (FastText):
Been there three times, and have tried out different dishes.

First