<div class="alert alert-block alert-info">
    <h1>Natural Language Processing</h1>
    04
    <h3>General Information:</h3>
    <p>Please do not add or delete any cells. Answers belong into the corresponding cells (below the question). If a function is given (either as a signature or a full function), you should not change the name, arguments or return value of the function.<br><br> If you encounter empty cells underneath the answer that can not be edited, please ignore them, they are for testing purposes.<br><br>When editing an assignment there can be the case that there are variables in the kernel. To make sure your assignment works, please restart the kernel and run all cells before submitting (e.g. via <i>Kernel -> Restart & Run All</i>).</p>
    <p>Code cells where you are supposed to give your answer often include the line  ```raise NotImplementedError```. This makes it easier to automatically grade answers. If you edit the cell please outcomment or delete this line.</p>
    <h3>Submission:</h3>
    <p>Please submit your notebook via the web interface (in the main view -> Assignments -> Submit). The assignments are due on <b>Monday at 13:00</b>.</p>
    <h3>Group Work:</h3>
    <p>You are allowed to work in groups of up to three people. Please enter the UID (your username here) of each member of the group into the next cell. We apply plagiarism checking, so do not submit solutions from other people except your team members. If an assignment has a copied solution, the task will be graded with 0 points for all people with the same solution.</p>
    <h3>Questions about the Assignment:</h3>
    <p>If you have questions about the assignment please post them in the LEA forum before the deadline. Don't wait until the last day to post questions.</p>
    
</div>

In [1]:
'''
Group Work:
Enter the username of each team member into the variables. 
If you work alone please leave the other variables empty.
'''
member1 = 'mfarra2s'
member2 = 'rhusai2s'
member3 = ''


# Word2Vec and FastText Embeddings

In this assignment we will work on Word2Vec embeddings and FastText embeddings.

I prepared three dictionaries for you:

- ```word2vec_yelp_vectors.pkl```: A dictionary with 300 dimensional word2vec embeddings trained on the Google News Corpus, contains only words that are present in our Yelp reviews (key is the word, value is the embedding)
- ```fasttext_yelp_vectors.pkl```: A dictionary with 300 dimensional FastText embeddings trained on the English version of Wikipedia, contains only words that are present in our Yelp reviews (key is the word, value is the embedding)
- ```tfidf_yelp_vectors.pkl```: A dictionary with 400 dimensional TfIdf embeddings trained on the Yelp training dataset from last assignment (key is the word, value is the embedding)

In the next cell we load those into the dictionaries ```w2v_vectors```, ```ft_vectors``` and ```tfidf_vectors```.

© Tim Metzler, Hochschule Bonn-Rhein-Sieg

In [2]:
import pickle

with open('/srv/shares/NLP/embeddings/word2vec_yelp_vectors.pkl', 'rb') as f:
    w2v_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/embeddings/fasttext_yelp_vectors.pkl', 'rb') as f:
    ft_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/embeddings/tfidf_yelp_vectors.pkl', 'rb') as f:
    tfidf_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/datasets/yelp/reviews_train.pkl', 'rb') as f:
    train = pickle.load(f)
    
with open('/srv/shares/NLP/datasets/yelp/reviews_test.pkl', 'rb') as f:
    test = pickle.load(f)
    
reviews = train + test

## Creating a vector model with helper functions [30 points]

In the next cell we have the class ```VectorModel``` with the methods:

- ```vector_size```: Returns the vector size of the model
- ```embed```: Returns the embedding for a word. Returns None if there is no embedding present for the word
- ```cosine_similarity```: Calculates the cosine similarity between two vectors
- ```most_similar```: Given a word returns the ```top_n``` most similar words from the model, together with the similarity value, **sorted by similarity (descending)**.
- ```most_similar_vec```: Given a vector returns the ```top_n``` most similar words from the model, together with the similarity value, **sorted by similarity (descending)**.

Your task is to complete these methods.

Example output:
```
model = VectorModel(w2v_vectors)

vector_good = model.embed('good')
vector_tomato = model.embed('tomato')

print(model.cosine_similarity(vector_good, vector_tomato)) # Prints: 0.05318105

print(model.most_similar('tomato')) 
'''
[('tomatoes', 0.8442263), 
 ('lettuce', 0.70699364),
 ('strawberry', 0.6888598), 
 ('strawberries', 0.68325955), 
 ('potato', 0.67841727)]
'''

print(model.most_similar_vec(vector_good)) 
'''
[('good', 1.0), 
 ('great', 0.72915095), 
 ('bad', 0.7190051), 
 ('decent', 0.6837349), 
 ('nice', 0.68360925)]
'''

```

In [3]:
from typing import List, Tuple, Dict
import numpy as np

   
class VectorModel:
    
    def __init__(self, vector_dict: Dict[str, np.ndarray]):
        
        self.vector_dict =vector_dict
        
    def embed(self, word: str) -> np.ndarray:
        
        if word in self.vector_dict:
            
            return self.vector_dict[word]
        
        else:
            return None
    
    def vector_size(self) -> int:

        return (len(self.vector_dict[list(self.vector_dict.keys())[0]]))
    
    def cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:

        dot_product = np.dot(vec1, vec2)
        magnitude_vector1 = np.linalg.norm(vec1)
        magnitude_vector2 = np.linalg.norm(vec2)
        similarity = dot_product / (magnitude_vector1 * magnitude_vector2)
        
        return similarity
    
    def most_similar(self, word: str, top_n: int=5) -> List[Tuple[str, float]]:

        if word not in self.vector_dict:
            return None
        
        else:
            
            top_words = []
            similarity = []
            word_vec = self.embed(word)
            
            for w in self.vector_dict:
                
                if word == w:
                    continue
                
                similarity.append((w , self.cosine_similarity(self.vector_dict[w] , word_vec)))
            
            top_words = sorted(similarity, key=lambda x: x[1], reverse=True)[0:top_n]   
            
            return top_words
        
        
    def most_similar_vec(self, vec: np.ndarray, top_n: int=5) -> List[Tuple[str, float]]:
        # YOUR CODE HERE
        if len(vec) == self.vector_size():
           
            top_words = []
            similarity = []
            
            for w in self.vector_dict:
                
                similarity.append((w , self.cosine_similarity(self.vector_dict[w] , vec)))
            
            top_words = sorted(similarity, key=lambda x: x[1], reverse=True)[0:top_n]   
            
            return top_words
        
        else:
            return None
        
model = VectorModel(w2v_vectors)

vector_good = model.embed('good')
vector_tomato = model.embed('tomato')

print(model.cosine_similarity(vector_good, vector_tomato)) # Prints: 0.05318105

print(model.most_similar('tomato')) 

print(model.most_similar_vec(vector_good)) 


0.05318105
[('tomatoes', 0.8442263), ('lettuce', 0.70699376), ('strawberry', 0.6888598), ('strawberries', 0.6832595), ('potato', 0.67841715)]
[('good', 1.0), ('great', 0.72915095), ('bad', 0.7190051), ('decent', 0.6837348), ('nice', 0.68360925)]


## Investigating similarity A) [10 points]

We now want to find the most similar words for a given input word for each model (Word2Vec, FastText and TfIdf).

Your input words are: ```['good', 'tomato', 'restaurant', 'beer', 'wonderful']```.

For each model and input word print the top three most similar words.

In [4]:
input_words = ['good', 'tomato', 'restaurant', 'beer', 'wonderful', 'dinner']

model1= VectorModel(w2v_vectors)
model2= VectorModel(ft_vectors)
model3= VectorModel(tfidf_vectors)

for w in input_words:
    print(f' the top three most similar words for {w} using Word2Vec are :\n\n', model1.most_similar(w),'\n\n')
    print(f' the top three most similar words for {w} using FastText are :\n\n', model2.most_similar(w),'\n\n')
    print(f' the top three most similar words for {w} using TfIdf are :\n\n', model3.most_similar(w),'\n\n')
    print('***********************************************************************************************')

    

 the top three most similar words for good using Word2Vec are :

 [('great', 0.72915095), ('bad', 0.7190051), ('decent', 0.6837348), ('nice', 0.68360925), ('excellent', 0.64429295)] 


 the top three most similar words for good using FastText are :

 [('excellent', 0.7223856825801254), ('decent', 0.7202461451724537), ('bad', 0.6704173041669614), ('lousy', 0.6329297059907056), ('luck', 0.6252317663536737)] 


 the top three most similar words for good using TfIdf are :

 [('the', 0.6199071144484399), ('a', 0.6170194328254505), ('and', 0.6121212998064655), ('was', 0.5879459968648972), ('it', 0.587488954923695)] 


***********************************************************************************************
 the top three most similar words for tomato using Word2Vec are :

 [('tomatoes', 0.8442263), ('lettuce', 0.70699376), ('strawberry', 0.6888598), ('strawberries', 0.6832595), ('potato', 0.67841715)] 


 the top three most similar words for tomato using FastText are :

 [('eggplant', 

## Investigating similarity B) [10 points]

Comment on the output from the previous task. Let us look at the output for the word ```wonderful```. How do the models differ for this word? Can you reason why the TfIdf model shows so different results?

Long story short, Word2Vec and FastText both showed understanding to the context and to semantics of the word while on the other hand TfIdf focuses only on the frequency of the worlds which also appears in the scores of cosine similarities of the embeddings of TfIdf

## Investigating similarity C) [10 points]

Instead of just finding the most similar word to a single word, we can also find the most similar word given a list of positive and negative words.

For this we just sum up the positive and negative words into a single vector by calculating a weighted mean. For this we multiply each positive word with a factor of $+1$ and each negative word with a factor of $-1$. Then we get the most similar words to that vector.

You are given the following examples:

```
inputs = [
    {
        'positive': ['good', 'wonderful'],
        'negative': ['bad']
    },
    {
        'positive': ['tomato', 'lettuce'],
        'negative': ['strawberry', 'salad']
    }    
]
```

In [5]:
# Answer
'''
[[('wonderful', 0.5333514383003238),
  ('fantastic', 0.388105050486788),
  ('fabulous', 0.373731015942465),
  ('great', 0.3706483732020164),
  ('exciting', 0.3641947265478558)],
 [('lettuce', 0.5442114749683798),
  ('spinach', 0.4292242312193039),
  ('tomatoes', 0.4068589989558246),
  ('tomato', 0.38179436632605823),
  ('broccoli', 0.33117993870250645)],
 [('chicken', 0.9999999977182473),
  ('meat', 0.6799130333713304),
  ('pork', 0.6541997485956462),
  ('turkey', 0.6282519126780681),
  ('shrimp', 0.6004992064697144)]]
'''
# Please write your code answer here!

inputs = [
    {
        'positive': ['good', 'wonderful'],
        'negative': ['bad']
    },
    {
        'positive': ['tomato', 'lettuce'],
        'negative': ['strawberry', 'fruit']
    },
    {
        'positive': ['ceasar', 'chicken'],
        'negative': []
    }    
]
# YOUR CODE HERE



def positive_negative_quary(quarys):
    
    model = VectorModel(w2v_vectors)
    
    positive_vec = np.zeros((1,model.vector_size()))
    negative_vec = np.zeros((1,model.vector_size()))
    most_similar = []
    
    for quary in quarys:
       
        if len(quary['positive']) != 0:
        
            for value in quary['positive']:
                embed = model.embed(value)
                if embed is None:
                    continue
                else:
                    positive_vec[0] = positive_vec[0] + embed
            positive_vec[0] = positive_vec[0]/len(quary['positive'])
       
        if len(quary['negative']) != 0:
            
            for value in quary['negative']:
                embed = model.embed(value)
                if embed is None:
                    continue
                else:
                    negative_vec[0] = negative_vec[0] + embed
            negative_vec[0] = negative_vec[0]/len(quary['negative'])
        
        vec = positive_vec[0] - negative_vec[0]
        most_similar.append(model.most_similar_vec(vec))
        positive_vec = np.zeros((1,model.vector_size()))
        negative_vec = np.zeros((1,model.vector_size()))
        
    return most_similar    
    
positive_negative_quary(inputs)

[[('wonderful', 0.5333514383003238),
  ('fantastic', 0.388105050486788),
  ('fabulous', 0.373731015942465),
  ('great', 0.3706483732020164),
  ('exciting', 0.3641947265478558)],
 [('lettuce', 0.5442114749683798),
  ('spinach', 0.4292242312193039),
  ('tomatoes', 0.4068589989558246),
  ('tomato', 0.38179436632605823),
  ('broccoli', 0.33117993870250645)],
 [('chicken', 0.9999999977182473),
  ('meat', 0.6799130333713304),
  ('pork', 0.6541997485956462),
  ('turkey', 0.6282519126780681),
  ('shrimp', 0.6004992064697144)]]

## Investigating similarity D) [15 points]

We can use our model to find out which word does not match given a list of words.

For this we build the mean vector of all embeddings in the list.  
Then we calculate the cosine similarity between the mean and all those vectors.

The word that does not match is then the word with the lowest cosine similarity to the mean.

Example:

```
model = VectorModel(w2v_vectors)
doesnt_match(model, ['potato', 'tomato', 'beer']) # -> 'beer'
```

In [6]:
def doesnt_match(model, words):
    # YOUR CODE HERE
    similarties= {}
    embeds = []
    for word in words:
        embed = model.embed(word)
        if embed is None:
            continue
        else:
            embeds.append(embed)
    means = np.mean(embeds, axis=0)
    
    for i, word in enumerate(words):
        embed = model.embed(word)
        if embed is None:
            continue
        else:
            similarties[i]=model.cosine_similarity(means,embed)
    print(similarties)       
    return words[min(similarties, key=similarties.get)]
    
    
# doesnt_match(VectorModel(w2v_vectors), ['vegetable', 'strawberry', 'tomato', 'lettuce'])

# YOUR CODE HERE
model = VectorModel(w2v_vectors)
print(doesnt_match(model, ['potato', 'tomato', 'beer'])) # -> 'beer'
print(doesnt_match(VectorModel(w2v_vectors), ['vegetable', 'strawberry', 'tomato', 'lettuce']))

{0: 0.8235691, 1: 0.8485855, 2: 0.6721274}
beer
{0: 0.800388, 1: 0.82798094, 2: 0.9011535, 3: 0.8445007}
vegetable


## Document Embeddings A) [15 points]

Now we want to create document embeddings similar to the last assignment. For this you are given the function ```bagOfWords```. In the context of Word2Vec and FastText embeddings this is also called ```SOWE``` for sum of word embeddings.

Take the yelp reviews (```reviews```) and create a dictionary containing the document id as a key and the document embedding as a value.

Create the document embeddings from the Word2Vec, FastText and TfIdf embeddings.

Store these in the variables ```ft_doc_embeddings```, ```w2v_doc_embeddings``` and ```tfidf_doc_embeddings```

In [7]:
def bagOfWords(model, doc: List[str]) -> np.ndarray:
    '''
    Create a document embedding using the bag of words approach
    
    Args:
        model     -- The embedding model to use
        doc       -- A document as a list of tokens
        
    Returns:
        embedding -- The embedding for the document as a single vector 
    '''
    embeddings = [np.zeros(model.vector_size())]
    n_tokens = 0
    for token in doc:
        embedding = model.embed(token)
        if embedding is not None:
            n_tokens += 1
            embeddings.append(embedding)
    if n_tokens > 0:
        return sum(embeddings)/n_tokens
    return sum(embeddings)


ft_doc_embeddings = dict()
w2v_doc_embeddings = dict()
tfidf_doc_embeddings = dict()

ft_model = VectorModel(ft_vectors)
w2v_model = VectorModel(w2v_vectors)
tfidf_model = VectorModel(tfidf_vectors)

for review in reviews:
    ft_doc_embeddings[str(review['id'])] = bagOfWords(ft_model ,review['tokens'])
    w2v_doc_embeddings[str(review['id'])] = bagOfWords(w2v_model ,review['tokens'])
    tfidf_doc_embeddings[str(review['id'])] = bagOfWords(tfidf_model ,review['tokens'])
    
print(w2v_doc_embeddings)

{'0': array([ 1.46077474e-02,  1.32617156e-02,  3.73287201e-02,  9.77732340e-02,
       -6.71564738e-02, -6.52949015e-03,  5.12431463e-02, -9.39474106e-02,
        6.16731644e-02,  4.41474915e-02, -9.23347473e-03, -1.37972514e-01,
       -3.64240011e-03, -3.25234731e-02, -9.14436976e-02,  8.35386912e-02,
        4.61819967e-02,  6.26602173e-02,  3.15742493e-02, -5.60677846e-02,
       -2.58313417e-02,  2.33227412e-02,  4.66791789e-02, -2.00643539e-02,
        8.37974548e-02, -3.37748130e-02, -1.24649048e-01,  5.11404673e-02,
       -1.01089478e-04,  4.34875488e-04, -3.10757806e-02,  6.23687108e-03,
       -9.70395406e-03, -5.30322393e-04,  1.44716899e-02,  1.63919131e-02,
        1.42235756e-02,  3.28191121e-03,  1.59168243e-02,  5.12651602e-02,
        6.95549647e-02, -5.17870585e-02,  1.47443136e-01, -4.53297297e-02,
       -1.08873049e-02, -2.23019918e-02, -5.96621831e-02,  5.00138005e-02,
        3.49936485e-02,  2.81174978e-02,  1.22076670e-02,  9.11537806e-03,
        2.30712891e

## Document Embeddings B) [10 points]

Create a vector model from each of the document embedding dictionaries. Call these ```model_w2v_doc```, ```model_ft_doc``` and ```model_tfidf_doc```.

Now find the most similar document (```top_n=1```) for document $438$ with each of these models. Print the text for each of the most similar reviews.

In [14]:
# First find the text for review 438
def find_doc(doc_id, reviews):
    for review in reviews:
        if review['id'] == doc_id:
            return review['text']
# print(review)
doc_id = 438

# Print it
print('Source document:')
print(find_doc(doc_id, reviews))

# Create the models
model_w2v = VectorModel(w2v_doc_embeddings)
model_ft = VectorModel(ft_doc_embeddings)
model_tfidf  = VectorModel(tfidf_doc_embeddings)


print('Text of the most similar doc using w2v:\n',model_w2v.most_similar_vec(w2v_doc_embeddings['438'], top_n =2),'\n')
print(find_doc(int(model_w2v.most_similar_vec(w2v_doc_embeddings['438'], top_n =2)[1][0]),reviews),'\n')

print('Text of the most similar doc using ft:\n',model_ft.most_similar_vec(ft_doc_embeddings['438'], top_n =2),'\n')
print(find_doc(int(model_ft.most_similar_vec(ft_doc_embeddings['438'], top_n =2)[1][0]),reviews),'\n')

print('Text of the most similar doc using tfidf:\n',model_tfidf.most_similar_vec(tfidf_doc_embeddings['438'], top_n =2),'\n')
print(find_doc(int(model_tfidf.most_similar_vec(tfidf_doc_embeddings['438'], top_n =2)[1][0]),reviews),'\n')



Source document:
Absolutely ridiculously amazing! Chicken Tikka masala was perfect. Best I've ever had!
Text of the most similar doc using w2v:
 [('438', 1.0000000000000002), ('145', 0.8126940000171786)] 

I think I've been spoiled by eating delicious quesadillas quite frequently because the chicken quesadilla I ate was sub par.  It was greasy and the quality of chicken was not impressive. I gave an extra star because you can choose the fillings and those were fresh. 

Text of the most similar doc using ft:
 [('438', 1.0), ('145', 0.8609679394500404)] 

I think I've been spoiled by eating delicious quesadillas quite frequently because the chicken quesadilla I ate was sub par.  It was greasy and the quality of chicken was not impressive. I gave an extra star because you can choose the fillings and those were fresh. 

Text of the most similar doc using tfidf:
 [('438', 1.0), ('312', 0.7910624510317715)] 

Pretty good.. had sweet and sour chicken.. i had to have them reheat the food but o