# Create a word2vec model

For this assignment, we will use Pytorch to create a word2vec model that infers numerical vectors for words that capture their meaning. Word2vec was first introduced in 2013 by Mikolov et al. at Google. Their paper can be found [here](https://arxiv.org/pdf/1301.3781.pdf), though you do not need to read and understand it in order to implement the model. It is a very popular machine learning model that has been implemented to capture the meaning of text for many real world cases. 

This [blog post](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/) is a great overview of word2vec. Please read it carefully before you create the word2vec model for this assignment. Specifically, you will build a "Continuous Bag-of-Words Model (CBOW)" word2vec model. CBOW predicts a focal (target) word from its context (the words surround it). The following Youtube videos also explain the concept of the CBOW model.
- https://www.youtube.com/watch?v=UqRCEmrv1gQ
- https://www.youtube.com/watch?v=gQddtTdmG_8 


<img src="https://analyticsindiamag.com/wp-content/uploads/2020/09/q.png">

[CBOW structure from https://analyticsindiamag.com/the-continuous-bag-of-words-cbow-model-in-nlp-hands-on-implementation-with-codes/]

Your task is to create a CBOW neural network model class called `CBOW`. `CBOW` has the structure shown above and the following properties:

- `vocab_size` - Size of vocabulary($V$). Note that vocabulary is a set of unique words in a corpus (a bunch of text documents).
- `embed_dim` - Dimension of the embedding vector
- `window_size` - Size of window. If a focal word is at position $t$, then the CBOW model uses embedding vectors of words between ($t$-window_size) and ($t$+window_size) to predict the focal word
- `hidden_dim` - Dimension of the hidden layer ($N$)

`CBOW` consists of three layers:

- `embedding` - An embedding layer that is initialized with `torch.nn.Embedding`
- `fc1` - A linear transformation that connects the embedding layer to the hidden layer. `torch.nn.functional.relu` activation should be applied to the output of `fc1`.
- `fc2` - A linear transformation that connects the activation of `fc1` to a tensor of length `vocab_size`. 

The training data (i.e., the features `X`, the labels `y`) that we will use to train the `CBOW` model will be:
- `X` will be a tensor of length (2 * `window_size`) containing the indices of all words in the window before and after the focal word. 
- `y`, (the label that our model is trying to predict) should be a list containing the index of the focal word.

Note that a single review in our data will produce multiple items of training data. For example, suppose a single review is:

 "the food was not good at all"
 
 If our `window_size` = 2, then this would generate the following (context, focal_word) training data tuples:

```python
(['the','food','not','good'], ['was']) # 'was' is the focal word
(['food','was','good','at'], ['not']) # 'not' is the focal word
(['was','not','at','all'],['good']) # 'good' is the focal word
```

However, we can't directly use these tuples to train our model. First we have to replace each word with a unique integer (its index) and then convert these to pytorch tensors. Note that, we will be using a special embedding layer (`torch.nn.Embedding`) which will convert these indexes to the one-hot vectors that are described in the videos.

To get tensors from the original data, you will need to:

- Create a list (or `set`) of all unique words in the cleaned text, called `vocab`.
- Create a dictionary called `word_to_index` where the key is a word and the value is the index of a word (a unique number for each word). You will have to figure out how to create this dictionary from the cleaned dataset.
- Write a function `make_cbow_data` that accepts a single review from cleaned_text as an input and outputs a **list of tuples** where:
 -  the first part of the tuple contains a tensor of the indices of words in the window before and after each focal word
 - the second part of the tuple is a tensor containing the index of the focal word.
 - The dtype of both tensors in the tuple should be `torch.long`.
 - You will have to figure out how to create multiple tuples of tensors from a single review (an item from `cleaned_text`) using loops 


We will use restaurant customer reviews data for this assignment.

**Do not change the code block below**. Below is a function that cleans up the text of a review and returns a list of all the words in the review.

You will use `cleaned_text`, which is defined below, to create a training dataset for your `CBOW` model.

In [18]:
# DO NOT CHANGE THIS CODEBLOCK
import pandas as pd
import string
import Word2VecSupport

def clean_text(text):    
    x = text.translate(str.maketrans('', '', string.punctuation)) # remove punctuation
    x = x.lower().split() # lower case and split by whitespace to differentiate words
    return x

example_text = pd.read_csv('text.csv')
cleaned_text = example_text.Review[:100].apply(clean_text)

### Data Description
Here, the `example_text` has 1000 pieces of text message. The first 5 pieces of message are shown in the following.

Since there exist punctuation in `example_text`, we use `cleaned_text` function to remove them and also set all words to lower case and split them ny whitespace.

In [19]:
print(example_text.count())
example_text.head()

Review    1000
dtype: int64


Unnamed: 0,Review
0,Wow... Loved this place.
1,Crust is not good.
2,Not tasty and the texture was just nasty.
3,Stopped by during the late May bank holiday of...
4,The selection on the menu was great and so wer...


In [20]:
cleaned_text.head()

0                            [wow, loved, this, place]
1                               [crust, is, not, good]
2    [not, tasty, and, the, texture, was, just, nasty]
3    [stopped, by, during, the, late, may, bank, ho...
4    [the, selection, on, the, menu, was, great, an...
Name: Review, dtype: object

### Create a CBOW Class

The first step is to create `vocab` and `word_to_index` according to the instructions above.

Here, to create unique vocabulary, each word in `cleaned_text` was checked through nested for loops and the `set()` function was used to delete redundant words.

The resulting `vocab` has a size of 483, which means there is in total 483 unique words in `cleaned_text`. 

`word_to_index` was a dictionary created to give all 483 words an index number.

You can click the variables explorer to find more details about `word_to_index` and `vocab`.

In [21]:
#Create your vocab here
intermid = []
for i in range(len(cleaned_text)):
    for j in range(len(cleaned_text[i])):
        intermid.append(cleaned_text[i][j])
vocab = set(intermid)

#Create your word_to_index dictionary here
vocab = list(vocab)

# Create word2Vec from the supporting class
word2Vec = Word2VecSupport.Word2Vec(vocab)


In [22]:
cleaned_text[0]

['wow', 'loved', 'this', 'place']

In [23]:
word2Vec.make_cbow_data(['the', 'food', 'was', 'not', 'good', 'at', 'all'], 2, word2Vec.word_to_index_dictionary)

[(tensor([109,  37, 115, 332]), tensor([318])),
 (tensor([ 37, 318, 332,  23]), tensor([115])),
 (tensor([318, 115,  23,  72]), tensor([332]))]

In [35]:
word2Vec.word_to_index_dictionary

{'lady': 0,
 'waiter': 1,
 'jeff': 2,
 'fear': 3,
 'only': 4,
 'just': 5,
 'under': 6,
 'thing': 7,
 'seafood': 8,
 'to': 9,
 'indicate': 10,
 'portions': 11,
 'cute': 12,
 'finish': 13,
 'rice': 14,
 'people': 15,
 'got': 16,
 'as': 17,
 'happier': 18,
 'sure': 19,
 'meat': 20,
 'second': 21,
 'cafe': 22,
 'at': 23,
 'dessert': 24,
 'sauce': 25,
 'delight': 26,
 'selection': 27,
 'our': 28,
 'there': 29,
 'final': 30,
 'palate': 31,
 'rib': 32,
 'my': 33,
 'dish': 34,
 'chocolate': 35,
 'and': 36,
 'food': 37,
 'tried': 38,
 'wayyy': 39,
 'greek': 40,
 'melted': 41,
 'cakes': 42,
 'cooked': 43,
 'well': 44,
 'totally': 45,
 'hour': 46,
 'drag': 47,
 'first': 48,
 'had': 49,
 'dressing': 50,
 'pita': 51,
 'time': 52,
 'coming': 53,
 'tables': 54,
 'if': 55,
 'arrived': 56,
 'rightthe': 57,
 'generic': 58,
 'give': 59,
 'grossed': 60,
 'overpriced': 61,
 'cod': 62,
 'bye': 63,
 'prime': 64,
 'judge': 65,
 'also': 66,
 'pasta': 67,
 'crust': 68,
 'gringos': 69,
 'apparently': 70,
 'cheat

Now define your CBOW model class.

In [24]:
# Define your CBOW model here (TODO)
import torch


## Train the CBOW model

Now that your model class is written, you must create an instance of the model and train it using the loss function `torch.nn.CrossEntropyLoss` on the output of `fc2` and `y` (the labels).

Train your CBOW model for 300 epochs with `embed_dim`= 100, `window_size`=2, and `hidden_dim`=30. 
- Do not split the data into training and test sets (we will not be evaluating the performance of this model). 
- Use the SGD optimizer with learning rate = 0.001.
- Append the loss at every epoch to a list (return the list if you use a function to fit your model), so that we can plot it later. 

In [25]:
# Parameters
VOCAB_SIZE = len(vocab)
EMBED_DIM = 100
WINDOW_SIZE = 2
HIDDEN_DIM = 30
N_EPOCHS = 300

# Construct input data
data = word2Vec.createTupleList(cleaned_text, WINDOW_SIZE, word2Vec.word_to_index_dictionary)

# Train your CBOW model here (TODO)



In [32]:
data[:5]

[(tensor([115, 416, 109, 420]), tensor([36])),
 (tensor([416,  36, 420, 318]), tensor([109])),
 (tensor([ 36, 109, 318,   5]), tensor([420])),
 (tensor([109, 420,   5, 244]), tensor([318])),
 (tensor([470, 209, 109, 421]), tensor([353]))]

## Plot losses by epochs (x-axis: epoch, y-axis: loss)

In [26]:
# Insert your code here to plot losses vs epoch (TODO)
import matplotlib.pyplot as plt


In C:\Users\wangy\Anaconda3\envs\SUSHI\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\wangy\Anaconda3\envs\SUSHI\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\wangy\Anaconda3\envs\SUSHI\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In C:\Users\wangy\Anaconda3\envs\SUSHI\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: 
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In C:\Users\wangy\Anaconda3\envs\SUSHI\lib\site-

## Print five most similar words with the word "delicious"

The whole point of traiing an embedding model is to get an embedding vector for each word. The idea is that the vector somehow captures the meaning of the word. This is useful because data scientists often face scenarios where they must derive meaning from unstructured text data.

Once your model has been trained, you can access the embedding vectors through `model.embedding.weight.data`. You can convert these vectors to a numpy matrix or numpy arrays if needed.

Find the five most similar words with the word "delicious" by calculating cosine similarity between the embedding vector of "delicious" and the embedding vectors of all other words in the vocabulary. 

Hint: cosine similarity is a common metric so you should be able to find one that you can use in an existing library. 

In [27]:
# Insert your code here (TODO)
import sklearn.metrics


While the model learns embedding vectors (that best predict the focal word from its contexts), the vectors that it learns don't seem to truly capture the meaning of words. However, this is mainly due to the small size of our training data. Google trained a word2vec model based on large-scale data (about 100 billion words), and this model captures similarity between words well. You can find the pretrained model at https://code.google.com/archive/p/word2vec/.