# Generating Word Embeddings - Lab

## Introduction

In this lab, you'll learn how to generate word embeddings by training a Word2Vec model, and then embedding layers into Deep Neural Networks for NLP!

## Objectives

You will be able to:

* Demonstrate a basic understanding of the architecture of the Word2Vec model
* Demonstrate an understanding of the various tunable parameters of word2vec such as vector size and window size

## Getting Started

In this lab, you'll start by creating your own word embeddings by making use of the Word2Vec Model. Then, you'll move onto building Neural Networks that make use of **_Embedding Layers_** to accomplish the same end-goal, but directly in your model. 

As you've seen, the easiest way to make use of Word2Vec is to import it from the [Gensim Library](https://radimrehurek.com/gensim/). This model contains a full implementation of Word2Vec, which you can use to begin training immediately. For this lab, you'll be working with the [News Category Dataset from Kaggle](https://www.kaggle.com/rmisra/news-category-dataset/version/2#_=_).  This dataset contains headlines and article descriptions from the news, as well as categories for which type of article they belong to.

Run the cell below to import everything you'll need for this lab. 

In [1]:
import pandas as pd
import numpy as np
np.random.seed(0)
from gensim.models import Word2Vec
from nltk import word_tokenize
import nltk



Now, import the data. The data stored in the file `'News_Category_Dataset_v2.json'`.  This file is compressed, so that it can be more easily stored in a github repo. **_Make sure to unzip the file before continuing!_**

In the cell below, use the `read_json` function from pandas to read the dataset into a DataFrame. Be sure to include the parameter `lines=True` when reading in the dataset!

Once you've loaded in the data, inspect the head of the DataFrame to see what your data looks like. 

In [2]:
!ls

CONTRIBUTING.md  LICENSE.md			News_Category_Dataset_v2.zip
index.ipynb	 News_Category_Dataset_v2.json	README.md


In [3]:
raw_df = pd.read_json("News_Category_Dataset_v2.json", lines = True)

In [4]:
raw_df.head()

Unnamed: 0,authors,category,date,headline,link,short_description
0,Melissa Jeltsen,CRIME,2018-05-26,There Were 2 Mass Shootings In Texas Last Week...,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...
1,Andy McDonald,ENTERTAINMENT,2018-05-26,Will Smith Joins Diplo And Nicky Jam For The 2...,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.
2,Ron Dicker,ENTERTAINMENT,2018-05-26,Hugh Grant Marries For The First Time At Age 57,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...
3,Ron Dicker,ENTERTAINMENT,2018-05-26,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...
4,Ron Dicker,ENTERTAINMENT,2018-05-26,Julianna Margulies Uses Donald Trump Poop Bags...,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ..."


## Preparing the Data

Since you're working with text data, you need to do some basic preprocessing including tokenization. Notice from the data sample that two different columns contain text data--`headline` and `short_description`. The more text data your Word2Vec model has, the better it will perform. Therefore, you'll want to combine the two columns before tokenizing each comment and training your Word2Vec model. 

In the cell below:

* Create a column called `combined_text` that consists of the data from `df.headline` plus a space character (`' '`) plus the data from `df.short_description`.
* Use the `combined_text` column's `map()` function and pass in `word_tokenize`. Store the result returned in `data`.

In [5]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/gentle-
[nltk_data]     sailor-8400/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /home/gentle-
[nltk_data]     sailor-8400/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
from nltk.corpus import stopwords
import string
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
#stop_words_list = stopwords.words("english")
stop_words_list = string.punctuation

def clean(list_): 
    list_ = nltk.regexp_tokenize(list_, pattern)
    list_ = [x.lower() for x in list_]
    list_ = list(filter(lambda w: w not in stop_words_list, list_))
    return list_
    
    

In [8]:
raw_df['combined_text'] = raw_df.headline + " " + raw_df.short_description
data = raw_df['combined_text'].map(word_tokenize)

Inspect the first 5 items in `data` to see how everything looks. 

In [9]:
data[:5]

0    [There, Were, 2, Mass, Shootings, In, Texas, L...
1    [Will, Smith, Joins, Diplo, And, Nicky, Jam, F...
2    [Hugh, Grant, Marries, For, The, First, Time, ...
3    [Jim, Carrey, Blasts, 'Castrato, ', Adam, Schi...
4    [Julianna, Margulies, Uses, Donald, Trump, Poo...
Name: combined_text, dtype: object

Notice that although the words are tokenized, they are still in the same order they were in as headlines. This is important, because the words need to be in their original order for Word2Vec to establish the meaning of them. Remember that for a Word2Vec model you can specify a  **_Window Size_** that tells the model how many words to take into consideration at one time. 

If your window size was 5, then the model would start by looking at the words "Will Smith joins Diplo and", and then slide the window by one, so that it's looking at "Smith joins Diplo and Nicky", and so on, until it had completely processed the text example at index 1 above. By doing this for every piece of text in the entire dataset, the Word2Vec model learns excellent vector representations for each word in an **_Embedding Space_**, where the relationships between vectors capture semantic meaning (recall the vector that captures gender in the previous "king - man + woman = queen" example you saw).

Now that you've prepared the data, train your model and explore a bit!

## Training the Model

Start by instantiating a Word2Vec Model from gensim below. 

In the cell below:

* Create a `Word2Vec` model and pass in the following arguments:
    * The dataset we'll be training on, `data`
    * The size of the word vectors to create, `size=100`
    * The window size, `window=5`
    * The minimum number of times a word needs to appear in order to be counted in  the model, `min_count=1`.
    * The number of threads to use during training, `workers=4`

In [None]:
model = Word2Vec(data, size = 100, window = 5, min_count = 1, workers = 4)

Now, that you've instantiated Word2Vec model, train it on your text data. 

In the cell below:

* Call `model.train()` and pass in the following parameters:
    * The dataset we'll be training on, `data`
    * The `total_examples`  of sentences in the dataset, which you can find in `model.corpus_count`. 
    * The number of `epochs` you want to train for, which we'll set to `10`

In [12]:
model.train(data,total_examples=model.corpus_count,epochs=10)

(48581891, 58846140)

Great! you now have a fully trained model! The word vectors themselves are stored inside of a `Word2VecKeyedVectors` instance, which is stored inside of `model.wv`. To simplify this, restore this object inside of the variable `wv` to save yourself some keystrokes down the line. 

In [13]:
wv = model.wv

## Examining Your Word Vectors

Now that you have a trained Word2Vec model, go ahead and explore the relationships between some of the words in the corpus! 

One cool thing you can use Word2Vec for is to get the most similar words to a given word. You can do this passing in the word to `wv.most_similar()`. 

In the cell below, try getting the most similar word to `'Texas'`.

In [20]:
wv.most_similar("texas")

[('california', 0.7956309914588928),
 ('maryland', 0.7830812335014343),
 ('florida', 0.759600043296814),
 ('ohio', 0.7581177353858948),
 ('louisiana', 0.7564389705657959),
 ('illinois', 0.7384380102157593),
 ('massachusetts', 0.7369791865348816),
 ('arkansas', 0.7245210409164429),
 ('colorado', 0.724008321762085),
 ('maine', 0.7235292196273804)]

Interesting! All of the most similar words are also states. 

You can also get the least similar vectors to a given word by passing in the word to the `most_similar()` function's `negative` parameter. 

In the cell below, get the least similar words to `'Texas'`.

In [21]:
wv.most_similar(negative = "texas")

[('untying', 0.3674625754356384),
 ('gaudiness', 0.33713918924331665),
 ('consumable', 0.30903759598731995),
 ('plebeians', 0.3052848279476166),
 ('ejecting', 0.29790550470352173),
 ('admon', 0.2939297556877136),
 ('endlessly', 0.29351308941841125),
 ("bloom's", 0.28974437713623047),
 ('arrestor', 0.286624550819397),
 ('wounds', 0.2850342392921448)]

This seems like random noise. It is a result of the way Word2Vec is computing the similarity between word vectors in the embedding space. Although the word vectors closest to a given word vector are almost certainly going to have similar meaning or connotation with your given word, the word vectors that the model considers 'least similar' are just the word vectors that are farthest away, or have the lowest cosine similarity. It's important to understand that while the closest vectors in the embedding space will almost certainly share some level of semantic meaning with a given word, there is no guarantee that this relationship will hold at large distances. 

You can also get the vector for a given word by passing in the word as if you were passing in a key to a dictionary. 

In the cell below, get the word vector for `'Texas'`.

In [23]:
wv["texas"]

array([ 2.0090384 ,  0.75508404,  1.8611152 , -0.35667685, -1.4326009 ,
       -2.2556093 ,  1.7592149 ,  0.5077297 ,  0.11268356,  1.0729005 ,
       -1.7922043 , -1.3332956 ,  0.1303678 ,  0.12776257,  0.98619777,
        2.4032702 , -0.6372821 ,  0.63103986, -1.0218865 ,  0.48107   ,
       -0.38277677, -0.10784437,  0.82842827, -0.35300058, -1.0195578 ,
       -1.5294015 ,  1.193883  ,  0.21973768,  3.2263632 , -1.4690269 ,
       -0.7964715 ,  1.3089036 ,  0.1809012 ,  0.69344604, -2.1082573 ,
       -1.1487836 ,  0.787975  ,  2.016956  ,  0.80880976, -1.3031173 ,
        0.8145309 ,  0.10848451,  0.3177543 ,  0.03470489,  0.74653053,
        0.45288604,  0.6208327 ,  1.8511808 , -0.08500962, -0.42039064,
        0.01490376, -0.7793061 ,  0.16366908,  0.07250556,  0.48366475,
        2.912557  , -0.2525233 , -1.1527581 ,  0.8865502 , -0.20991236,
        0.8834063 ,  1.0445212 ,  0.42663732,  1.1076812 ,  0.49889156,
        1.4466814 ,  0.7606143 ,  0.18570831, -1.3340117 ,  2.32

Now get all of the word vectors from the object at once. You can find these inside of `wv.vectors`. Try it out in the cell below.  

In [26]:
wv.vectors

array([[-4.7077924e-01,  1.1556065e+00, -1.3402198e-02, ...,
         1.7398964e-01, -2.0110080e+00,  9.0257162e-01],
       [-1.0383769e+00, -3.7016425e+00,  3.7340254e-01, ...,
        -5.3157401e-01, -1.3925468e+00,  8.3121645e-01],
       [ 4.8119392e-02,  2.3471232e-01,  2.5955519e-01, ...,
        -4.9096093e-01, -2.8872790e+00,  4.5041013e-01],
       ...,
       [-1.7314790e-02,  3.3184584e-02, -1.3293274e-01, ...,
         1.1087551e-02, -2.8950747e-02,  1.3506210e-02],
       [-5.8289520e-03,  3.8061773e-03, -7.1094491e-02, ...,
         7.0535250e-02, -5.0932974e-02,  7.4596122e-02],
       [ 1.7670708e-02, -3.0869486e-02, -1.7925272e-02, ...,
         3.3065942e-04, -5.3495005e-02, -1.4450377e-02]], dtype=float32)

As a final exercise, try to recreate the _'king' - 'man' + 'woman' = 'queen'_ example previously mentioned. You can do this by using the `most_similar` function and translating the word analogies into an addition/subtraction formulation (as shown above). Pass the original comparison, which you are calculating a difference between, to the negative parameter, and the analogous starter you want to apply the same transformation to, to the `positive` parameter.

Do this now in the cell below. 

In [32]:
wv.most_similar(positive = ["king","woman"], negative = ["man"])

[("hrer's", 0.5673308372497559),
 ('ck', 0.5487414598464966),
 ('hrer', 0.533789873123169),
 ('ltskog', 0.5326483249664307),
 ('cking', 0.5219861268997192),
 ('kface', 0.5002694725990295),
 ('nationale', 0.4922553300857544),
 ('agnetha', 0.4905812442302704),
 ('ckable', 0.4851534366607666),
 ('gloria', 0.47879135608673096)]

As you can see from the output above, your model isn't perfect, but 'Queen' is still in the top 3, and with 'Princess' not too far behind. As you can see from the word in first place, 'reminiscent', your model is far from perfect. This is likely because you didn't have enough training data. That said, given the small amount of training data provided, the model still performs remarkably well! 

In the next lab, you'll reinvestigate transfer learning, loading in the weights from an open-sourced model that has already been trained for a very long time on a massive amount of data. Specifically, you'll work with the GloVe model from the Stanford NLP Group. There's not really any benefit from training the model ourselves, unless your text uses different, specialized vocabulary that isn't likely to be well represented inside an open-source model.

## Summary

In this lab, you learned how to train and use a Word2Vec model to created vectorized word embeddings!