<a href="https://colab.research.google.com/github/Mngambi/dsc-generating-word-embeddings-lab/blob/master/index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Word Embeddings - Lab

## Introduction

In this lab, you'll learn how to generate word embeddings by training a Word2Vec model, and then embedding layers into deep neural networks for NLP!

## Objectives

You will be able to:

- Train a Word2Vec model and transform words into vectors
- Obtain most similar words by using methods associated with word vectors


## Getting Started

In this lab, you'll start by creating your own word embeddings by making use of the Word2Vec model. Then, you'll move onto building neural networks that make use of **_Embedding Layers_** to accomplish the same end-goal, but directly in your model.

As you've seen, the easiest way to make use of Word2Vec is to import it from the [Gensim Library](https://radimrehurek.com/gensim/). This model contains a full implementation of Word2Vec, which you can use to begin training immediately. For this lab, you'll be working with the [News Category Dataset from Kaggle](https://www.kaggle.com/rmisra/news-category-dataset/version/2#_=_).  This dataset contains headlines and article descriptions from the news, as well as categories for which type of article they belong to.

Run the cell below to import everything you'll need for this lab.

In [5]:
import pandas as pd
import numpy as np
np.random.seed(0)
from gensim.models import Word2Vec
import nltk
from nltk import word_tokenize

Now, import the data. The data is stored in the file `'News_Category_Dataset_v2.json'`.  This file is compressed, so that it can be more easily stored in a GitHub repo. **_Make sure to unzip the file before continuing!_**

In the cell below, use the `read_json()` function from Pandas to read the dataset into a DataFrame. Be sure to include the parameter `lines=True` when reading in the dataset!

Once you've imported the data, inspect the first few rows of the DataFrame to see what your data looks like.

In [3]:
df = pd.read_csv('News_Category_Dataset_v2.csv')
df.head()

Unnamed: 0,authors,category,date,headline,link,short_description
0,Melissa Jeltsen,CRIME,2018-05-26,There Were 2 Mass Shootings In Texas Last Week...,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...
1,Andy McDonald,ENTERTAINMENT,2018-05-26,Will Smith Joins Diplo And Nicky Jam For The 2...,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.
2,Ron Dicker,ENTERTAINMENT,2018-05-26,Hugh Grant Marries For The First Time At Age 57,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...
3,Ron Dicker,ENTERTAINMENT,2018-05-26,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...
4,Ron Dicker,ENTERTAINMENT,2018-05-26,Julianna Margulies Uses Donald Trump Poop Bags...,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ..."


## Preparing the Data

Since you're working with text data, you need to do some basic preprocessing including tokenization. Notice from the data sample that two different columns contain text data -- `headline` and `short_description`. The more text data your Word2Vec model has, the better it will perform. Therefore, you'll want to combine the two columns before tokenizing each comment and training your Word2Vec model.

In the cell below:

* Create a column called `'combined_text'` that consists of the data from the `'headline'` column plus a space character (`' '`) plus the data from the `'short_description'` column
* Use the `'combined_text'` column's `.map()` method and pass in `word_tokenize`. Store the result returned in `data`

In [8]:
nltk.download('punkt')

df['combined_text'] = df['headline'] + ' ' + df['short_description']

df['combined_text'] = df['combined_text'].astype(str)

# Tokenize the combined text
df['combined_text'] = df['combined_text'].map(word_tokenize)

# Display the first few rows to inspect the tokenized data

data = df['combined_text']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Inspect the first 5 items in `data` to see how everything looks.

In [9]:
data[:5]

0    [There, Were, 2, Mass, Shootings, In, Texas, L...
1    [Will, Smith, Joins, Diplo, And, Nicky, Jam, F...
2    [Hugh, Grant, Marries, For, The, First, Time, ...
3    [Jim, Carrey, Blasts, 'Castrato, ', Adam, Schi...
4    [Julianna, Margulies, Uses, Donald, Trump, Poo...
Name: combined_text, dtype: object

Notice that although the words are tokenized, they are still in the same order they were in as headlines. This is important, because the words need to be in their original order for Word2Vec to establish the meaning of them. Remember that for a Word2Vec model you can specify a  **_Window Size_** that tells the model how many words to take into consideration at one time.

If your window size was 5, then the model would start by looking at the words "Will Smith joins Diplo and", and then slide the window by one, so that it's looking at "Smith joins Diplo and Nicky", and so on, until it had completely processed the text example at index 1 above. By doing this for every piece of text in the entire dataset, the Word2Vec model learns excellent vector representations for each word in an **_Embedding Space_**, where the relationships between vectors capture semantic meaning (recall the vector that captures gender in the previous "king - man + woman = queen" example you saw).

Now that you've prepared the data, train your model and explore a bit!

## Training the Model

Start by instantiating a Word2Vec Model from `gensim`.

In the cell below:

* Create a `Word2Vec` model and pass in the following arguments:
    * The dataset we'll be training on, `data`
    * The size of the word vectors to create, `size=100`
    * The window size, `window=5`
    * The minimum number of times a word needs to appear in order to be counted in  the model, `min_count=1`
    * The number of threads to use during training, `workers=4`

In [11]:
model = Word2Vec(data, vector_size=100, window=5, min_count=1, workers=4)

Now, that you've instantiated Word2Vec model, train it on your text data.

In the cell below:

* Call the `.train()` method on your model and pass in the following parameters:
    * The dataset we'll be training on, `data`
    * The `total_examples`  of sentences in the dataset, which you can find in `model.corpus_count`
    * The number of `epochs` you want to train for, which we'll set to `10`

In [13]:
model.train(data, total_examples = model.corpus_count, epochs = 10)



(53948506, 65591130)

Great! You now have a fully trained model! The word vectors themselves are stored in the `Word2VecKeyedVectors` instance, which is stored in the `.wv` attribute. To simplify this, restore this object inside of the variable `wv` to save yourself some keystrokes down the line.

In [14]:
wv = model.wv

## Examining Your Word Vectors

Now that you have a trained Word2Vec model, go ahead and explore the relationships between some of the words in the corpus!

One cool thing you can use Word2Vec for is to get the most similar words to a given word. You can do this by passing in the word to `wv.most_similar()`.

In the cell below, try getting the most similar word to `'Texas'`.

In [15]:
wv.most_similar('Texas')

[('Maryland', 0.8139601349830627),
 ('Pennsylvania', 0.8036527037620544),
 ('Louisiana', 0.8002697825431824),
 ('Arkansas', 0.795661211013794),
 ('Oregon', 0.7946459650993347),
 ('Georgia', 0.7909244894981384),
 ('Ohio', 0.7888235449790955),
 ('California', 0.7886297106742859),
 ('Utah', 0.7870084643363953),
 ('Florida', 0.7796040177345276)]

Interesting! All of the most similar words are also states.

You can also get the least similar vectors to a given word by passing in the word to the `.most_similar()` method's `negative` parameter.

In the cell below, get the least similar words to `'Texas'`.

In [16]:
wv.most_similar(negative = 'Texas')

[('dreams…', 0.4211546778678894),
 ('barbers', 0.42086607217788696),
 ('book-selling', 0.4177745282649994),
 ('bean-to-bar', 0.41288691759109497),
 ('Butchering', 0.409580796957016),
 ('heave-ho', 0.40630730986595154),
 ('clothe', 0.4032965898513794),
 ('squelches', 0.3997737765312195),
 ('crave', 0.3976283669471741),
 ('savor', 0.3910464644432068)]

This seems like random noise. It is a result of the way Word2Vec is computing the similarity between word vectors in the embedding space. Although the word vectors closest to a given word vector are almost certainly going to have similar meaning or connotation with your given word, the word vectors that the model considers 'least similar' are just the word vectors that are farthest away, or have the lowest cosine similarity. It's important to understand that while the closest vectors in the embedding space will almost certainly share some level of semantic meaning with a given word, there is no guarantee that this relationship will hold at large distances.

You can also get the vector for a given word by passing in the word as if you were passing in a key to a dictionary.

In the cell below, get the word vector for `'Texas'`.

In [18]:
wv['Texas']

array([ 0.22217141,  0.87112224,  0.7391627 ,  0.515254  ,  0.30281776,
        1.7726098 ,  0.8205501 ,  0.6442181 , -0.80416566, -0.51955104,
       -0.86031854, -0.8204836 ,  1.6551604 , -1.933038  , -1.3417718 ,
       -2.1606588 ,  0.09147494, -0.7334369 , -0.6124126 , -0.5225271 ,
        0.6346694 , -0.22254959,  0.02006735, -1.0744045 ,  1.4458474 ,
       -0.40388706,  0.9141012 ,  0.54518735,  0.4294047 ,  0.39026847,
        1.4258251 , -1.0647564 , -2.9693897 , -1.6051122 , -1.2095877 ,
       -0.22459689,  2.412939  , -1.7531189 ,  1.3400723 , -0.6921844 ,
        0.85666686, -0.11849586, -0.6490497 , -2.001796  , -2.221216  ,
       -0.8083835 , -3.0974617 ,  1.1626399 , -0.68072516,  1.4913346 ,
       -0.13961944, -0.86258024,  0.513815  , -2.067179  , -2.698512  ,
       -0.54548687, -0.27285355,  1.6630605 , -0.264176  ,  0.5397529 ,
        1.5965425 , -0.55283374,  0.62561196, -1.7701678 , -0.17797868,
        2.649918  ,  0.01349298,  1.7463522 , -1.5964756 ,  1.79

Now get all of the word vectors from the object at once. You can find these inside of `wv.vectors`. Try it out in the cell below.  

In [19]:
wv.vectors

array([[ 4.2542290e-02, -2.1577649e-01, -1.4970371e+00, ...,
        -2.0171845e+00, -6.9785565e-01,  2.4467449e+00],
       [-1.2703447e+00, -4.9568641e-01, -1.6766056e+00, ...,
        -2.8246958e+00, -1.9154435e+00,  1.3611634e+00],
       [-8.7466615e-01, -7.9602563e-01, -1.3294630e+00, ...,
        -5.7183206e-01,  2.4109747e+00,  5.0759214e-01],
       ...,
       [ 5.4313704e-02, -2.6301863e-02,  4.3602679e-02, ...,
        -5.2183498e-02, -4.3695364e-02, -1.1918947e-02],
       [-2.2647740e-02,  1.4364987e-02, -4.1366119e-02, ...,
        -1.2917069e-01,  4.4304941e-02,  4.1366827e-02],
       [-1.7938606e-02,  3.6227588e-02,  3.9658576e-02, ...,
        -1.7293084e-02, -2.6459325e-02, -1.2370693e-03]], dtype=float32)

As a final exercise, try to recreate the _'king' - 'man' + 'woman' = 'queen'_ example previously mentioned. You can do this by using the `.most_similar()` method and translating the word analogies into an addition/subtraction formulation (as shown above). Pass the original comparison, which you are calculating a difference between, to the negative parameter, and the analogous starter you want to apply the same transformation to, to the `positive` parameter.

Do this now in the cell below.

In [20]:
wv.most_similar(negative = 'man', positive = ['woman', 'king'])

[("'Scooby-Doo", 0.5563579201698303),
 ('regents', 0.527040421962738),
 ('stinkin', 0.525495707988739),
 ('y', 0.5254740715026855),
 ('80s', 0.5180624723434448),
 ('mon', 0.5107259154319763),
 ('Katwe', 0.507721483707428),
 ('Orcia', 0.5039877891540527),
 ('nina', 0.5038350224494934),
 ('Livin', 0.5032905340194702)]

As you can see from the output above, your model isn't perfect, but 'Queen' and 'Princess' are still in the top 5. As you can see from the other word in top 5, 'reminiscent' -- your model is far from perfect. This is likely because you didn't have enough training data. That said, given the small amount of training data provided, the model still performs remarkably well!

In the next lab, you'll reinvestigate transfer learning, loading in the weights from an open-sourced model that has already been trained for a very long time on a massive amount of data. Specifically, you'll work with the GloVe model from the Stanford NLP Group. There's not really any benefit from training the model ourselves, unless your text uses different, specialized vocabulary that isn't likely to be well represented inside an open-source model.

## Summary

In this lab, you learned how to train and use a Word2Vec model to create vectorized word embeddings!