## Word Embeddings in Action - Word2Vec

 - Word embeddings are a really useful way of converting text into a format that is interpretable to the model while still keeping it's semantic meaning intact.
 - Now that you have already learnt the theory behind Word2Vec, in this notebook you will learn how to utilize it practically on a real-world data set.
 
![](https://www.tensorflow.org/images/linear-relationships.png)

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
bbc_news = pd.read_csv('../datasets/bbc_news_mixed.csv')
bbc_news.head()

Unnamed: 0,text,label
0,Cairn shares slump on oil setback\n\nShares in...,business
1,Egypt to sell off state-owned bank\n\nThe Egyp...,business
2,Cairn shares up on new oil find\n\nShares in C...,business
3,Low-cost airlines hit Eurotunnel\n\nChannel Tu...,business
4,"Parmalat to return to stockmarket\n\nParmalat,...",business


The dataset that you are going to use is a collection of news articles from BBC across 5 major categories, namely:
 
 - Business
 - Entertainment
 - Politics
 - Sport
 - Tech

In [71]:
# print first 3 articles
for art in bbc_news.text[:3]:
    print(art[:200])

Cairn shares slump on oil setback


The company said tests ha
Egypt to sell off state-owned bank

The Egyptian government is reportedly planning to privatise one of the country's big public banks.

An Investment Ministry official has told the Reuters news agency


Now that you have an idea of how your data looks like, let's see the count of each category in the dataset!

In [3]:
# category-wise count
bbc_news.label.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: label, dtype: int64

### Using Google's pre-trained Word2Vec

Google has released a pre-trained Word2Vec model that has the advantage of being trained on **Google's News data set of 3 million words**. You can __download__ the word2vec embeddings from this [link](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing).

__Installation__

 - Make sure you have downloaded it in the same folder where this Jupyter notebook is residing.
 
 - Once you have finished downloading, you need to decompress the file by the following command in your terminal.
 
 `gzip -d GoogleNews-vectors-negative300.bin.gz`
 - Now that you have the model downloaded and ready, let's see how to import it in Gensim.

In [15]:
from gensim.models import KeyedVectors

# path of the downloaded model
filename = 'GoogleNews-vectors-negative300.bin'
# load into gensim
w2vec = KeyedVectors.load_word2vec_format(filename, binary=True)

Once you have executed the above code, your word2vec model is finally installed and loaded in gensim. Let's explore some of the features of this model.

__Contextual Relationship Between Words__

 - One of the impressive things about word2vec is it's ability to capture semantic relationship between words. That is the reason that you can do cool stuff like perform linear algebra on words and get an appropriate output. Have a look at the following example:

    `airplane - fly + drive = car`

 - If you pass the left hand side of the above equation to the model, it will give the right handside. Which makes sense because what would you get if you remove the ability to fly from an airplane? And add the ability to drive? You would get a car!
 - The underlying model is able to understand implicit relationship between airplane and fly and also how removing the medium of travel changes the machine used to travel.

In [20]:
# airplane - fly + drive
w2vec.most_similar(positive=['airplane', 'drive'], negative=['fly'], topn=5)

[('car', 0.511200487613678),
 ('drives', 0.47777241468429565),
 ('automobile', 0.45616623759269714),
 ('vehicle', 0.44856154918670654),
 ('SUV', 0.44360122084617615)]

Here are a few other examples

1. `king - man + woman = queen`
Removing man from kind and adding woman gives queen.
2. `moscow - russia + japan = tokyo`
Removing russia from moscow(capital of russia) and adding japan gives tokyo(capital of japan)

In [33]:
# king - man + woman
print(w2vec.most_similar(positive=['king', 'woman'], negative=['man'], topn=5))
# moscow - russia + japan
print(w2vec.most_similar(positive=['moscow', 'japan'], negative=['russia'], topn=5))

[('queen', 0.7118192911148071), ('monarch', 0.6189674139022827), ('princess', 0.5902431607246399), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321243286133)]
[('tokyo', 0.49488699436187744), ('hawaii', 0.4802300035953522), ('malta', 0.4749153256416321), ('japanese', 0.45723214745521545), ('seattle', 0.4502826929092407)]


### Text Classification using Word2Vec

Let's now get back to our BBC News data set. We will classify the data into different categories by using Word2Vec as a text feature. However, Word2Vec gives vector representation of individual words, in order to find the same for a statement or a document you can take mean of the vectors of it's constituent words.

This is what we will do in the `get_embedding_w2v()`. It iterates through all the words in a document/statement and extracts the vector for them if they are present in the vocabulary of the word2vec model.


In [62]:
# returns vector reperesentation
def get_embedding_w2v(doc_tokens, pre_trained):
    for tok in doc_tokens:
        if tok in w2vec.wv.vocab:
            embeddings.append(w2vec.wv.word_vec(tok))
    # mean the vectors of individual words to get the vector of the entire text
    return np.mean(embeddings, axis=0)

<br>

Now we will follow the steps below:

 - Generate vector representation for each document. Note that here `pre_trained=1` should be used.
 - Create the X data set and split it into train and test sets.
 - Train a text classification (Naive Bayes) model and compute it's accuracy.

In [68]:
# general preprocessing
from sklearn.model_selection import train_test_split

# create X from w2vec
X_w2v = preprocessed_bbc.apply(lambda x: get_embedding_w2v(x))
X_w2v = pd.DataFrame(X_w2v.tolist())
print('X shape:', X_w2v.shape)

# split into train and test
y = bbc_news.label
X_train_wv, X_test_wv, y_train_wv, y_test_wv = train_test_split(X_w2v, y, test_size=0.2, random_state=42)

  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':


X shape: (2225, 300)


In [70]:
# build a text classification model
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Initialize GaussianNB classifier
model = GaussianNB()
# Fit the model on the train dataset
model = model.fit(X_train_wv, y_train_wv)
# Make predictions on the test dataset
pred = model.predict(X_test_wv)

# check the accuracy of the model
print("Accuracy:", accuracy_score(y_test_wv, pred)*100, "%")

Accuracy: 92.80898876404494 %


The accuracy of the model is pretty impressive by using the Google's pre-trained model (92.8 %). This is because these word embeddings have been trained on a rich vocabulary of 3 million words. The more rich the vocabulary is, the better the model generates the semantic vectors of a word.