## Part 3 - Word vectors

So far we've seen simple feature selection methods, a statistical feature selection approach, dimensionality reduction techniques such as PCA and SVD, but in the last few years, with the rise in popularity of neural networks, a new technique has become the state of the art for representing words in NLP tasks.

This technique is commonly referred to as **word vectors** or **word embeddings**, and its inner workings are really simple. It consists of defining a vocabulary and a vector for each word in it with a maximum number of dimensions. Then all the vectors' are found through the use of **neural networks**. In essence, word embeddings try to capture information on a word's meaning and usage. This not only allows us to significantly reduce the number of features fed to our models, but it also allows meaningful and easy representations across datasets, that are transferrable among tasks. 

Pretty cool, huh?

<img src="media/what-year-is-this.jpg" width="400">




### Table of contents
[1. Words vectors explained](#1.-Word-vectors-explained)   
&emsp;[1.1 Training word vectors](#1.1-Training-word-vectors)   
&emsp;[1.2 Pretrained word vectors](#1.2-Pretrained-word-vectors)   
[2. Word representations in Spacy](#2.-Word-representations-in-Spacy)   
[3. Cosine similarity](#3.-Cosine-similarity)   
[4. Word relations](#4.-Word-relations)   
[5. Applying word vectors to sentences](#5.-Applying-word-vectors-to-sentences)   
[6. NLP practical example](#6.-NLP-practical-example)   
[7. Final remarks](#7.-Final-remarks)

In [4]:
import spacy

import numpy as np
from numpy import dot
from numpy.linalg import norm
import pandas as pd
import re
import nltk
from nltk.tokenize import WordPunctTokenizer

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings('ignore')

seed=42

## 1. Word vectors explained

First of all, by now you could be thinking: "But wait Doc, didn't I get a bunch of vectors before also?". Why yes, yes you did, Marty. You could consider the columns of the matrix with document-term counts a possible word vector representation. You could even construct a simpler matrix. 

If you assume a vocabulary of size V, and each word having an index in this vocabulary, a natural representation would be what is called a one-hot encoding, where each word is represented by a vector of size V - the vocabulary size - with the single component corresponding to this word set to 1, and the remaining fields zeroed out.

<img src="media/one-hot-vec.png" width="300">


We are going in the right direction! But keep in mind that this representation fits in a very large space and we suddenly fall into the pitfalls of high dimensionality. You could think of applying PCA or SVD to these one-hot vectors but as for most tasks nowadays, neural networks have proven to be better at this. To put it simply, there is a more elegant way. 

<img src="media/but-how-doc.jpg" width="450">



## 1.1 Training word vectors

So you know the data - a bunch of words. You know the goal - a vector with an arbitrary number of K features. And you know the means - neural networks. So how does it all work? "You shall know a word by the company it keeps". These are the words of John Rupert Firth (at least according to Wikipedia), and they are the basis of the following method - Word2vec. 

**Word2vec** is a popular technique for using neural models to produce word embeddings, and it encompasses two main approaches - Continuous Bag Of Words (CBOW) and skipgram - that we will describe here.

Initially, we center a window of length n around each word in our training text. The word at the center is called the center word and the rest are context words. Each window will produce a training example that we will plug into a neural network. There are two approaches to the training:

1 - **CBOW**: the input words are the context words, and we predict the center word 

2 - **Skip-gram**: complementary to the previous method, the input is the center word, and the predictions are the context words

The values we are trying to predict are called the **weights** of the neural network, and they will map to our word vectors directly. We aren't going to deepdive into neural networks at the moment, and there are definitely more details on how to set up these models, but the basic intuition can be seen in the following image:

<img src="./media/word2vec.png" width="500">

The projection layer depicted contains these **weights** we mentioned, which we'll iteratively train based on a large amount of data so that we get strong word vectors at the end of training.


## 1.2 Pretrained word vectors

The best thing about these vectors, however, is that they are universal and ready to use of the shelf. They were trained on a huge amount of text data in the same language and we just take them and use them for our task. It saves us the time and effort of gathering, processing and training on our own data.

One set of such pretrained vectors is from **spacy**. [Spacy](https://spacy.io) is a toolkit similar to NLTK, but it contains embedded deep learning models for NLP and it typically has better performance for industrial applications. The pretrained word vectors are easy to use out of the box by importing the spacy library.
Spacy has several library versions with different vocabulary and vector size, we are loading the medium one. You can try to switch between versions in the following experiments and see the impact.

In [6]:
nlp = spacy.load('en_core_web_md')

If the previous cell throws an error, the library has not downloaded during the virtual environment creation, so you need to do it manually by running `python -m spacy download en_core_web_md` in your terminal.

There are several available libraries of word vectors out there, such as [FastText](https://fasttext.cc) and [Glove](https://nlp.stanford.edu/projects/glove/). These all provide good quality embeddings for your NLP tasks. Their training methods are usually based on the Word2vec, but there are differences in the details.

## 2. Word representations in Spacy

Now let's dig into the vectors and see what we can get from them. We can start by seeing the representation for a particular word, for example *house*.


In [7]:
nlp('house').vector

array([-0.90543 , -2.6086  , -5.0713  , -2.0154  ,  2.7027  , -2.7953  ,
        4.8408  ,  4.7418  ,  1.686   ,  6.492   ,  5.1332  , -0.95634 ,
       -2.5573  ,  5.4958  , -1.9762  , -2.6233  , -2.1019  , -1.9035  ,
        2.0439  , -5.2747  ,  1.0465  ,  0.35399 ,  4.4307  , -0.11785 ,
        2.5747  ,  0.33719 , -0.87352 , -5.2227  , -3.2541  , -1.9433  ,
       -0.50336 , -0.058398,  3.3487  , -3.742   , -3.3131  ,  0.12359 ,
       -0.55388 ,  5.7538  ,  2.7122  , -1.1781  , -1.585   , -1.6623  ,
        3.2545  , -0.41248 , -1.1713  ,  2.408   , -0.44603 , -4.2474  ,
        0.087175,  2.1871  , -1.3372  ,  6.4992  , -0.354   , -3.6692  ,
        1.0833  ,  0.38513 ,  2.6469  , -1.651   ,  6.7615  , -1.855   ,
       -1.4257  , -4.3342  ,  1.3257  ,  0.12542 ,  2.6058  , -3.6989  ,
       -2.8216  , -2.4928  ,  2.1173  ,  5.8335  , -0.99999 , -1.8232  ,
        2.0089  , -2.694   ,  5.8093  ,  2.9267  , -0.89982 ,  4.6008  ,
       -1.5294  ,  1.4374  , -3.3156  ,  0.20706 , 

We can define a simple function just to make it easier and avoid rewriting the same thing over and over again.

In [8]:
def vec(s):
    return nlp.vocab[s].vector

Let's also check the size of the vector:

In [9]:
nlp('house').vector.shape

(300,)

These word embeddings are 300-dimensional, or, in other words, they have 300 features. We'll come back to this later.

## 3. Cosine similarity

We can check similarities between words using cosine similarity. The cosine similarity is a measure of similarity between the vectors expressed as the cosine of the angle between them. It is defined by the following equation:

$$\text{cos-similarity} = \frac{A \cdot B}{\| A \| \| B \|}$$

More similar vectors point to a similar direction, so the angle between them is low and the cosine similarity is high. The values of cosine similarity are between -1 and 1. At 1, the vectors are pointing in the same direction, at 0, they are perpendicular, and at -1, they are pointing in opposite direction. It is very easy to see this in the 2D plane.

<img src="./media/cosine.png" width="400">

In this example, there are three animals with two features - if the animal lives in the woods and how much it hunts. The vectors represent where each animal is in this feature space and so if the vectors are closer together, they are more similar.

In [10]:
# cosine similarity
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

Let's test it out. Using cosine similarity, closer words - like *house* and *home* - should have higher scores. On the other hand words with different meanings, even if they are close in terms of characters - like *house* and *mouse* - should produce a low score, if our word vectors really capture meaning:

In [11]:
cosine(vec('house'), vec('home'))

0.70501816

In [12]:
cosine(vec('house'), vec('mouse'))

0.30104372

As expected, *house* is closer to *home* than it is to *mouse*. Makes sense!

<img src="./media/future.jpg" width="400">




Once again, to simplify our next examples, let's create a function that gets us the closest words to the vector that we are interested in:

In [13]:
def spacy_closest(token_list, vec_to_check, n=10, dont_include_list=[]):
    return sorted([(x, cosine(vec_to_check, vec(x))) for x in token_list if x not in dont_include_list],
                  key=lambda x: x[1],
                  reverse=True)[:n]

We are going to apply this function in further examples. To simplify a bit, let's limit the vocabulary to the one from the Twitter dataset we used in the previous learning notebook. We can then find the closest words to the word *house*.  We start by reading the dataset and getting its vocabulary:

In [15]:
df = pd.read_csv('./data/twitter_rep_dem_data_small.csv')

handle_removal = lambda doc: re.subn(r'@\w+','', doc.lower())[0]
df['Tweet'] = df['Tweet'].map(handle_removal)

simple_tokenizer = lambda doc: " ".join(WordPunctTokenizer().tokenize(doc))
df['Tweet'] = df['Tweet'].map(simple_tokenizer)

vectorizer = TfidfVectorizer()
vectorizer.fit_transform(df.Tweet)

tweet_vocab = vectorizer.vocabulary_

These are the 10 closest words to *house*: 

In [16]:
spacy_closest(tweet_vocab.keys(),
              vec('house'),
              dont_include_list=['house'])

[('hothouse', 0.99999994),
 ('houses', 0.7942134),
 ('playhouses', 0.7942134),
 ('apartment', 0.7159029),
 ('home', 0.70501816),
 ('homenaje', 0.70501816),
 ('mansion', 0.68605477),
 ('courthouse', 0.67482996),
 ('warehouse', 0.6685663),
 ('plazo', 0.66722745)]

## 4. Word relations

There is much more that we can do to show you that these vectors capture the meaning, or at least some semantic information, of our vocabulary. Hopefully, if you still don't believe it, this will help. For example, what do you think will happen if we subtract man from king and add woman?

In [17]:
spacy_closest(tweet_vocab.keys(), 
              vec('king') - vec('man') + vec('woman'),
              dont_include_list=['king', 'man', 'woman'])

[('monarchs', 0.6899287),
 ('prince', 0.65628284),
 ('princess', 0.6480763),
 ('queens', 0.6178014),
 ('queen', 0.6178014),
 ('throne', 0.6081843),
 ('nobility', 0.6025089),
 ('imperial', 0.6010856),
 ('kingdom', 0.5926837),
 ('royal', 0.589522)]

<img src="./media/mind-blown-2.png" width="300">


And what is the mean between morning and evening?

In [18]:
spacy_closest(tweet_vocab.keys(),
              np.mean(np.array([vec('morning'), vec('evening')]), axis=0),
              dont_include_list=['morning', 'evening'])

[('mornin', 0.9607939),
 ('convening', 0.95948684),
 ('afternoon', 0.8967448),
 ('mornings', 0.8549312),
 ('night', 0.83286226),
 ('nigh', 0.83286226),
 ('nighter', 0.83286226),
 ('evenings', 0.7993388),
 ('midnight', 0.75258917),
 ('daybreak', 0.75258917)]

<img src="./media/mind-blown-3.png" width="300">


What the sky is to blue, the grass is to ...

In [19]:
spacy_closest(tweet_vocab.keys(), 
              vec('blue') - vec('sky') + vec('grass'),
              dont_include_list=['blue', 'sky', 'grass'])

[('green', 0.49475595),
 ('greener', 0.49475595),
 ('greene', 0.49475595),
 ('greenest', 0.49475595),
 ('greenbrier', 0.49475595),
 ('brown', 0.47857535),
 ('walgreens', 0.4525114),
 ('coloured', 0.43883103),
 ('synthetic', 0.43828773),
 ('white', 0.43364882)]

<img src="./media/mind-blown-4.png" width="300">

<br>

## 5. Applying word vectors to sentences

There are several ways you could think of to construct a sentence representation from these vectors, such as:

* sum
* average 
* concatenation

The average is a good enough approach to start with, so let's implement a function to get the sentence vector representation from the average of its words:

In [21]:
def sentvec(s):
    sent = nlp(s)
    return np.mean(np.array([w.vector for w in sent]), axis=0)

We can then use the same logic to get the closest sentence according to the sentence representation we chose. Below you have the implementation of the previous function that used cosine similarity, but for sentences.

In [22]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sentvec(input_str)
    return sorted(space,
                  key=lambda x: cosine(sentvec(x), input_vec),
                  reverse=True)[:n]

Let's try it out with a sentence:

In [24]:
for sent in spacy_closest_sent(df.Tweet.values[:2000], "i am against the trump administration ."):
    print(sent)
    print('---')

i am opposed to this proposal and will fight to keep fair and just .
---
i am not the least bit surprised that is pushing to weaken # airquality regulations . we must stand for # environmentaljustice !
---
rt : i am delighted that will be voting for the cra to overrule the fcc and save our # netneutrality rules . find out …
---
“ and i am sure that one day the united states will come back and join the paris climate agreement !”
---
rt : i am looking forward to joining and this sunday in nyc for this important discussion on the call f …
---
earlier this week i was honored to welcome the to d . c . for their first ever hill day . i am committ … https :// t . co / euvcuwlh6e
---
he turned a tragedy into a legacy of helping others . i am proud to call him my constituent and look forward to cont … https :// t . co / pbr4fpewys
---
i call on mr . pruitt to resign because americans cannot and should not be the victim of his abuse of power .
---
rt : we will be following this case closely . fo

It seems to have worked quite well, wouldn't you agree, Marty?

If you are still not convinced about this, you can try to project all your vectors into a 2D space (by applying PCA, for example) and convince yourself that words are somewhat organized by meaning, and we can extract word relations from its distances. If you project your vectors, you should get something similar to this:

<img src="./media/word-vectors-projection.png" width="600">


## 6. NLP practical example

All that is left is to use these new word representations as the features of our models. We start by defining a function to build sentence vectors for this dataset.

In [25]:
def build_sentence_vecs(docs):
    num_examples = len(docs)
    word_vector_shape = nlp.vocab.vectors.shape[-1]
    vectors = np.zeros([num_examples, word_vector_shape])
    for ii, doc in enumerate(docs):
        vector = sentvec(doc)
        vectors[ii] = vector
    
    # in case we get any NaN's or Inf, replace them with 0s
    return np.nan_to_num(vectors)

First let's get a baseline as we did before (it should match the one from the previous notebook). 

In [26]:
handle_removal = lambda doc: re.subn(r'@\w+','', doc.lower())[0]
df['Tweet'] = df['Tweet'].map(handle_removal)

simple_tokenizer = lambda doc: " ".join(WordPunctTokenizer().tokenize(doc))
df['Tweet'] = df['Tweet'].map(simple_tokenizer)

train_data, test_data = train_test_split(df, test_size=0.3, random_state=seed)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data.Tweet)
X_test = vectorizer.transform(test_data.Tweet)

y_train = train_data.Party
y_test = test_data.Party

clf =  KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

Accuracy: 0.5062994960403168


Let's also get baselines for our previous methods - SVD and PCA. We'll use 300 as the number of components to keep so we can compare with the new technique.

In [27]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data.Tweet)
X_test = vectorizer.transform(test_data.Tweet)

svd = TruncatedSVD(n_components=300, random_state=seed)
svd.fit(X_train)
X_train_svd = svd.transform(X_train)
X_test_svd =  svd.transform(X_test)

clf =  KNeighborsClassifier()
clf.fit(X_train_svd, y_train)
y_pred = clf.predict(X_test_svd)
print('Truncated SVD Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

pca = PCA(n_components=300, random_state=seed)
X_train_dense = X_train.toarray()
X_test_dense = X_test.toarray()
pca.fit(X_train_dense)
X_train_pca = pca.transform(X_train_dense)
X_test_pca =  pca.transform(X_test_dense)

clf =  KNeighborsClassifier()
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
print('PCA Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

Truncated SVD Accuracy: 0.6040316774658028
PCA Accuracy: 0.599712023038157


For 300 features, PCA and SVD have a pretty low accuracy. Now let's split the data and build the vectors - it might take a few minutes to get the vectors for all training and test data. We'll print the shape of the output vector - you should see that our feature vector now has 300 dimensions.

In [28]:
#calculate sentence vectors for each tweet
X_train = build_sentence_vecs(train_data.Tweet.values)
X_test = build_sentence_vecs(test_data.Tweet.values)

print(X_train.shape)

(12963, 300)


Let's run the same model and see how much accuracy we can get out of our 300 features:

In [29]:
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print('Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

Accuracy: 0.6002519798416127


For 300 features, this is a pretty close accuracy. We can even go further, for example let's try to remove stopwords from the equation.

First we need to download the set of stopwords. Uncomment the below line to run `nltk.download('stopwords')`

In [30]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sanctus/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [31]:
# Redefine functions to use stopwords information

def sentvec_tfidf(s, stopwords):
    sent = nlp(s)
    return np.average(np.array([w.vector for w in sent if w.text not in stopwords]), axis=0)
    
def build_sentence_vecs_tfidf(docs, stopwords):
    num_examples = len(docs)
    word_vector_shape = nlp.vocab.vectors.shape[-1]
    vectors = np.zeros([num_examples, word_vector_shape])
    for ii, doc in enumerate(docs):
        vector = sentvec_tfidf(doc, stopwords)
        vectors[ii] = vector
    
    # in case we get any NaN's or Inf, replace them with 0s
    return np.nan_to_num(vectors)

# Run with english stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))
X_train = build_sentence_vecs_tfidf(train_data.Tweet.values, stopwords)
X_test = build_sentence_vecs_tfidf(test_data.Tweet.values, stopwords)

clf =  KNeighborsClassifier()
clf.fit(X_train, train_data.Party)
pred = clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(pred, test_data.Party)))

Accuracy: 0.6195104391648668


Awesome! We got a little bit more accuracy with a simple strategy. There are much more tweaks that you can (and should) try to improve your accuracy. 

<br> 

## 7. Final remarks

In this last part, we've shown you that word vectors are pretty useful and intuitive, keeping meaningful information about words in a compact feature space. If you wish to dig further into these word representations we suggest this [paper](https://arxiv.org/pdf/1301.3781.pdf). As before, take into consideration that although word vectors can be used as an out of the box solution for several NLP tasks, all the factors mentioned before will affect your model performance. So once again, you should still be careful with:

- Initial text preprocessing
- Choice of classifier
- Parameter selection

In particular, for most of NLP tasks, neural networks have been showing extremely good performance, and if you really want to get into this field, you should learn more about that. However, these basic techniques are essential to understand some of the reasoning when handling text and can still prove quite useful to us.

And that's it for this BLU. You have come out on the other side with a much wider view of the different methods and reasoning you can take when handling features in NLP (and outside NLP) in a high dimensional space. There is so much more, but these basic tools should suffice for you to start working with text data and to understand more complex approaches built on top of these methods. See you in the next BLU!

<br>

<img src="./media/see-you-in-the-future.png" width="500">

