## Part 3 - Word Vectors

So far we've seen simple feature selection methods, a statistical feature selection approach, dimensionality reduction techniques such as PCA and SVD, but in the last few years, with the rise in popularity of Neural Networks, a new technique has become the state of the art for representing words in NLP tasks.

This technique is commonly referred to as word vectors or word embeddings, and its inner workings are really simple. It consists of defining a vocabulary and a vector for each word in it with a maximum number of dimensions. Then all the vectors' weights are found through the use of neural networks. In essence, word embeddings try to capture information a word's meaning and usage. This not only allows us to reduce significantly the number of features inputed to our models, but it also allows meaningful and easy representations across the data, that are transferrable among tasks. 

Pretty cool, huh?

<img src="./media/what-year-is-this.jpg" width="400">




In [1]:
import spacy

import numpy as np
from numpy import dot
from numpy.linalg import norm
import pandas as pd
import re
import nltk
from nltk.tokenize import WordPunctTokenizer

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

seed=42



## 1 - Word Vectors Explained

First of all, by now you could be thinking: "But wait Doc, didn't I get a bunch of vectors before also?". Why yes, yes you did Marty. You could consider the matrix with document-term counts to contain in their columns a possible word vector representation. You could even construct a simpler matrix. 

If you assume your vocabulary with size V, and each word having an index in this vocabulary, a natural representation would be what it is called a 1-hot encoding, where each word is represented by a vector of size V - the vocabulary size - with the single component corresponding to its word set to 1, and the remaining zeroed out.

<img src="./media/one-hot-vec.png" width="300">


We are going in the right direction! But keep in mind that this representation fits in a very large space and we suddenly fall into the pitfalls of high-dimensionality. You could think of applying PCA or SVD to these 1-hot vectors but as for most tasks nowadays, neural networks have proven to be better at the task. To simply put it, here is a more elegant way. 

<img src="./media/but-how-doc.jpg" width="450">



### Training Word Vectors

So you know the data - a bunch of words. You know the goal - a vector with an arbitrary number of K features. And you know the means - neural networks. So how does it all work? "You shall know a word by the company it keeps". These are the words of John Rupert Firth (at least according to wikipedia), and they are the basis of the following method - Word2vec. 

**Word2vec** is a popular technique for using neural models to produce word embeddings, and it encompasses two main approaches - Continuous Bag Of Words (CBOW) and skipgram - that we will describe here.

Initially, we prepare the dataset to consider for each sentence several windows of length n, centered around each word. Each of these will create training examples that we will plug into our neural network, in one of two ways:

1 - **CBOW**: the input words are the context words, and we predict the center word, this is, our model output. 

2 - **Skip-gram**: complementary to the previous method, the input is the center word, and the predictions are the context words

The weights of the network are shared in both cases for the side that has more than one word, and there are a few more details on how setup these models, but the basic intuition can be seen on the following image:

<img src="./media/word2vec.png" width="500">


### Pretrained word vectors

The best thing about these vectors, however, is that we can transfer them among tasks. What this means is that we don't need to go through the painful task of training them, and we can rely on pretrained vectors. Most of these pretrained vectors were trained on a huge amount of data in the same language, that would take time to gather, process and iterate over to train the network.

One set of such pretrained vectors are from **spacy**. [Spacy](https://spacy.io) is a toolkit similar to NLTK, but it contains embedded deep learning models for NLP and it typically has better performance for industrial applications. The pretrained word vectors are easy to use out of the box by importing the spacy library. At this point, if you did not go through the README carefully, you should run this command to download the required models:

`python -m spacy download en_core_web_md`

Spacy has different versions with different sizes, and the one we are downloading is the medium one. You can try to switch between versions to see the impact it gets in the following experiments. Different sizes are related to different vocabulary sizes and feature size. Load the medium pretrained model:


In [2]:
nlp = spacy.load('en_core_web_md')

There are several available libraries of word vectors out there, such as [FastText](https://fasttext.cc) and [Glove](https://nlp.stanford.edu/projects/glove/). These all provide good quality embeddings for your NLP tasks. Their training methods are usually based on the Word2vec, but they normally have a few difference in details.

## 2 - Word Representations in Spacy

Now let's dig into the vectors and see what we can get from them. We can start by seing the representation for a particular word, for example *house*.


In [3]:
nlp('house').vector

array([ 1.9847e-01,  1.8087e-01, -8.9119e-02, -2.5626e-01,  7.4104e-02,
        5.9422e-03, -8.0814e-02, -8.7499e-01,  1.6353e-01,  2.7836e+00,
       -8.9134e-01,  3.7017e-02, -5.5995e-01, -2.1853e-01, -3.6847e-01,
        4.2609e-01,  2.5508e-02,  1.1834e+00, -5.9869e-02, -1.6261e-02,
        3.6331e-01,  1.2664e-01,  3.1424e-01,  2.3845e-02,  5.7331e-02,
       -4.7905e-01, -2.3247e-01,  2.3379e-02, -2.9739e-01,  1.0735e-01,
        2.9723e-01,  5.4123e-02, -2.6837e-01,  4.8272e-01, -4.8055e-02,
       -1.0766e-02,  1.6169e-01, -7.4395e-02,  1.2789e-03, -6.1155e-02,
        2.4258e-01,  1.4165e-02,  8.3789e-02, -3.5793e-01, -4.8655e-02,
        1.1436e-01,  2.7535e-01, -9.2720e-01,  3.2332e-01,  1.6197e-01,
       -2.6260e-01, -3.2542e-01,  1.8347e-01,  5.7849e-01,  1.9925e-01,
       -3.7611e-01,  1.8520e-01,  1.3349e-01,  1.9571e-01,  5.1844e-01,
        2.0733e-01,  2.0470e-01,  8.3850e-02,  4.2725e-01,  1.1571e-01,
       -1.2066e-01, -7.6344e-02,  2.2959e-01, -1.9066e-01,  2.88

We can define a simple function just to make it easier and avoid rewriting the same thing over and over again.

In [4]:
def vec(s):
    return nlp.vocab[s].vector

Let's also check the size of the vector:

In [5]:
nlp('house').vector.shape

(300,)

These word embeddings are 300-dimensional, or, in other words, they have 300 features. We'll come back to this later.

## 3 - Cosine similarity

We can check similarities between words using cosine similarity. The cosine similarity is a measure of distance between to vectors. It is defined by the following equation:

$$\text{cos-similarity} = \frac{A \cdot B}{\| A \| \| B \|}$$

And it's computation is very intuitive in the 2D plane. 

<img src="./media/cosine.png" width="400">

In this example, there are three animals that have two features that represent them - if the animal lives in the woods and how much it hunts. The vectors represent where each animal is in this feature space and so if the vectors are more close together, they are more similar. This can be measured by the cosine of the angle between them - if the angle between two vectors is low (similar vectors), the cosine of that angle is greater and thus the similarity between the words in this feature space is greater!

In [6]:
# cosine similarity
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

Let's test it out. Using cosine similarity, closer words - like *house* and *home* - should have higher scores. On the other hand words with different meanings, even if they are close in terms of characters - like *house* and *mouse* - should produce a low score, if our word vectors really capture meaning:

In [7]:
cosine(vec('house'), vec('home'))

0.73886245

In [8]:
cosine(vec('house'), vec('mouse'))

0.16095257

As expected, *house* is closer to *home* than it is to *mouse*. Makes sense!

<img src="./media/future.jpg" width="400">




Once again, to simplify our next examples, let's create a function that gets us the closest words to the vector that we are interested in:

In [9]:
def spacy_closest(token_list, vec_to_check, n=10, dont_include_list=[]):
    return sorted([(x, cosine(vec_to_check, vec(x))) for x in token_list if x not in dont_include_list],
                  key=lambda x: x[1],
                  reverse=True)[:n]

We are going to apply this function in further examples. To simplify a bit, let's limit the vocabulary to the one from our previous example. We can then find the closest words to the word *house*.  We start by reading the dataset and getting its vocabulary:

In [10]:
df = pd.read_csv('./datasets/twitter_rep_dem_data_small.csv')

handle_removal = lambda doc: re.subn(r'@\w+','', doc.lower())[0]
df['Tweet'] = df['Tweet'].map(handle_removal)

simple_tokenizer = lambda doc: " ".join(WordPunctTokenizer().tokenize(doc))
df['Tweet'] = df['Tweet'].map(simple_tokenizer)

vectorizer = TfidfVectorizer()
vectorizer.fit_transform(df.Tweet)

tweet_vocab = vectorizer.vocabulary_

We can also obtain the 10 closest words: 

In [11]:
spacy_closest(tweet_vocab.keys(),
              vec('house'),
              dont_include_list=['house'])

[('hous', 0.9999999),
 ('housed', 0.75457484),
 ('houses', 0.75457484),
 ('home', 0.73886245),
 ('apartment', 0.71463555),
 ('residence', 0.656779),
 ('homes', 0.6402517),
 ('mansion', 0.6243122),
 ('room', 0.61347073),
 ('pantry', 0.608584)]

## 4 - Word relations

There are much more that we can do to show you that these vectors capture the meaning, or at least some semantic information, of our vocabulary. Hopefully, if you still don't believe it, this will help. For example, what do you think will happen if we subtract man from king and add woman?

In [12]:
spacy_closest(tweet_vocab.keys(), 
              vec('king') - vec('man') + vec('woman'),
              dont_include_list=['king', 'man', 'woman'])

[('queen', 0.78808445),
 ('prince', 0.6401078),
 ('princess', 0.61256355),
 ('royal', 0.580097),
 ('throne', 0.57870126),
 ('queens', 0.5743794),
 ('kingdom', 0.55209804),
 ('lady', 0.5254389),
 ('woma', 0.5150813),
 ('mother', 0.49758324)]

<img src="./media/mind-blown-2.png" width="300">


And what is the mean between morning and evening?

In [13]:
spacy_closest(tweet_vocab.keys(),
              np.mean(np.array([vec('morning'), vec('evening')]), axis=0),
              dont_include_list=['morning', 'evening'])

[('afternoon', 0.9260557),
 ('night', 0.8225889),
 ('mornings', 0.7452829),
 ('noon', 0.7396107),
 ('10am', 0.7396107),
 ('2pm', 0.7396107),
 ('11am', 0.7396107),
 ('4pm', 0.7396107),
 ('8am', 0.7396107),
 ('5pm', 0.7396107)]

<img src="./media/mind-blown-3.png" width="300">


If sky is to blue, grass is to ...

In [14]:
spacy_closest(tweet_vocab.keys(), 
              vec('blue') - vec('sky') + vec('grass'),
              dont_include_list=['blue', 'sky', 'grass'])

[('green', 0.6426651),
 ('red', 0.5832038),
 ('purple', 0.5736932),
 ('lilac', 0.5736932),
 ('lawn', 0.5727839),
 ('landscaping', 0.5727839),
 ('turf', 0.54457664),
 ('orange', 0.5386561),
 ('brown', 0.52307475),
 ('hazel', 0.52307475)]

<img src="./media/mind-blown-4.png" width="300">

<br>

## 5 - Applying word vectors to sentences

There are several ways you could think of to construct a sentence representation from these vectors, such as:

* sum
* average 
* concatenation

The average is a good enough approach to start with, so let's implement a function to get the sentence vector representation from the average of its words:

In [15]:
def sentvec(s):
    sent = nlp(s)
    return np.mean(np.array([w.vector for w in sent]), axis=0)

We can then use the same logic to get the closest sentence according to the sentence representation we chose. Below you have the implementation of the previous function that used cosine similarity, but for sentences.

In [16]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sentvec(input_str)
    return sorted(space,
                  key=lambda x: cosine(sentvec(x), input_vec),
                  reverse=True)[:n]

Let's try it out with a sentence:

In [17]:
for sent in spacy_closest_sent(df.Tweet.values[:2000], "i am against the trump administration ."):
    print(sent)
    print('---')

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


last night , i voted against advancing the republican tax bill because i ' ve heard from delawareans about the need fo … https :// t . co / sseek7kbaq
---
congressional republicans and the trump administration must realize that now is the time for progress – not rollbacks . # earthday
---
i call on mr . pruitt to resign because americans cannot and should not be the victim of his abuse of power .
---
rt : the court is giving the trump administration 90 days to explain why they ended # daca , which the administrati …
---
i am opposed to this proposal and will fight to keep fair and just .
---
rt : a civil war that gets so little attention from u . s . public . in congress it has become about the proxy war between iran and saud …
---
this week , i joined 73 of my colleagues in calling on the administration to stop playing politics with the cdc . the … https :// t . co / gqr1cesbar
---
“ stop donald trump and his politics of fear .” - i support this message from the brave group of migrant

It seems to have worked quite well, wouldn't you agree, Marty?

If you are still not convinced about this, you can try to project all your vectors into a 2D space (by applying PCA, for example) and convince yourself that words are somewhat organized by meaning, and we can extract word relations from its distances. If you project your vectors, you should get something similar to this:

<img src="./media/word-vectors-projection.png" width="600">


## 6 - NLP practical example

All that is left is to try to use these new representations as the features of our models. We start by defining a function to build our vectors for this dataset.

In [18]:
def build_sentence_vecs(docs):
    num_examples = len(docs)
    word_vector_shape = nlp.vocab.vectors.shape[-1]
    vectors = np.zeros([num_examples, word_vector_shape])
    for ii, doc in enumerate(docs):
        vector = sentvec(doc)
        vectors[ii] = vector
    
    # in case we get any NaN's or Inf, replace them with 0s
    return np.nan_to_num(vectors)

First let's get a baseline as we did before (it should match the one from the previous notebook). 

In [19]:
handle_removal = lambda doc: re.subn(r'@\w+','', doc.lower())[0]
df['Tweet'] = df['Tweet'].map(handle_removal)

simple_tokenizer = lambda doc: " ".join(WordPunctTokenizer().tokenize(doc))
df['Tweet'] = df['Tweet'].map(simple_tokenizer)

train_data, test_data = train_test_split(df, test_size=0.3, random_state=seed)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data.Tweet)
X_test = vectorizer.transform(test_data.Tweet)

y_train = train_data.Party
y_test = test_data.Party

clf =  KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

Accuracy: 0.5062994960403168


Let's also get baselines for our previous methods - SVD and PCA. We'll use 300 as the number of components to keep so we can compare with the new technique.

In [20]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data.Tweet)
X_test = vectorizer.transform(test_data.Tweet)

svd = TruncatedSVD(n_components=300, random_state=seed)
svd.fit(X_train)
X_train_svd = svd.transform(X_train)
X_test_svd =  svd.transform(X_test)

clf =  KNeighborsClassifier()
clf.fit(X_train_svd, y_train)
y_pred = clf.predict(X_test_svd)
print('Truncated SVD Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

pca = PCA(n_components=300, random_state=seed)
X_train_dense = X_train.toarray()
X_test_dense = X_test.toarray()
pca.fit(X_train_dense)
X_train_pca = pca.transform(X_train_dense)
X_test_pca =  pca.transform(X_test_dense)

clf =  KNeighborsClassifier()
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
print('PCA Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

Truncated SVD Accuracy: 0.6040316774658028
PCA Accuracy: 0.599712023038157


For 300 features, PCA and SVD have a pretty low accuracy. Now let's split the data and build the vectors - it might take a few minutes to get the vectors for all training and test data. Print the shape of the output vector so we get an idea of the number of features that our model is going to use now. You should see that our feature vector is now of 300 features only.

In [21]:
X_train = build_sentence_vecs(train_data.Tweet.values)
X_test = build_sentence_vecs(test_data.Tweet.values)

print(X_train.shape)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


(12963, 300)


Let's run the same model and see how much accuracy we can get out of our 300 features:

In [22]:
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print('Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

Accuracy: 0.6384089272858171


For 300 features, this is a pretty close accuracy. We can even go further, for example let's try to remove stopwords from the equation.

First we need to download the set of stopwords. Uncomment the below line to run `nltk.download('stopwords')`

In [23]:
# nltk.download('stopwords')

In [24]:
# Redefine functions to use stopwords information

def sentvec_tfidf(s, stopwords):
    sent = nlp(s)
    return np.average(np.array([w.vector for w in sent if w.text not in stopwords]), axis=0)
    
def build_sentence_vecs_tfidf(docs, stopwords):
    num_examples = len(docs)
    word_vector_shape = nlp.vocab.vectors.shape[-1]
    vectors = np.zeros([num_examples, word_vector_shape])
    for ii, doc in enumerate(docs):
        vector = sentvec_tfidf(doc, stopwords)
        vectors[ii] = vector
    
    # in case we get any NaN's or Inf, replace them with 0s
    return np.nan_to_num(vectors)

# Run with english stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))
X_train = build_sentence_vecs_tfidf(train_data.Tweet.values, stopwords)
X_test = build_sentence_vecs_tfidf(test_data.Tweet.values, stopwords)

clf =  KNeighborsClassifier()
clf.fit(X_train, train_data.Party)
pred = clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(pred, test_data.Party)))

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


Accuracy: 0.6576673866090713


Awesome! We got a little bit more accuracy with a simple strategy. There are much more tweaks that you can (and should) try to improve your accuracy. 

<br> 

## Final Remarks

In this last part, we've shown you that word vectors are pretty useful and intuitive, keeping meaningful information about words in a compact feature space. If you wish to dig further into these word representations we would suggest this [paper](https://arxiv.org/pdf/1301.3781.pdf). As before, take into consideration that although they can be used as an out of the box solution for several NLP tasks, all the factors mentioned before will affect your model performance. So once again, you should still be careful with:

- Initial text preprocessing
- Choice of classifier
- Parameter selection

In particular, for most of NLP tasks, neural networks have been showing extremely good performance, and if you really want to get into this field, you should learn more about that. However, these basic techniques are essential to understand some of the reasoning when handling text and can still prove quite useful to us.

And that's it for this BLU. You have come out the other side with a much wider view of the different methods and reasoning you can take when handling features in NLP (and outside NLP) in a high dimensional space. There is so much more, but these basic tools should suffice for you to start working with text data and to understand more complex approaches built on top of these methods. See you in the next BLU!

<br>

<img src="./media/see-you-in-the-future.png" width="500">

