# Part 3 - Word vectors

So far we've seen simple feature selection methods, a statistical feature selection approach, dimensionality reduction techniques such as PCA and SVD, but in the last few years, with the rise in popularity of neural networks, a new technique has become the state of the art for representing words in NLP tasks.

This technique is commonly referred to as **word vectors** or **word embeddings**, and its inner workings are really simple. It consists of defining a vocabulary and a vector for each word in it with a maximum number of dimensions. Then all the vectors are found through the use of **neural networks** and we can use them off-the-shelf. In essence, word embeddings try to capture information on a word's meaning and usage. This not only allows us to significantly reduce the number of features fed to our models, but it also allows meaningful and easy transferable representations across data sets. 

Pretty cool, huh?

<img src="media/what-year-is-this.jpg" width="400">

In [1]:
import spacy

import numpy as np
from numpy import dot
from numpy.linalg import norm
import pandas as pd
import re
import nltk
from nltk.tokenize import WordPunctTokenizer

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

seed=42

## 1. Word vectors explained

First of all, by now you could be thinking: "But wait Doc, didn't I get a bunch of vectors before also?". Why yes, yes you did, Marty. You could consider the columns of the matrix with document-term counts a possible word vector representation. You could construct an even simpler matrix. 

If you assume a vocabulary of size V, and each word having an index in this vocabulary, a natural representation would be a one-hot encoding, where each word is represented by a vector of size V - the vocabulary size - with the single component corresponding to this word set to 1, and the remaining fields zeroed out.

<img src="media/one-hot-vec.png" width="300">

We are going in the right direction! But keep in mind that this representation fits in a very large space and we suddenly fall into the pitfalls of high dimensionality. You could think of applying PCA or SVD to these one-hot vectors but as for most tasks nowadays, neural networks have proven to be better at this. To put it simply, there is a more elegant way. 

<img src="media/but-how-doc.jpg" width="450">

## 1.1 Training word vectors

So you know the data - a bunch of words. You know the goal - a vector with K features. And you know the means - neural networks. So how does it all work? "You shall know a word by the company it keeps". These are the words of John Rupert Firth (at least according to Wikipedia), and they are the basis of the following method - Word2vec. 

**Word2vec** is a popular technique to produce word embeddings with neural networks, and it encompasses two main approaches - Continuous Bag Of Words (CBOW) and skipgram - that work as follows.

Initially, we center a window of length n around each word in the training text. The word at the center is called the center word and the rest are context words. Each window will produce a training example that we will plug into a neural network. There are two approaches to the training:

1 - **CBOW**: the input words are the context words, and we predict the center word 

2 - **Skip-gram**: complementary to the previous method, the input is the center word, and the predictions are the context words

The values we are trying to predict are called the **weights** of the neural network, and they will map to our word vectors directly. We aren't going to deepdive into neural networks at the moment, and there are definitely more details on how to set up these models, but the basic intuition can be seen in the following image:

<img src="./media/word2vec.png" width="500">

The projection layer contains these **weights** we mentioned, which we'll iteratively train based on a large amount of data so that we get strong word vectors at the end of training.

Remember this next time you use chatGPT - it is only putting together the most probable chain of words.

## 1.2 Pretrained word vectors

The best thing about these vectors however, is that they are universal and ready to use off the shelf. They were trained on a huge amount of text data in the same language and we just take them and use them for our task. It saves us the time and effort of gathering, processing and training on our own data.

One set of such pretrained vectors is from **Spacy**. [Spacy](https://spacy.io) is a toolkit similar to NLTK, but it contains word embedding trained with deep learning models and it typically has better performance for industrial applications. The pretrained word vectors are easy to use out of the box by importing the Spacy library.

Spacy has several library versions with different vocabulary and vector size, we are loading the medium one. You can try to switch between versions in the following experiments and see the impact.

In [2]:
nlp = spacy.load('en_core_web_md')

If the previous cell throws an error, the library has not been downloaded during the virtual environment creation, so you need to do it manually by running `python -m spacy download en_core_web_md` in your terminal.

There are other libraries of word vectors out there, such as [FastText](https://fasttext.cc) and [Glove](https://nlp.stanford.edu/projects/glove/). These all provide good quality embeddings for NLP tasks. Their training methods are usually based on the Word2vec, but there are differences in the details.

## 2. Word representations in Spacy

Now let's dig into the vectors and see what we can get from them. We can start by seeing the representation for a particular word, for example *house*.


In [3]:
nlp('house').vector

array([-7.5073e-01,  8.1650e-02,  9.0288e-02, -3.4719e-01, -6.0598e-01,
       -2.6782e-02, -1.7644e-01, -4.5973e-01,  4.8586e-01,  2.9120e+00,
       -8.2821e-01,  4.4448e-01,  1.4028e-01,  1.1009e-01, -4.5023e-01,
       -2.1889e-01, -4.8917e-01,  7.4006e-01,  1.5316e-01,  5.5353e-01,
        9.6078e-02,  1.7717e-01,  2.0261e-02, -3.5839e-02, -4.3881e-02,
       -1.1955e-01, -1.9034e-02, -3.0087e-01, -5.3294e-03, -1.6692e-01,
        1.7790e-01, -4.8102e-01,  1.5397e-01,  1.5131e-01,  1.2383e-01,
       -1.0739e-01, -9.7154e-02, -1.4213e-01,  1.8673e-01,  2.1388e-01,
       -1.6718e-01,  4.0141e-01,  4.9244e-01, -7.0160e-01, -1.6582e-01,
        1.9418e-01,  2.6764e-01, -3.4181e-01, -1.5499e-01,  2.2845e-01,
       -5.2560e-02,  2.7192e-01, -2.0114e-01,  3.4382e-02,  3.0014e-01,
       -1.8081e-01,  1.8739e-01, -1.0126e-01, -3.3541e-01,  6.5063e-03,
       -9.6152e-02, -3.9275e-01,  8.8390e-02, -1.4326e-01,  5.5553e-01,
       -2.0286e-01,  4.2895e-01,  6.1838e-01,  3.3416e-01, -4.76

We can define a simple function to get a word vector just to make it easier and avoid rewriting the same thing over and over again.

In [4]:
def vec(s):
    return nlp.vocab[s].vector

Let's also check the size of the vector:

In [5]:
nlp('house').vector.shape

(300,)

These word embeddings are 300-dimensional, or, in other words, they have 300 features. We'll come back to this later.

## 3. Cosine similarity

As the words are represented as vectors, we can measure similarities between words with cosine similarity. The cosine similarity is a measure of similarity between vectors expressed as the cosine of the angle between them. It is defined by the following expression:

$$\text{cos-similarity} = \frac{A \cdot B}{\| A \| \| B \|}$$

More similar vectors point to a similar direction, so the angle between them is low and the cosine similarity is high. The values of cosine similarity are between -1 and 1. At 1, the vectors are pointing in the same direction, at 0, they are perpendicular, and at -1, they are pointing in opposite directions. It is very easy to see this in the 2D plane.

<img src="./media/cosine.png" width="400">

In this example, there are three animals with two features - if the animal lives in the woods and how much it hunts. The vectors represent where each animal is in this feature space. If the vectors are closer together, the animals are more similar.

Let's define a function to compute cosine similarity:

In [6]:
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

Let's test it out. Using cosine similarity, closer words - like *house* and *home* - should have higher scores. On the other hand words with different meanings, even if they are close in terms of characters - like *house* and *mouse* - should produce a low score, if our word vectors really capture meaning:

In [7]:
cosine(vec('house'), vec('home'))

0.47925475

In [8]:
cosine(vec('house'), vec('mouse'))

-0.009982352

As expected, *house* is closer to *home* than it is to *mouse*. Makes sense!

<img src="./media/future.jpg" width="400">

Once again, to simplify our next steps, let's create a function that gets us the closest words to the word that we are interested in:

In [9]:
def spacy_closest(token_list, vec_to_check, n=10, dont_include_list=[]):
    return sorted([(x, cosine(vec_to_check, vec(x))) for x in token_list if x not in dont_include_list],
                  key=lambda x: x[1],
                  reverse=True)[:n]

We are going to apply this function in further examples. To simplify a bit, let's limit the vocabulary to the one from the Twitter data set we used in the previous learning notebook. We can then find the closest words from this data set to the word *house*.  We start by reading the dataset and getting its vocabulary:

In [10]:
df = pd.read_csv('./data/twitter_rep_dem_data_small.csv')

handle_removal = lambda doc: re.subn(r'@\w+','', doc.lower())[0]
df['Tweet'] = df['Tweet'].map(handle_removal)

simple_tokenizer = lambda doc: " ".join(WordPunctTokenizer().tokenize(doc))
df['Tweet'] = df['Tweet'].map(simple_tokenizer)

vectorizer = TfidfVectorizer()
vectorizer.fit_transform(df.Tweet)

tweet_vocab = vectorizer.vocabulary_

These are the 10 closest words to *house*: 

In [11]:
spacy_closest(tweet_vocab.keys(),
              vec('house'),
              dont_include_list=['house'])

[('you', 1.0000001),
 ('making', 1.0000001),
 ('your', 1.0000001),
 ('invest', 1.0000001),
 ('investing', 1.0000001),
 ('own', 1.0000001),
 ('saving', 1.0000001),
 ('funds', 1.0000001),
 ('keep', 1.0000001),
 ('fund', 1.0000001)]

Interesting, so 'house' is all about owning and investing nowadays.

## 4. Word relations

There is much more that we can do to show you that these vectors capture the meaning, or at least some semantic information, of our vocabulary. Hopefully, if you still don't believe it, this will help. For example, what do you think will happen if we subtract man from king and add woman?

In [12]:
spacy_closest(tweet_vocab.keys(),
              vec('king') - vec('man') + vec('woman'),
              dont_include_list=['king', 'man', 'woman'])

[('girl', 0.7344864),
 ('black', 0.7344864),
 ('gal', 0.7344864),
 ('doll', 0.7344864),
 ('actress', 0.7344864),
 ('nurse', 0.6498234),
 ('chick', 0.6498234),
 ('latin', 0.63816786),
 ('latina', 0.63816786),
 ('mature', 0.63816786)]

<img src="./media/mind-blown-2.png" width="300">

And what is the mean between morning and evening?

In [13]:
spacy_closest(tweet_vocab.keys(),
              np.mean(np.array([vec('morning'), vec('evening')]), axis=0),
              dont_include_list=['morning', 'evening'])

[('scenic', 0.7720237),
 ('tour', 0.7720237),
 ('breakfast', 0.7720237),
 ('afternoon', 0.7720237),
 ('trip', 0.7720237),
 ('bar', 0.7720237),
 ('lunch', 0.7720237),
 ('restaurant', 0.7720237),
 ('guests', 0.7720237),
 ('dinner', 0.7720237)]

<img src="./media/mind-blown-3.png" width="300">


What the sky is to blue, the grass is to ...

In [14]:
spacy_closest(tweet_vocab.keys(), 
              vec('blue') - vec('sky') + vec('grass'),
              dont_include_list=['blue', 'sky', 'grass'])

[('hay', 0.6579658),
 ('hollow', 0.6579658),
 ('comb', 0.6579658),
 ('tent', 0.6579658),
 ('hats', 0.6579658),
 ('blanket', 0.6579658),
 ('hat', 0.6579658),
 ('sand', 0.6579658),
 ('tie', 0.6579658),
 ('cotton', 0.6579658)]

<img src="./media/mind-blown-4.png" width="300">

## 5. Applying word vectors to sentences

There are several ways to construct a sentence representation from these vectors, such as:

* sum
* average 
* concatenation

The average is a good enough approach to start with, so let's implement a function to get the sentence vector representation from the average of its words.

In [15]:
def sentvec(s):
    sent = nlp(s)
    sent_vec = np.array([w.vector for w in sent])
    if len(sent_vec)>0:
        return np.mean(sent_vec, axis=0)
    else:
        return 0

We can then use the same logic to get the closest sentence according to the sentence representation we chose. Below you have the implementation of the previous function that used cosine similarity, but for sentences.

In [16]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sentvec(input_str)
    return sorted(space,
                  key=lambda x: cosine(sentvec(x), input_vec),
                  reverse=True)[:n]

Let's find the closest sentences from the Twitter data for the sentence 'i am against the trump administration .'.

In [17]:
for sent in spacy_closest_sent(df.Tweet.values[:2000], "i am against the trump administration ."):
    print(sent)
    print('---')

i am opposed to this proposal and will fight to keep fair and just .
---
rt : we cannot allow this # citizenship question to pit communities against each other . we all have a stake in this fight . we must …
---
rt : " it is sad that we have to continue to remind this administration that immigrants founded this country and have fought …
---
i am not the least bit surprised that is pushing to weaken # airquality regulations . we must stand for # environmentaljustice !
---
congressional republicans and the trump administration must realize that now is the time for progress – not rollbacks . # earthday
---
rt : we need to respect the work people do . a man in wv told me his brother goes 6 feet underground to get coal so i can have …
---
senator mccain has faced every battle in his life with dignity , respect and heroism .
---
rt : once again administration showing inhumanity . we need compassion ! thank you for stepping u …
---
“ and i am sure that one day the united states will come back

It seems to have worked quite well, wouldn't you agree, Marty?

If you are still not convinced about this, you can try to project all your vectors into a 2D space (by applying PCA, for example) and convince yourself that words are somewhat organized by meaning and we can extract word relations from their distances. If you project your vectors, you should get something similar to this:

<img src="./media/word-vectors-projection.png" width="600">

## 6. NLP practical example

All that is left is to use these new word representations as the features for our model. We start by defining a function to build sentence vectors for the Twitter data set.

In [18]:
def build_sentence_vecs(docs):
    num_examples = len(docs)
    word_vector_shape = nlp.vocab.vectors.shape[-1]
    vectors = np.zeros([num_examples, word_vector_shape])
    for ii, doc in enumerate(docs):
        vector = sentvec(doc)
        vectors[ii] = vector
    
    # in case we get any NaN's or Inf, replace them with 0s
    return np.nan_to_num(vectors)

First let's get a baseline (it should match the one from the previous notebook). 

In [19]:
handle_removal = lambda doc: re.subn(r'@\w+','', doc.lower())[0]
df['Tweet'] = df['Tweet'].map(handle_removal)

simple_tokenizer = lambda doc: " ".join(WordPunctTokenizer().tokenize(doc))
df['Tweet'] = df['Tweet'].map(simple_tokenizer)

train_data, test_data = train_test_split(df, test_size=0.3, random_state=seed)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data.Tweet)
X_test = vectorizer.transform(test_data.Tweet)

y_train = train_data.Party
y_test = test_data.Party

clf =  KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

Accuracy: 0.5062994960403168


Let's also get baselines for our previous methods - SVD and PCA. We'll use 300 components so that we can compare with the word vector technique.

In [20]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data.Tweet)
X_test = vectorizer.transform(test_data.Tweet)

svd = TruncatedSVD(n_components=300, random_state=seed)
svd.fit(X_train)
X_train_svd = svd.transform(X_train)
X_test_svd =  svd.transform(X_test)

clf =  KNeighborsClassifier()
clf.fit(X_train_svd, y_train)
y_pred = clf.predict(X_test_svd)
print('Truncated SVD Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

pca = PCA(n_components=300, random_state=seed)
X_train_dense = X_train.toarray()
X_test_dense = X_test.toarray()
pca.fit(X_train_dense)
X_train_pca = pca.transform(X_train_dense)
X_test_pca =  pca.transform(X_test_dense)

clf =  KNeighborsClassifier()
clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
print('PCA Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

Truncated SVD Accuracy: 0.6040316774658028
PCA Accuracy: 0.599712023038157


For 300 features, PCA and SVD have a reasonable accuracy. Now let's split the data and build the vectors. We'll print the shape of the output vector - you should see that our feature vector now has 300 dimensions.

In [21]:
#calculate sentence vectors for each tweet
X_train = build_sentence_vecs(train_data.Tweet.values)
X_test = build_sentence_vecs(test_data.Tweet.values)

print(X_train.shape)

(12963, 300)


Let's run the same model and see how much accuracy we can get out of our 300 features.

In [22]:
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print('Accuracy: {}'.format(accuracy_score(y_pred, y_test)))

Accuracy: 0.5890928725701944


This is a bit worse than SVD and PCA. But we can go further, let's try to remove stopwords from the equation.

First we need to download the set of stopwords.

In [23]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/maria/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
# Redefine functions to use stopwords information

def sentvec_tfidf(s, stopwords):
    sent = nlp(s)
    sent_vec = np.array([w.vector for w in sent if w.text not in stopwords])
    if len(sent_vec)>0:
        return np.average(sent_vec, axis=0)
    else:
        return 0
    
def build_sentence_vecs_tfidf(docs, stopwords):
    num_examples = len(docs)
    word_vector_shape = nlp.vocab.vectors.shape[-1]
    vectors = np.zeros([num_examples, word_vector_shape])
    for ii, doc in enumerate(docs):
        vector = sentvec_tfidf(doc, stopwords)
        vectors[ii] = vector
    
    # in case we get any NaN's or Inf, replace them with 0s
    return np.nan_to_num(vectors)

# Run with english stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))
X_train = build_sentence_vecs_tfidf(train_data.Tweet.values, stopwords)
X_test = build_sentence_vecs_tfidf(test_data.Tweet.values, stopwords)

clf =  KNeighborsClassifier()
clf.fit(X_train, train_data.Party)
pred = clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(pred, test_data.Party)))

Accuracy: 0.615370770338373


Awesome! We got a little bit more accuracy with this simple improvement and got ahead of SVD and PCA.

## 7. Final remarks

In this last part, we've shown you that word vectors are pretty useful and intuitive, keeping meaningful information about words in a compact feature space. If you wish to dig further into these word representations we suggest this [paper](https://arxiv.org/pdf/1301.3781.pdf). As before, take into consideration that although word vectors can be used as an out-of-the-box solution for several NLP tasks, you still need work on a few things that can affect the model performance. So once again, you should still be careful with:

- initial text preprocessing
- choice of classifier
- parameter selection.

Neural networks show extremely good performance for most NLP tasks, and if you really want to get into this field, you should learn more about that. However, these basic techniques are essential to understand some of the reasoning when handling text and can still prove quite useful.

And that's it for this BLU. You have come out on the other side with a much wider view of the different methods you can use when handling features in NLP (and outside NLP) in a high dimensional space. There is so much more, but these basic tools should suffice for you to start working with text data and to understand more complex approaches built on top of these methods. See you in the next BLU!

<img src="./media/see-you-in-the-future.png" width="500">