# Word Embeddings

Bag of words is not the only way to encode text. Word embeddings are a mapping from text to a high-dimensional vector space. Contrary to bag of words, word embeddings are a dense representation, and it retains some sense of similarity / closeness between words. 

### A bit of theory

The idea behind word embeddings is, as J. R. Firth said, "to know a word by the company it keeps". 

Embeddings are learned using neural networks. There are two ways of training the networks:
 1. Learn to predict the target word given the context (continuous bag of words, CBOW).
 2. Learn to predict words in the context windows given the target words (skipgram).

CBOW is usually faster to train, but skipgram gives better results, especially on infrequent words. 

Word Embeddings requires a very large corpus of texts to be learned, that is why most of the time we use pretrained embeddings, such as Word2Vec or GloVe.

## Play around with word embeddings

Now, let's load an embedding and run a few tests.

You need to download Word2Vec embedding [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit).

In [None]:
!gzip -d data/GoogleNews-vectors-negative300.bin.gz -c

In [None]:
from gensim.models import KeyedVectors

wordVectors = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

We can check how similar some words are for this embedding.

In [None]:
wordVectors.similarity('apple', 'banana')

In [None]:
wordVectors.similarity('apple', 'python')

In [None]:
wordVectors.similarity('python', 'snake')

One nice feature of word embeddings is that in addition of similarities, they retain some analogies.

In [None]:
wordVectors.similar_by_vector(wordVectors['king']- wordVectors['man'] + wordVectors['woman'], topn=2)

## Applying Word Embeddings to our dataset

We have a new way of encoding text, how can we apply it to our initial classification task? 

If we load our dataset again, each sample is a list of words, to which correspond a numerical vector. Numerical vectors is something we can work with, except that there is a different number of words in each sample, which is not something we can easily keep in a matrix. 

The solution that is often used is to create a document vector, computed by averaging word vectors in the extract. Let's implement that!

In [None]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'))

We first need to do a bit of preprocessing that was previously handled for us by scikit-learn. We will tokenize the text (split the text into words), convert all strings to lower-case, get rid of non-alphabetical strings and of stop words.

In [None]:
import nltk

nltk.download('stopwords')

In [None]:
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords

stopwords_en = set(stopwords.words('english'))

def clean_text(string):
    tokens = wordpunct_tokenize(string)
    return [token.lower() for token in tokens if (token.isalpha() and token.lower() not in stopwords_en)]


Let's see what our cleaning function does on the first extract.

In [None]:
newsgroups_train.data[0]

In [None]:
clean_text(newsgroups_train.data[0])

Now, we can write the function to compute the document vector for each text extract, and use it to build the encoded training set. 

The document vector is computed by taking the average of all word vectors present in the extract.

Don't forget to also encode the test set! 

In [None]:
# %load solutions/word_encoding.py


## Classify with Logistic Regression

Now we can try and apply Logistic Regression as we did before.

In [None]:
# %load solutions/classify_word_embedding.py


In [None]:
# %load solutions/confusion_matrix.py
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
import pandas as pd

def confusion_matrix(y_true, y_predicted, labels):
    df = pd.DataFrame(data=sk_confusion_matrix(y_true, y_predicted), index=labels, columns=labels)
    df.index.name = 'true classes'
    df.columns.name = 'predicted classes'
    return df


In [None]:
y_estimated_test = lr_classifier.predict(X_test) 
confusion_mat = confusion_matrix(newsgroups_test.target, y_estimated_test, newsgroups_test.target_names)
confusion_mat

In [None]:
import seaborn as sns
%matplotlib inline

sns.heatmap(confusion_mat)

We have improved our accuracy score with respect to the bag-of-words approach. 

Our confusion matrix still looks similar, though, we are still having a hard time to classify `talk.religion.misc` extracts.