# Word Embeddings

As I noted earlier, when modeling natural language you need to represent text or speech as vectors. In the last exercise you used bag of words vectors as the model input. We can, of course, improve on bag of words which is typically considered a baseline encoding.

Word embeddings (also called word vectors) are vector encodings for words that are learned by considering the context in which the words appear. Words that appear in similar contexts will have similar vectors. For example, vectors for "leopard", "lion", and "tiger" will be close together, while they'll be far away from "galaxy", "oolong", or "castle".

What's even cooler is that relations between words are often maintained. Subtracting the vectors for "man" and "woman" will return another vector. If you add that to the vector for "king" you'll end up at the vector for "queen" (or at least close by).

![Word vector examples](https://www.tensorflow.org/images/linear-relationships.png)

These vectors can be used as features for machine learning models. Word vectors will typically improve the performance of your models above bag of words encoding. SpaCy uses embeddings learned from the Word2Vec model. To access them, you need to load one of the large language models, like `en_core_web_lg`. Then they will be available on tokens from the `.vector` attribute.

In [1]:
import numpy as np
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

In [2]:
# Disabling other pipes because we don't need them and it'll speed up this part a bit
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
    vectors = np.array([token.vector for token in  nlp(text)])

In [3]:
vectors.shape

(12, 300)

The embeddings from SpaCy are 300 dimension vectors. You see here that we get one vector for each word. However, we only have document-level labels and our models won't be able to use the word-level embeddings. Instead, you need a vector representation for the entire document. A document vector is calculated by averaging the word vectors for each token in the document. Then, these document vectors are used to train the model. 

SpaCy calculates the average document vector which you can get with `doc.vector`. Here I'll load in the spam data and convert the text to document vectors.

In [4]:
import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')

In [5]:
# This takes a minute or so
with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in spam.text])

In [6]:
doc_vectors

array([[ 0.02204493,  0.09757433,  0.00255367, ..., -0.09383627,
        -0.00796944,  0.19099015],
       [-0.07367852, -0.19237824, -0.1709596 , ..., -0.003535  ,
         0.02119275,  0.17852125],
       [ 0.01289313, -0.0072732 , -0.00627819, ..., -0.03291722,
         0.08602007,  0.12489114],
       ...,
       [-0.05494839,  0.19570266, -0.13729948, ..., -0.10815047,
        -0.02305553,  0.18632641],
       [-0.06460267,  0.17402254, -0.21391848, ..., -0.02812874,
         0.04536904,  0.21436742],
       [ 0.09184885,  0.14416684, -0.2082083 , ...,  0.03214014,
        -0.04769757,  0.19402859]], dtype=float32)

With the document vectors, we can train scikit-learn models using all the normal methods like cross validation.

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label, test_size=0.1, random_state=1)

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

model = LogisticRegression(random_state=1, solver='lbfgs')
model.fit(X_train, y_train)

accuracy = metrics.accuracy_score(y_test, model.predict(X_test))
print("Accuracy: ", accuracy)

Accuracy:  0.967741935483871


Note that scikit-learns' `LogisticRegression` model uses L2 regularization by default. You typically want to regularize the model, but the default parameter `C=1` is likely not the best choice. You can find the best regularization parameter with something like `LogisticRegressionCV` which will perform cross validation over a range of values for `C`.

Another good model to try for text classification is a support vector machine (SVM). A convenient model to use for this is the linear support vector classifier, `LinearSVC`.

In [11]:
from sklearn.svm import LinearSVC

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print("Accuracy: ", svc.score(X_test, y_test))

Accuracy:  0.9731182795698925


Again, note that `LinearSVC` uses L2 regularization by default. So, you'll want to use cross validation to find the best regularization paramater. Scikit-learn doesn't have a convenient class like `LogisticRegressionCV` to do this, but you can use `cross_validation_score` and write the equivalent code.

Next, you'll get word vectors for the Yelp reviews and try to improve on the classifier you trained with SpaCy. If you aren't familiar with `LogisticRegressionCV` and `LinearSVC`, you'll want to work through our Intermediate Machine Learning course.