# Word Embeddings

As I noted earlier, when modeling natural language you need to represent text or speech as vectors. In the last exercise you used bag of words vectors as the model input. We can, of course, improve on bag of words which is typically considered a baseline encoding.

Word embeddings (also called word vectors) are vector encodings for words that are learned by considering the context in which the words appear. Words that appear in similar contexts will have similar vectors. For example, vectors for "leopard", "lion", and "tiger" will be close together, while they'll be far away from "galaxy", "oolong", or "castle".

What's even cooler is that relations between words are often maintained. Subtracting the vectors for "man" and "woman" will return another vector. If you add that to the vector for "king" you'll end up at the vector for "queen" (or at least close by).

![Word vector examples](https://www.tensorflow.org/images/linear-relationships.png)

These vectors can be used as features for machine learning models. Word vectors will typically improve the performance of your models above bag of words encoding. SpaCy uses embeddings learned from the Word2Vec model. To access them, you need to load one of the large language models, like `en_core_web_lg`. Then they will be available on tokens from the `.vector` attribute.

In [None]:
import numpy as np
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

In [None]:
# Disabling other pipes because we don't need them and it'll speed up this part a bit
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
    vectors = np.array([token.vector for token in  nlp(text)])

In [None]:
vectors.shape

The embeddings from SpaCy are 300 dimension vectors. You see here that we get one vector for each word. However, we only have document-level labels and our models won't be able to use the word-level embeddings. Instead, you need a vector representation for the entire document. A document vector is calculated by averaging the word vectors for each token in the document. Then, these document vectors are used to train the model. 

SpaCy calculates the average document vector which you can get with `doc.vector`. Here I'll load in the spam data and convert the text to document vectors.

In [None]:
import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')

In [None]:
# This takes a minute or so
with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in spam.text])

In [None]:
doc_vectors.shape

## Classification Models

With the document vectors, we can train scikit-learn models using all the normal methods like cross validation.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label,
                                                    test_size=0.1, random_state=1)

[Support vector machines](https://scikit-learn.org/stable/modules/svm.html#svm) are a common model for text classification as they work well with high-dimensional data. Scikit-learn provides an SVM classifier `LinearSVC`. This works similar to other scikit-learn models.

In [None]:
from sklearn.svm import LinearSVC

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )

Note that `LinearSVC` uses L2 regularization by default. You'll generally want to do some optimization to find the best regularization parameter.

## Document Similarity

Documents with similar content will in general have similar vectors. This means we can find similar documents by measuring the similarity between the vectors. A common metric for this is the *cosine similarity* which measures the angle between two vectors, $\mathbf{a}$ and $\mathbf{b}$.

$$
\cos \theta = \frac{\mathbf{a}\cdot\mathbf{b}}{\| \mathbf{a} \| \, \| \mathbf{b} \|}
$$

This is the dot product of $\mathbf{a}$ and $\mathbf{b}$, divided by the magnitudes of each vector. The cosine similarity can vary between -1 and 1, complete opposite to perfect similarity, respectively. To calculate it, you can use [the metric from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) or write your own function.

In [None]:
def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))

In [None]:
a = nlp("REPLY NOW FOR FREE TEA").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector
cosine_similarity(a, b)

Next, you'll get word vectors for the Yelp reviews and try to improve on the classifier you trained with SpaCy. You'll also use document vectors to find similar reviews in the corpus.