# Vectorizing Language

Embeddings are both conceptually clever and practically effective. 

So let's try them for the sentiment analysis model you built for the restaurant. Then you can find the most similar review in the data set given some example text. It's a task where you can easily judge for yourself how well the embeddings work.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import spacy

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex3 import *
print("\nSetup complete")

In [None]:
# Load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

review_data = pd.read_csv('../input/nlp-course/yelp_ratings.csv')
review_data.head()

Here's an example of loading some document vectors. 

Calculating 44,500 document vectors takes about 20 minutes, so we'll get only the first 100. To save time, we'll load pre-saved document vectors for the hands-on coding exercises.

The result is a matrix of 100 rows and 300 columns. 

Why 100 rows?
Because we have 1 row for each column.

Why 300 columns?
This is the same length as word vectors. See if you can figure out why document vectors have the same length as word vectors (some knowledge of linear algebra or vector math would be needed to figure this out).

Go ahead and run the following cell to load in the rest of the document vectors.

In [None]:
# Loading all document vectors from file
vectors = np.load('../input/nlp-course/review_vectors.npy')

## Exercise: Train a Scikit-learn model

Next you'll train a `LinearSVM` model using the document vectors. We've provided some reasonable values for model parameters (specifically setting `dual=False` to speed up training.) 

After running the LinearSVC model, you might try experimenting with other types of models you know to see whether it improves your results.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment, 
                                                    test_size=0.1, random_state=1)

# Create the LinearSVC model
model = LinearSVC(C=10, random_state=1, dual=False)
# Fit the model
____

# Run this cell to see the model accuracy
print(f'Model test accuracy: {model.score(X_test, y_test)*100:.3f}%')

q_1.check()

In [None]:
# Uncomment if you need some guidance
# q_1.hint()
# q_1.solution()

In [None]:
#%%RM_IF(PROD)%%
model = LinearSVC(C=10, random_state=1, dual=False)
model.fit(X_train, y_train)
q_2.assert_check_passed()

In [None]:
# Scratch space in case you want to experiment with other models

#second_model = ____
#second_model.fit(X_train, y_train)
#print(f'Model test accuracy: {second_model.score(X_test, y_test)*100:.3f}%')

## Exercise: Make a prediction

With the model trained, you can use it to predict the sentiment of other reviews. The below review is for a tea house in San Franciso. Use your model to predict if the sentiment of the review is positive or negative. 

In [None]:
review="""I absolutely love this place. The 360 degree glass windows with the Yerba buena garden view, tea pots all around and the smell of fresh tea everywhere transports you to what feels like a different zen zone within the city. I know the price is slightly more compared to the normal American size, however the food is very wholesome, the tea selection is incredible and I know service can be hit or miss often but it was on point during our most recent visit. Definitely recommend!

I would recommend the butternut squash gyoza and ideally the tea sets as I feel like it is better value!"""

In [None]:
# Fill in the ____ values
vector = nlp(____).vector.reshape((1, -1))
sentiment = ____
q_2.check()

In [None]:
# Uncomment if you need some guidance
# q_2.hint()
# q_2.solution()

In [None]:
#%%RM_IF(PROD)%%
vector = nlp(review).vector.reshape((1, -1))
sentiment = model.predict(vector)[0]
q_2.assert_check_passed()

In [None]:
print(f"Sentiment = {'Positive' if sentiment else 'Negative'}")

# Document Similarity

For the same tea house review, find the most similar review in the dataset using the cosine similarity.

## Exercise: Centering the Vectors

Sometimes people center document vectors when calculating similarities. That is, they calculate the mean vector from all documents, and they subtract this from each individual document's vector. Why do you think this could help with similarity metrics?

Uncomment the following line after you've decided your answer.

In [None]:
#q_3.solution()

## Exercise: Find the most similar review

Given the review above, find the most similar document within the Yelp dataset using the cosine similarity.

In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

review_vec = nlp(review).vector

## Center the document vectors
# Calculate the mean for the document vectors, should have shape (300,)
vec_mean = vectors.mean(axis=0)
# Subtract the mean from the vectors
centered = ____

# Calculate similarities for each document in the dataset
# Make sure to subtract the mean from the review vector
sims = ____

# Get the index for the most similar document
most_similar = ____
q_4.check()

In [None]:
# Uncomment if you need some guidance
# q_4.hint()
# q_4.solution()

In [None]:
#%%RM_IF(PROD)%%
review_vec = nlp(review).vector

## Center the document vectors
# Calculate the mean for the document vectors
vec_mean = vectors.mean(axis=0)
# Subtract the mean from the vectors
centered = vectors - vec_mean

# Calculate similarities for each document in the dataset
# Make sure to subtract the mean from the review vector
sims = np.array([cosine_similarity(review_vec - vec_mean, vec) for vec in centered])

# Get the index for the most similar document
most_similar = sims.argmax()
q_5.assert_check_passed()

In [None]:
print(review_data.iloc[most_similar].text)

Even though there are many different sorts of businesses in our Yelp dataset, you should have found another tea shop. 

## Exercise: Other similar reviews

If you look at other similar reviews, you'll see many coffee shops. Why do you think reviews for coffee are similar to the example review which mentions only tea?

In [None]:
#q_5.solution()

# Congratulations!

You've finished the NLP course. It's an exciting field that will help you make use of vast amounts of data you didn't know how to work with before.

This course should be just your introdoction. Try a project with text. You'll have fun with it, and your skills will continue growing.