# Vectorizing Language

In this exercise you'll use SpaCy to convert the review text into word vectors, then train a Scikit-learn model with the vectors. You'll also find the most similar review inthe data set given some example text.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex3 import *
print("\nSetup complete")

In [None]:
review_data = pd.read_csv('../input/nlp-course/yelp_ratings.csv')
review_data.head()

In [None]:
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

## Exercise: Get document vectors

To start, use SpaCy to get document vectors from the review text. 

Returning all 44,500 document vectors takes about 20 minutes, so here you'll need to get only the first 100. For the rest of this exercise, I've provided a file with all of the document vectors.

In [None]:
reviews = review_data[:100]
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    # vectors should be a Numpy array with shape (100, 300)
    vectors = ____
    
q_1.check()

In [None]:
# Uncomment if you need some guidance
# q_1.hint()
# q_1.solution()

In [None]:
#%%RM_IF(PROD)%%
reviews = review_data[:100]
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in reviews.iterrows()])
    
q_1.assert_check_passed()

Run the cell below to load in the rest of the document vectors.

In [None]:
# Loading all document vectors from file
vectors = np.load('../input/nlp-course/review_vectors.npy')

## Exercise: Train a Scikit-learn model

Next up, train a `LinearSVM` model using the document vectors. Set the regularization parameter to 10, this gives better results than the default. Also set the random state to 1 and `dual=False` (speeds up training without loss in accuracy).

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment, 
                                                    test_size=0.1, random_state=1)

In [None]:
# Create the LinearSVC model
model = ____
# Fit the model
____

q_2.check()

In [None]:
# Uncomment if you need some guidance
# q_2.hint()
# q_2.solution()

In [None]:
#%%RM_IF(PROD)%%
model = LinearSVC(C=10, random_state=1, dual=False)
model.fit(X_train, y_train)
q_2.assert_check_passed()

In [None]:
# Run this cell to see the model accuracy
print(f'Model test accuracy: {model.score(X_test, y_test)*100:.3f}%')

You should get an accuracy of 93.9%.

## Exercise: Make a prediction

With the model trained, you can use it to predict the sentiment of other reviews. The below review is for a tea house in San Franciso. Use your model to predict if the sentiment of the review is positive or negative. 

In [None]:
review="""I absolutely love this place. The 360 degree glass windows with the Yerba buena garden view, tea pots all around and the smell of fresh tea everywhere transports you to what feels like a different zen zone within the city. I know the price is slightly more compared to the normal American size, however the food is very wholesome, the tea selection is incredible and I know service can be hit or miss often but it was on point during our most recent visit. Definitely recommend!

I would recommend the butternut squash gyoza and ideally the tea sets as I feel like it is better value!"""

In [None]:
vector = ____
sentiment = ____
q_3.check()

In [None]:
# Uncomment if you need some guidance
# q_3.hint()
# q_3.solution()

In [None]:
#%%RM_IF(PROD)%%
vector = nlp(review).vector.reshape((1, -1))
sentiment = model.predict(vector)[0]
q_3.assert_check_passed()

In [None]:
print(f"Sentiment = {'Positive' if sentiment else 'Negative'}")

# Document Similarity

For the same tea house review, find the most similar review in the dataset using the cosine similarity.

## Exercise: Centering the Vectors

Sometimes you'll get better results when measuring similarities if you center the document vectors. This means you subtract the mean of the vectors from each vectors, so the new mean is 0. Why do you think this could help with similarity metrics?

Uncomment the following line after you've decided your answer.

In [None]:
#q_4.solution()

## Exercise: Find the most similar review

Given the review above, find the most similar document within the Yelp dataset using the cosine similarity.

In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

In [None]:
review_vec = nlp(review).vector

## Center the document vectors
# Calculate the mean for the document vectors, should have shape (300,)
vec_mean = ____
# Subtract the mean from the vectors
centered = ____

# Calculate similarities for each document in the dataset
# Make sure to subtract the mean from the review vector
sims = ____

# Get the index for the most similar document
most_similar = ____
q_5.check()

In [None]:
# Uncomment if you need some guidance
# q_5.hint()
# q_5.solution()

In [None]:
#%%RM_IF(PROD)%%
review_vec = nlp(review).vector

## Center the document vectors
# Calculate the mean for the document vectors
vec_mean = vectors.mean(axis=0)
# Subtract the mean from the vectors
centered = vectors - vec_mean

# Calculate similarities for each document in the dataset
# Make sure to subtract the mean from the review vector
sims = np.array([cosine_similarity(review_vec - vec_mean, vec) for vec in centered])

# Get the index for the most similar document
most_similar = sims.argmax()
q_5.assert_check_passed()

In [None]:
print(review_data.iloc[most_similar].text)

Even though there are many different sorts of businesses in our Yelp dataset, you should have found another tea shop. 

## Exercise: Other similar reviews

If you look at other similar reviews, you'll see many coffee shops. Why do you think reviews for coffee are similar to the example review which mentions only tea?

In [None]:
#q_6.solution()

Congratulations on finishing this course! At this point you know how to get embeddings for each word in the documents, but you're only using the averaged document vectors with these models. Using the word vectors themselves might result in even better performing models. To do this you'll want to use a recurrent neural network (RNN for short). We won't cover RNNs in this course, but look them up if you want to learn about state-of-the-art NLP models.