# Vectorizing Language

In this exercise you will use SpaCy to convert the review text into word vectors. 

You'll first get these vectors with SpaCy. Then, you'll use them to train a logistic regression model and a linear support vector classifier.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex2 import *
print("\nSetup complete")


Setup complete


In [2]:
review_data = pd.read_csv('../input/nlp-course/yelp_ratings.csv', index_col=0)
review_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


In [3]:
review_data = review_data.drop(index=[7455, 26631, 34073])

In [None]:
review_data.

In [4]:
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

### Exercise: Get document vectors

Use SpaCy to get document vectors from the review text. 

Returning all 44,500 document vectors takes about 20 minutes, so here you'll need to get only the first 100. For the rest of this exercise, I've provided a file with all 44,500 document vectors.

In [5]:
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in review_data.iterrows()])

The cell below will load in the rest of the document vectors.

In [12]:
# Loading all document vectors from file
vectors = np.load('../input/nlp-course/review_vectors.npy')

## Exercise: Document Similarity

Find the document most similar to an example.



In [32]:
text = """I can't say enough about the amazing experience we had at Mazda San Francisco. My wife and I just purchased a new Mazda3 Grand Touring and couldn't have worked with a better team to make it happen. Michael was incredibly friendly and helpful, as were Brandon and Chris, who all contributed to working the perfect deal out for us. We were so impressed by their teamwork and the very laid back vibe of the dealership.

The car itself is impeccable, as we new it would be, but the unknowns about the car-buying experience can be a bit intimidating. These guys changed our outlook on that completely. If you're in the market for a new car or are unsure whether to go new or used, swing by here first. 100% recommend Mazda SF. Super knowledgeable, great selection, and the most easy-going car-buying experience you'll find."""

In [33]:
doc = nlp(text)

In [52]:
vec_mean = vectors.mean(axis=0)

In [55]:
centered = vectors - vec_mean

(44530, 300)

In [28]:
def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

In [57]:
sims = np.array([cosine_similarity(doc.vector - vec_mean, vec) for vec in centered])

In [58]:
sims[:10]

array([ 0.07282877,  0.24734822,  0.49633935, -0.3401171 ,  0.08971624,
       -0.22052783,  0.17142576,  0.21271807,  0.14273277,  0.12064052],
      dtype=float32)

In [45]:
review_data.iloc[2].text

"I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!"

In [61]:
review_data.iloc[sims.argmax()].text

'If you are looking to buy a new car and want the best possible experience then I recommend you visit Kwasi at BMW of Las Vegas.  My husband and I recently purchased a new 2014 BMW 328i and we couldn\'t be happier!  Our sales representative, Kwasi, took his time to listen to our wants and needs, took us for a test drive and searched his computer system for about 30 minutes until he found the exact car we were looking for.  He even had his sales manager, Steve, go to another location to pick up the exact car for us to see and test drive.  Both Kwasi and Steve were phenomenal to work with.  After the test drive we were sold, these cars drive like a dream.  They truly live up to the title "ultimate driving machine".  Kwasi quickly had the paperwork drawn up and after pairing our phones and giving us a brief overview of the cars features we were on our way.  Kwasi continued to follow-up with us after our purchase to see if we had any questions about the car and offered to have us come back

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment, test_size=0.1, random_state=1)

### Exercise: Fit a logistic regression model

Use the document vectors to train a scikit-learn logistic regression model and calculate the accuracy.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)

accuracy = metrics.accuracy_score(y_test, model.predict(X_test))
print(accuracy)

### Exercise: Find the best regularization parameter

You can find a model with better performance by searching through hyperparameter values. For example, here you can adjust the regularization parameter `C`. The easiest way to do this is with cross-validation using `LogisticRegressionCV`. This will automatically split up your data into folds and measure the scoring metric for each fold. Using this, you can search through a range of values for `C` to maximize the accuracy.

In [None]:
from sklearn.linear_model import LogisticRegressionCV
c_vals = [0.1, 1, 10, 100, 1000, 10000]
model = LogisticRegressionCV(Cs=c_vals, scoring='accuracy',
                             cv=5, max_iter=10000,
                             random_state=1).fit(X_train, y_train)

Now you can test the model's performance on the hold-out test data.

In [None]:
print(f'Model test accuracy: {model.score(X_test, y_test)}')

### Exercise: Try a different model

It's possible to get better accuracy using a different model. Here, try scikit-learn's `LinearSVC` model to see if you can improve on the logisitic regression model. Again, use cross-validation to find the best value for the regularization parameter. This time you'll need to do the cross-validation yourself with `cross_val_score`.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

In [None]:
scores, models = [], []
for c in c_vals:
    clf = LinearSVC(C=c, random_state=1, dual=False, max_iter=10000)
    cv_score = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
    scores.append(cv_score.mean())
    models.append(clf)

In [None]:
max_score = np.array(scores).argmax()
model = models[max_score].fit(X_train, y_train)
print(f'Model test accuracy: {model.score(X_test, y_test)}')

You should get an accuracy of 94.6%, slightly better than the best logistic regression model.

Congratulations on finishing this course! At this point you know how to get embeddings for each word in the documents, but you're only using the averaged document vectors with these models. Using the word vectors themselves might result in even better performing models. To do this you'll want to use a recurrent neural network (RNN for short). We won't cover RNNs in this course, but look them up if you want to learn about state-of-the-art NLP models.