# Vectorizing Language

In this exercise you will use SpaCy to convert the review text into word vectors. 

You'll first get these vectors with SpaCy. Then, you'll use them to train a logistic regression model and a linear support vector classifier.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex2 import *
print("\nSetup complete")


Setup complete


In [2]:
review_data = pd.read_csv('../input/nlp-course/yelp_ratings.csv', index_col=0)
review_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


In [3]:
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

### Exercise: Get document vectors

Use SpaCy to get document vectors from the review text. 

Returning all 44,500 document vectors takes about 20 minutes, so here you'll need to get only the first 100. For the rest of this exercise, I've provided a file with all 44,500 document vectors.

In [25]:
review_data.iloc[7455]

text         :(
stars         1
sentiment     0
Name: 7455, dtype: object

In [None]:
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in review_data[:100].iterrows()])

The cell below will load in the rest of the document vectors.

In [4]:
# Loading all document vectors from file
vectors = np.load('../input/nlp-course/review_vectors.npy')

Split the data into train and test sets.

## Exercise: Document Similarity

Find the document most similar to an example.



In [6]:
text = "One of the last real pubs in San Francisco. They have solid food (the bistro burger, chicken tenders and hummus plate are my favorites). They have a great selection of taps and the owner Neil, is a great guy. It's one of the only non-posh San Francisco pubs you'll find with a clean atmosphere, food and beer. I've been in the neighborhood for 2-3 years and come here weekly. I highly recommend for a relaxing night and some decent food!"

In [7]:
text_vector = nlp(text).vector

In [8]:
a = text_vector

In [12]:
def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

In [19]:
sims = np.array([cosine_similarity(text_vector, vec) for vec in vectors])

  


In [23]:
vectors[7456]

array([-1.59594957e-02,  1.74750701e-01, -1.43281862e-01, -1.05491593e-01,
        1.07150845e-01, -2.63998043e-02,  4.11011614e-02, -2.59563357e-01,
        9.51598398e-03,  2.15807128e+00, -7.23177046e-02,  3.73177268e-02,
        9.84339975e-03, -6.19993657e-02, -1.73671991e-01, -1.39454722e-01,
       -4.87762317e-02,  1.06987953e+00, -1.33403018e-01,  1.82681810e-02,
       -5.22443168e-02, -5.80078661e-02, -4.15326608e-03, -1.38557283e-02,
       -4.22815606e-02,  5.30283479e-03, -1.57636940e-01, -1.19175091e-01,
        1.12210467e-01, -8.98163170e-02, -4.70235758e-02,  4.73255143e-02,
       -3.16928029e-02,  5.19502647e-02,  7.86208361e-02, -6.96767718e-02,
        3.39470953e-02, -6.29663328e-03, -1.46525025e-01, -1.36079848e-01,
        2.27249358e-02,  1.06234916e-01,  7.26362392e-02, -6.96387962e-02,
        4.26478758e-02,  7.52198473e-02, -1.64077535e-01, -5.70037402e-02,
       -6.06145374e-02,  4.77300072e-03, -8.43887627e-02,  6.30923882e-02,
        2.24606059e-02, -

In [21]:
sims.argmax()

7455

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment, test_size=0.1, random_state=1)

### Exercise: Fit a logistic regression model

Use the document vectors to train a scikit-learn logistic regression model and calculate the accuracy.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)

accuracy = metrics.accuracy_score(y_test, model.predict(X_test))
print(accuracy)

### Exercise: Find the best regularization parameter

You can find a model with better performance by searching through hyperparameter values. For example, here you can adjust the regularization parameter `C`. The easiest way to do this is with cross-validation using `LogisticRegressionCV`. This will automatically split up your data into folds and measure the scoring metric for each fold. Using this, you can search through a range of values for `C` to maximize the accuracy.

In [None]:
from sklearn.linear_model import LogisticRegressionCV
c_vals = [0.1, 1, 10, 100, 1000, 10000]
model = LogisticRegressionCV(Cs=c_vals, scoring='accuracy',
                             cv=5, max_iter=10000,
                             random_state=1).fit(X_train, y_train)

Now you can test the model's performance on the hold-out test data.

In [None]:
print(f'Model test accuracy: {model.score(X_test, y_test)}')

### Exercise: Try a different model

It's possible to get better accuracy using a different model. Here, try scikit-learn's `LinearSVC` model to see if you can improve on the logisitic regression model. Again, use cross-validation to find the best value for the regularization parameter. This time you'll need to do the cross-validation yourself with `cross_val_score`.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

In [None]:
scores, models = [], []
for c in c_vals:
    clf = LinearSVC(C=c, random_state=1, dual=False, max_iter=10000)
    cv_score = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
    scores.append(cv_score.mean())
    models.append(clf)

In [None]:
max_score = np.array(scores).argmax()
model = models[max_score].fit(X_train, y_train)
print(f'Model test accuracy: {model.score(X_test, y_test)}')

You should get an accuracy of 94.6%, slightly better than the best logistic regression model.

Congratulations on finishing this course! At this point you know how to get embeddings for each word in the documents, but you're only using the averaged document vectors with these models. Using the word vectors themselves might result in even better performing models. To do this you'll want to use a recurrent neural network (RNN for short). We won't cover RNNs in this course, but look them up if you want to learn about state-of-the-art NLP models.