# Vectorizing Language

In this exercise you will use SpaCy to convert review text into word vectors (also known as word embeddings). These vectors can be used as features for machine learning models. Word vectors will typically improve the performance of your models above one-hot and bag of words encoding.

You'll first get these vectors with SpaCy then use them to train a logistic regression model. You can get vectors for the individual words in a document, but a logistic regression model can't  make use of word-level encodings. Instead, you need a vector representation for the entire document. A document vector is calculated by averaging the word vectors for each token in the document. Then, these document vectors are used to train the model.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex2 import *
print("\nSetup complete")


Setup complete


In [2]:
review_data = pd.read_csv('../input/yelp_ratings.csv', index_col=0)
review_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


In [3]:
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

### Exercise: Get document vectors

Use SpaCy to get document vectors from the review text. 

Returning all 44,500 document vectors takes about 20 minutes, so here you'll need to get only the first 100. For the rest of this exercise, I've provided a file with all 44,500 document vectors.

In [4]:
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in review_data[:100].iterrows()])

The cell below will load in the rest of the document vectors.

In [8]:
# Loading all document vectors from file
vectors = np.load('../input/review_vectors.npy')

Split the data into train and test sets.

In [54]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment, test_size=0.1, random_state=1)

### Exercise: Fit a logistic regression model

Use the document vectors to train a scikit-learn logistic regression model and calculate the accuracy

In [64]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

model = LogisticRegression()
model.fit(X_train, y_train)

accuracy = metrics.accuracy_score(y_test, model.predict(X_test))
print(accuracy)



0.9387067804220925


### Exercise: Find the best regularization parameter

You can get better performance by searching through different hyperparameters. Here you can adjust the regularization parameter `C`. The easiest way to do this is with cross-validation using `LogisticRegressionCV`. This will automatically split up your data into k-folds and measure the socring metric for each fold. Using this, you can search through a range of values for `C` to maximize the accuracy.

In [57]:
from sklearn.linear_model import LogisticRegressionCV
c_vals = [1, 10, 100, 1000, 10000, 100000]
model = LogisticRegressionCV(Cs=c_vals, scoring='accuracy',
                             cv=5, max_iter=10000,
                             random_state=1).fit(X_train, y_train)

Now you can test the model's performance on the hold-out test data.

In [63]:
model.score(X_test, y_test)

0.9461158509205209

### Exercise: Try a different model

Now see if you can get better performance with a different model. Support vector machines are also popular for natural language classification. Use scikit-learn's `LinearSVC` model to

In [33]:
from sklearn.svm import LinearSVC

In [58]:
model = LinearSVC()
model.fit(X_train, y_train)

accuracy = metrics.accuracy_score(y_test, model.predict(X_test))
print(accuracy)

0.9447687471935339


In [65]:
metric_log = {}
c_vals = [0.1, 1, 10, 100, 1000, 10000]
for c in c_vals:
    model = LinearSVC(C=c, dual=False, max_iter=100000)
    model.fit(X_train, y_train)
    metric_log[c] = metrics.accuracy_score(y_test, model.predict(X_test))

In [66]:
metric_log

{0.1: 0.9387067804220925,
 1: 0.9447687471935339,
 10: 0.9456668163448586,
 100: 0.9461158509205209,
 1000: 0.9461158509205209,
 10000: 0.9461158509205209}

You should get an accuracy of 94.6%, slightly better than the best logistic regression model.