# Pre-processing Text Data

In this exercise, you'll pre-process the data to hopefully improve the model. Text data can contain words 

As pre-processing steps in this exercise, you'll use the SpaCy package for tokenizing, stemming, and lemmatizing.

In [1]:
import pandas as pd

Load in data and take a small sample so we don't wait around forever.

In [97]:
all_data = pd.read_csv('../input/amazon/train.csv')
data = all_data.sample(100_000, random_state=7)
data.head()

Unnamed: 0,text,rating
2970342,Plano Gun Cse: I have an FN AR with a scope th...,1
2811203,Avoid...: As a long time Byrdmaniac own virtua...,0
2395023,"Great Story, Defective Copy: Ender's Shadow is...",0
3115925,Typical Comic Book Come-On Cover: As you would...,0
1876317,perfect: This product is even better than it a...,1


## Use SpaCy for a bit of preprocessing

Here we're going to get lemmas and drop stop words. This actually leads to the model performing worse. But it's part of the hyperparameter iterations. So maybe it helps other models, but hurts naive bayes.

In [100]:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

docs = nlp.pipe(data.text)

# Get lemmas of each token and drop stop words
tokenized = ([token.lemma_ for token in doc if not token.is_drop] for doc in docs)

# Convert tokens back into strings for CountVectorizer
processed = [' '.join(tokens) for tokens in tokenized]

Fit NB model with processed text...

In [104]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(processed)

y = data.rating
# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X_train_counts, y, random_state=1)

nb_model = MultinomialNB()
nb_model.fit(train_X, train_y)

accuracy = nb_model.score(val_X, val_y)
print(accuracy)

0.84352


## Bigrams and Trigrams

Have CountVectorizer give us bigrams and trigrams for the model. This improves it quite a bit.

In [105]:
count_vect = CountVectorizer(ngram_range=(1, 3))
X_train_counts = count_vect.fit_transform(processed)

y = data.rating
# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X_train_counts, y, random_state=1)

nb_model = MultinomialNB()
nb_model.fit(train_X, train_y)

accuracy = nb_model.score(val_X, val_y)
print(accuracy)

0.89172
