# Natural Language Classification

In the first exercise, you will train a model that classifies product reviews into "good" or "bad" sentiments. You'll use a basic Naive Bayes model to get a baseline which you'll improve on over the next exercises. 

Original data is from here: https://www.kaggle.com/bittlingmayer/amazonreviews/downloads/amazonreviews.zip/2
But I think I'm going to trim it down and make my own dataset for the course.

Alternatively, dataset on sms spam https://www.kaggle.com/uciml/sms-spam-collection-dataset. I might use this for the tutorials.

In [1]:
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex1 import *
print("\nSetup complete")


Setup complete


In [2]:
data = pd.read_csv('../input/amazon/train.csv')
data.head()

Unnamed: 0,text,rating
0,Stuning even for the non-gamer: This sound tra...,1
1,The best soundtrack ever to anything.: I'm rea...,1
2,Amazing!: This soundtrack is my favorite music...,1
3,Excellent Soundtrack: I truly like this soundt...,1
4,"Remember, Pull Your Jaw Off The Floor After He...",1


Okay, this code takes a really long time, so I'm actually only going to use a small sample from this dataset. Maybe I should make the whole thing available, but have students work with a small sample to optimize for time.

In [3]:
data = data.sample(1_000_000, random_state=7)

## Vectorizing Input Text

In this step, you'll create your feature vectors from the raw text data using scikit-learn's `CountVectorizer`.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Create the CountVectorizer
count_vect = ____

# Convert the text data to token counts
X_train_counts = ____

#q_1.check()

In [4]:
#%%RM_IF(PROD)%%
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data.text)

#step_1.assert_check_passed()

How many messages contain the word "tea"? Should I do this? Might be too advanced.

In [5]:
X_train_counts[:, count_vect.vocabulary_['tea']].data.sum()

6169

## Normalize with TF-IDF

You have the text data encoded with occurence counts, now normalize those counts relative to the frequency of occurence across the messages. This will prove to be a better representation for models.

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer

# Create the transformer for TF-IDF
tfidf_transformer = ____

# Use the transformer to convert the count data to tf-idf frequencies
X_train_tfidf = ____

In [6]:
#%%RM_IF(PROD)%%
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

#step_2.assert_check_passed()

In [7]:
X_train_tfidf

<1000000x448236 sparse matrix of type '<class 'numpy.float64'>'
	with 54884567 stored elements in Compressed Sparse Row format>

## Train the Naive Bayes model

With the data processed, we're ready to use it as input to a classification model. Use the `MultinomialNB` model as the classifier. Remember that the Naive Bayes model assumes no relationships or correlations between the input words. As such, this should be the simplest model and can be used as a baseline for developing more powerful models.

So that you can measure the performance of the model after training, we'll first split the data into training and validation sets. If this is unfamiliar to you, please take our Intro to Machine Learning mini-course.

In [9]:
from sklearn.model_selection import train_test_split

y = data.rating
# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X_train_tfidf, y, random_state=1)

Here, train the Naive Bayes model and calculate the accuracy.

In [10]:
from sklearn.naive_bayes import MultinomialNB

nb_model = ____
# Fit the model on the training data
____

# Calculate the accuracy using the validation data
accuracy = ____

#step_3.check()

In [11]:
#%%RM_IF(PROD)%%
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(train_X, train_y)

accuracy = nb_model.score(val_X, val_y)
print(accuracy)
#step_3.assert_check_passed()

0.846124


You should have found that the accuracy is quite high, around 96%. However, we should check the confusion matrix to see how the model is misclassifying the data.

In [12]:
from sklearn.metrics import confusion_matrix
predicted = nb_model.predict(val_X)
cm = confusion_matrix(val_y, predicted)

print("Confusion matrix\n", cm)
print("\nNormalized confusion matrix")
for row in cm:
    print(row / row.sum())

Confusion matrix
 [[108564  16664]
 [ 21805 102967]]

Normalized confusion matrix
[0.86693072 0.13306928]
[0.17475876 0.82524124]


What we see here is that there are few actual spam messages in our dataset. This is true of a normal dataset of text messages as well, the vast majority will be non-spam. Our model is able to classify these ham messages with perfect accuracy, but it misses over a quarter of the actual spam messages. In the next tutorial, we'll look at improving our model to better predict when a message is spam.