# Text Classification Project
Now we're at the point where we should be able to:
* Read in a collection of documents - a *corpus*
* Transform text into numerical vector data using a pipeline
* Create a classifier
* Fit/train the classifier
* Test the classifier on new data
* Evaluate performance

For this project we'll use the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/



## Perform imports and load the dataset
The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file.

In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics

In [2]:
df=pd.read_csv('moviereviews.tsv',sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [3]:
len(df)
#have 2000 reviews


2000

### Take a look at a typical review. This one is labeled "negative":

In [4]:
#using the raw text, predict the movie is positive or negative
df['review'][0]

'how do films like mouse hunt get into theatres ? \r\nisn\'t there a law or something ? \r\nthis diabolical load of claptrap from steven speilberg\'s dreamworks studio is hollywood family fare at its deadly worst . \r\nmouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . \r\nwriter adam rifkin and director gore verbinski are the names chiefly responsible for this swill . \r\nthe plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . \r\ndeciding to check out the long-abandoned house , they soon learn that it\'s worth a fortune and set about selling it in auction to the highest bidder . \r\nbut battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . \r\

## Check for missing values:

### Detect & remove NaN values:

In [5]:
# Check for the existence of NaN values in a cell:
df.isnull().sum()

label      0
review    35
dtype: int64

35 records show **NaN** (this stands for "not a number" and is equivalent to *None*). These are easily removed using the `.dropna()` pandas function.
<div class="alert alert-info" style="margin: 20px">CAUTION: By setting inplace=True, we permanently affect the DataFrame currently in memory, and this can't be undone. However, it does *not* affect the original source data. If we needed to, we could always load the original DataFrame from scratch.</div>

In [6]:
df.dropna(inplace=True)
len(df)

1965

In [7]:
df.isnull().sum()

label     0
review    0
dtype: int64

### Detect & remove empty strings


In [8]:

blanks=[]    # start with an empty list
for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
            
print(len(blanks),'blanks',blanks)

27 blanks [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


Next we'll pass our list of index numbers to the **.drop()** method, and set `inplace=True` to make the change permanent.

In [9]:
#blanks list contain all the ids whose reviews are empty, drop them from the dataframe
df.drop(blanks,inplace=True)

In [10]:
len(df)

1938

Great! We dropped 62 records from the original 2000. Let's continue with the analysis.

## Take a quick look at the `label` column:

In [11]:
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

## Split the data into train & test sets:

In [12]:
from sklearn.model_selection import train_test_split 
X=df['review']
y=df['label']
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=42)

## Build pipelines to vectorize the data, then train and fit a model
Now that we have sets to train and test, we'll develop a selection of pipelines, each with a different model.

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

#naive_bayes
text_clf_nb=Pipeline([('tfidf',TfidfVectorizer()),('clf',MultinomialNB())])

#Linear SVC
text_clf_lsvc=Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())])

## Feed the training data through the first pipeline
We'll run naïve Bayes first

In [14]:
text_clf_nb.fit(X_train,y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

## Run predictions and analyze the results (naïve Bayes)

In [15]:
# Form a prediction set
predications=text_clf_nb.predict(X_test)

In [16]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predications))

[[259  23]
 [102 198]]


In [17]:
# Print a classification report
print(metrics.classification_report(y_test,predications))

              precision    recall  f1-score   support

         neg       0.72      0.92      0.81       282
         pos       0.90      0.66      0.76       300

    accuracy                           0.79       582
   macro avg       0.81      0.79      0.78       582
weighted avg       0.81      0.79      0.78       582



In [18]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predications))

0.7852233676975945


Naïve Bayes gave us better-than-average results at 76.4% for classifying reviews as positive or negative based on text alone. Let's see if we can do better.

## Feed the training data through the second pipeline
Next we'll run Linear SVC

In [19]:
text_clf_lsvc.fit(X_train,y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

## Run predictions and analyze the results (Linear SVC)

In [20]:
# Form a prediction set
predications=text_clf_lsvc.predict(X_test)

In [21]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predications))

[[235  47]
 [ 41 259]]


In [22]:
# Print a classification report
print(metrics.classification_report(y_test,predications))

              precision    recall  f1-score   support

         neg       0.85      0.83      0.84       282
         pos       0.85      0.86      0.85       300

    accuracy                           0.85       582
   macro avg       0.85      0.85      0.85       582
weighted avg       0.85      0.85      0.85       582



In [23]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predications))

0.8487972508591065


Not bad! Based on text alone we correctly classified reviews as positive or negative **84.7%** of the time. 

In [24]:
print("The End")

The End
