# Text Classification Project
Now we're at the point where we should be able to:
* Read in a collection of documents - a *corpus*
* Transform text into numerical vector data using a pipeline
* Create a classifier
* Fit/train the classifier
* Test the classifier on new data
* Evaluate performance

For this project we'll use the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/



## Perform imports and load the dataset
The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file.

In [None]:
import numpy as np
import pandas as pd
from sklearn import metrics

In [None]:
df=pd.read_csv('moviereviews.tsv',sep='\t')
df.head()

In [None]:
len(df)
#have 2000 reviews


### Take a look at a typical review. This one is labeled "negative":

In [None]:
#using the raw text, predict the movie is positive or negative
df['review'][0]

## Check for missing values:

### Detect & remove NaN values:

In [None]:
# Check for the existence of NaN values in a cell:
df.isnull().sum()

35 records show **NaN** (this stands for "not a number" and is equivalent to *None*). These are easily removed using the `.dropna()` pandas function.
<div class="alert alert-info" style="margin: 20px">CAUTION: By setting inplace=True, we permanently affect the DataFrame currently in memory, and this can't be undone. However, it does *not* affect the original source data. If we needed to, we could always load the original DataFrame from scratch.</div>

In [None]:
df.dropna(inplace=True)
len(df)

In [None]:
df.isnull().sum()

### Detect & remove empty strings


In [None]:

blanks=[]    # start with an empty list
for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
            
print(len(blanks),'blanks',blanks)

Next we'll pass our list of index numbers to the **.drop()** method, and set `inplace=True` to make the change permanent.

In [None]:
#blanks list contain all the ids whose reviews are empty, drop them from the dataframe
df.drop(blanks,inplace=True)

In [None]:
len(df)

Great! We dropped 62 records from the original 2000. Let's continue with the analysis.

## Take a quick look at the `label` column:

In [None]:
df['label'].value_counts()

## Split the data into train & test sets:

In [None]:
from sklearn.model_selection import train_test_split 
X=df['review']
y=df['label']
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=42)

## Build pipelines to vectorize the data, then train and fit a model
Now that we have sets to train and test, we'll develop a selection of pipelines, each with a different model.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

#naive_bayes
text_clf_nb=Pipeline([('tfidf',TfidfVectorizer()),('clf',MultinomialNB())])

#Linear SVC
text_clf_lsvc=Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())])

## Feed the training data through the first pipeline
We'll run naïve Bayes first

In [None]:
text_clf_nb.fit(X_train,y_train)

## Run predictions and analyze the results (naïve Bayes)

In [None]:
# Form a prediction set
predications=text_clf_nb.predict(X_test)

In [None]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predications))

In [None]:
# Print a classification report
print(metrics.classification_report(y_test,predications))

In [None]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predications))

Naïve Bayes gave us better-than-average results at 76.4% for classifying reviews as positive or negative based on text alone. Let's see if we can do better.

## Feed the training data through the second pipeline
Next we'll run Linear SVC

In [None]:
text_clf_lsvc.fit(X_train,y_train)

## Run predictions and analyze the results (Linear SVC)

In [None]:
# Form a prediction set
predications=text_clf_lsvc.predict(X_test)

In [None]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predications))

In [None]:
# Print a classification report
print(metrics.classification_report(y_test,predications))

In [None]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predications))

Not bad! Based on text alone we correctly classified reviews as positive or negative **84.7%** of the time. 

In [None]:
print("The End")