# Text Classification

The moviereviews2.tsv dataset contains the text of 6000 movie reviews. The text has been reduced 
and preprocessed as a tab-delimited file. For more information on this dataset visit 
http://ai.stanford.edu/~amaas/data/sentiment/ 
* Perform imports and load the dataset into a pandas DataFrame.
* Data Cleanup: Handle missing values, and NaN
* Split the data into train & test sets. Use test_size=0.33, random_state=42
* Build a pipeline to vectorize the data, then train and fit a model. You may use whatever 
model you like and LinearSVC. 
* Run predictions and analyze the results. Report the confusion matrix and classification report.

### Perform imports and load the dataset into a pandas DataFrame.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')

### Data Cleanup: Handle missing values, and NaN

In [2]:
df.dropna(inplace=True)

blanks = []

for i,lb,rv in df.itertuples():
    if type(rv)==str: 
        if rv.isspace():   
            blanks.append(i) 
        
df.drop(blanks, inplace=True)

###  Split the data into train & test sets. Use test_size=0.33, random_state=42

In [3]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Build a pipeline to vectorize the data, then train and fit a model. You may use whatever model you like and LinearSVC. 

#### Build a pipeline to vectorize the data

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

#### Naïve Bayes

In [5]:
from sklearn.naive_bayes import MultinomialNB
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

#### Linear SVC

In [6]:
from sklearn.svm import LinearSVC #Support Vector Classification
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC(dual=False)),
])

#### Train and fit a model

#### Naïve Bayes

In [7]:
text_clf_nb.fit(X_train, y_train)

#### Linear SVC

In [8]:
text_clf_lsvc.fit(X_train, y_train)

### Run predictions and analyze the results. Report the confusion matrix and classification report.


#### Naïve Bayes

In [9]:
predictions = text_clf_nb.predict(X_test)

In [10]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[940  51]
 [136 847]]


In [11]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.87      0.95      0.91       991
         pos       0.94      0.86      0.90       983

    accuracy                           0.91      1974
   macro avg       0.91      0.91      0.91      1974
weighted avg       0.91      0.91      0.91      1974



In [12]:
print(metrics.accuracy_score(y_test,predictions))

0.9052684903748733


#### Linear SVC

In [13]:
predictions = text_clf_lsvc.predict(X_test)

In [14]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[900  91]
 [ 63 920]]


In [15]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [16]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.9219858156028369
