# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [1]:
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')
df.head()
df.describe()

Unnamed: 0,label,review
count,6000,5980
unique,2,5966
top,pos,What was an exciting and fairly original serie...
freq,3000,2


### Task #2: Check for missing values:

In [2]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [3]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

0 blanks:  []


### Task #3: Remove NaN values:

In [4]:
df.dropna(inplace=True)
df.describe(())

Unnamed: 0,label,review
count,5980,5980
unique,2,5966
top,pos,What was an exciting and fairly original serie...
freq,2990,2


### Task #4: Take a quick look at the `label` column:

In [5]:
df['label'].value_counts()

label
pos    2990
neg    2990
Name: count, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [6]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)

y_train.value_counts(), y_test.value_counts()

(label
 pos    2007
 neg    1999
 Name: count, dtype: int64,
 label
 neg    991
 pos    983
 Name: count, dtype: int64)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC
from sklearn.ensemble import RandomForestClassifier

text_clf_dummy = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', DummyClassifier()),
])
text_clf_svc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])
text_clf_nusvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', NuSVC()),
])
text_clf_rf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', RandomForestClassifier()),
])


In [8]:
text_clf_dummy.fit(X_train, y_train)  

In [9]:
text_clf_svc.fit(X_train, y_train)  



In [10]:
text_clf_nusvc.fit(X_train, y_train)  

In [11]:
text_clf_rf.fit(X_train, y_train)  

### Task #7: Run predictions and analyze the results

In [12]:
# Form a prediction set
y_test_dummy = text_clf_dummy.predict(X_test)
y_test_svc = text_clf_svc.predict(X_test)
y_test_nusvc = text_clf_nusvc.predict(X_test)
y_test_rf = text_clf_rf.predict(X_test)

In [13]:
# Report the confusion matrix
from sklearn.metrics import confusion_matrix
print("Dummy: \n",confusion_matrix(y_test,y_test_dummy))
print("SVC: \n",confusion_matrix(y_test,y_test_svc))
print("NuSVC: \n",confusion_matrix(y_test,y_test_nusvc))
print("Random Forest: \n",confusion_matrix(y_test,y_test_rf))


Dummy: 
 [[  0 991]
 [  0 983]]
SVC: 
 [[900  91]
 [ 63 920]]
NuSVC: 
 [[894  97]
 [ 61 922]]
Random Forest: 
 [[875 116]
 [113 870]]


In [14]:
# Print a classification report
from sklearn.metrics import classification_report
print("Dummy: \n",classification_report(y_test,y_test_dummy,digits=4))
print("SVC: \n",classification_report(y_test,y_test_svc,digits=4))
print("NuSVC: \n",classification_report(y_test,y_test_nusvc,digits=4))
print("Random Forest: \n",classification_report(y_test,y_test_rf,digits=4))

Dummy: 
               precision    recall  f1-score   support

         neg     0.0000    0.0000    0.0000       991
         pos     0.4980    1.0000    0.6649       983

    accuracy                         0.4980      1974
   macro avg     0.2490    0.5000    0.3324      1974
weighted avg     0.2480    0.4980    0.3311      1974

SVC: 
               precision    recall  f1-score   support

         neg     0.9346    0.9082    0.9212       991
         pos     0.9100    0.9359    0.9228       983

    accuracy                         0.9220      1974
   macro avg     0.9223    0.9220    0.9220      1974
weighted avg     0.9223    0.9220    0.9220      1974

NuSVC: 
               precision    recall  f1-score   support

         neg     0.9361    0.9021    0.9188       991
         pos     0.9048    0.9379    0.9211       983

    accuracy                         0.9200      1974
   macro avg     0.9205    0.9200    0.9199      1974
weighted avg     0.9205    0.9200    0.9199      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [15]:
# Print the overall accuracy
from sklearn import metrics
print(f'Dummy:        {metrics.accuracy_score(y_test,y_test_dummy):.4f}')
print(f'SVC:          {metrics.accuracy_score(y_test,y_test_svc):.4f}')
print(f'NuSVC:        {metrics.accuracy_score(y_test,y_test_nusvc):.4f}')
print(f'RandomForest: {metrics.accuracy_score(y_test,y_test_rf):.4f}')

Dummy:        0.4980
SVC:          0.9220
NuSVC:        0.9200
RandomForest: 0.8840
