# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [2]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [3]:
# Check for whitespace strings (it's OK if there aren't any!):
whitespace = []

for i,lb,rv in df.itertuples():  
    if type(rv)==str:            
        if rv.isspace():         
            blanks.append(i)     
        
print('whitespace strings: ', len(whitespace))

whitespace strings:  0


### Task #3: Remove NaN values:

In [4]:
print(f"Rows before removal: {len(df)}")
df.dropna(inplace=True)
print(f"Rows after removal: {len(df)}")

Rows before removal: 6000
Rows after removal: 5980


### Task #4: Take a quick look at the `label` column:

In [5]:
df['label'].value_counts()

label
pos    2990
neg    2990
Name: count, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

X = df['review']
y = label_encoder.fit_transform(df['label'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier

# Linear SVC:
text_clf_lsvc = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC()),
])

text_clf_xgb = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', XGBClassifier()),
])

text_clf_lsvc.fit(X_train, y_train)
text_clf_xgb.fit(X_train, y_train)

ModuleNotFoundError: No module named 'xgboost'

### Task #7: Run predictions and analyze the results

In [None]:
# Form a prediction set
predictions_lsvc = text_clf_lsvc.predict(X_test)
predictions_xgb = text_clf_xgb.predict(X_test)

In [None]:
# Report the confusion matrix
from sklearn import metrics
print("Linear SVC: \n", metrics.confusion_matrix(y_test,predictions_lsvc))
print("XGBoost: \n", metrics.confusion_matrix(y_test,predictions_xgb))

Linear SVC: 
 [[900  91]
 [ 63 920]]
XGBoost: 
 [[846 145]
 [ 95 888]]


In [None]:
# Print a classification report
print("Linear SVC: \n", metrics.classification_report(y_test,predictions_lsvc))
print("XGBoost: \n", metrics.classification_report(y_test,predictions_xgb))

Linear SVC: 
               precision    recall  f1-score   support

           0       0.93      0.91      0.92       991
           1       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974

XGBoost: 
               precision    recall  f1-score   support

           0       0.90      0.85      0.88       991
           1       0.86      0.90      0.88       983

    accuracy                           0.88      1974
   macro avg       0.88      0.88      0.88      1974
weighted avg       0.88      0.88      0.88      1974



In [None]:
# Print the overall accuracy
print("Linear SVC: \n", metrics.accuracy_score(y_test,predictions_lsvc))
print("XGBoost: \n", metrics.accuracy_score(y_test,predictions_xgb))

Linear SVC: 
 0.9219858156028369
XGBoost: 
 0.878419452887538
