___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [1]:
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('moviereviews.tsv','\t')
df.sample(5)

Unnamed: 0,label,review
89,neg,"wow , a film without any redeeming qualities w..."
1538,neg,
334,neg,
1670,neg,the second serial-killer thriller of the month...
120,pos,"available for rental - october 12 , 1999 \r\n1..."


### Task #2: Check for missing values:

In [3]:
# Check for NaN values:
df.isnull().sum()

label      0
review    35
dtype: int64

In [4]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks =[]
for index,label,review in df.itertuples():
    if str(review).isspace():
        blanks.append(index)

### Task #3: Remove NaN values:

In [5]:
df.dropna(inplace=True)
df.drop(blanks)

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...
...,...,...
1995,pos,"i like movies with albert brooks , and i reall..."
1996,pos,it might surprise some to know that joel and e...
1997,pos,the verdict : spine-chilling drama from horror...
1998,pos,i want to correct what i wrote in a former ret...


### Task #4: Take a quick look at the `label` column:

In [6]:
df.value_counts(['label'])

label
neg      983
pos      982
dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [7]:
X = df['review']
y = df['label']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.18,random_state=37)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [8]:
train_clf = Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())])
train_clf.fit(X_train,y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

### Task #7: Run predictions and analyze the results

In [9]:
# Form a prediction set
y_pred = train_clf.predict(X_test)

In [10]:
# Report the confusion matrix
print(confusion_matrix(y_test,y_pred))

[[137  32]
 [ 24 161]]


In [11]:
# Print a classification report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         neg       0.85      0.81      0.83       169
         pos       0.83      0.87      0.85       185

    accuracy                           0.84       354
   macro avg       0.84      0.84      0.84       354
weighted avg       0.84      0.84      0.84       354



In [12]:
# Print the overall accuracy
train_clf.score(X_test,y_test)

0.8418079096045198

## Great job!