### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [2]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('./moviereviews2.tsv', sep='\t')

In [5]:
df.sample(5)

Unnamed: 0,label,review
2429,neg,Don't waste your time on this dreck. As portra...
941,pos,This short film that inspired the soon-to-be f...
4226,pos,This James bond game is the best bond game i h...
1725,neg,I've had to change my view on the worst film i...
4975,neg,I´m not surprised that even cowgirls get the b...


### Task #2: Check for missing values:

In [7]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

label     0
review    0
dtype: int64

In [10]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []

for i,lb,rv in df.itertuples():
    if rv.isspace():
        blanks.append(i)

print(blanks)

[]


### Task #3: Remove NaN values:

In [None]:
df.dropna(inplace=True)

### Task #4: Take a quick look at the `label` column:

In [12]:
df['label'].value_counts()

label
pos    2990
neg    2990
Name: count, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [13]:
X = df['review']
y = df['label']

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
from sklearn import metrics

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [19]:
txt_clf = Pipeline([('tfidf', TfidfVectorizer()),
                   ('clf', LinearSVC())])

In [21]:
txt_clf.fit(X_train,y_train)

### Task #7: Run predictions and analyze the results

In [22]:
# Form a prediction set
pred = txt_clf.predict(X_test)

In [24]:
# Report the confusion matrix

print(metrics.confusion_matrix(y_test, pred))

[[900  91]
 [ 63 920]]


In [25]:
# Print a classification report
print(metrics.classification_report(y_test, pred))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [26]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test, pred))

0.9219858156028369


## Great job!