# Text Classification Project - Predict Movie Review Sentiment

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.


For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('moviereviews2.tsv',sep='\t')
raw_leng = len(df)
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


## Clean missing values

In [3]:
display(df.isnull().sum())
df.dropna(inplace=True)
display(df.isnull().sum())

blanks = []
for i,lb,rv in df.itertuples():
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)
df.drop(blanks,inplace=True)

clean_leng = len(df)

print("Removed {} empty rows from the data; now contains {} rows".format(raw_leng - clean_leng, clean_leng))

display(df['label'].value_counts())

label      0
review    20
dtype: int64

label     0
review    0
dtype: int64

Removed 20 empty rows from the data; now contains 5980 rows


neg    2990
pos    2990
Name: label, dtype: int64

## Prepare data for classification

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## # Pipeline to vectorize the data to train and fit the model

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [7]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                    ('clf', LinearSVC())])

## Train the model

In [8]:
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

## Make predictions

In [9]:
predictions = text_clf.predict(X_test)

## Evaluate performance

In [10]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [11]:
display(pd.DataFrame(confusion_matrix(y_test,predictions), 
                            index=['TrueNeg','TruePos'], 
                            columns=['PredNeg','PredPos']))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test,predictions))

Unnamed: 0,PredNeg,PredPos
TrueNeg,900,91
TruePos,63,920


              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

   micro avg       0.92      0.92      0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974

0.9219858156028369
