# Text Classification
In this project, we analyse, pre-process, vectorize and classify a movie review dataset (more information on http://ai.stanford.edu/~amaas/data/sentiment/) by building a simple text classifier and evaluate the model predictions.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. The labels are given as `pos` and `neg`.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews2.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


In [2]:
# Check for NaN values:
df.isna().sum()

label      0
review    20
dtype: int64

In [4]:
# Check for whitespace strings:
blanks = []

for idx, lb, rv in df.itertuples():
    if type(rv) == 'str':
        if rv.isspace() or rv=='':
            blanks.append(idx)

len(blanks)

0

### Remove NaN values:

In [5]:
df.dropna(inplace=True)

In [6]:
df.label.value_counts()

pos    2990
neg    2990
Name: label, dtype: int64

### plit the data into train & test sets:

In [15]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Build a pipeline to vectorize the date, then train and fit a model

In [16]:
#we created a list of stop words for the model to filter out
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

In [17]:
# SVC will perform good on sparse matrices, so select this as classifier
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)), ('clf', LinearSVC())])

text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(stop_words=['a', 'about', 'an', 'and', 'are',
                                             'as', 'at', 'be', 'been', 'but',
                                             'by', 'can', 'even', 'ever', 'for',
                                             'from', 'get', 'had', 'has',
                                             'have', 'he', 'her', 'hers', 'his',
                                             'how', 'i', 'if', 'in', 'into',
                                             'is', ...])),
                ('clf', LinearSVC())])

### Run predictions and analyze the results

In [18]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [20]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test, predictions))

[[884 107]
 [ 62 921]]


In [21]:
# Print a classification report
print(metrics.classification_report(y_test, predictions))

              precision    recall  f1-score   support

         neg       0.93      0.89      0.91       991
         pos       0.90      0.94      0.92       983

    accuracy                           0.91      1974
   macro avg       0.92      0.91      0.91      1974
weighted avg       0.92      0.91      0.91      1974



In [22]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test, predictions))

0.914387031408308
