**Text Classification Project**

In this exercise we'll try to develop a classification model - that is, we'll try to predict the Positive/Negative labels based on text content alone.

**Perform imports and load the dataset**

The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file.

In [25]:
import pandas as pd
import numpy as np

In [26]:
df = pd.read_csv("moviereviews.tsv",sep="\t")

In [27]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [28]:
len(df)

2000

In [4]:
df["review"][0]

'how do films like mouse hunt get into theatres ? \r\nisn\'t there a law or something ? \r\nthis diabolical load of claptrap from steven speilberg\'s dreamworks studio is hollywood family fare at its deadly worst . \r\nmouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . \r\nwriter adam rifkin and director gore verbinski are the names chiefly responsible for this swill . \r\nthe plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . \r\ndeciding to check out the long-abandoned house , they soon learn that it\'s worth a fortune and set about selling it in auction to the highest bidder . \r\nbut battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . \r\

**Check for missing values:**
 
Detect & remove NaN values

In [29]:
df.isnull().sum()

label      0
review    35
dtype: int64

35 records show NaN. These are easily removed using the .dropna() pandas function.

In [30]:
df.dropna(inplace=True)

In [31]:
df.isnull().sum()

label     0
review    0
dtype: int64

**Detect & remove empty strings**

Technically, we're dealing with "whitespace only" strings. If the original .tsv file had contained empty strings, pandas .read_csv() would have assigned NaN values to those cells by default.

In order to detect these strings we need to iterate over each row in the DataFrame. The .itertuples() pandas method is a good tool for this as it provides access to every field. For brevity we'll assign the names i, lb and rv to the index, label and review columns.

In [32]:
#Get the index of the rows of review column filled with spaces

blanks = []

for i,lb,rv in df.itertuples():
  if rv.isspace():
    blanks.append(i)


In [33]:
blanks

[57,
 71,
 147,
 151,
 283,
 307,
 313,
 323,
 343,
 351,
 427,
 501,
 633,
 675,
 815,
 851,
 977,
 1079,
 1299,
 1455,
 1493,
 1525,
 1531,
 1763,
 1851,
 1905,
 1993]

Next we'll pass our list of index numbers to the .drop() method, and set inplace=True to make the change permanent.

In [34]:
df.drop(blanks,inplace=True)

In [35]:
len(df)

1938

We dropped 62 records from the original 2000. Let's continue with the analysis.

**Split the data into train & test sets:**

In [36]:
from sklearn.model_selection import train_test_split

In [38]:
x = df["review"]

y = df["label"]

In [39]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=42)

**Build pipelines to vectorize the data, then train and fit a model**

In [40]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

In [41]:
# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

# Linear SVC:
text_clf_lsvc = Pipeline([("tfidf",TfidfVectorizer()),("clf",LinearSVC())])

**Feed the training data through the first pipeline**

In [43]:
text_clf_nb.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

**Run predictions and analyze the results (naïve Bayes)**

In [45]:
# Form a prediction set
predictions = text_clf_nb.predict(x_test)

In [46]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[259  23]
 [102 198]]


In [47]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.72      0.92      0.81       282
         pos       0.90      0.66      0.76       300

    accuracy                           0.79       582
   macro avg       0.81      0.79      0.78       582
weighted avg       0.81      0.79      0.78       582



In [48]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.7852233676975945


Naïve Bayes gave us better-than-average results at 78.5% for classifying reviews as positive or negative based on text alone. Let's see if we can do better.

**Feed the training data through the second pipeline**

Next we'll run Linear SVC

In [49]:
text_clf_lsvc.fit(x_train,y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

Run predictions and analyze the results (Linear SVC)

In [50]:
predictions = text_clf_lsvc.predict(x_test)

In [51]:
from sklearn import metrics

In [52]:
print(metrics.confusion_matrix(y_test,predictions))

[[235  47]
 [ 41 259]]


In [53]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.85      0.83      0.84       282
         pos       0.85      0.86      0.85       300

    accuracy                           0.85       582
   macro avg       0.85      0.85      0.85       582
weighted avg       0.85      0.85      0.85       582



In [54]:
metrics.accuracy_score(y_test,predictions)

0.8487972508591065

In [23]:
text_clf.predict(["This movie is very Good"])

array(['pos'], dtype=object)

Not bad! Based on text alone we correctly classified reviews as positive or negative 84.8% of the time. In an upcoming section we'll try to improve this score even further by performing sentiment analysis on the reviews.