# Text Classification Project
Now we're at the point where we should be able to:
* Read in a collection of documents - a *corpus*
* Transform text into numerical vector data using a pipeline
* Create a classifier
* Fit/train the classifier
* Test the classifier on new data
* Evaluate performance

For this project we'll use the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/

In this exercise we'll try to develop a classification model as we did for the SMSSpamCollection dataset - that is, we'll try to predict the Positive/Negative labels based on text content alone. In an upcoming section we'll apply *Sentiment Analysis* to train models that have a deeper understanding of each review.

## Perform imports and load the dataset
The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [2]:
len(df)

2000

## Check for missing values:
We have intentionally included records with missing data. Some have NaN values, others have short strings composed of only spaces. This might happen if a reviewer declined to provide a comment with their review. We will show two ways using pandas to identify and remove records containing empty data.
* NaN records are efficiently handled with [.isnull()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html) and [.dropna()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)
* Strings that contain only whitespace can be handled with [.isspace()](https://docs.python.org/3/library/stdtypes.html#str.isspace), [.itertuples()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html), and [.drop()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)

### Detect & remove NaN values:

In [4]:
df.isnull().sum()

label      0
review    35
dtype: int64

35 records show **NaN** (this stands for "not a number" and is equivalent to *None*). These are easily removed using the `.dropna()` pandas function.
<div class="alert alert-info" style="margin: 20px">CAUTION: By setting inplace=True, we permanently affect the DataFrame currently in memory, and this can't be undone. However, it does *not* affect the original source data. If we needed to, we could always load the original DataFrame from scratch.</div>

In [5]:
df.dropna(inplace=True)

In [6]:
len(df)

1965

### Detect & remove empty strings
Technically, we're dealing with "whitespace only" strings. If the original .tsv file had contained empty strings, pandas **.read_csv()** would have assigned NaN values to those cells by default.

In order to detect these strings we need to iterate over each row in the DataFrame. The **.itertuples()** pandas method is a good tool for this as it provides access to every field. For brevity we'll assign the names `i`, `lb` and `rv` to the `index`, `label` and `review` columns.

In [7]:
sample = 'sample data'
empty = '   '

In [8]:
sample.isspace()

False

In [9]:
empty.isspace()

True

In [10]:
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [11]:
df.drop(blanks) #pass the blanks indexes

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...
...,...,...
1995,pos,"i like movies with albert brooks , and i reall..."
1996,pos,it might surprise some to know that joel and e...
1997,pos,the verdict : spine-chilling drama from horror...
1998,pos,i want to correct what i wrote in a former ret...


In [12]:
len(df)

1965

In [13]:
df['label'].value_counts()

neg    983
pos    982
Name: label, dtype: int64

In [14]:
# split and train the data
X = df['review']
y = df['label']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
len(X_train), len(X_test)

(1572, 393)

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

#gonna create 2 different model
text_clf_nb = Pipeline([
    ('tfid', TfidfVectorizer()), 
    ('clf', MultinomialNB())
])

text_clf_svm = Pipeline([
    ('tfid', TfidfVectorizer()), 
    ('clf', LinearSVC())
])

In [17]:
text_clf_nb.fit(X_train, y_train)

In [18]:
predictions = text_clf_nb.predict(X_test)

In [19]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test, predictions))

[[174  28]
 [ 50 141]]


In [20]:
print(metrics.classification_report(y_test, predictions))

              precision    recall  f1-score   support

         neg       0.78      0.86      0.82       202
         pos       0.83      0.74      0.78       191

    accuracy                           0.80       393
   macro avg       0.81      0.80      0.80       393
weighted avg       0.80      0.80      0.80       393



In [22]:
print(metrics.accuracy_score(y_test,predictions)) #Naive bayes Result

0.8015267175572519


In [23]:
text_clf_svm.fit(X_train, y_train)

In [24]:
svm_predictions = text_clf_svm.predict(X_test)

In [25]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test, svm_predictions))

[[175  27]
 [ 41 150]]


In [26]:
print(metrics.classification_report(y_test, svm_predictions))

              precision    recall  f1-score   support

         neg       0.81      0.87      0.84       202
         pos       0.85      0.79      0.82       191

    accuracy                           0.83       393
   macro avg       0.83      0.83      0.83       393
weighted avg       0.83      0.83      0.83       393



In [None]:
print(metrics.accuracy_score(y_test, svm_predictions)) #svm prediction 

In [33]:
# how abt increasing the test size - will it help ? let's see
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

text_clf_svm2 = Pipeline([
    ('tfid', TfidfVectorizer()), 
    ('clf', LinearSVC())
])

text_clf_svm2.fit(X_train, y_train)
predictions = text_clf_svm2.predict(X_test)

#confusion matrix
print("Confusion matrix:\n ", metrics.confusion_matrix(y_test, predictions), end='\n\n')

#classification report
print("Classification report:\n ",metrics.classification_report(y_test, predictions), end='\n\n')

#Accuracy
print(metrics.accuracy_score(y_test, predictions))

Confusion matrix:
  [[281  41]
 [ 56 271]]

Classification report:
                precision    recall  f1-score   support

         neg       0.83      0.87      0.85       322
         pos       0.87      0.83      0.85       327

    accuracy                           0.85       649
   macro avg       0.85      0.85      0.85       649
weighted avg       0.85      0.85      0.85       649


0.8505392912172574


### Boom, from 82% to 85% 

In [34]:
text_clf_svm2.predict(["The movie red notice was a semi decent movie for general audience and wheread it is a interesting movie for movie enthu's who are waiting for trill and fun"])

array(['neg'], dtype=object)

In [36]:
text_clf_svm2.predict(["The movie red notice was a semi decent movie for general audience and wheread it is a great movie for movie enthu's who are waiting for trill and fun"])
#just changes one word from interesting to great 

array(['pos'], dtype=object)

In [37]:
text_clf_svm2.predict(["funny"])

array(['neg'], dtype=object)

In [38]:
text_clf_svm2.predict(["okish"])

array(['neg'], dtype=object)

In [39]:
text_clf_svm2.predict(["nice movie"])

array(['neg'], dtype=object)

In [40]:
text_clf_svm2.predict(["good movie"])

array(['neg'], dtype=object)

In [41]:
text_clf_svm.predict(['good movie'])

array(['neg'], dtype=object)

In [42]:
text_clf_svm2.predict(["great movie"])

array(['pos'], dtype=object)

In [43]:
text_clf_svm2.predict(["fantastic movie"])

array(['pos'], dtype=object)

In [45]:
text_clf_svm2.predict(["chill movie"])

array(['neg'], dtype=object)