This project is about to classify either a movie rating is positive or negative depending on text review. The dataset contain about 6000 movie rating, 3000 thousand of which is positive & 3000 of which is negative. There are several reviews with NaN value. For this project two Machine learning algorithm will be performed.
1. Support Vector Machine (SVM)
2. GradientBoostingClassifier

The dataset is taken from http://ai.stanford.edu/~amaas/data/sentiment/ 



### Movie Rating Classification

In [1]:
# importing requried library
import numpy as np
import pandas as pd

# to split dataset into train & test set
from sklearn.model_selection import train_test_split

# To build a pipeline & vectorization
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

# To build the model
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier

# To measure the accuracy
from sklearn import metrics

In [2]:
# loading dataset
df = pd.read_csv('moviereviews2.tsv', sep='\t')
df.head() # view top 5 rows of df

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


In [3]:
# now let's see how many postive and negative reviews are there
df.label.value_counts()

neg    3000
pos    3000
Name: label, dtype: int64

In [4]:
# Checking if is there any null
print(df.isnull().sum())

label      0
review    20
dtype: int64


There are 20 null values in review column

In [5]:
# If there are any duplicate value in review
print(df.review.duplicated().sum())

33


There are 33 same review appears in the dataset. 

Now let's remove the null values & also the duplicate values

In [6]:
df.dropna(inplace=True)
df.drop_duplicates(keep='first', inplace=True)

In [7]:
# check again for the null and duplicates
print(df.isnull().sum())
print(df.review.duplicated().sum())

label     0
review    0
dtype: int64
0


In [8]:
# Now let's check if there is only space in any review
space = []

for idx, lb, rv in df.itertuples():
    if rv.isspace():
        space.append(i)

In [9]:
space

[]

There is no review having only space.

Now let's split the dataset into train & test set

In [10]:
# split dataset
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['label'], test_size=0.25, random_state=42)

In [11]:
# Now let's vectorize the data for the ml model
tfidf = TfidfVectorizer() # creating instance

In [12]:
X_train_tfid = tfidf.fit_transform(X_train) # fit transform train data
X_test_tfid = tfidf.transform(X_test) # transform test data

Now let's build the model
#### Support Vector Machine

In [13]:
svc_ = LinearSVC()

In [14]:
svc_.fit(X_train_tfid, y_train)

LinearSVC()

In [15]:
# predictions
pred_svc = svc_.predict(X_test_tfid)

In [16]:
# let's evaluate the model
print(metrics.accuracy_score(y_test, pred_svc))

0.9349865951742627


In [17]:
# the model has an accuracy of 94%.
# let's see the confusion matrix and classification report

print(metrics.confusion_matrix(y_test, pred_svc, labels=['pos','neg']))
print('============================================')
print(metrics.classification_report(y_test, pred_svc))

[[702  47]
 [ 50 693]]
              precision    recall  f1-score   support

         neg       0.94      0.93      0.93       743
         pos       0.93      0.94      0.94       749

    accuracy                           0.93      1492
   macro avg       0.93      0.93      0.93      1492
weighted avg       0.93      0.93      0.93      1492



Here we can see that the model perform very well. It predicted 47 review positive where it meant to be negative and 50 negative where it ment to be positive. It's precision & recall values are also quite good. 

Let's move to second algorithm.

##### GradientBoostingClassifier

In [18]:
gbc = GradientBoostingClassifier() # creating instance with all default parameters

In [19]:
gbc.fit(X_train_tfid, y_train) # fitting data

GradientBoostingClassifier()

In [20]:
# prediction on test data

pred_gbc = gbc.predict(X_test_tfid)

In [21]:
# checking the model performance
print(metrics.accuracy_score(y_test, pred_svc))

0.9349865951742627


In [22]:
# the model has an accuracy of 93%.
# let's see the confusion matrix and classification report

print(metrics.confusion_matrix(y_test, pred_gbc, labels=['pos','neg']))
print('============================================')
print(metrics.classification_report(y_test, pred_gbc))

[[685  64]
 [139 604]]
              precision    recall  f1-score   support

         neg       0.90      0.81      0.86       743
         pos       0.83      0.91      0.87       749

    accuracy                           0.86      1492
   macro avg       0.87      0.86      0.86      1492
weighted avg       0.87      0.86      0.86      1492



The GradientBoosing perform worst than the SVC. So for final we'll consider the SVC model. 

NOw let's create a pipeline for the final model.

In [23]:
# creating pipeline
text_clf = Pipeline([('tfidf',TfidfVectorizer()),
                    ('clf',LinearSVC())])

In [24]:
# fit the raw data into the pipeline
text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [25]:
# predictions
pred = text_clf.predict(X_test)

In [26]:
# Model evaluation

print(metrics.confusion_matrix(y_test, pred))
print('==========================')
print(metrics.classification_report(y_test, pred))

[[693  50]
 [ 47 702]]
              precision    recall  f1-score   support

         neg       0.94      0.93      0.93       743
         pos       0.93      0.94      0.94       749

    accuracy                           0.93      1492
   macro avg       0.93      0.93      0.93      1492
weighted avg       0.93      0.93      0.93      1492



In [27]:
print(metrics.accuracy_score(y_test, pred))

0.9349865951742627


The final model performs 94% accuracy with an f-1 score of 0.93 & 0.94 which is quite good.