# Movie Review Dataset

### The polarity dataset from Cornell University consists of movie review documents which are labelled according to the overall sentiment of the rating as positibe or negative. 
### The dataset consists of 1000 positive and 1000 negative review. The purpose of this model is to be able to correctly train and predict the sentiment of the movie review for the tes dataset.

In [None]:
import tarfile
import numpy as np
import pandas as pd

In [54]:
import sys
import os

In [55]:
train_data = []
train_labels = []
test_data = []
test_labels =[]

In [56]:
data_dir= "C:\\Users\\sushsiva\\Desktop\\txt_sentoken"

In [57]:
classes = ['pos','neg']

In [58]:
data_dir

'C:\\Users\\sushsiva\\Desktop\\txt_sentoken'

### The dataset is divided into 90% training and the rest 10% is the test dataset, each dataset is in the form of a document 

In [59]:
for curr_class in classes:
    dirname= os.path.join(data_dir,curr_class)
    for fname in os.listdir(dirname):
        with open(os.path.join(dirname,fname),'r') as f:
                content = f.read()
                if fname.startswith('cv9'):
                    test_data.append(content)
                    test_labels.append(curr_class)
                    
                else:
                    train_data.append(content)
                    train_labels.append(curr_class)

### Step 1 : Perform TFIDF on the train and test feature dataset in order to obtain all the most important words occuring in the dataset 

In [60]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True,min_df=5,use_idf=True, max_df=0.8, analyzer='word', stop_words='english')
train_idf = vectorizer.fit_transform(train_data)
test_idf = vectorizer.transform(test_data)

## Following is a list of all the features in the dataset with their respective TF-IDF score in the sorted order.

In [12]:
features = vectorizer.get_feature_names()

In [81]:
scores = zip(vectorizer.get_feature_names(),np.asarray(train_idf.sum(axis=0)).ravel())
sorted_scores = pd.DataFrame(sorted(scores,key=lambda x:x[1],reverse=True))

In [83]:
sorted_scores

Unnamed: 0,0,1
0,movie,49.508609
1,like,39.897191
2,just,36.504413
3,good,33.707173
4,time,32.859499
5,story,31.742505
6,character,30.109930
7,characters,29.207121
8,way,27.838372
9,make,27.580741


### Performing classification on the tf-idf processed data using Support Vector Machines with a Radial Basis Function Kernel for not linearly separable data. The Kernel function is used to transform non linearly separable data into a higher dimensional space thus converting a not separable problem into a separable problem. The main parameters used for tuning here is "kernel", "gamma" and "C".
### Kernel values can be linear, rbf or poly, where "rbf" anf "poly" are used for non linear hyperplane to project the data to a higher dimensional space. 
### Gamma values on increasing them can make the training data set exactly fit which can lead to generalization and over-fitting error, hence changing this value is crucial to reduce the error.
### C is the penalty parameter which on increasing is used to smoothen the boundary and classify the points correctly.


In [62]:
from sklearn import svm
import time
classifier_rbf = svm.SVC()
t0 = time.time()
classifier_rbf.fit(train_idf,train_labels)
t1 = time.time()
prediction_rbf = classifier_rbf.predict(test_idf)
t2 = time.time()
time_rbf_train = t1-t0
time_rbf_predict = t2-t1


In [63]:
classifier_rbf.fit(train_idf,train_labels)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [93]:
print("Results for SVC(kernel=rbf)")
print("Training time: %fs; Prediction time: %fs" % (time_rbf_train, time_rbf_predict))
from sklearn.metrics import classification_report, accuracy_score
print(classification_report(test_labels, prediction_rbf))
print(accuracy_score(test_labels,prediction_rbf))

Results for SVC(kernel=rbf)
Training time: 24.435744s; Prediction time: 2.476587s
             precision    recall  f1-score   support

        neg       1.00      1.00      1.00        97
        pos       1.00      1.00      1.00       103

avg / total       1.00      1.00      1.00       200

1.0


In [73]:
test_labels=np.array(prediction_rbf.tolist())

In [74]:
test = pd.DataFrame({"movie review":test_data,'polarity prediction':test_labels})

In [84]:
test.head()

Unnamed: 0,movie review,polarity prediction
0,"in 1912 , a ship set sail on her maiden voyage...",pos
1,the start of this movie reminded me of parts f...,neg
2,note : some may consider portions of the follo...,pos
3,robert altman's cookie's fortune is that rare ...,pos
4,well i'll be damned . . . \nthe canadians can ...,pos
