In [4]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_csv("C:\\Users\\Priyanka\\Downloads\\IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
df.shape

(50000, 2)

In [7]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [8]:
df['sentiment_int']= df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
df.head()

Unnamed: 0,review,sentiment,sentiment_int
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [9]:
df.sentiment.value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [10]:
# Split the data into test and training sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df['review'],df['sentiment_int'],test_size = 0.3)

In [11]:
X_train.shape , X_test.shape

((35000,), (15000,))

In [12]:
X_train[2]

'I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I\'d laughed at one of Woody\'s comedies in years (dare I say a decade?). While I\'ve never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.'

In [14]:
type(X_train)

pandas.core.series.Series

In [15]:
X_train.values[0]

'Had the League been unknowns pitching this script, the backers would simply have turned around and said "no - you\'re not having the money - this is dreadful". As a fan of the League of Gentlemen, this is their poorest outing to date. Not particularly funny, not particularly entertaining, there are few laugh out loud moments. They do exist, but they are few and far between. I felt the format was tired and really dragging. The film refers to the writers being bored of the characters and it shows. As for being a film. I felt the Xmas special had better production value; the FX are generally pretty poor and it is clearly obvious that they didn\'t film in the original Royston Vasey (they filmed this on the cheap in Ireland). The musical score is weak and the dialogue is terrible. Also, the accents of the characters were largely off from their TV equivalents. Tubs and Edward, much underused (again), just didn\'t sound like themselves. Disappointing really, because I was hoping for somethin

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
#model_nb is Naive bayes model , clf is the count vectorizer for vectorizing the input using bag of words technique

model_nb = Pipeline([
    ('vectorizer',CountVectorizer()),
    ('nb',MultinomialNB())
])

#model_rf uses random forest classifier as the model
model_rfc = Pipeline([
    ('vectorizer',CountVectorizer()),
    ('rf',RandomForestClassifier(n_estimators = 50,criterion = 'entropy'))
])

#model_knn uses KNN as the model
model_knn = Pipeline([
    ('vectorizer',CountVectorizer()),
    ('knn',KNeighborsClassifier(n_neighbors = 10, metric = 'euclidean'))
])


In [17]:
model_nb.fit(X_train,y_train)

In [18]:
y_pred_nb = model_nb.predict(X_test)

In [19]:
model_nb.score(X_test,y_test)

0.8506

In [20]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred_nb))

              precision    recall  f1-score   support

           0       0.83      0.88      0.86      7571
           1       0.87      0.82      0.84      7429

    accuracy                           0.85     15000
   macro avg       0.85      0.85      0.85     15000
weighted avg       0.85      0.85      0.85     15000



In [21]:
model_rfc.fit(X_train,y_train)

In [22]:
y_pred_rfc = model_rfc.predict(X_test)

In [23]:
model_rfc.score(X_test,y_test)

0.8381333333333333

In [24]:
print(classification_report(y_test,y_pred_rfc))

              precision    recall  f1-score   support

           0       0.84      0.84      0.84      7571
           1       0.84      0.84      0.84      7429

    accuracy                           0.84     15000
   macro avg       0.84      0.84      0.84     15000
weighted avg       0.84      0.84      0.84     15000



In [25]:
model_knn.fit(X_train,y_train)

In [26]:
y_pred_knn = model_knn.predict(X_test)

In [27]:
model_knn.score(X_test,y_test)

0.6512666666666667

In [28]:
print(classification_report(y_test,y_pred_knn))

              precision    recall  f1-score   support

           0       0.66      0.64      0.65      7571
           1       0.64      0.66      0.65      7429

    accuracy                           0.65     15000
   macro avg       0.65      0.65      0.65     15000
weighted avg       0.65      0.65      0.65     15000



In [29]:
'''
Observations:

In this process, we convert text into a very high dimensional numeric vector using the technique of Bag of words.
Model like K-Nearest Neighbours(KNN) doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance of model.
The easy calculation of probabilities for the words in corpus(Bag of words) and storing them in contigency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.
As Random Forest uses Bootstrapping(Row and column Sampling) with many decision tree and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifing the categories.
Machine Learning is like trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which give good results and satisfy the requirements like latency, interpretability etc.

'''


"\nObservations:\n\nIn this process, we convert text into a very high dimensional numeric vector using the technique of Bag of words.\nModel like K-Nearest Neighbours(KNN) doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance of model.\nThe easy calculation of probabilities for the words in corpus(Bag of words) and storing them in contigency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.\nAs Random Forest uses Bootstrapping(Row and column Sampling) with many decision tree and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifing the categories.\nMachine Learning is like trial and error scientific method, where we keep trying all th