## Project IMDB MOVIE REVIEW

The IMDB dataset comprises 25,000 movie reviews, making it a valuable resource for natural language processing and text analytics. This dataset surpasses previous benchmark datasets in terms of its size and contains a binary sentiment classification. Specifically, it consists of 12,500 highly polar movie reviews for training and an additional 12,500 for testing purposes.

The primary objective of this project is to accurately predict the number of positive and negative reviews by utilizing classification techniques.

In summary, this dataset provides a comprehensive and robust platform for analyzing movie reviews and developing classification models to accurately predict sentiment.

The project will involve a series of fundamental procedures, which include the following steps:

- Preprocess Text Data(Remove punctuation, Perform Tokenization, Remove stopwords and Lemmatize/Stem)
- Perform TFIDF Vectorization
- Exploring parameter settings using GridSearchCV on Random Forest & Gradient Boosting Classifier
- Perform Final evaluation of models on the best parameter settings using the evaluation metrics
- Report the best performing model

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string

In [2]:
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

### Read in text data

In [3]:
data = pd.read_csv('IMDB_dataset_reduced.csv')
data.columns = ['body_text','label']
data.head()

Unnamed: 0,body_text,label
0,"This show was an amazing, fresh & innovative i...",negative
1,Encouraged by the positive comments about this...,negative
2,Phil the Alien is one of those quirky films wh...,negative
3,I saw this movie when I was about 12 when it c...,negative
4,So im not a big fan of Boll's work but then ag...,negative


### Exploring the dataset

In [4]:
# Shape of the dataset

print("Input data has {} rows and {} columns".format(len(data), len(data.columns)))

Input data has 6000 rows and 2 columns


In [5]:
# Positive/negative values

print("Out of {} rows, {} are positive, {} are negative".format(len(data),
                                                       len(data[data['label']=='positive']),
                                                       len(data[data['label']=='negative'])))

Out of 6000 rows, 3000 are positive, 3000 are negative


In [6]:
# How much missing data is there?

print("Number of null in label: {}".format(data['label'].isnull().sum()))
print("Number of null in text: {}".format(data['body_text'].isnull().sum()))

Number of null in label: 0
Number of null in text: 0


In [2]:
#After the analysis of data, its observed that there are no null text and label.Also,dataset have equal no of positive and negative reviews

In [7]:
# It is counting the percentage of punctations in each text
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

In [8]:
# Counting the no of characters and punctuation percentage in each text 
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

data['body_len']

0        761
1        552
2        483
3        758
4       1830
        ... 
5995     728
5996    2583
5997     448
5998    2027
5999    1209
Name: body_len, Length: 6000, dtype: int64

In [None]:
#The clean_text function is removing the punctuations then tokenizing.Later,the stopwords are removed and stemmed

In [9]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation]) # We are removing punctuations here
    
    tokens = re.split('\W+', text) # We are tokenizing the data here by using Regex
    
    text = [ps.stem(word) for word in tokens if word not in stopwords] 
    
#Removing stopwords removed then passing the token to porter stemmer for stemming
    
    return text

### TF-IDF Vectorization

In [10]:
# TF-IDF
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])
X_tfidf_feat = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)

# CountVectorizer
# count_vect = CountVectorizer(analyzer=clean_text)
# X_count = count_vect.fit_transform(data['body_text'])
# X_count_feat = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_count.toarray())], axis=1)

# X_count_feat.head()

### Exploring parameter settings using GridSearchCV for RandomForest

In [11]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [12]:
rf = RandomForestClassifier(n_jobs=-1)
param = {'n_estimators': [10, 50, 100],
        'max_depth': [10, 20, 30]}

gs = GridSearchCV(rf, param, cv=5)
gs_fit = gs.fit(X_tfidf_feat, data['label'])
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
5,10.729252,0.893545,0.724543,0.11698,20,100,"{'max_depth': 20, 'n_estimators': 100}",0.82,0.836667,0.860833,0.835833,0.844167,0.8395,0.013256,1
8,11.966633,0.505469,0.631164,0.039127,30,100,"{'max_depth': 30, 'n_estimators': 100}",0.8175,0.821667,0.84,0.824167,0.845,0.829667,0.010809,2
2,7.120791,0.724328,0.581538,0.078089,10,100,"{'max_depth': 10, 'n_estimators': 100}",0.8175,0.8225,0.8275,0.809167,0.839167,0.823167,0.010033,3
7,8.301973,0.562558,0.542519,0.06869,30,50,"{'max_depth': 30, 'n_estimators': 50}",0.798333,0.815,0.826667,0.808333,0.8275,0.815167,0.011086,4
4,7.591951,0.263456,0.550727,0.045098,20,50,"{'max_depth': 20, 'n_estimators': 50}",0.786667,0.815,0.8375,0.8,0.8275,0.813333,0.018311,5


In [None]:
# After performing Grid Search CV for Random Forest we can see that best performing parameters are
# n_estimators -> 100
# max_depth -> 20

In [13]:
# Exploring parameter setting using GradientBoostingClassifier 
from sklearn.ensemble import GradientBoostingClassifier

In [14]:
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [10], 
    'max_depth': [3, 7, 11],
    'learning_rate': [0.1]
}

clf = GridSearchCV(gb, param, cv=5)
cv_fit = clf.fit(X_tfidf_feat, data['label'])
pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
2,63.326527,0.47477,0.788108,0.009424,0.1,11,10,"{'learning_rate': 0.1, 'max_depth': 11, 'n_est...",0.726667,0.745,0.765833,0.740833,0.754167,0.7465,0.013117,1
1,46.059748,0.734426,0.815267,0.031587,0.1,7,10,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",0.705,0.748333,0.759167,0.736667,0.780833,0.746,0.025141,2
0,23.276588,0.204028,0.746569,0.03064,0.1,3,10,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",0.694167,0.7025,0.729167,0.704167,0.76,0.718,0.024035,3


In [None]:
# After performing Grid Search CV for Gradient Boosting we can see that best performing parameters are:
# n_estimators -> 10
# max_depth -> 11
#learning_rate -> 0.1

In [15]:
#Splitting to train and test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)

In [16]:
#Vectorizing the text
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])

tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])

X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_test.toarray())], axis=1)
X_train_vect.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,36750,36751,36752,36753,36754,36755,36756,36757,36758,36759
0,594,7.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,560,3.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,691,3.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1842,3.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,577,4.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
#Final evaluation of models

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time

In [18]:
rf = RandomForestClassifier(n_estimators=100, max_depth=30, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='positive', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 6.375 / Predict time: 0.609 ---- Precision: 0.839 / Recall: 0.852 / Accuracy: 0.849


In [19]:
gb = GradientBoostingClassifier(n_estimators=10, max_depth=11)

start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='positive', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 60.901 / Predict time: 0.569 ---- Precision: 0.712 / Recall: 0.79 / Accuracy: 0.743


In [None]:
#Random Forest is best performing model with accuracy of 84.9 %