<a href="https://colab.research.google.com/github/DhanuMW/Data_Science_Projects/blob/main/IMDB_Movie_Review_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## IMDB Movie Review Analysis - NLP Project

The IMDB movie review dataset had 25,000 movie reviews. Each review is labelled as either a positive review or a negative review. However, due to the limitations of RAM capacity 10% from positive reviews and 10% from negative reviews were taken and shuffled to be used as the dataset for the project 3.

First the text data were analyzed and then preprocessed by removing punctuation, performing tokenization, removing stopwords, and lemmetizing.

The required Python libraries are imported.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

The required Natural Language Processing (NLP) libraries are imported.

In [None]:
import nltk
from nltk.corpus import stopwords
import string
import re

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#load the IMDB_dataset.csv dataset file
imdb = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/IMDB_dataset.csv')
imdb.head()

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative


In [None]:
print("Input data has {} rows and {} columns".format(len(imdb), len(imdb.columns)))

Input data has 25000 rows and 2 columns


The initial dataset had 25000 rows and 2 columns.

In [None]:
print("Out of {} rows, {} are positive, {} are negative".format(len(imdb),
                                                       len(imdb[imdb['sentiment']=='positive']),
                                                       len(imdb[imdb['sentiment']=='negative'])))

Out of 25000 rows, 12500 are positive, 12500 are negative


Out of the given data, 50% was positive and 50% was negative.

In [None]:
print("Number of null in sentiments: {}".format(imdb['sentiment'].isnull().sum()))
print("Number of null in reviews: {}".format(imdb['review'].isnull().sum()))

Number of null in sentiments: 0
Number of null in reviews: 0


There were no null values in the dataset.

#### Text Data Preprocessing

In [None]:
# Splitting positive and negative reviews
positive_reviews = imdb[imdb['sentiment'] == 'positive']
negative_reviews = imdb[imdb['sentiment'] == 'negative']

# Selecting half of positive and negative reviews
frac_positive = positive_reviews.sample(frac=0.1, random_state=42)
frac_negative = negative_reviews.sample(frac=0.1, random_state=42)

# Combine selected positive and negative reviews
imdb_frac = pd.concat([frac_positive, frac_negative])

# Shuffle the combined data
imdb_frac = imdb_frac.sample(frac=1, random_state=42)

# Display information about the selected dataset
print(imdb_frac['sentiment'].value_counts())

negative    1250
positive    1250
Name: sentiment, dtype: int64


From the original dataset of 12500 positive and 12500 negative data, 1250 positive and 1250 negative data were selected randomly and then shuffled to make it an unbias dataset.

In [None]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

The the review length and percentage of punctuations is calculated and added to the dataset as a feature engineering process in order to make the predictions more accurate.

In [None]:
def count_punct(review):
    count = sum([1 for char in review if char in string.punctuation])
    return round(count/(len(review) - review.count(" ")), 3)

imdb_frac['body_len'] = imdb_frac['review'].apply(lambda x: len(x) - x.count(" "))
imdb_frac['punct%'] = imdb_frac['review'].apply(lambda x: count_punct(x))

Then the text preprocessing is started. First the punctuations as in the above punctuation list are removed from the reviews and the result is returned.

In [None]:
def remove_punct(review):
    review_nopunct = "".join([char for char in review if char not in string.punctuation])
    return review_nopunct

imdb_frac['review_cleaned'] = imdb_frac['review'].apply(lambda x: remove_punct(x))

imdb_frac.head()

Unnamed: 0,review,sentiment,body_len,punct%,review_cleaned
7274,"Dahmer, a young confused man. Dahmer, a confus...",negative,811,0.09,Dahmer a young confused man Dahmer a confusing...
6248,"I couldn't stop watching this movie, though it...",positive,1329,0.065,I couldnt stop watching this movie though it w...
14422,I have so much hope for the sequel to Gen-X. L...,positive,557,0.068,I have so much hope for the sequel to GenX Luc...
11562,...But it definitely still only deserves 4/10 ...,negative,763,0.042,But it definitely still only deserves 410 star...
24325,This movie really deserves the MST3K treatment...,negative,304,0.049,This movie really deserves the MST3K treatment...


Then the words of each review is converted to tokens for easy processing using the \W+ syntax to seperate words. These tokens will be used in all the future steps.

In [None]:
def tokenize(review):
    tokens = re.split('\W+', review)
    return tokens

imdb_frac['review_tokenized'] = imdb_frac['review_cleaned'].apply(lambda x: tokenize(x.lower()))

imdb_frac.head()

Unnamed: 0,review,sentiment,body_len,punct%,review_cleaned,review_tokenized
7274,"Dahmer, a young confused man. Dahmer, a confus...",negative,811,0.09,Dahmer a young confused man Dahmer a confusing...,"[dahmer, a, young, confused, man, dahmer, a, c..."
6248,"I couldn't stop watching this movie, though it...",positive,1329,0.065,I couldnt stop watching this movie though it w...,"[i, couldnt, stop, watching, this, movie, thou..."
14422,I have so much hope for the sequel to Gen-X. L...,positive,557,0.068,I have so much hope for the sequel to GenX Luc...,"[i, have, so, much, hope, for, the, sequel, to..."
11562,...But it definitely still only deserves 4/10 ...,negative,763,0.042,But it definitely still only deserves 410 star...,"[but, it, definitely, still, only, deserves, 4..."
24325,This movie really deserves the MST3K treatment...,negative,304,0.049,This movie really deserves the MST3K treatment...,"[this, movie, really, deserves, the, mst3k, tr..."


stopwords package is downloaded from nltk library (a famous nlp library in Python for sentimental analysis). Stopwords are words like a, the, is, are, etc., which doesn't give any meaning in understanding the sentiment meaning of the sentence.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Stopwords from the corpus package are used to remove the stop words from the reviews.

In [None]:
stopword = nltk.corpus.stopwords.words('english')
def remove_stopwords(tokenized_list):
    review = [word for word in tokenized_list if word not in stopword]
    return review

imdb_frac['review_nostopwords'] = imdb_frac['review_tokenized'].apply(lambda x: remove_stopwords(x))

imdb_frac.head()

Unnamed: 0,review,sentiment,body_len,punct%,review_cleaned,review_tokenized,review_nostopwords
7274,"Dahmer, a young confused man. Dahmer, a confus...",negative,811,0.09,Dahmer a young confused man Dahmer a confusing...,"[dahmer, a, young, confused, man, dahmer, a, c...","[dahmer, young, confused, man, dahmer, confusi..."
6248,"I couldn't stop watching this movie, though it...",positive,1329,0.065,I couldnt stop watching this movie though it w...,"[i, couldnt, stop, watching, this, movie, thou...","[couldnt, stop, watching, movie, though, far, ..."
14422,I have so much hope for the sequel to Gen-X. L...,positive,557,0.068,I have so much hope for the sequel to GenX Luc...,"[i, have, so, much, hope, for, the, sequel, to...","[much, hope, sequel, genx, luckily, hopes, cam..."
11562,...But it definitely still only deserves 4/10 ...,negative,763,0.042,But it definitely still only deserves 410 star...,"[but, it, definitely, still, only, deserves, 4...","[definitely, still, deserves, 410, stars, moro..."
24325,This movie really deserves the MST3K treatment...,negative,304,0.049,This movie really deserves the MST3K treatment...,"[this, movie, really, deserves, the, mst3k, tr...","[movie, really, deserves, mst3k, treatment, ps..."


Then the words are lemmatized using the WordNetLemmatizer package from nltk stem library. Lemmatization is the step of converting the word to its meaningful base form by considering the contexual meaning.

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')
wnl = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
def lemmatizing(tokenized_reviews):
    review = [wnl.lemmatize(word) for word in tokenized_reviews]
    return review

imdb_frac['review_lemmatized'] = imdb_frac['review_nostopwords'].apply(lambda x: lemmatizing(x))

imdb_frac.head()

Unnamed: 0,review,sentiment,body_len,punct%,review_cleaned,review_tokenized,review_nostopwords,review_lemmatized
7274,"Dahmer, a young confused man. Dahmer, a confus...",negative,811,0.09,Dahmer a young confused man Dahmer a confusing...,"[dahmer, a, young, confused, man, dahmer, a, c...","[dahmer, young, confused, man, dahmer, confusi...","[dahmer, young, confused, man, dahmer, confusi..."
6248,"I couldn't stop watching this movie, though it...",positive,1329,0.065,I couldnt stop watching this movie though it w...,"[i, couldnt, stop, watching, this, movie, thou...","[couldnt, stop, watching, movie, though, far, ...","[couldnt, stop, watching, movie, though, far, ..."
14422,I have so much hope for the sequel to Gen-X. L...,positive,557,0.068,I have so much hope for the sequel to GenX Luc...,"[i, have, so, much, hope, for, the, sequel, to...","[much, hope, sequel, genx, luckily, hopes, cam...","[much, hope, sequel, genx, luckily, hope, came..."
11562,...But it definitely still only deserves 4/10 ...,negative,763,0.042,But it definitely still only deserves 410 star...,"[but, it, definitely, still, only, deserves, 4...","[definitely, still, deserves, 410, stars, moro...","[definitely, still, deserves, 410, star, moron..."
24325,This movie really deserves the MST3K treatment...,negative,304,0.049,This movie really deserves the MST3K treatment...,"[this, movie, really, deserves, the, mst3k, tr...","[movie, really, deserves, mst3k, treatment, ps...","[movie, really, deserves, mst3k, treatment, ps..."


The previous process only lemmatized the nouns, because the default pos tag is noun. Hence it is performed again elow to remove the verb tags as well using pos='v' section.

In [None]:
def preprocess(tokenized_reviews):
    verbs_lemmatized = [wnl.lemmatize(word, pos='v') for word in tokenized_reviews]
    return verbs_lemmatized

imdb_frac['review_preprocessed'] = imdb_frac['review_lemmatized'].apply(lambda x: preprocess(x))
imdb_frac.head()

Unnamed: 0,review,sentiment,body_len,punct%,review_cleaned,review_tokenized,review_nostopwords,review_lemmatized,review_preprocessed
7274,"Dahmer, a young confused man. Dahmer, a confus...",negative,811,0.09,Dahmer a young confused man Dahmer a confusing...,"[dahmer, a, young, confused, man, dahmer, a, c...","[dahmer, young, confused, man, dahmer, confusi...","[dahmer, young, confused, man, dahmer, confusi...","[dahmer, young, confuse, man, dahmer, confuse,..."
6248,"I couldn't stop watching this movie, though it...",positive,1329,0.065,I couldnt stop watching this movie though it w...,"[i, couldnt, stop, watching, this, movie, thou...","[couldnt, stop, watching, movie, though, far, ...","[couldnt, stop, watching, movie, though, far, ...","[couldnt, stop, watch, movie, though, far, pas..."
14422,I have so much hope for the sequel to Gen-X. L...,positive,557,0.068,I have so much hope for the sequel to GenX Luc...,"[i, have, so, much, hope, for, the, sequel, to...","[much, hope, sequel, genx, luckily, hopes, cam...","[much, hope, sequel, genx, luckily, hope, came...","[much, hope, sequel, genx, luckily, hope, come..."
11562,...But it definitely still only deserves 4/10 ...,negative,763,0.042,But it definitely still only deserves 410 star...,"[but, it, definitely, still, only, deserves, 4...","[definitely, still, deserves, 410, stars, moro...","[definitely, still, deserves, 410, star, moron...","[definitely, still, deserve, 410, star, moroni..."
24325,This movie really deserves the MST3K treatment...,negative,304,0.049,This movie really deserves the MST3K treatment...,"[this, movie, really, deserves, the, mst3k, tr...","[movie, really, deserves, mst3k, treatment, ps...","[movie, really, deserves, mst3k, treatment, ps...","[movie, really, deserve, mst3k, treatment, pse..."


#### TF-IDF Vectorization

At this point the text preprocessing is completed. Next the TFIDF vectorization step is performed in order to convert the collection of raw documents to a matrix of TF-IDF (which stands for Term Frequency - Inverse Document Frequency) features. This step measures how important a term is within a document relative to the collection of documents. The TfidfVectorizer library is used to vectorize the documents in the project.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(analyzer=preprocess)
X_tfidf = tfidf.fit_transform(imdb_frac['review_preprocessed'])
print(X_tfidf.shape)
print(tfidf.get_feature_names_out())

(2500, 28603)
['' '0' '0000000000001' ... 'œextended' 'œpuppydog' 'ž']


Out of the 2500 reviews, 28603 features are extracted and vectorized.

In [None]:
X_features = pd.concat([imdb_frac[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,28593,28594,28595,28596,28597,28598,28599,28600,28601,28602
0,811,0.09,0.092318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1329,0.065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,557,0.068,0.107195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,763,0.042,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,304,0.049,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The previously created dimensions are added to the vectorized feature set and final feature set is created for training and testing purposes.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

imdb_frac['sentiment'] = label_encoder.fit_transform(imdb_frac['sentiment'])
print(imdb_frac['sentiment'])

7274     0
6248     1
14422    1
11562    0
24325    0
        ..
4758     0
9596     1
15256    1
1958     0
494      1
Name: sentiment, Length: 2500, dtype: int64


The LabelEncoder library is used to convert the string characters in the label column to numeric values in which negative=0 and positive=1.

#### Exploring Parameter Settings using GridSearchCV

##### Random Forest Classifier

First the GridSearchCV  is used on the Random Forest Classifier.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

Several parameters are tested and the best parameters are determined using the highest mean_test_score value.

In [None]:
rfc = RandomForestClassifier()
param = {'n_estimators': [10, 150, 300],
        'max_depth': [30, 60, 90, None]}
X_features.columns = X_features.columns.astype(str)
gscv = GridSearchCV(rfc, param, cv=5, n_jobs=-1)
gscv_fit = gscv.fit(X_features, imdb_frac['sentiment'])
pd.DataFrame(gscv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
8,23.719408,0.647322,0.358495,0.090954,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.822,0.84,0.852,0.83,0.848,0.8384,0.011128,1
11,22.956765,1.56806,0.291329,0.019352,,300,"{'max_depth': None, 'n_estimators': 300}",0.822,0.842,0.856,0.816,0.846,0.8364,0.015041,2
5,23.193718,0.774693,0.406473,0.173358,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.818,0.852,0.842,0.842,0.824,0.8356,0.012611,3
2,15.819389,0.34655,0.30252,0.048503,30.0,300,"{'max_depth': 30, 'n_estimators': 300}",0.82,0.85,0.826,0.83,0.842,0.8336,0.010911,4
4,11.921916,0.250159,0.300906,0.039892,60.0,150,"{'max_depth': 60, 'n_estimators': 150}",0.812,0.838,0.852,0.834,0.82,0.8312,0.014006,5


The run time for GridSearchCV on Random Forest Classifier was 6 minutes. According to the results of grid search cross validation process, it can be determined that n_estimators=300 and max_depth=90 are the best parameters for Random Forest Classifier.

##### XGBoost Classifier

Next the GridSearchCV is used on XGBoost Classifier (because Gradient Boosting Classifier needed a very long time and the notebook crashed trying to use it).

In [None]:
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
xgbc = XGBClassifier(
    objective= 'binary:logistic',
    nthread=4,
    seed=42
)
param = {'n_estimators': [100, 150],
        'max_depth': [7, 11, 15],
         }
gscv_xgb = GridSearchCV(xgbc, param, cv=5, n_jobs=-1)

Several parameters are tested and the best parameters are determined using the highest mean_test_score value.

In [None]:
gscv_xgb_fit = gscv_xgb.fit(X_features, imdb_frac['sentiment'])
pd.DataFrame(gscv_xgb_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
3,129.352518,5.115767,4.871622,1.59332,11,150,"{'max_depth': 11, 'n_estimators': 150}",0.814,0.826,0.83,0.81,0.816,0.8192,0.007547,1
1,99.549906,3.629468,5.007033,1.907757,7,150,"{'max_depth': 7, 'n_estimators': 150}",0.826,0.814,0.84,0.794,0.814,0.8176,0.0152,2
4,105.087633,2.848881,4.765309,0.776075,15,100,"{'max_depth': 15, 'n_estimators': 100}",0.808,0.812,0.842,0.812,0.814,0.8176,0.012355,2
0,75.698901,2.224154,4.916913,1.013831,7,100,"{'max_depth': 7, 'n_estimators': 100}",0.824,0.808,0.836,0.798,0.81,0.8152,0.013303,4
5,137.848981,14.829403,4.326229,1.367475,15,150,"{'max_depth': 15, 'n_estimators': 150}",0.808,0.814,0.834,0.81,0.804,0.814,0.010507,5


The run time for GridSearchCV on XGBoost Classifier was 29 minutes. From the grid search cross validation process, the n_estimator=150 and max_depth=11 is identifies as the best parameters for XG Boost Classifier.

#### Final Evaluation for Model Selection

The Random Forest Classifier and XG Boost Classifier with their best parameters are used to choose the best model. The evaluation metrices accuracy, precision, and recall are used to determine the best model among the two fine tuned models.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
import time

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_features, imdb_frac['sentiment'], test_size=0.2)

The best parameters for Random Forest Classifier are n_estimators=300 and max_depth=90.

In [None]:
rfc = RandomForestClassifier(n_estimators=300, max_depth=90, n_jobs=-1)

start = time.time()
rfc_model = rfc.fit(X_train, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rfc_model.predict(X_test)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label=1, average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 19.703 / Predict time: 0.244 ---- Precision: 0.802 / Recall: 0.858 / Accuracy: 0.826


The run time for the evaluation of Random Forest Classifier with n_estimators=300 and max_depth=90 was 19 seconds. The accuracy is 82.6%, precision is 80.2%, and recall is 85.8%.

Next the same procedure is followed for XGBoost Classifier.

The best parameters for XGBoost Classifier are n_estimators=150 and max_depth=11.

In [None]:
xgbc = XGBClassifier(n_estimators=150, max_depth=11)
start = time.time()
xgbc_model = xgbc.fit(X_train, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = xgbc_model.predict(X_test)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label=1, average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 79.834 / Predict time: 1.804 ---- Precision: 0.804 / Recall: 0.866 / Accuracy: 0.83


The run time for the evaluation of XGBoost Classifier with n_estimators=150 and max_depth=11 was 1 minute. The accuracy is 83.0%, precision is 80.4%, and recall is 86.6%.

#### Conclusion

According to the results of the evaluation, the XGBoost Classifier with n_estimators=150 and max_depth=11 provides slightly higher score than the Random Forest Classifier with n_estimators=300 and max_depth=90. However, considering the runtime (total of fit time predict time) of the model Random Forest Classifier can be considered as the best performing model in the IMDB moview review dataset.

The result could be improved more exploring a higher range of parameters and using the complete dataset. However, due to the very long runtime, notebook crashes and limited RAM, the results were restricted.

Finally it can be concluded that the choosen best model, Random Forest Classifier with n_estimators=300 and max_depth=90, can be used to predict whether an IMDB movie review has a positive sentimental value or a negative sentimental value with an accuracy score of 82.6%, precision score of 80.2%, and recall score of 85.8%.