<!-- Project 3 – IMDB MOVIE REVIEW
Due on Sat 9th Apr 11:59pm EST
Context: IMDB dataset having 25K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous
benchmark datasets. We provide a set of 12,500 highly polar movie reviews for training and 12,500 for
testing. Please use less data eg 6K reviews if you are facing memory issues but make sure to use equal
number of positive and negative sentiment reviews. Mention clearly in the notebook, if you have used a
reduced dataset.
For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/
Dataset Source: Click here
Task: Goal of this project is to predict the number of positive and negative reviews using classification
Implementation:
- Preprocess Text Data(Remove punctuation, Perform Tokenization, Remove stopwords and
Lemmatize/Stem)
- Perform TFIDF Vectorization
- Exploring parameter settings using GridSearchCV on Random Forest & Gradient Boosting
Classifier. Use Xgboost instead of Gradient Boosting if it's taking a very long time in
GridSearchCV
- Perform Final evaluation of models on the best parameter settings using the evaluation metrics
- Report the best performing model
Submission Instructions: Please just submit one jupyter notebook containing all the code and make use
of markdown cells to include the comments, answers, reasoning, analysis, etc.
Note: Name of your file should be your “Project3-id_Firstname_Lastname.ipynb” -->

### Project 3 – IMDB MOVIE REVIEW
Due on Sat 9th Apr 11:59pm EST
Context: IMDB dataset having 25K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous
benchmark datasets. We provide a set of 12,500 highly polar movie reviews for training and 12,500 for
testing. Please use less data eg 6K reviews if you are facing memory issues but make sure to use equal
number of positive and negative sentiment reviews. Mention clearly in the notebook, if you have used a
reduced dataset.
#### For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/
#### Dataset Source: Click here
#### Task: Goal of this project is to predict the number of positive and negative reviews using classification
Implementation:
- Preprocess Text Data(Remove punctuation, Perform Tokenization, Remove stopwords and
Lemmatize/Stem)
- Perform TFIDF Vectorization
- Exploring parameter settings using GridSearchCV on Random Forest & Gradient Boosting
Classifier. Use Xgboost instead of Gradient Boosting if it's taking a very long time in
GridSearchCV
- Perform Final evaluation of models on the best parameter settings using the evaluation metrics
- Report the best performing model
#### Submission Instructions: Please just submit one jupyter notebook containing all the code and make use of markdown cells to include the comments, answers, reasoning, analysis, etc.
#### Note: Name of your file should be your “Project3-id_Firstname_Lastname.ipynb”

In [1]:
# Import necessary libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report


In [2]:
# Load the IMDB dataset
# Replace 'path_to_dataset' with the actual path to your dataset file
imdb = pd.read_excel('IMDB_dataset.xlsx')
imdb.head()

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative


### Reduce the Dataset to 6000 reviews

In [3]:
# Randomly select an equal number of positive and negative sentiment comments from the dataset
positive_reviews = imdb[imdb['sentiment'] == 'positive'].sample(n=3000, random_state=42)
negative_reviews = imdb[imdb['sentiment'] == 'negative'].sample(n=3000, random_state=42)

# Combine extracted positive and negative sentiment reviews into a smaller dataset
imdb_sample = pd.concat([positive_reviews, negative_reviews])

# Reset index
imdb_sample.reset_index(drop=True, inplace=True)

# Print the shape of a smaller data set
print("Reduced dataset shape:", imdb_sample.shape)

Reduced dataset shape: (6000, 2)


In [4]:
df =imdb_sample
df

Unnamed: 0,review,sentiment
0,"Of course the average ""Sci-Fi"" Battle Star Gal...",positive
1,Sorry to say I have no idea what Hollywood is ...,positive
2,"""The Lady from Shanghai"" is well known as one ...",positive
3,Ed Harris and Cuba Gooding Jr. where cast perf...,positive
4,Kate Miller (Angie Dickinson) is having proble...,positive
...,...,...
5995,Let me start out by saying that I used to real...,negative
5996,What we have here is a classic case of TOO muc...,negative
5997,Oh it really really is. I've seen films that I...,negative
5998,OK well i found this movie in my dads old pile...,negative


### Preprocess Text Data(Remove punctuation, Perform Tokenization, Remove stopwords and Lemmatize/Stem)


In [5]:
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

df['body_len'] = df['review'].apply(lambda x: len(x) - x.count(" "))
df['punct%'] = df['review'].apply(lambda x: count_punct(x))


def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

df['clean_review'] = df['review'].apply(lambda x: clean_text(x))
df.head()

Unnamed: 0,review,sentiment,body_len,punct%,clean_review
0,"Of course the average ""Sci-Fi"" Battle Star Gal...",positive,744,5.8,"[course, average, scifi, battle, star, gallact..."
1,Sorry to say I have no idea what Hollywood is ...,positive,561,2.7,"[sorry, say, idea, hollywood, sure, give, us, ..."
2,"""The Lady from Shanghai"" is well known as one ...",positive,1350,3.8,"[lady, shanghai, well, known, one, hollywoods,..."
3,Ed Harris and Cuba Gooding Jr. where cast perf...,positive,617,4.1,"[ed, harris, cuba, gooding, jr, cast, perfectl..."
4,Kate Miller (Angie Dickinson) is having proble...,positive,3643,4.3,"[kate, miller, angie, dickinson, problems, mar..."


### Split into train/test

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[['review', 'body_len', 'punct%']], df['sentiment'], test_size=0.2)

### Perform TFIDF Vectorization

In [7]:
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['review'])

tfidf_train = tfidf_vect_fit.transform(X_train['review'])
tfidf_test = tfidf_vect_fit.transform(X_test['review'])

X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,50780,50781,50782,50783,50784,50785,50786,50787,50788,50789
0,510,4.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2639,5.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,562,2.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1916,4.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1760,4.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4795,468,6.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4796,740,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4797,705,4.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4798,1270,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Exploring parameter settings using GridSearchCV on Random Forest & Gradient Boosting Classifier. Use Xgboost instead of Gradient Boosting if it's taking a very long time in GridSearchCV

In [8]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time

#### Exploring parameter settings using GridSearchCV on Random Forest

In [9]:
# Convert feature name to string type
X_train_vect.columns = X_train_vect.columns.astype(str)
X_test_vect.columns = X_test_vect.columns.astype(str)

rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='positive', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 6.84 / Predict time: 0.387 ---- Precision: 0.879 / Recall: 0.811 / Accuracy: 0.841


#### Exploring parameter settings using GridSearchCV on Gradient Boosting Classifier

In [10]:
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='positive', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 634.449 / Predict time: 0.553 ---- Precision: 0.836 / Recall: 0.843 / Accuracy: 0.829


### Perform Final evaluation of models on the best parameter settings using the evaluation metrics


#### These two results evaluate the performance of a classification model, including training time, prediction time, precision, recall, and accuracy.

#### Fit time: 
The training time for the first model is 6.84 seconds, while for the second model, it is 634.449 seconds. It's evident that the first model's training time is much shorter than the second model's training time.

#### Predict time: 
The prediction time for the first model is 0.387 seconds, and for the second model, it is 0.553 seconds. The prediction time for both models is similar, but the second model is slightly slower.

#### Precision: 
The precision of the first model is 0.879, whereas for the second model, it's 0.836. Precision measures the proportion of true positive predictions among all positive predictions. Hence, the first model has slightly higher precision than the second model.

#### Recall:
The recall of the first model is 0.811, and for the second model, it's 0.843. Recall assesses the model's ability to correctly identify all positive samples. Therefore, the second model has slightly higher recall than the first model.

#### Accuracy:
The accuracy of the first model is 0.841, and for the second model, it's 0.829. Accuracy measures the proportion of correctly predicted samples among all samples. Both models have similar accuracy, but the first model is slightly higher.

#### In conclusion, although the first model has a shorter training time, it slightly lags behind the second model in terms of precision and recall. Therefore, the choice between these two models should be based on specific application scenarios and requirements.

### Report the best performing model
#### Taking various indicators into consideration, the first model is slightly better than the second model in precision and accuracy, but the second model is slightly better than the first model in recall rate. Therefore, considering precision, recall, and precision, the second model can be considered a better performing model.