# Sentiment Analysis of Movie Reviews Using Deep Learning
Team Members: Mik Vattakandy, Aidan Sim

## Project Overview
    This project explores sentiment analysis using Natural Language Processing to classify text as expressing positive, negative, or neutral sentiment. The aim is to compare several machine learning models to eachother for sentiment analysis in order to see the differences between traditional models (TF-IDF with logistic regression) and deep learning models (LSTM)

In [13]:
#Imports
import pandas as pd
import numpy as np
import re
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

## Dataset Description

### Dataset Preprocessing

The code segment below imports the dataset, and then cleans the values within (the reviews) and also converts the sentiment values to being 1, for positive sentiment, and 0, for negative sentiment. The reviews need to be cleaned as some of them contain HTML tags, punctuation, and other symbols and characters that would be problematic for the models to run on.

In [14]:
#Dataset Cleaning and processing
df = pd.read_csv("Data\IMDB_dataset.csv")

def clean_text(text):
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"[^a-zA-Z']", " ", text)
    return text.lower()

df['review_clean'] = df['review'].apply(clean_text)
df['label'] = df['sentiment'].map({'positive':1, 'negative':0})

X_train, X_test, y_train, y_test = train_test_split(df['review_clean'], df['label'], test_size=0.2, random_state=42)

### TF-IDF Model

In [15]:
#TF-IDF Model Code

tfidf = TfidfVectorizer(max_features=20000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
start = time.time()
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_tfidf, y_train)
end = time.time()
y_pred_tfidf = clf.predict(X_test_tfidf)

tfidf_features = X_train_tfidf.shape[1]
tfidf_time = end - start

print(f"TF-IDK Model Accuracy: {accuracy_score(y_test, y_pred_tfidf)}\n{classification_report(y_test, y_pred_tfidf)}")
print(f"TF-IDF Features: {tfidf_features}")
print(f"TF-IDF Model Training Time: {end - start} seconds")

TF-IDK Model Accuracy: 0.9005
              precision    recall  f1-score   support

           0       0.91      0.89      0.90      4961
           1       0.89      0.91      0.90      5039

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

TF-IDF Features: 20000
TF-IDF Model Training Time: 0.5110175609588623 seconds


### LSTM Model

In [16]:
#LSTM Model Code
tokenizer = Tokenizer(num_words=20000)
tokenizer.fit_on_texts(X_train)
X_train_seq = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=200)
X_test_seq = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=200)

model = Sequential([
    Embedding(input_dim=20000, output_dim=128),
    LSTM(128, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)
start = time.time()
history = model.fit(
    X_train_seq, y_train,
    validation_split=0.2,
    batch_size=64,
    epochs=20,
    callbacks=[early_stop]
)
end = time.time()

lstm_time = end - start

lstm_loss, lstm_acc = model.evaluate(X_test_seq, y_test)
print(f"LSTM Model Accuracy: {lstm_acc}")
print(model.summary())
print(f"TF-IDF Model Training Time: {end - start} seconds")

Epoch 1/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m121s[0m 239ms/step - accuracy: 0.7924 - loss: 0.4545 - val_accuracy: 0.8420 - val_loss: 0.3870
Epoch 2/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 195ms/step - accuracy: 0.8681 - loss: 0.3312 - val_accuracy: 0.8360 - val_loss: 0.4032
Epoch 3/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m108s[0m 217ms/step - accuracy: 0.9013 - loss: 0.2561 - val_accuracy: 0.8543 - val_loss: 0.3460
Epoch 4/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m107s[0m 213ms/step - accuracy: 0.9276 - loss: 0.1947 - val_accuracy: 0.8680 - val_loss: 0.3442
Epoch 5/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m100s[0m 199ms/step - accuracy: 0.9398 - loss: 0.1660 - val_accuracy: 0.8670 - val_loss: 0.3954
Epoch 6/20
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 195ms/step - accuracy: 0.9563 - loss: 0.1231 - val_accuracy: 0.8645 - val_loss: 0.4153
[1m31

None
TF-IDF Model Training Time: 630.7716052532196 seconds


In [17]:
#Comparison Code
print(f"Accuracy Differential (TF-IDF - LSTM): {accuracy_score(y_test, y_pred_tfidf) - lstm_acc}")
print(f"TF-IDF Model Time: {tfidf_time}\nLSTM Model Time: {lstm_time}")
print(f"Time Differential (TF-IDF - LSTM): {tfidf_time - lstm_time}")
print(f"TF-IDF Parameter Count: {tfidf_features}\nLSTM Parameter Count: {model.count_params()}")

Accuracy Differential (TF-IDF - LSTM): 0.029699981689453092
TF-IDF Model Time: 0.5110175609588623
LSTM Model Time: 630.7716052532196
Time Differential (TF-IDF - LSTM): -630.2605876922607
TF-IDF Parameter Count: 20000
LSTM Parameter Count: 2691713


## Analysis and Results

The results gathered from training the above models demonstrate the differences in accuracy between the classical model (TF-IDF) and the more advanced deep learning model (LSTM). **ADD MORE**

## References