# Sentiment Analysis of Movie Reviews Using Deep Learning
Team Members: Mik Vattakandy, Aidan Sim

## Project Overview
    This project explores sentiment analysis using Natural Language Processing to classify text as expressing positive, negative, or neutral sentiment. The aim is to compare several machine learning models to eachother for sentiment analysis in order to see the differences between traditional models (TF-IDF with logistic regression) and deep learning models (LSTM)

In [1]:
#Imports
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

## Dataset Description

### Dataset Preprocessing

The code segment below imports the dataset, and then cleans the values within (the reviews) and also converts the sentiment values to being 1, for positive sentiment, and 0, for negative sentiment. The reviews need to be cleaned as some of them contain HTML tags, punctuation, and other symbols and characters that would be problematic for the models to run on.

In [2]:
#Dataset Cleaning and processing
df = pd.read_csv("Data\IMDB_dataset.csv")

def clean_text(text):
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"[^a-zA-Z']", " ", text)
    return text.lower()

df['review_clean'] = df['review'].apply(clean_text)
df['label'] = df['sentiment'].map({'positive':1, 'negative':0})

X_train, X_test, y_train, y_test = train_test_split(df['review_clean'], df['label'], test_size=0.2, random_state=42)

  df = pd.read_csv("Data\IMDB_dataset.csv")


### TF-IDF Model

In [3]:
#TF-IDF Model Code

tfidf = TfidfVectorizer(max_features=20000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_tfidf, y_train)
y_pred_tfidf = clf.predict(X_test_tfidf)

print(f"TF-IDK Model Accuracy: {accuracy_score(y_test, y_pred_tfidf)}\n{classification_report(y_test, y_pred_tfidf)}")

TF-IDK Model Accuracy: 0.9005
              precision    recall  f1-score   support

           0       0.91      0.89      0.90      4961
           1       0.89      0.91      0.90      5039

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



### LSTM Model

In [4]:
#LSTM Model Code

In [5]:
#Comparison Code

## Analysis and Results

## References