# Sentiment Analysis on Movie Reviews

This project will create a machine learning model that is able to predict if a given movie review is positive or negative.
It uses Stanford's Large Movie Review Dataset: [Link](https://ai.stanford.edu/~amaas/data/sentiment/)


## Setup

These libraries are included in the `requirements.txt` file and can be downloaded using a simple:
`pip install -r requirements.txt`


In [25]:
# Expanding contractions
import contractions

# Working with datasets
import pandas as pd

# Text cleaning
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Regular expressions
import re

# ML related
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report

## Data Reading


In [26]:
# Path to where the dataset is located
DATASET_PATH = "./dataset/IMDB Dataset.csv"

# Read the local dataset containing movie reviews and their sentiments
df = pd.read_csv(DATASET_PATH)[:1000]

print(df)

                                                review sentiment
0    One of the other reviewers has mentioned that ...  positive
1    A wonderful little production. <br /><br />The...  positive
2    I thought this was a wonderful way to spend ti...  positive
3    Basically there's a family where a little boy ...  negative
4    Petter Mattei's "Love in the Time of Money" is...  positive
..                                                 ...       ...
995  Nothing is sacred. Just ask Ernie Fosselius. T...  positive
996  I hated it. I hate self-aware pretentious inan...  negative
997  I usually try to be professional and construct...  negative
998  If you like me is going to see this in a film ...  negative
999  This is like a zoology textbook, given that it...  negative

[1000 rows x 2 columns]


## Data Preprocessing


In [None]:
# Preprocessing
nltk.download("punkt_tab")
nltk.download("wordnet")

p = re.compile("<.*?>")
lemmatizer = WordNetLemmatizer()

def remove_html(text):
    cleantext = re.sub(p, "", text)
    return cleantext

def expand_contractions(text):
    expanded = [contractions.fix(word) for word in text.split()]
    return " ".join(expanded)

def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    filtered_text = [word for word in text.split() if word not in stop_words]
    return " ".join(filtered_text)

def lemmatize(text):
    lemmas = [lemmatizer.lemmatize(word) for word in text.split()]
    return " ".join(lemmas)

def clean(text):
    text = remove_html(text)
    text = expand_contractions(text)
    text = remove_stopwords(text.lower())
    return lemmatize(text)

df["review"] = df["review"].apply(clean)

## Data splitting


In [28]:
# Split the dataset into a training and testing set
train_data, test_data = train_test_split(df, test_size=0.3, shuffle=False)

# Verify data
print(train_data)
print(test_data)

                                                review sentiment
0    one reviewer mentioned watching 1 oz episode h...  positive
1    wonderful little production. filming technique...  positive
2    thought wonderful way spend time hot summer we...  positive
3    basically family little boy (jake) think zombi...  negative
4    petter mattei's "love time money" visually stu...  positive
..                                                 ...       ...
695  okay stupid,they say making another nightmare ...  negative
696  everyone, name may sound weird, nothing else! ...  positive
697  finally released good modesty blaise movie, te...  positive
698  now, game's stale, right?the joke done. over. ...  positive
699  decided watch movie would seen carol lombard m...  negative

[700 rows x 2 columns]
                                                review sentiment
700  unfortunately spoiler review nothing spoil mov...  negative
701  enjoyed watching well acted movie much!it well...  positive
7

## Vectorization


In [29]:
# Setup vectorizer to convert words into word vectors
vectorizer = TfidfVectorizer(
    min_df=5, max_df=0.8, sublinear_tf=True
)
train_vectors = vectorizer.fit_transform(train_data["review"])
test_vectors = vectorizer.transform(test_data["review"])

## Training


In [None]:
# Train the model using a Support Vector Machine
clf = SVC(kernel="linear", probability=True, random_state=42)
clf.fit(train_vectors, train_data["sentiment"])

## Testing and Accuracy


In [31]:
# Predict the sentiments of the test data and compare to the actual sentiments
predictions = clf.predict(test_vectors)
report = classification_report(test_data["sentiment"], predictions, output_dict=True)

print("positive: ", report["positive"])
print("negative: ", report["negative"])
print("accuracy:", report["accuracy"])

positive:  {'precision': 0.8835616438356164, 'recall': 0.7914110429447853, 'f1-score': 0.8349514563106796, 'support': 163.0}
negative:  {'precision': 0.7792207792207793, 'recall': 0.8759124087591241, 'f1-score': 0.8247422680412371, 'support': 137.0}
accuracy: 0.83


## Further Testing


In [32]:
# Test with custom reviews
# Change this to your review to test
review = "good" 
prediction_transformed = vectorizer.transform([review])

print(clf.predict(prediction_transformed), clf.predict_proba(prediction_transformed))

['positive'] [[0.11412153 0.88587847]]
