# **ReelFeel - IMDB Reviews Sentiment Classification Model with Natural Language Processing - using Recurrent Neural Networks**

---

**Alam Rincon - [GitHub: MrRincon](https://github.com/MrRincon)**

**Petar Atanasov - [GitHub: petar-Atanasov](https://github.com/petar-Atanasov)**

**Teon Morgan - [GitHub: Mi1kDev](https://github.com/Mi1kDev)**

---

**Lakshmipathi N. (2019) ‘IMDB Dataset of 50K Movie Reviews’. Available at: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews (Accessed: 14 April 2025).**

# **Preinstalling Libraries**

Run once and restart the kernel. Do not run again, and continue.

In [None]:
!pip install gensim

# **Preprocessing Data**

Importing core python libraries
*   pandas for dataset manipulation
*   numpy for mathematical processes
*   pyplot and seaborn for data visualization

In [2]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

nlp_dataset = pd.read_csv("./datasets/IMDB Dataset.csv")
nlp_dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Display general information breakdown of the dataset

In [3]:
# describes basic information regarding the dataset
nlp_dataset.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [4]:
# indicates datatyes of the various data columns
nlp_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


Reviewing the dataset structure for preprocessing.

In [5]:
nlp_dataset.shape

(50000, 2)

In [6]:
# checks for null values in the dataset
nlp_dataset.isnull().sum()

review       0
sentiment    0
dtype: int64

In [7]:
nlp_dataset.isnull().sum().sum()

np.int64(0)

Checking for duplicates and removing them

In [8]:
# checks for duplicate values in the dataset
nlp_dataset.duplicated().sum()

np.int64(418)

In [9]:
# removes existing duplicates
nlp_dataset.drop_duplicates(inplace=True)
nlp_dataset.shape

(49582, 2)

# **Deep Learning Model Implementation**

Importing and downloading all the necessary libraries to tokenise the reviews.

In [10]:
# Regular Expressions Library to Clean the data
import re
# Natural Language Toolkit Library to Preprocess the data
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Downhload the necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\techn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\techn\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\techn\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\techn\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\techn\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

Function to determine the part-of-speech(POS) tag for each word.

In [11]:
def get_wordnet_pos(word):
  tag = nltk.pos_tag([word])[0][1][0].upper()
  # post_tag access = [Tupple][POS Tag][First Letter POS Tag]
  tag_dict = {
      "J": wordnet.ADJ, # Adjectives
      "N": wordnet.NOUN, # Nouns
      "V": wordnet.VERB, # Verbs
      "R": wordnet.ADV # Adverb
      }
  return tag_dict.get(tag, wordnet.NOUN)

Cleaning the dataset
*   Turning each word to lower case
*   Removing HTML tags
*   Tokenising the words
*   Removing Stopwords
*   Applying lemmatization

In [12]:
# converts review into a token
def preprocess_text(review):
  review = review.lower()
  review = re.sub(r'<[^>]+>', '', review)
  review = re.sub(r'[^a-zA-Z0-9]', ' ', review)
  tokens = word_tokenize(review)
  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words]
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

  return " " .join(tokens)

Extracting the tokens of the review and target labels as binary:

In [None]:
nlp_tokenised_reviews = []
nlp_sentiment_labels = []

def extractTokenisedReview(row):
    return preprocess_text(row['review'])
    
def extractSentimentLabels(row):
    if row['sentiment'] == 'positive':
        return 1
    else:
        return 0

nlp_tokenised_reviews = nlp_dataset.apply(extractTokenisedReview, axis=1)
nlp_sentiment_labels = nlp_dataset.apply(extractSentimentLabels, axis=1)

# example of conversion from review to token
print(nlp_tokenised_reviews[:5])
print(nlp_sentiment_labels[:5])

0    one reviewer mention watch 1 oz episode hooked...
1    wonderful little production film technique una...
2    thought wonderful way spend time hot summer we...
3    basically family little boy jake think zombie ...
4    petter mattei love time money visually stun fi...
dtype: object
0    1
1    1
2    1
3    0
4    1
dtype: int64


In [None]:
import gensim.downloader as api
word2vec_model = api.load('word2vec-google-news-300')

## **Recurrent Neural Network (RNN)**

## **Model Training**

# **Evaluation and Insights**