<a href="https://colab.research.google.com/github/AmeerAliSaleem/MA4J5_Project/blob/main/Ameer_Ali_Saleem_MA4J5_Main_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MA4J5 Project: Investigating the Utility of Natural Language Processing for Sentiment Analysis
---
### Ameer Ali Saleem

A Python notebook containing the Python code to support my MA4J5 report. Please note that this notebook has been written in Google Colab: this allows for the use of the TensorFlow package without requiring a direct installation (which would take roughly 1.1GB of storage). If you do not have TensorFlow installed, please click on the "Open in Colab" link at the top of the notebook.

# Imports and checks

In [7]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import regex as re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Our goal is to build a neural network to predict the rating (out of 5 stars) of an input Amazon Kindle review. The data used in this project can be found at <a href="https://nijianmo.github.io/amazon/index.html">this link</a>. Due to the sheer quantity of data, I have chosen to use just 10% of the 5-core Kindle Store data. We begin by importing and cleaning the data:

In [8]:
df = pd.read_csv("kindle_cleaned.csv")

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,overall,reviewTime,reviewText,summary
0,0,5.0,"01 24, 2015",Great Classic for free. only wish they had th...,Five Stars
1,1,4.0,"05 16, 2013","I liked it, wished for more about her grandpar...",Pretty good
2,2,5.0,"01 2, 2013","I really liked this short story, but I really ...",Wished it was longer...
3,3,5.0,"03 23, 2016",Wow I love this series,Five Stars
4,4,5.0,"09 19, 2015",Great series. Can't wait to read more. Love t...,Love it


In [10]:
df = df.drop(columns="Unnamed: 0")

In [11]:
df.head()

Unnamed: 0,overall,reviewTime,reviewText,summary
0,5.0,"01 24, 2015",Great Classic for free. only wish they had th...,Five Stars
1,4.0,"05 16, 2013","I liked it, wished for more about her grandpar...",Pretty good
2,5.0,"01 2, 2013","I really liked this short story, but I really ...",Wished it was longer...
3,5.0,"03 23, 2016",Wow I love this series,Five Stars
4,5.0,"09 19, 2015",Great series. Can't wait to read more. Love t...,Love it


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141992 entries, 0 to 141991
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   overall     141992 non-null  float64
 1   reviewTime  141992 non-null  object 
 2   reviewText  141961 non-null  object 
 3   summary     141895 non-null  object 
dtypes: float64(1), object(3)
memory usage: 4.3+ MB


# Text Preprocessing

In [16]:
type(df["reviewText"].values)

numpy.ndarray

In [23]:
print(type(df["reviewText"].iloc[0]))

<class 'str'>


In [24]:
df["reviewText"] = df["reviewText"].astype(str)

In [13]:
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [32]:
reviews_cleaned = []

for i in range(len(df["reviewText"])):
  reviews_raw = df["reviewText"].iloc[i]
  reviews_edit_1 = reviews_raw.lower()
  reviews_edit_2 = re.sub(r"[^a-zA-Z0-9\s\.]", "", reviews_edit_1) # Filter punctuation
  # reviews_edit_3 = re.sub(" \.|\. ", "", reviews_edit_2) # Get rid of fullstops that are outside of words
  reviews_cleaned.append(reviews_edit_2.strip()) # remove unnecessary whitespace and append to list of cleaned reviews

In [33]:
# remove stopwords from the cleaned list
filtered_reviews = [
    ' '.join(word for word in sentence.split() if word.lower() not in stopwords)
    for sentence in reviews_cleaned
]

In [34]:
# take a look at some sample data
reviews_cleaned[:5]

['great classic for free.  only wish they had the whole pepper set',
 'i liked it wished for more about her grandparents n his family history but it was enough. enjoy the read',
 'i really liked this short story but i really do wish it was longer. it was like reading a book and getting to the end and the rest of the book is missing. it left me wanting more and to me personally that means it was a very good story. i have several of stephen king s books and this story is now one of my favorites. i would recommend it to everyone',
 'wow i love this series',
 'great series. cant wait to read more.  love the family and the terrific group of amazing characters.  already bought the rest.  cant wait to read them']

In [35]:
# compare the above with stopwords removed
filtered_reviews[:5]

['great classic free. wish whole pepper set',
 'liked wished grandparents n family history enough. enjoy read',
 'really liked short story really wish longer. like reading book getting end rest book missing. left wanting personally means good story. several stephen king books story one favorites. would recommend everyone',
 'wow love series',
 'great series. cant wait read more. love family terrific group amazing characters. already bought rest. cant wait read']

In [37]:
labels = list(df["overall"])

# Shuffle the data.

filtered_reviews, labels = zip(*random.sample(list(zip(filtered_reviews,labels)), len(filtered_reviews)))
filtered_reviews = list(filtered_reviews)
labels = list(labels)

# Train-test split (80:20)

trainsize = int(len(filtered_reviews)*0.8)

train_reviews, train_labels = filtered_reviews[:trainsize], labels[:trainsize]
test_reviews, test_labels = filtered_reviews[trainsize:], labels[trainsize:]

In [40]:
# Check distribution reviews (1,2,3,4,5) in the training set

print("TRAINING SET")
print("Number of reviews to use for training is: {}.".format(len(train_labels)))
print("Number of 5-star reviews is: {} (or {:.1f}%).".format(sum(np.array(train_labels)==5),100*sum(np.array(train_labels)==5)/len(train_labels)))
print("Number of 4-star reviews is: {} (or {:.1f}%).".format(sum(np.array(train_labels)==4),100*sum(np.array(train_labels)==4)/len(train_labels)))
print("Number of 3-star reviews is: {} (or {:.1f}%).".format(sum(np.array(train_labels)==3),100*sum(np.array(train_labels)==3)/len(train_labels)))
print("Number of 2-star reviews is: {} (or {:.1f}%).".format(sum(np.array(train_labels)==2),100*sum(np.array(train_labels)==2)/len(train_labels)))
print("Number of 1-star reviews is: {} (or {:.1f}%).".format(sum(np.array(train_labels)==1),100*sum(np.array(train_labels)==1)/len(train_labels)))

TRAINING SET
Number of reviews to use for training is: 113593.
Number of 5-star reviews is: 69085 (or 60.8%).
Number of 4-star reviews is: 26784 (or 23.6%).
Number of 3-star reviews is: 10855 (or 9.6%).
Number of 2-star reviews is: 3925 (or 3.5%).
Number of 1-star reviews is: 2944 (or 2.6%).


The class imbalance in the above training set can cause bias in the model towards the more common classes, e.g. the model may favour 5-star reviews far more than any other review score. We will see later how to mitigate these issues.

The word tokenizer provided by the Keras package will be able to deal with full stops for us, e.g. "missing." and "missing" will be treated as the same token. Now for the tokenisation:

In [None]:
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(train_reviews)
word_index = tokenizer.word_index
print(word_index)

# Model Building and Evaluation

# Model Applicaton

# Model Extensions