## ANLP Assignment 1: Sentiment Analysis
### Christopher Hamilton,  a1766121

In [None]:
import json
import os

import pandas as pd
import numpy as np

### 1. Reading dataset and initial pre-processing

In [None]:
def read_json_to_df(file_name):
    data = []
    with open(file_name) as data_file:
        for line in data_file:
            # Load each line of the JSON file as a dictionary
            data.append(json.loads(line))

    # Form a Pandas DataFrame from the dictionaries
    return pd.json_normalize(data)

# Load the training and test data
raw_train_df = read_json_to_df("hotel_reviews_train.json")
raw_test_df = read_json_to_df("hotel_reviews_test.json")

In [None]:
# Print out the initially loaded dataframes
raw_train_df.head()

In [None]:
raw_test_df.head()

In [None]:
# Select the title, text and overall rating columns to make a new dataframe
train_df = raw_train_df[["title", "text", "ratings.overall"]]
test_df = raw_test_df[["title", "text", "ratings.overall"]]

# Check the value counts for the ratings
print("Training data ratings")
print(train_df["ratings.overall"].value_counts())

print()

print("Test data ratings")
print(test_df["ratings.overall"].value_counts())

In [None]:
# Find indices of rows where the rating is 0
zero_rating_indices = test_df[test_df['ratings.overall'] == 0].index
for index in zero_rating_indices:
    # Print the text corresponding to the zero rating
    print(test_df['text'][index])

In [None]:
# Based on the above text, it is unlikely the reviewer meant to give a low rating
# Instead, we will remvoe the 0 from the dataset
test_df = test_df.drop(zero_rating_indices)

In [None]:
# Check the value counts for the ratings after the 0 rating has been removed
print("Test data ratings")
print(test_df["ratings.overall"].value_counts())

Python's lambda functions can be used to remove the special characters from the dataset. Pandas DataFrames columns include an `apply` method that can take in a lambda function to apply to each cell in the column. By including a lambda function that will only include characters which are alphanumeric or spaces, the special characters can be removed from the dataset (Saturn Cloud 2024).

At the same time, we can apply the `lower()` function on each character to convert all the text to lowercase. This can be seen by viewing the first few rows with the `head()` function on the DataFrames.

In [None]:
# Remove remove non-alphanumeric characters from the title and text columns
train_df.loc[:, 'title'] = train_df['title'].apply(lambda x: ''.join(char.lower() for char in x if char.isalnum() or char.isspace()))
train_df.loc[:, 'text'] = train_df['text'].apply(lambda x: ''.join(char.lower() for char in x if char.isalnum() or char.isspace()))

test_df.loc[:, 'title'] = test_df['title'].apply(lambda x: ''.join(char.lower() for char in x if char.isalnum() or char.isspace()))
test_df.loc[:, 'text'] = test_df['text'].apply(lambda x: ''.join(char.lower() for char in x if char.isalnum() or char.isspace()))

In [None]:
train_df.head()

In [None]:
test_df.head()

The provided code for the `language_filter.py` file includes an example of using the `langdetect` Python package to filter for only English text. Rather than applying the filter for only English reviews when reading the file, we can apply the filter on the loaded DataFrames using a similar method to above. By using the Pandas `apply` method on the text and title columns, the returned DataFrame will only include rows where both the title and text are in English as determined by the `langdetect` package.

In [None]:
from langdetect import detect as detect_language

def filter_english_reviews(df):
    def is_english(text):
        try:
            return detect_language(text) == "en"
        except:
            return False

    # Filter the DataFrame for reviews where both title and text are in English
    return df[df['text'].apply(is_english) & df['title'].apply(is_english)]

Since the language detecting process takes some time over the whole dataset, to save time during development, the filtered DataFrames can be saved and loaded from CSV. Since these DataFrames will not change, and all preprocessing steps are the same, running the language filter each time is not necessary. I have written some quick checks to see if the files have already been saved, and if they have load them, otherwise run the language check code and save the files for later.

In [None]:
# Save the English reviews to a CSV file to save time filtering when running again (NumFOCUS, Inc. 2024)
if os.path.exists("english_hotel_reviews_train.csv"):
    train_df = pd.read_csv("english_hotel_reviews_train.csv")
else:
    train_df = filter_english_reviews(train_df)
    train_df.to_csv("english_hotel_reviews_train.csv", index=False)

if os.path.exists("english_hotel_reviews_test.csv"):
    test_df = pd.read_csv("english_hotel_reviews_test.csv")
else:
    test_df = filter_english_reviews(test_df)
    test_df.to_csv("english_hotel_reviews_test.csv", index=False)

In [None]:
print(train_df.info())

In [None]:
print(test_df.info())

### 2. Exploratory Data Analysis (EDA)

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
import matplotlib.pyplot as plt

# Plot distribution of ratings
train_df['ratings.overall'].value_counts().sort_index().plot(kind='bar', figsize=(8,5), color='skyblue')

plt.xlabel("Rating")
plt.ylabel("Count")
plt.title("Distribution of Ratings")
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Plot distribution of ratings
test_df['ratings.overall'].value_counts().sort_index().plot(kind='bar', figsize=(8,5), color='green')

plt.xlabel("Rating")
plt.ylabel("Count")
plt.title("Distribution of Ratings")
plt.show()

The distribution of the ratings can be plotted on a bar chart for both the training and test data. From the charts above, it is clear that most of the ratings for the hotels in the hotel booking company are positive, with a similar distribution of ratings across the training and testing sets.

Based on the code provided as part of Workshop 2, the predictive and non-predictive words in the dataset can be found using the TF-IDF (Term Frequency-Inverse Document Frequency) (Feature Engineering 2025). From TF-IDF, the words with the correlations closest to 0 indicate a very small effect on the prediction, whereas the words with a correlation higher indicate they are more positive and words with a more negative correlation indicate they are more negative.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tf_idf_train = vectorizer.fit_transform(train_df["text"])

# Convert to DataFrame
tfidf_df = pd.DataFrame(tf_idf_train.toarray(), columns=vectorizer.get_feature_names_out())

# Find the correlations with the ratings
correlations = tfidf_df.corrwith(train_df["ratings.overall"])
correlations = correlations.sort_values(ascending=False)

# Find 10 words with the weakest correlation by sorting
non_predictive_words = correlations.sort_values(key=lambda x: np.abs(x))
print("Non-Predictive Words:\n", non_predictive_words.head(10))

# Display top 10 positive and negative correlated words
print("Most Positive Words:\n", correlations.head(10))
print("\nMost Negative Words:\n", correlations.tail(10))

In order to find the number of unique words, the text can be converted into a list of tokens, and the number of unique tokens can then easily be found with `numpy`. Given that the data to be used for classification into the ratings is the textual review data, the title and text columns can be combined into a single text column. To make analysis simpler, the overall rating column can also be renamed to just rating. At this stage the stop words are also removed from the dataset.

In [None]:
from nltk.corpus import stopwords

# Create a column with the title and text together
train_df["combined_text"] = train_df["title"] + " " + train_df["text"]
test_df["combined_text"] = test_df["title"] + " " + test_df["text"]

train_df = train_df.drop(columns=["title", "text"])
test_df = test_df.drop(columns=["title", "text"])
train_df = train_df.rename(columns={"ratings.overall": "rating", "combined_text": "text"})
test_df = test_df.rename(columns={"ratings.overall": "rating", "combined_text": "text"})

stop_words = set(stopwords.words('english'))
train_df["text"] = train_df["text"].apply(lambda text: ' '.join([word for word in text.split(' ') if word not in stop_words]))
test_df["text"] = test_df["text"].apply(lambda text: ' '.join([word for word in text.split(' ') if word not in stop_words]))

# Split all reviews into words and find unique ones
all_words_text = np.concatenate(train_df.text.apply(nltk.word_tokenize).to_numpy())

unique_words = np.unique(all_words_text)

print("Total Unique Words:", len(unique_words))

In [None]:
train_df.head()

In [None]:
test_df.head()

The most frequent words in the dataset can be plotted on a bar chart. Stop words are removed for this analysis so that the chart is not filled with very common words such as 'the' or 'is'.

In [None]:
from collections import Counter
import matplotlib.pyplot as plt

tokens = [word for word in all_words_text if word not in stop_words]
word_freq = Counter(tokens)

plt.figure(figsize=(12, 5))
plt.bar(*zip(*word_freq.most_common(20)))
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.title("Word Frequency Plot")
plt.show()


The most common trigrams in the dataset can give us insight into common phrases that are used in the dataset. (Exploratory Data Analysis 2025) These sequences can be calculated and listed as well as plotted on a chart for viewing.

In [None]:
from nltk import ngrams
from collections import Counter
import matplotlib.pyplot as plt

# Function to generate n-grams
def generate_ngrams(text, n):
    n_grams = ngrams(text, n)
    return [' '.join(gram) for gram in n_grams]

# Specify the value of n for n-grams
n_value = 3

# Generate n-grams
ngrams_list = generate_ngrams(tokens, n_value)

# Count the occurrences of each n-gram
ngrams_count = Counter(ngrams_list)
most_common_ngrams = ngrams_count.most_common(100)

# Display the distribution
print(f"Distribution of {n_value}-grams:")
for ngram, count in most_common_ngrams:
    print(f"{ngram}: {count}")

# Plot the distribution
labels, values = zip(*most_common_ngrams)
indexes = range(len(labels))

plt.figure(figsize=(20, 10))
plt.bar(indexes, values)
plt.xlabel(f'{n_value}-grams')
plt.ylabel('Frequency')
plt.xticks(indexes, labels, rotation='vertical')
plt.title(f'Distribution of {n_value}-grams')
plt.show()

### 3. Selection and training Machine Learning models

When training machine learning models, the dataset should be balanced to ensure that there is no bias to any one category. In the training dataset, there are more positive reviews than negative, and as a result the trained model may become biased towards classifying text positively. To address this, it is possible to use oversampling to create a data set for training that includes an equal number for each category. (Income Evaluation Notebook 2025)

In [None]:
# Balance the training data by oversampling
def balance_data_oversample(df):
    max_count = df['rating'].value_counts().max()
    balanced_df = pd.DataFrame()

    for rating in df['rating'].unique():
        rating_df = df[df['rating'] == rating]
        balanced_df = pd.concat([balanced_df, rating_df.sample(max_count, replace=True)])

    return balanced_df
balanced_train_df = balance_data_oversample(train_df)

# Plot distribution of ratings
balanced_train_df['rating'].value_counts().sort_index().plot(kind='bar', figsize=(8,5), color='skyblue')

plt.xlabel("Rating")
plt.ylabel("Count")
plt.title("Distribution of Ratings")
plt.show()

The text is already in lowercase and stop words have been removed from the dataset. To prepare the data for machine learning, the text can be lemmatised. Lemmatisation is one method for reducing words to their base forms, and this can be included in the preprocessing of data before a machine learning technique is applied to improve results. (Murel 2023)

In [None]:
# Lemmatize the text
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
balanced_train_df.loc[:, 'text'] = balanced_train_df['text'].apply(lambda x: ' '.join(lemmatizer.lemmatize(word) for word in x.split()))
test_df.loc[:, 'text'] = test_df['text'].apply(lambda x: ' '.join(lemmatizer.lemmatize(word) for word in x.split()))

The classical machine learning method that will be used in this experiment in Multinomial Naive Bayes. This classification algorithm "simplifies the process of classifying text by assuming that the presence of one word doesn’t depend on others", which "makes it computationally efficient and reliable for a range of tasks" (Sriram 2024). In order to train the Multinomial Naive Bayes classifier, the data must be arranged into a training and validation set.

The Scikit Learn Python module includes a function to automatically split a dataset into a training and testing set or a training an validation set. For the training that is to be completed in this experiment, 80% of the data will be used for training and 20% will be used for validation.

In [None]:
from sklearn.model_selection import train_test_split

X_res = balanced_train_df["text"]
y_res = balanced_train_df["rating"]

X_train, X_val, y_train, y_val = train_test_split(X_res, y_res, test_size=0.2, shuffle=True)

In [None]:
# (Feature Engineering 2025)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_val_vectors = vectorizer.transform(X_val)

In [None]:
from sklearn.model_selection import cross_val_score

#### Multinomial Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

In [None]:
# (Income Evaluation Notebook 2025)
nb_accuracies = cross_val_score(classifier, X_train_vectors, y_train, cv=5)
classifier.fit(X_train_vectors, y_train)
print(f"Naive Bayes Train Score: {round(np.mean(nb_accuracies) * 100, 2)}%")

In [None]:
naive_bayes_score = classifier.score(X_val_vectors, y_val)
print(f"Naive Bayes Validation Score: {round(naive_bayes_score * 100, 2)}%")

After training the Multinomial Naive Bayes classifier on the training data and testing the accuracy on the validation data, it is clear that the classification has performed quite well. The accuracy percentages are shown above, and this model could be considered to evaluate using the test data as well. However, a deep learning model should also be trained to determine how well it performs.

To do this, Tensorflow and Keras will be used. Some extra configuration is needed for Tensorflow to make use of the GPU, without encountering memory issues, as shown below.

In [None]:
import tensorflow as tf

# Limit GPU memory usage
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.set_logical_device_configuration(
                gpu,
                [tf.config.LogicalDeviceConfiguration(memory_limit=(6 * 1024))])
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

Since the problem to be solved is to classify text data into one of 5 rating categories, it may make sense to use a classification model. However, the problem is also to understand how reliable the ratings are, and therefore it may be useful to understand how different the model's prediciton is compared to the actual rating.

To do this, a regression model will be used. The same text that the Multinomial Naive Bayes algorithm was trained on will be used for trainin the regression model, and as outlined by Poliak, the GloVe (Global Vectors for Word Representation) can be used to represent the words in the text for the machine learning model (2020).

In [None]:
train_Y = balanced_train_df["rating"]

test_Y = test_df["rating"]

In [None]:
import requests
import zipfile

# Store the GloVe files in a directory in this repository
glove_dir = '../glove'
if not os.path.exists(glove_dir):
    os.makedirs(glove_dir)

glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
glove_zip_path = os.path.join(glove_dir, "glove.6B.zip")

# Download the GloVe file
if not os.path.exists(glove_zip_path):
    print("Downloading GloVe embeddings...")
    # (Reitz 2016)
    response = requests.get(glove_url, stream=True)
    with open(glove_zip_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
    print("Download complete.")

# Extract the GloVe file
if not os.path.exists(os.path.join(glove_dir, "glove.6B.100d.txt")):
    print("Extracting GloVe embeddings...")
    with zipfile.ZipFile(glove_zip_path, "r") as zip_ref:
        zip_ref.extractall(glove_dir)
    print("Extraction complete.")

# (Poliak 2020)
embedding_index = {}
f = open(os.path.join(glove_dir,'glove.6B.100d.txt'),encoding='utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:],dtype='float32')
    embedding_index[word] = coefs
f.close()
print('Found %s word vectors ' % len(embedding_index))

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# (Poliak 2020)
tokenizer=Tokenizer(oov_token="'oov'")
tokenizer.fit_on_texts(balanced_train_df['text'])

max_words = len(tokenizer.word_index) + 1
embedding_dim = 100
embedding_matrix = np.zeros((max_words,embedding_dim))

for word, idx in tokenizer.word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[idx]=embedding_vector

maxlen = 200
train_X = pad_sequences(tokenizer.texts_to_sequences(balanced_train_df['text']), maxlen=maxlen)
test_X = pad_sequences(tokenizer.texts_to_sequences(test_df['text']), maxlen=maxlen)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional

def create_regression_model():
    # Define a regression model
    model=Sequential()
    model.add(Embedding(max_words, embedding_dim, weights=[embedding_matrix], trainable=False))
    model.add(Bidirectional(LSTM(8)))
    model.add(Dense(4, activation="relu"))
    model.add(Dense(1, activation="linear"))

    return model

In [None]:
model = create_regression_model()

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), loss='mean_squared_error')
model.build(train_X.shape)
print(model.summary())

# Train the model with 20% of data used for validation
history = model.fit(
    train_X,
    train_Y,
    epochs=25,
    batch_size=256,
    validation_split=0.2,
)

In [None]:
from matplotlib import pyplot as plt

# Plot the training history
plt.figure(figsize=(12, 5))
plt.plot(history.history['loss'], label='Train MSE')
plt.plot(history.history['val_loss'], label='Validation MSE')
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.title('Model Training')
plt.legend()
plt.show()

As shown in the training and the graph above, the model was trained successfully, with both the training loss and validation loss decreasing over the time spent training. Testing will need to be done with this model for further analysis.

### 4. Experiment with VADER sentiment lexicon

In [None]:
import numpy as np

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def include_sentiment_analysis(df):
    df2 = df.copy()
    # Create text data from text and title
    text_data = df2["text"].to_numpy()

    # Create target vector for VADER. Define a rating of 4 or 5 to be positive, 1 or 2 to be negative and 3 to be neutral
    y = train_Y.apply(lambda x: "positive" if x > 3 else ("negative" if x < 3 else "neutral")).tolist()

    # Analyse with VADER
    analyser = SentimentIntensityAnalyzer()
    correct_predictions = 0

    # (VADER Sentiment Example 2025)
    for text in text_data:
        score = analyser.polarity_scores(text)
        sentiment = "neutral"
        # Classify the sentiment based on the compound score from the analyser
        if score['compound'] > 0.05:
            sentiment = "positive"
        elif score['compound'] < -0.05:
            sentiment = "negative"
        
        # Compare the predicted sentiment with the actual sentiment
        index = text_data.tolist().index(text)
        if sentiment == y[index]:
            correct_predictions += 1
        # Add the score to the balanced_train_df in a new column
        df2.loc[df2["text"] == text, "VADER_Sentiment"] = sentiment

    print(f"VADER accuracy: {round(correct_predictions/len(text_data) * 100, 2)}%")
    return df2


In [None]:
balanced_train_df2 = include_sentiment_analysis(balanced_train_df)
train_X = pad_sequences(tokenizer.texts_to_sequences(balanced_train_df2['text']), maxlen=maxlen)
# Create a training set with the vader sentiment represented as -1 if neutral, 0 if negative and 1 if positive
train_X = np.concatenate((train_X, np.array(balanced_train_df2["VADER_Sentiment"].apply(lambda x: 1 if x == "positive" else (-1 if x == "negative" else 0)).tolist()).reshape(-1, 1)), axis=1)

In order to make use of the VADER sentiment analysis in this experiment, an assumption is made that the ratings which are rated higher would have more positive text, and lower ratings would have more negative text. However, after running the VADER sentiment analysis code over the training data, only 54.1% of the training data was classified correctly by VADER into positive, negative, or neutral, where positive was equivalent to ratings of 4 or 5, neutral was equivalent to a rating of 3, and negative was equivalent to a rating of 1 or 2.

This may indicate that the ratings in the dataset are not reliable, since it is unlikely that positive words in a rating would result in a lower score, and vice-versa. However, the VADER Sentiment was added to the training dataset anyway to allow the regression model to train with it as an input too. A numerical value was assigned, with 1 being if the text was positive, 0 if neutral and -1 if the text was negative.

In [None]:
vader_model = create_regression_model()

vader_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), loss='mean_squared_error')
vader_model.build(train_X.shape)

# Train the model
history = vader_model.fit(
    train_X,
    train_Y,
    epochs=25,
    batch_size=256,
    validation_split=0.2,
)

In [None]:
from matplotlib import pyplot as plt

# Plot the training history
plt.figure(figsize=(12, 5))
plt.plot(history.history['loss'], label='Train MSE')
plt.plot(history.history['val_loss'], label='Validation MSE')
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.title('Model Training')
plt.legend()
plt.show()

### 5. Final testing on test set and discussion of results

In [None]:
# Predict the ratings for the test set and check the value compared to the actual ratings
predictions = model.predict(test_X)

# Calculate the mean squared error
mse = np.mean((predictions.flatten() - test_Y.to_numpy().flatten())**2)
print(f"Mean Squared Error: {mse:.2f}")

# Round to the nearest whole number for the prediction
predictions = np.round(predictions).astype(int)

correct_predictions = np.sum(predictions.flatten() == test_Y.to_numpy().flatten())
total_predictions = len(predictions)
accuracy = correct_predictions / total_predictions
print(f"Model Accuracy: {accuracy:.2f}")

In [None]:
# Show the predictions which were incorrect by more than 1
incorrect_predictions = np.abs(predictions.flatten() - test_Y.to_numpy().flatten()) > 1
incorrect_reviews = test_df[incorrect_predictions]
print("Incorrect Predictions:")
for i, row in incorrect_reviews.iterrows():
    print(f"Text: {row['text']}")
    print(f"Predicted Rating: {predictions[i][0]}")
    print(f"Actual Rating: {row['rating']}")
    print("-" * 50)

# Print the number of incorrect predictions compared to the total number of predictions
num_incorrect = len(incorrect_reviews)
num_total = len(test_df)
print(f"Total Predictions: {num_total}")
print(f"Number of Correct Predictions: {num_total - num_incorrect}")
print(f"Number of Incorrect Predictions: {num_incorrect}")
# Print the accuracy based on the number of correct predictions
accuracy = (num_total - num_incorrect) / num_total
print(f"Accuracy: {accuracy:.2f}")

### 6. Propose a method to predict aspects 

***(COMP SCI 7417 and COMP SCI 7717 only)***

### 7. Reflection on the ***Product*** development.

### 9. References

'Exploratory Data Analysis', Applied Natural Language Processing workshop 2 code files, The University of Adelaide, in Week 2, Semester 1, 2025.

'Feature Engineering', Applied Natural Language Processing workshop 2 code files, The University of Adelaide, in Week 2, Semester 1, 2025.

'VADER Sentiment Example', Applied Natural Language Processing Assignment 1 code files, The University of Adelaide, in Semester 1, 2025.

'Income Evaluation Notebook', Mining Big Data workshop 1 code files, The University of Adelaide, in Week 1, Semester 1, 2025.

malamahadevan, 2025, Step-by-Step Exploratory Data Analysis (EDA) using Python, Analytics Vidhya, viewed 24 Mar 2025 <https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/>

Murel, J, Kavlakoglu, E, 2023, What are stemming and lemmatization?, IBM, viewed 01 Apr 2025, <https://www.ibm.com/think/topics/stemming-lemmatization>

NumFOCUS, Inc., 2024, pandas.DataFrame.to_csv, pandas, viewed 29 Mar 2025, <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html>

NumFOCUS, Inc., 2024, pandas.read_csv, pandas, viewed 29 Mar 2025, <https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html>

NumFOCUS, Inc., 2024, pandas.json_normalize, pandas, viewed 29 Mar 2025, <https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html>

Saturn Cloud, 2024, How to Remove Special Characters in Pandas Dataframe, Saturn Cloud, viewed 29 Mar 2025, <https://saturncloud.io/blog/how-to-remove-special-characters-in-pandas-dataframe/#use-lambda-function>

Sriram, 2024, Multinomial Naive Bayes Explained: Function, Advantages & Disadvantages, Applications, UpGrad, viewed 3 Apr 2025, <https://www.upgrad.com/blog/multinomial-naive-bayes-explained/>

Poliak, S, 2020, 1 to 5 Star Ratings – Classification or Regression?, towards data science, viewed 29 Mar 2025, <https://towardsdatascience.com/1-to-5-star-ratings-classification-or-regression-b0462708a4df/>

Reitz, K, 2016, Raw Response Content, Requests Documentation, viewed 29 Mar 2025, <https://requests.readthedocs.io/en/latest/user/quickstart/#raw-response-content>

### Appendix