**Dataset information:**
The dataset contains reviews about an amusement park, written down by 42656 visitors. The following fields are available:
1. ReviewID, numeric and distinct code;
2. Rating, ranging from 1 (unsatisfied) to 5 (satisfied);
3. YearMonth, string, e.g. 2023-12. When the reviewer visited the theme park;
4. ReviewerLocation, string, country of origin of visitor;
5. ReviewText, text. The whole text of the visitor review;
6. Branch, string, which branch of the park. It has three branches.
Mean and maximum length of field ReviewText is 129.7 and 3963 words, respectively.
The purpose of the project is to design a deep neural network model trained to predict the Rating value of reviews.

In [None]:
import pandas as pd
from google.colab import drive
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential
from keras.utils import to_categorical
from keras.models import Model
from keras.layers import Input, Dense, Dropout, BatchNormalization
from tensorflow.keras.initializers import he_normal
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from matplotlib import pyplot as plt
from sklearn.metrics import classification_report
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
shared_link = 'https://drive.google.com/file/d/1ltl0n-tBzMTrUVaB-eN_g9vmTgEDE7wL/view?usp=share_link'
file_id = shared_link.split('/')[-2]
# Extract the file ID from the shared link
download_link = f'https://drive.google.com/uc?id={file_id}'
data = pd.read_csv(download_link, encoding='latin-1')

In [None]:
print(data.head())

   Review_ID  Rating Year_Month     Reviewer_Location  \
0  670772142       4     2019-4             Australia   
1  670682799       4     2019-5           Philippines   
2  670623270       4     2019-4  United Arab Emirates   
3  670607911       4     2019-4             Australia   
4  670607296       4     2019-4        United Kingdom   

                                         Review_Text               Branch  
0  If you've ever been to Disneyland anywhere you...  Disneyland_HongKong  
1  Its been a while since d last time we visit HK...  Disneyland_HongKong  
2  Thanks God it wasn   t too hot or too humid wh...  Disneyland_HongKong  
3  HK Disneyland is a great compact park. Unfortu...  Disneyland_HongKong  
4  the location is not in the city, took around 1...  Disneyland_HongKong  


1) Fields to use

Although other fields may potentially have a significant impact on the rating, I have made the decision to focus solely on the text reviews.


In [None]:
columns_to_drop = ['Review_ID', 'Year_Month', 'Reviewer_Location', 'Branch']
data = data.drop(columns_to_drop, axis=1) #dropping the unnecessary columns and keeping only the textReviews and the ratings

In [None]:
data #Visualizing the data

Unnamed: 0,Rating,Review_Text
0,4,If you've ever been to Disneyland anywhere you...
1,4,Its been a while since d last time we visit HK...
2,4,Thanks God it wasn t too hot or too humid wh...
3,4,HK Disneyland is a great compact park. Unfortu...
4,4,"the location is not in the city, took around 1..."
...,...,...
42651,5,i went to disneyland paris in july 03 and thou...
42652,5,2 adults and 1 child of 11 visited Disneyland ...
42653,5,My eleven year old daughter and myself went to...
42654,4,"This hotel, part of the Disneyland Paris compl..."


 The smaller subset will not fully capture the patterns and variations present in the complete dataset, leading to incomplete training.

In [None]:
subset_size = 500 #setting a small subset bcs the dimension of my data encoding was exhausting the available RAM
data = data.head(subset_size)

Pre-processing

In [None]:
# Preprocess your target variable to 5 classes
data['Rating'] = data['Rating'].astype(int)  # Ensure ratings are integers
data['Rating'] -= 1  # Shift the ratings from 1-5 to 0-4

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Rating'] = data['Rating'].astype(int)  # Ensure ratings are integers
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Rating'] -= 1  # Shift the ratings from 1-5 to 0-4


In [None]:
# First I Defined a function for text pre-processing
def preprocess_text(text):
    text = text.lower()   # Lowercasing the text
    text = re.sub(r'[^\w\s]', '', text) # Removing the punctuation and special characters
    tokens = text.split()# Tokenize the text
    stop_words = set(stopwords.words('english')) # Removing stop words
    tokens = [word for word in tokens if word not in stop_words]
    lemmatizer = WordNetLemmatizer()  # Lemmatizing the words
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Applying text pre-processing to the review text
data['Review_Text'] = data['Review_Text'].apply(preprocess_text)

# Tokenizing the pre-processed text
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(data['Review_Text'])
text_sequences = tokenizer.texts_to_sequences(data['Review_Text']) #converting the text reviews into sequences of integers
max_sequence_length = max(len(seq) for seq in text_sequences)# Define a maximum sequence length
X = pad_sequences(text_sequences, maxlen=max_sequence_length)#to ensure that all sequences have the same length


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Review_Text'] = data['Review_Text'].apply(preprocess_text)


Creating the RNN model for sentiment analysis

In [None]:
# Splitting the data into training and testing sets
X_train_sentiment, X_test_sentiment, y_train_sentiment, y_test_sentiment = train_test_split(X, data['Rating'], test_size=0.2, random_state=42)

In [None]:
from keras.layers import Embedding, LSTM
rnn_model = Sequential()
rnn_model.add(Embedding(input_dim=10000, output_dim=128, input_length=max_sequence_length))
rnn_model.add(LSTM(64, return_sequences=False))
rnn_model.add(Dropout(0.2))
rnn_model.add(Dense(5, activation='softmax'))  # Multi-class sentiment classification with 5 classes
rnn_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
# Get the output of the RNN model as features
rnn_features = rnn_model.predict(X)




MLP model to predict the ratings

Converting the target variables to one-hot encoded vectors with 5 classes (indicating the 5 possible ratings) to represent categorical labels in a format suitable for the classification.

In [None]:
# Split the data into train and test sets for rating prediction
X_train_rating, X_test_rating, y_train_rating, y_test_rating = train_test_split(rnn_features, pd.get_dummies(data['Rating']).values, test_size=0.2, random_state=42)

In [None]:
#defining the MLP model
model =  Sequential()

In [None]:
#Input layer
mlp_model.add(Dense(64, kernel_initializer='he_normal', input_shape=(rnn_features.shape[1],)))#number of neurons 64 equal to the dimensionality of the compressed data

Regularizers, initializers, normalizers

I used dropout regularization to prevent overfitting by randomly dropping out a fraction of input units during training which helps improve the model's generalization ability by reducing reliance on specific features and forcing the network to learn more robust representations. He initialization as the weight initializer which helps alleviate the vanishing or exploding gradient problem with ReLU activation functions.And batch normalization after each hidden layer to normalize the activations within each batch.

In [None]:
#First hidden layer
model.add(Dense(512, activation='relu', kernel_initializer='he_normal')) #starting with 512 neurons
model.add(BatchNormalization())
model.add(Dropout(0.2))

In [None]:
#Second hidden layer
model.add(Dense(256, activation='relu', kernel_initializer='he_normal')) #256 neurons
model.add(BatchNormalization())
model.add(Dropout(0.2))

In [None]:
#Third hidden layer
model.add(Dense(128, activation='relu', kernel_initializer='he_normal')) #128 neurons
model.add(BatchNormalization())
model.add(Dropout(0.2))

ReLU helps introduce non-linearity and capture complex patterns in the data.

In [None]:
#The design of the output layer is the 5 neurons each corresponding to a rating value and a softmax activation function
model.add(Dense(5, activation='softmax'))

The softmax activation function is used to obtain a probability distribution over the five possible rating values, enabling the model to make predictions for each class.

The loss function is added in the code below as for the  activation functions, they are added to the individual layers of the MLP model during its construction in the code cells above.


In [None]:
model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.01), metrics=['accuracy'])

In [None]:
#Training the model
train = True
if train:
    batch_size = 16 #number of samples to be used in each training batch
    epochs = 15 #number of times the entire training dataset will be passed through the model during training
    history = model.fit(X_train_rating, y_train_rating,
                        epochs=epochs,
                        batch_size=batch_size,
                        shuffle=True,
                        validation_data=(X_test_rating, y_test_rating))



Epoch 1/15


Evaluation

The classification report was generated to obtain the metrics I specified in my exam that offer insights into the model's performance on a per-class basis, helping identify any biases or discrepancies in predicting specific ratings.

Then i added some visualizations such as the learning curves for both the accuracy and the loss that help gain insights into the model's generalization capabilities.

In [None]:
# Predicting on the test data using the MLP model
y_pred_rating = model.predict(X_test_rating)  # Use the correct model

# Converting MLP model predictions to class labels
y_pred_rating_labels = np.argmax(y_pred_rating, axis=1)
y_true_rating_labels = np.argmax(y_test_rating, axis=1)

# Incrementing by 1 to adjust the range from 0-4 to 1-5
y_true_rating_labels = y_true_rating_labels + 1
y_pred_rating_labels = y_pred_rating_labels + 1

# Generating the classification report for rating prediction
from sklearn.metrics import classification_report
classification_rep_rating = classification_report(y_true_rating_labels, y_pred_rating_labels)
print("Classification Report for Rating Prediction:")
print(classification_rep_rating)


In [None]:
#Plotting the learning curve for the accuracy for better visualization
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
#Plotting the learning curve for the loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

However, it's important to note that the results obtained from evaluating the model on unseen data may not be entirely valid or representative due to the utilization of a very small subset of the available data so the evaluation results may not capture the full complexity and diversity present in the entire dataset.