# INFO284 - Group Exam 2026

## 1.    Introduction

## 2.    Task I: Sentiment Analysis

### 2.1 Configutation

In [31]:
# Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

In [32]:
# Load our data
df = pd.read_csv("reviews.csv")

# Check first few rows to see if everything is loaded correctly and to get an idea of the structure of the data
df.head()

Unnamed: 0,review_id,rating,review_text,review_date,helpful
0,981e465b-d3ba-4632-9c60-25051efac38a,5,It's good,11/22/2025 1:19,0
1,964d3555-9429-4c20-8127-ce3c71ce9273,5,WhatsApp not working well always shows offline...,11/24/2025 20:03,0
2,6c28859f-1554-4ca1-9aa8-9d66f204be0a,5,"Oppo not corresponding, share with me the offi...",11/25/2025 6:26,0
3,a7efafc3-5871-4020-a398-9cc12cb4072a,5,"Excellent app, great communication super conne...",11/25/2025 18:09,0
4,de142b31-a5ad-446f-b7c8-51b264728478,4,simply the ɓest for chat and calls.i love it,11/24/2025 1:10,1


In this dataset, we have the following 5 columns: review_id, rating, review_text, review_date and helpful. These column names are intuitive, so we keep them. The dataset seems to be reviews of the Whatsapp mobile app.

In [33]:
# Quick overview of the dataset before preprocessing and EDA

# Dataset size
print("Shape:", df.shape)

# Duplicates
print("Duplicate rows:", df.duplicated().sum())
print("Duplicate review_id:", df.duplicated(subset=["review_id"]).sum())
print("Duplicate review_text:", df.duplicated(subset=["review_text"]).sum())

# Missing values
print("\nMissing values per column:")
print(df.isna().sum())


Shape: (6210, 5)
Duplicate rows: 0
Duplicate review_id: 0
Duplicate review_text: 384

Missing values per column:
review_id      0
rating         0
review_text    0
review_date    0
helpful        0
dtype: int64


We see there are 6210 rows across the 5 columns, no missing values and practically no duplicates. Although 384 duplicates appear in review_text, which amount to about 6,2% of the observations, it is reasonable to assume that this is a pure coincidence as many reviews in the dataset are short. Even if some of them are true duplicates, we regard the percentage to be too minisclue to have practical implications for our purposes. Due to this, as well as there not being any missing values, we do not need to keep any adjustments in mind during preprocessing, and can simply continue our work straightforward.

### 2.2 Preprocessing

We split the dataset into training, validation, and test sets using stratified random sampling to preserve the distribution of rating classes across all subsets. A random split was chosen because the reviews represent independent observations without temporal dependency, making time-based splitting unnecessary.

A fixed split was preferred over cross-validation to maintain methodological consistency across all four models, including the neural network. Although cross-validation can yield a more robust estimate of performance, it requires repeated model training and significantly increases computational cost, particularly for neural networks, which we will be using later. The chosen approach therefore provides a balanced trade-off between reliability, computational feasibility, and fair model comparison.

In [34]:
# Define features and target
split_df_copy = df[["review_text", "review_date", "helpful"]]
y = df["rating"]

# First split: Train (70%) and Temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    split_df_copy, y,
    test_size=0.30,
    stratify=y,
    random_state=1
)

# Second split: Validation (15%) and Test (15%)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.50,
    stratify=y_temp,
    random_state=1
)

# Print sizes
print("Train size:", X_train.shape[0])
print("Validation size:", X_val.shape[0])
print("Test size:", X_test.shape[0])

Train size: 4347
Validation size: 931
Test size: 932


We keep the raw input features review_text and helpful from the original dataframe, and we also create a small set of engineered features. From review_date, we extract month, day_of_week, and hour to capture when reviews are posted, and we compute review_length as the number of words in review_text. The add_features function adds these engineered columns separately to the train, validation, and test splits, and we then keep only the final feature set used for modeling: review_text, helpful, month, day_of_week, hour, and review_length.

In [35]:
def add_features(split_df):
    split_df_copy = split_df.copy()    # Make a copy so we don't change the original split

    # Convert date text to datetime
    split_df_copy["review_date"] = pd.to_datetime(split_df_copy["review_date"], errors="coerce")    # errors="coerce" will set invalid parsing to NaT (Not a Time)

    # Time features
    split_df_copy["month"] = split_df_copy["review_date"].dt.month
    split_df_copy["day_of_week"] = split_df_copy["review_date"].dt.dayofweek
    split_df_copy["hour"] = split_df_copy["review_date"].dt.hour

    # Review length (number of words)
    text = split_df_copy["review_text"].fillna("")   # Replace missing text with empty string
    words = text.str.split()                         # Split each review into a list of words
    split_df_copy["review_length"] = words.str.len() # Count words in each list


    return split_df_copy

# Add features to each dataset split
X_train = add_features(X_train)
X_val = add_features(X_val)
X_test = add_features(X_test)

# Columns we want the models to use
cols = ["review_text", "helpful", "month", "day_of_week", "hour", "review_length"]
X_train = X_train[cols]
X_val = X_val[cols]
X_test = X_test[cols]

print(X_train.shape, X_val.shape, X_test.shape)

(4347, 6) (931, 6) (932, 6)


In [36]:
# Sanity check that stratification worked
print("Train rating distribution:\n", y_train.value_counts(normalize=True).sort_index())
print("Val rating distribution:\n", y_val.value_counts(normalize=True).sort_index())
print("Test rating distribution:\n", y_test.value_counts(normalize=True).sort_index())


Train rating distribution:
 rating
1    0.254658
2    0.075224
3    0.081896
4    0.101679
5    0.486542
Name: proportion, dtype: float64
Val rating distribution:
 rating
1    0.254565
2    0.075188
3    0.081633
4    0.102041
5    0.486574
Name: proportion, dtype: float64
Test rating distribution:
 rating
1    0.255365
2    0.075107
3    0.081545
4    0.101931
5    0.486052
Name: proportion, dtype: float64


The sanity check looks good. The class proportions are almost identical across train, validation, and test, which confirms that stratified sampling preserved the rating distribution as intended.

### 2.3 Exploratory Data Analysis

We start our exploratory data analysis (EDA) with a statistical summary of the dataframe we created.

In [None]:
# We only want to summarize the numeric features, so we select those columns
numeric_cols = ["helpful", "month", "day_of_week", "hour", "review_length"]

# Basic summary statistics (count, mean, std, min, quartiles, max)
stats_table = X_train[numeric_cols].describe().T

# Add percent of missing values for each feature
stats_table["missing_percent"] = (X_train[numeric_cols].isna().mean() * 100)

# Keep the columns we care about and round to 2 decimals
stats_table = stats_table[["count", "missing_percent", "mean", "std", "min", "25%", "50%", "75%", "max"]].round(2)

stats_table


Statistical summary of numeric features in training data



Unnamed: 0,count,missing_percent,mean,std,min,25%,50%,75%,max
helpful,4347.0,0.0,60.74,3778.5,0.0,0.0,0.0,0.0,248962.0
month,3795.0,12.7,11.0,0.0,11.0,11.0,11.0,11.0,11.0
day_of_week,3795.0,12.7,2.84,2.27,0.0,1.0,2.0,5.0,6.0
hour,3795.0,12.7,13.44,6.98,0.0,9.0,15.0,20.0,23.0
review_length,4347.0,0.0,15.32,19.42,1.0,3.0,7.0,19.0,105.0


Se Chat for hvordan vi tyder dette og hva vi skal gjøre med dette

### 2.4 Models

### 2.5 Evaluation

### 2.6 Discussion

## 3.    Task II: Convolutional Neural Networks 

### 3.X Exploratory Data Analysis

### 3.X Configuration

### 3.X Preprocessing

### 3.X Model Choice and Justification

### 3.X Training

### 3.X Evaluation

### 3.X Testing on New Images

### 3.X Discussion

## 4. Final Summary and Reflection

## 5. References & Tools Used