# Backpack Prediction Challenge

**FEUP 2024/2025 - L.EIC029 IART**

- Bruno Oliveira - 202208700  
- Henrique Fernandes - 202204988  
- Rodrigo Coelho - 202205188  

> Based on Kaggle Playground Season 5, Episode 2  
> April 2025

___

## Project setup

### Virtual Environment

In order to setup the project, use the following commands to setup a virtual environment and install the needed dependencies:

In [None]:
!python3 -m venv .venv
!source .venv/bin/activate
%pip install -r requirements.txt

Once the dependencies are installed, the script below can be used to download the dataset from the Kaggle competition, using your Kaggle account.

<div class="alert alert-block alert-warning">
<b>Warning:</b> Don't forget to download the Kaggle token associated with your account from the <a href="https://www.kaggle.com/settings">Settings page</a>, move it to the current folder and join the <a href="https://www.kaggle.com/competitions/playground-series-s5e2/">Kaggle playground competition</a>.
</div>

In [None]:
import os
from pathlib import Path

os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

!kaggle competitions download -c playground-series-s5e2
data_dir = Path("data")
data_dir.mkdir(parents=True, exist_ok=True)
!mv playground-series-s5e2.zip data/
os.chdir(data_dir)
!unzip -o playground-series-s5e2.zipnd-series-s5e2.zip
!rm playground-series-s5e2.zip
!ls -lh
os.chdir("..")

### Loading the Datasets

With the dependencies met and having downloaded the dataset, we can now load it into our environment. The following commands wil load boat of the datasets:

- `train.csv` which contains 300000 entries and is used to train the models
- `test.csv` which contains 200000 entries and is used to test the models

In [None]:
import pandas as pd

train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

train_data.info()
train_data.head()

___

## Exploratory Data Analysis

Missing Data

In [None]:
missing_values_train = pd.DataFrame({
    'Column': train_data.columns,
    'Missing Train Values': train_data.isnull().sum().values,
    'Percentage of Missing Train Values': train_data.isnull().sum().values / len(train_data) * 100
})

missing_values_test = pd.DataFrame({
    'Column': test_data.columns,
    'Missing Test Values': test_data.isnull().sum().values,
    'Percentage of Missing Test Values': test_data.isnull().sum().values / len(test_data) * 100
})

merged_missing_values = pd.merge(missing_values_train, missing_values_test, on='Column', how='outer')
merged_missing_values = merged_missing_values[~merged_missing_values['Column'].isin(['id', 'Price'])]
merged_missing_values

Duplicated Data

In [None]:
train_data_duplicates = train_data.drop('id', axis=1).duplicated().sum()
test_data_duplicates = test_data.drop('id', axis=1).duplicated().sum()
print(f"Train data duplicates: {train_data_duplicates}")
print(f"Test data duplicates: {test_data_duplicates}")

Data Description

In [None]:
train_data.describe()

Distribution of Data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def plot_text_columns(data, columns):
    for column in columns:
        plt.figure(figsize=(10, 6))
        sns.countplot(data[column], order=data[column].value_counts().index)
        plt.title(f"Distribution of {column}")
        plt.show()

def plot_numeric_columns(data, columns):
    for column in columns:
        plt.figure(figsize=(10, 6))
        sns.histplot(data[column], bins=30)
        plt.title(f"Distribution of {column}")
        plt.show()
        
def plot_pairplot(data, column1, column2):
    plt.figure(figsize=(10, 6))
    sns.pairplot(data[['Brand', 'Material', 'Price']], hue='Brand')
    plt.title(f"Scatter plot of {column1} vs {column2}")
    plt.show()
        
# plot_text_columns(train_data, train_data.columns[train_data.dtypes == 'object'].tolist())
# plot_numeric_columns(train_data, [col for col in train_data.columns if col != 'id' and train_data[col].dtypes != 'object'])
plot_pairplot(train_data, 'Brand', 'Material')

Data Correlation

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

numeric_cols = ['Brand', 'Material', 'Price']

df_numeric = train_data[numeric_cols].dropna()

sns.pairplot(df_numeric)


## Encode categorical features and define features and targets

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error

train_data.dropna(inplace=True)
label_encoders = {}
for col in ['Brand', 'Material', 'Size', 'Laptop Compartment', 'Waterproof', 'Style', 'Color']:
    le = LabelEncoder()
    train_data[col] = le.fit_transform(train_data[col])
    label_encoders[col] = le

X = train_data.drop('Price', axis=1)
y = train_data['Price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

model = DecisionTreeRegressor(random_state=1, max_depth=5)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {mse**0.5:.2f}")



In [None]:
from sklearn.tree import export_graphviz
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plt.figure(figsize=(100, 20))
plot_tree(model, filled=True, feature_names=X.columns, fontsize=10, max_depth=5)
plt.title("Decision Tree Visualization")
plt.show()


In [None]:
# get all the values from hte Brand column
brand_values = train_data['Brand'].unique()
# get average price per brand
avg_price_per_brand = train_data.groupby('Brand')['Price'].mean().sort_values(ascending=False)
avg_price_per_brand
