## **Objective**  
In this episode we are given the task of predicting the price of backpacks given various attributes.  
Submissions are scored on the root mean squared error.

RMSE is defined as:

$$
\textrm{RMSE} =  \left( \frac{1}{N} \sum_{i=1}^{N} (y_i - \widehat{y}_i)^2 \right)^{\frac{1}{2}}
$$

Predicting backpack prices based on material, size, and brand is a useful real-world application for e-commerce platforms, retailers and manufacturers. By building a robust model it posibble to help companies optimize pricing strategies, detect desirable features and understand key cost drivers.  

## **Data**  
The dataset for this competition is generated from a deep learning model trained on the [Student Bag Pric Preditions Dateset](https://www.kaggle.com/datasets/souradippal/student-bag-price-prediction-dataset/data)


In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor

print('Libaires imported')

In [None]:
df_train = pd.read_csv('/kaggle/input/playground-series-s5e2/train.csv')
df_train_ex = pd.read_csv('/kaggle/input/playground-series-s5e2/training_extra.csv')
df_original = pd.read_csv('/kaggle/input/student-bag-price-prediction-dataset/Noisy_Student_Bag_Price_Prediction_Dataset.csv')
df_test = pd.read_csv('/kaggle/input/playground-series-s5e2/test.csv')
df_sub = pd.read_csv('/kaggle/input/playground-series-s5e2/sample_submission.csv')
print('Data imported')

### **Initial Observations**  


In [None]:
df_test

In [None]:
df_train

In [None]:
df_train_ex

In [None]:
df_original

In [None]:
df_train_combined = pd.concat([df_train, df_train_ex, df_original], ignore_index=True)
df_train_combined["id"] = df_train_combined.index

In [None]:
df_train_combined

In [None]:
df_train = df_train_combined

In [None]:
train_duplicates = df_train.duplicated().sum()
test_duplicates = df_test.duplicated().sum()

print(f"Number of duplicate rows in df_train: {train_duplicates}")
print(f"Number of duplicate rows in df_test: {test_duplicates}")

### **Summary of Datasets**

In [None]:
missing_values_train = pd.DataFrame({'Feature': df_train.columns,
                              '[TRAIN] No. of Missing Values': df_train.isnull().sum().values,
                              '[TRAIN] % of Missing Values': ((df_train.isnull().sum().values)/len(df_train)*100)})

missing_values_test = pd.DataFrame({'Feature': df_test.columns,
                             '[TEST] No.of Missing Values': df_test.isnull().sum().values,
                             '[TEST] % of Missing Values': ((df_test.isnull().sum().values)/len(df_test)*100)})

unique_values = pd.DataFrame({'Feature': df_train.columns,
                              'No. of Unique Values[FROM TRAIN]': df_train.nunique().values})

feature_types = pd.DataFrame({'Feature': df_train.columns,
                              'DataType': df_train.dtypes})

df_summary = pd.merge(missing_values_train, missing_values_test, on='Feature', how='left')
df_summary = pd.merge(df_summary, unique_values, on='Feature', how='left')
df_summary = pd.merge(df_summary, feature_types, on='Feature', how='left')

df_summary

## **Dataset Observations**

### **Shape**
Training Data: 3,994,318rows × 11 columns  
Test Data: 200,000 rows × 10 columns

### **Missing Values**
Several features contain missing values in both the training and test sets:  

**Training set**
- Brand: ~3.20%
- Material:  ~2.80%
- Size: ~2.23%
- Compartments: ~0.06%
- Laptop Compartment: ~2.50%
- Waterproof: ~2.40%
- Style: ~2.64%
- Color: ~3.37%
- Weight Capacity (kg): ~0.11%
- Price: ~0.06%%

**Test set**
- Brand: ~3.11%
- Material: ~2.81%
- Size: ~2.19%
- Laptop Compartment: ~2.48%
- Waterproof: ~2.41%
- Style: ~2.58%
- Color: ~3.39%
- Weight Capacity (kg): ~0.04%


### **Feature Breakdown**

- **ID:** A unique identifier for each backpack.
- **Brand, Material, Size, Style:** Categorical variables.
- **Compartments:** Numeric, range from 1 to 10.
- **Laptop Compartment & Waterproof:** Binary categorical.
- **Color:** 6 unique values, with missing data.
- **Weight Capacity (kg):** Numerical.
- **Price:** The target variable in the training set.

## **EDA**

In [None]:
plt.figure(figsize=(12, 6))
sns.histplot(df_train["Price"], bins=50, kde=True)
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.title("Distribution of Price (Target Variable)")
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.heatmap(df_train.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap")
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x="Brand", y="Price", data=df_train)
plt.title("Price Distribution by Brand")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x="Material", y="Price", data=df_train)
plt.title("Price Distribution by Material")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x="Size", y="Price", data=df_train)
plt.title("Price Distribution by Size")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x="Color", y="Price", data=df_train)
plt.title("Price Distribution by Color")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x="Style", y="Price", data=df_train)
plt.title("Price Distribution by Style")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x="Laptop Compartment", y="Price", data=df_train)
plt.title("Price Distribution by Laptop Compartment")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x="Waterproof", y="Price", data=df_train)
plt.title("Price Distribution by Waterproof")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(x="Compartments", y="Price", data=df_train)
plt.title("Price Distribution by Compartments")
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(x="Weight Capacity (kg)", y="Price", data=df_train)
plt.title("Price Distribution by Weight Capacity (kg)")
plt.xticks(rotation=45)
plt.show()

/////// Write about EDA ////////////////

## **Data Imputation**

**Brand, Material, Size, Laptop Compartment, Waterproof, Style, Color**
Due to the small catagorical types and little to interpret from the plots from the EDA the best approach for the missing values would be to use mode imputation

**Weight Capacity (kg), Price**
Here I will use the median value for the imputation since we are working with a continuous numerical variable.

In [None]:
categorical_features = ["Brand", "Material", "Size", "Compartments", "Laptop Compartment", "Waterproof", "Style", "Color"]
numerical_features = ["Weight Capacity (kg)", "Price"]
numerical_features_test = ["Weight Capacity (kg)"]

for col in categorical_features:
    df_train[col].fillna(df_train[col].mode()[0], inplace=True)
    df_test[col].fillna(df_test[col].mode()[0], inplace=True)

for col in numerical_features:
    df_train[col].fillna(df_train[col].median(), inplace=True)
for col in numerical_features_test:
    df_test[col].fillna(df_test[col].median(), inplace=True)

## **Encoding Categorical Features**

In [None]:
# One-Hot Encoding for categorical variables
one_hot_features = ["Brand", "Material", "Size", "Style", "Color"]
df_train = pd.get_dummies(df_train, columns=one_hot_features)
df_test = pd.get_dummies(df_test, columns=one_hot_features)


# Label Encoding for binary variables
binary_features = ["Laptop Compartment", "Waterproof"]
for col in binary_features:
    le = LabelEncoder()
    df_train[col] = le.fit_transform(df_train[col])
    
for col in binary_features:
    le = LabelEncoder()
    df_test[col] = le.fit_transform(df_test[col])

## **Baseline Model - Linear Regression**

In [None]:
X = df_train.drop(columns=["Price", "id"])
y = df_train["Price"]


X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_valid)

rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
print("Baseline RMSE:", rmse)

## **LightGBM Model**

In [None]:
# X = df_train.drop(columns=["Price", "id"])
y = df_train["Price"]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

lgb_model = lgb.LGBMRegressor(
    n_estimators=5000, 
    learning_rate=0.02, 
    max_depth=8
)
lgb_model.fit(
    X_train, y_train,
    eval_set=[(X_valid, y_valid)],
    eval_metric="rmse",
    callbacks=[lgb.early_stopping(100)]
)

y_pred = lgb_model.predict(X_valid)

rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
print("LightGBM RMSE:", rmse)


## **XGBoost Model**

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)

xgb_params = {
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
    "learning_rate": 0.02,
    "max_depth": 8,
    "n_estimators": 5000
}

xgb_model = xgb.train(
    xgb_params, 
    dtrain, 
    evals=[(dvalid, "valid")], 
    early_stopping_rounds=100, 
    verbose_eval=100
)

y_pred_xgb = xgb_model.predict(dvalid)

rmse_xgb = np.sqrt(mean_squared_error(y_valid, y_pred_xgb))
print("XGBoost RMSE:", rmse_xgb)

## **CatBoost Model**

In [None]:
cat_model = CatBoostRegressor(
    iterations=5000,
    learning_rate=0.02,
    depth=8,
    loss_function="RMSE",
    verbose=100,
    early_stopping_rounds=100
)

cat_model.fit(X_train, y_train, eval_set=(X_valid, y_valid))

y_pred_cat = cat_model.predict(X_valid)

rmse_cat = np.sqrt(mean_squared_error(y_valid, y_pred_cat))
print("CatBoost RMSE:", rmse_cat)

## **Model Performance Comparison**

In [None]:
print(f"LightGBM RMSE: {rmse}")
print(f"XGBoost RMSE: {rmse_xgb}")
print(f"CatBoost RMSE: {rmse_cat}")

**LightGBM** consistently outperforms other models, although the differences are marginal. This is largely due to the dataset's structure—there are no highly predictive features that significantly influence the target variable.

Comparing the initial baseline linear model to the more advanced models, we see a noticeable improvement, but the performance gap is not substantial. This raises an important consideration at what cost should I invest in further improvements?

While feature engineering could enhance predictive accuracy, the dataset does not provide many distinguishing features that stand out. This project highlights a common challenge in machine learning the trade-off between time, resources and model performance.

One of the bigger challenges is the dataset is synthetically generated, meaning that certain real-world factors that typically influence backpack prices: brand perception, country-specific trends, GDP per capita are absent. Had this data been real data it would have been favouriable to explore these external factors. Incorporating such external sources may not lead to meaningful improvements. This highlights another important takeaway: When working with synthetic data, traditional feature engineering and real-world data augmentation may have limited benefits.

Ultimately, the best-performing model (**LightGBM**) showed only marginal improvements over simpler models, reinforcing that this dataset lacks strong predictive signals beyond its given features.

## **Submission File**

In [None]:
X_test = df_test.drop(columns=["id"])
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)

y_test_pred = lgb_model.predict(X_test)

submission = pd.DataFrame({
    "id": df_test["id"],
    "Price": y_test_pred
})

submission.to_csv("submission.csv", index=False)
print("Submission File Saved: submission.csv")