# Work Package 3 (WP3): Open Kaggle Competition

Supervised Machine Learning Model

Development and Deployment

---

**Universitat de Lleida**  
**Enginyeria Informàtica**  
**Sistemes Intel·ligents**  

---

**Professor:** Mariano Garralda Barrio  

**Authors:**  
- Jordi García Ventura  
- Christian López García  

**Date:** 12/01/2025  


## 0. Setup

In [1]:
!python --version

Python 3.9.13


In [111]:
%%capture
%pip install -r ../requirements.txt

In [3]:
%matplotlib inline

In [118]:
import sys
sys.path.append("../")

%load_ext autoreload
%autoreload 2
from lib.cache import cache, DataFrameCache

In [131]:
import webbrowser
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_log_error

In [20]:
TRAIN_FILENAME = "train.csv"
TEST_FILENAME = "test.csv"
DATA_PATH = Path.joinpath(Path.cwd().parent, "data")
CACHE_PATH = Path.joinpath(Path.cwd().parent, "cache")
CACHE_IMAGES_PATH = Path.joinpath(CACHE_PATH, "images")
CACHE_MODELS_PATH = Path.joinpath(CACHE_PATH, "models")
CACHE_DATAFRAMES_PATH = Path.joinpath(CACHE_PATH, "dataframes")
CACHE_NUMPY_PATH = Path.joinpath(CACHE_PATH, "numpy")

In [35]:
df = pd.read_csv(DATA_PATH / TRAIN_FILENAME)

In [36]:
dataframe_cache = DataFrameCache(CACHE_DATAFRAMES_PATH)
dataframe_cache.save("0-original", df)

## 1. EDA

In [37]:
PROFILE_PATH = CACHE_PATH / "profiling_report.html"

if not PROFILE_PATH.exists():
    print("Generating profiling report...")
    profile = ProfileReport(df, title="Regression with an Insurance Dataset Report", explorative=True)
    profile.to_file(PROFILE_PATH)
    print("Profiling report generated.")

Generating profiling report...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling report generated.


In [73]:
webbrowser.open(PROFILE_PATH.as_uri())

True

## 2. Feature engineering

In [28]:
df = dataframe_cache.load("0-original")

### Features

1. `id`: Unique identifier for the record (Numerical)
1. `Age`: Age of the insured individual (Numerical)
1. `Gender`: Gender of the insured individual (Categorical: Male, Female)
1. `Annual Income`: Annual income of the insured individual (Numerical, skewed)
1. `Marital Status`: Marital status of the insured individual (Categorical: Single, Married, Divorced)
1. `Number of Dependents`: Number of dependents (Numerical, with missing values)
1. `Education Level`: Highest education level attained (Categorical: High School, Bachelor's, Master's, PhD)
1. `Occupation`: Occupation of the insured individual (Categorical: Employed, Self-Employed, Unemployed)
1. `Health Score`: A score representing the health status (Numerical, skewed)
1. `Location`: Type of location (Categorical: Urban, Suburban, Rural)
1. `Policy Type`: Type of insurance policy (Categorical: Basic, Comprehensive, Premium)
1. `Previous Claims`: Number of previous claims made (Numerical, with outliers)
1. `Vehicle Age`: Age of the vehicle insured (Numerical)
1. `Credit Score`: Credit score of the insured individual (Numerical, with missing values)
1. `Insurance Duration`: Duration of the insurance policy (Numerical, in years)
1. `Policy Start Date`: Start date of the insurance policy (Text, improperly formatted)
1. `Customer Feedback`: Short feedback comments from customers (Text)
1. `Smoking Status`: Smoking status of the insured individual (Categorical: Yes, No)
1. `Exercise Frequency`: Frequency of exercise (Categorical: Daily, Weekly, Monthly, Rarely)
1. `Property Type`: Type of property owned (Categorical: House, Apartment, Condo)
1. `Premium Amount`: Target variable representing the insurance premium amount (Numerical, skewed)

We think that the 10 most important features related with the `Premium Amount` (target variable) are:
1. `Policy Type`
1. `Vehicle Age`
1. `Previous Claims`
1. `Annual Income`
1. `Credit Score`
1. `Age`
1. `Insurance Duration`
1. `Marital Status`
1. `Occupation`
1. `Location`

In [58]:
df_numerical = df.select_dtypes(include=[np.number])
df_numerical.head()

Unnamed: 0,id,Age,Annual Income,Number of Dependents,Health Score,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Premium Amount
0,0,19.0,10049.0,1.0,22.598761,2.0,17.0,372.0,5.0,2869.0
1,1,39.0,31678.0,3.0,15.569731,1.0,12.0,694.0,2.0,1483.0
2,2,23.0,25602.0,3.0,47.177549,1.0,14.0,,3.0,567.0
3,3,21.0,141855.0,2.0,10.938144,1.0,0.0,367.0,1.0,765.0
4,4,21.0,39651.0,1.0,20.376094,0.0,8.0,598.0,4.0,2022.0


In [56]:
df_categorical = df.select_dtypes(include=["object"])
df_categorical.head()

Unnamed: 0,Gender,Marital Status,Education Level,Occupation,Location,Policy Type,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type
0,Female,Married,Bachelor's,Self-Employed,Urban,Premium,2023-12-23 15:21:39.134960,Poor,No,Weekly,House
1,Female,Divorced,Master's,,Rural,Comprehensive,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House
2,Male,Divorced,High School,Self-Employed,Suburban,Premium,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House
3,Male,Married,Bachelor's,,Rural,Basic,2024-06-12 15:21:39.226954,Poor,Yes,Daily,Apartment
4,Male,Single,Bachelor's,Self-Employed,Rural,Premium,2021-12-01 15:21:39.252145,Poor,Yes,Weekly,House


### Null values

- `Age`: median
- `Annual Income`: median
- `Marital Status`:  mode
- `Number of Dependents`: mode
- `Occupation`: "Unkown" class
- `Health Score`: mean
- `Previous Claims`: median
- `Vehicle Age`: median
- `Credit Score`: median (could be binned and add an "Unknown" class)
- `Insurance Duration`: median
- `Customer Feedback`: "No feedback" class

In [72]:
nulls = df.isnull().sum()
nulls[nulls > 0]

Age                      18705
Annual Income            44949
Marital Status           18529
Number of Dependents    109672
Occupation              358075
Health Score             74076
Previous Claims         364029
Vehicle Age                  6
Credit Score            137882
Insurance Duration           1
Customer Feedback        77824
dtype: int64

In [75]:
df_imputed = dataframe_cache.load("0-original")
df_imputed["Age"] = df_imputed["Age"].fillna(df_imputed["Age"].median())
df_imputed["Annual Income"] = df_imputed["Annual Income"].fillna(df_imputed["Annual Income"].median())
df_imputed["Marital Status"] = df_imputed["Marital Status"].fillna(df_imputed["Marital Status"].mode().values[0])
df_imputed["Number of Dependents"] = df_imputed["Number of Dependents"].fillna(df_imputed["Number of Dependents"].mode().values[0])
df_imputed["Occupation"] = df_imputed["Occupation"].fillna("Unknown")
df_imputed["Health Score"] = df_imputed["Health Score"].fillna(df_imputed["Health Score"].mean())
df_imputed["Previous Claims"] = df_imputed["Previous Claims"].fillna(df_imputed["Previous Claims"].median())
df_imputed["Vehicle Age"] = df_imputed["Vehicle Age"].fillna(df_imputed["Vehicle Age"].median())
df_imputed["Credit Score"] = df_imputed["Credit Score"].fillna(df_imputed["Credit Score"].median())
df_imputed["Insurance Duration"] = df_imputed["Insurance Duration"].fillna(df_imputed["Insurance Duration"].median())
df_imputed["Customer Feedback"] = df_imputed["Customer Feedback"].fillna("No feedback")

In [80]:
df_imputed.isnull().sum().sum()

np.int64(0)

In [81]:
dataframe_cache.save("1-imputed", df_imputed)

### Encoding

- `Gender`: one-hot encoding
- `Marital Status`: one-hot encoding
- `Education Level`: one-hot encoding
- `Occupation`: one-hot encoding
- `Location`: one-hot encoding
- `Policy Type`: ordinal encoding
- `Policy Start Date`: year extraction
- `Customer Feedback`: ordinal encoding
- `Smoking Status`: binary encoding
- `Exercise Frequency`: ordinal encoding
- `Property Type`: one-hot encoding

In [101]:
df_encoded = dataframe_cache.load("1-imputed")
df_encoded["Education Level"] = df_encoded["Education Level"].map({"High School": 0, "Bachelor's": 1, "Master's": 2, "PhD": 3})
df_encoded["Policy Type"] = df_encoded["Policy Type"].map({"Basic": 0, "Comprehensive": 1, "Premium": 2})
df_encoded["Policy Start Year"] = pd.to_datetime(df_encoded["Policy Start Date"]).dt.year
df_encoded.drop(columns=["Policy Start Date"], inplace=True)
df_encoded["Customer Feedback"] = df_encoded["Customer Feedback"].map({"Poor": 0, "Average": 1, "Good": 2})
df_encoded["Smoking Status"] = df_encoded["Smoking Status"].map({"No": 0, "Yes": 1})
df_encoded["Exercise Frequency"] = df_encoded["Exercise Frequency"].map({"Rarely": 0, "Monthly": 1, "Weekly": 2, "Daily": 3})
df_encoded = pd.get_dummies(df_encoded, columns=["Gender", "Marital Status", "Occupation", "Location", "Property Type"], drop_first=True)
df_encoded.head()

Unnamed: 0,id,Age,Annual Income,Number of Dependents,Education Level,Health Score,Policy Type,Previous Claims,Vehicle Age,Credit Score,...,Gender_Male,Marital Status_Married,Marital Status_Single,Occupation_Self-Employed,Occupation_Unemployed,Occupation_Unknown,Location_Suburban,Location_Urban,Property Type_Condo,Property Type_House
0,0,19.0,10049.0,1.0,1,22.598761,2,2.0,17.0,372.0,...,False,True,False,True,False,False,False,True,False,True
1,1,39.0,31678.0,3.0,2,15.569731,1,1.0,12.0,694.0,...,False,False,False,False,False,True,False,False,False,True
2,2,23.0,25602.0,3.0,0,47.177549,2,1.0,14.0,595.0,...,True,False,False,True,False,False,True,False,False,True
3,3,21.0,141855.0,2.0,1,10.938144,0,1.0,0.0,367.0,...,True,True,False,False,False,True,False,False,False,False
4,4,21.0,39651.0,1.0,1,20.376094,2,0.0,8.0,598.0,...,True,False,True,True,False,False,False,False,False,True


In [102]:
dataframe_cache.save("2-encoded", df_encoded)

### Scaling

In [126]:
df_scaled = dataframe_cache.load("2-encoded")

scaler = StandardScaler()
columns_to_scale = df_scaled.columns.difference(["Premium Amount"])
df_scaled[columns_to_scale] = scaler.fit_transform(df_scaled[columns_to_scale])
df_scaled.head()

Unnamed: 0,id,Age,Annual Income,Number of Dependents,Education Level,Health Score,Policy Type,Previous Claims,Vehicle Age,Credit Score,...,Gender_Male,Marital Status_Married,Marital Status_Single,Occupation_Self-Employed,Occupation_Unemployed,Occupation_Unknown,Location_Suburban,Location_Urban,Property Type_Condo,Property Type_House
0,-1.732049,-1.648301,-0.707414,-0.796935,-0.46541,-0.255071,1.221087,1.216739,1.286338,-1.567375,...,-1.004294,1.429421,-0.725646,1.801557,-0.547217,-0.652154,-0.709152,1.420839,-0.706673,1.413289
1,-1.732046,-0.159542,-0.023289,0.651486,0.433367,-0.849704,-0.003359,-0.002284,0.420713,0.71463,...,-1.004294,-0.699584,-0.725646,-0.555075,-0.547217,1.53338,-0.709152,-0.703809,-0.706673,1.413289
2,-1.732044,-1.350549,-0.215473,0.651486,-1.364188,1.824212,1.221087,-0.002284,0.766963,0.01302,...,0.995724,-0.699584,-0.725646,1.801557,-0.547217,-0.652154,1.410135,-0.703809,-0.706673,1.413289
3,-1.732041,-1.499425,3.461605,-0.072725,-0.46541,-1.241521,-1.227805,-0.002284,-1.656788,-1.60281,...,0.995724,1.429421,-0.725646,-0.555075,-0.547217,1.53338,-0.709152,-0.703809,-0.706673,-0.70757
4,-1.732038,-1.499425,0.228896,-0.796935,-0.46541,-0.443102,1.221087,-1.221307,-0.271787,0.034281,...,0.995724,-0.699584,1.378082,1.801557,-0.547217,-0.652154,-0.709152,-0.703809,-0.706673,1.413289


In [127]:
dataframe_cache.save("3-scaled", df_scaled)

## 3. Model training

In [128]:
df_train = dataframe_cache.load("3-scaled")

In [134]:
# XGBoost
@cache(CACHE_MODELS_PATH / "xgboost.joblib")
def train_xgboost(X_train, y_train):
    model = XGBRegressor(objective="reg:squarederror")
    model.fit(X_train, y_train)
    return model

target = "Premium Amount"
X = df_train.drop(columns=[target])
y = df_train[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = train_xgboost(X_train, y_train)

y_true = y_test
y_pred = model.predict(X_test)
root_mean_squared_log_error(y_true, y_pred)

np.float64(1.1387584826591357)