# [](http://)TI3145TU Final Assignment 
## Health Insurance 

We hope you enjoy this assignment, good luck!

Student names: XXX

Student numbers: XXX

## Part A - Data Exploration and Preprocessing

### Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score

import category_encoders as ce

### Load data

In [2]:
# These are your training samples along with their labels
data = pd.read_csv('/kaggle/input/health-insurance/health_insurance_train.csv')

X = data.drop('whrswk', axis=1)
y = data['whrswk']

print(data.head())
# You need to extract the features and the regression target. The regression target is 'whrswk'. 

   whrswk  hhi  whi hhi2   education   race hispanic  experience  kidslt6  \
0    40.0   no  yes  yes  13-15years  white       no        17.0      0.0   
1    40.0   no  yes  yes  13-15years  white       no         4.0      1.0   
2     0.0  yes   no  yes     16years  white       no        21.0      0.0   
3    40.0   no   no  yes  13-15years  white       no        22.0      NaN   
4    35.0   no  yes   no     12years  white       no        15.0      0.0   

   kids618   husby        region  
0      1.0  22.000         south  
1      0.0  15.000         south  
2      1.0  99.999         other  
3      NaN  60.000  northcentral  
4      2.0   0.000         south  


  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


In [3]:
pd.set_option('display.max_rows', 200)
print(data.head(20))

    whrswk  hhi  whi hhi2   education   race hispanic  experience  kidslt6  \
0     40.0   no  yes  yes  13-15years  white       no        17.0      0.0   
1     40.0   no  yes  yes  13-15years  white       no         4.0      1.0   
2      0.0  yes   no  yes     16years  white       no        21.0      0.0   
3     40.0   no   no  yes  13-15years  white       no        22.0      NaN   
4     35.0   no  yes   no     12years  white       no        15.0      0.0   
5     20.0   no   no   no   9-11years  white       no        25.5      0.0   
6      0.0  yes   no  yes  13-15years  white       no        45.0      0.0   
7     40.0  yes  yes  yes  13-15years  black       no        23.0      0.0   
8      5.0  yes   no  yes  13-15years    NaN       no        20.0      1.0   
9     40.0   no  yes  yes     12years  white       no         9.0      1.0   
10    40.0   no   no   no     12years  white       no        33.0      0.0   
11    40.0   no   no   no   9-11years  white       no         5.

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


#### Kijken voor afwijkende data

In [4]:
cols = ['whrswk', 'hhi', 'whi', 'hhi2', 'education', 'race','hispanic',
        'experience', 'kidslt6', 'kids618', 'husby', 'region']

for col in cols:
    zeros = (data[col] == 0).sum()
    others = (data[col] == 'other').sum()
    nans = data[col].isna().sum()
    print(f"{col}: {zeros} nullen, {nans} NaN's, {others} other")

whrswk: 1477 nullen, 0 NaN's, 0 other
hhi: 0 nullen, 0 NaN's, 0 other
whi: 0 nullen, 0 NaN's, 0 other
hhi2: 0 nullen, 0 NaN's, 0 other
education: 0 nullen, 0 NaN's, 0 other
race: 0 nullen, 500 NaN's, 37 other
hispanic: 0 nullen, 0 NaN's, 0 other
experience: 6 nullen, 0 NaN's, 0 other
kidslt6: 3379 nullen, 500 NaN's, 0 other
kids618: 2641 nullen, 500 NaN's, 0 other
husby: 950 nullen, 0 NaN's, 0 other
region: 0 nullen, 0 NaN's, 1179 other


### Pipeline 1

In [5]:
Num = ['experience', 'kidslt6', 'kids618', 'husby']
Bin_cat = ['hhi', 'whi', 'hhi2', 'hispanic']
Multi_cat = ['race', 'region', 'education']


num_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])


cat_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
    ('scaler', StandardScaler())
])


prep_1 = ColumnTransformer(
    transformers=[
        ('num_scaled', num_pipe, Num),
        ('bin_cat_ohe', cat_pipe, Bin_cat), 
        ('multi_cat_ohe', cat_pipe, Multi_cat) 
    ],
    remainder='drop'
)

In [6]:
X_transformed = prep_1.fit_transform(X)

print(f"1. Initieel aantal rijen en kolommen (X): {X.shape}")
print(f"   Aantal rijen en kolommen na P1:        {X_transformed.shape}")

# Gebruik .get_feature_names_out() om de kolomnamen te zien.
try:
    feature_names = prep_1.get_feature_names_out()
    print(f"\n2. Totaal aantal features na P1: {len(feature_names)}")
    print("   Alle feature namen:")
    print(feature_names)
except Exception as e:
    print(f"\nKon featurenamen niet ophalen: {e}")

X_transformed_df = pd.DataFrame(X_transformed, columns=feature_names)
#print("\n3. Eerste 5 rijen van de getransformeerde data (Controle Scaling/OHE):")
#print(X_transformed_df.head())

1. Initieel aantal rijen en kolommen (X): (5000, 11)
   Aantal rijen en kolommen na P1:        (5000, 25)

2. Totaal aantal features na P1: 25
   Eerste 10 feature namen (Controle OHE):
['num_scaled__experience' 'num_scaled__kidslt6' 'num_scaled__kids618'
 'num_scaled__husby' 'bin_cat_ohe__hhi_no' 'bin_cat_ohe__hhi_yes'
 'bin_cat_ohe__whi_no' 'bin_cat_ohe__whi_yes' 'bin_cat_ohe__hhi2_no'
 'bin_cat_ohe__hhi2_yes' 'bin_cat_ohe__hispanic_no'
 'bin_cat_ohe__hispanic_yes' 'multi_cat_ohe__race_black'
 'multi_cat_ohe__race_other' 'multi_cat_ohe__race_white'
 'multi_cat_ohe__region_northcentral' 'multi_cat_ohe__region_other'
 'multi_cat_ohe__region_south' 'multi_cat_ohe__region_west'
 'multi_cat_ohe__education_12years' 'multi_cat_ohe__education_13-15years'
 'multi_cat_ohe__education_16years' 'multi_cat_ohe__education_9-11years'
 'multi_cat_ohe__education_<9years' 'multi_cat_ohe__education_>16years']


### Pipeline 2

In [7]:
num_pipe2 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

cat_pipe2 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

prep_2_unscaled = ColumnTransformer(
    transformers=[
        ('num_raw', num_pipe2, Num),
        ('bin_cat_ohe', cat_pipe2, Bin_cat),
        ('multi_cat_ohe', cat_pipe2, Multi_cat)
    ],
    remainder='drop'
)

prep_2_scaled = Pipeline(steps=[
    ('base_prep', prep_2_unscaled),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('final_scaler', StandardScaler())
])

In [8]:
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
features_poly = poly.fit_transform(X_transformed)

print("Original features:", X_transformed.shape[1])  # Use .shape[1] for number of features
print("Polynomial features:", features_poly.shape[1])  # Use .shape[1] for number of features

Original features: 25
Polynomial features: 325


## Part B - Regression with Default Hyperparameters

### Define models

In [9]:
models_p1 = {
    'KNN Regression': Pipeline(steps=[
        ('preprocessor', prep_1), 
        ('regressor', KNeighborsRegressor(n_neighbors=5))
    ]),
    'SGD Linear Regression': Pipeline(steps=[
        ('preprocessor', prep_1), 
        ('regressor', SGDRegressor(random_state=42))
    ]),   
    'Random Forest (RF)': Pipeline(steps=[
        ('preprocessor', prep_1), 
        ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)) # n_jobs=-1 voor snellere training
    ]),    
    'Regression Tree (RT)': Pipeline(steps=[
        ('preprocessor', prep_1), 
        ('regressor', DecisionTreeRegressor(random_state=42))
    ])
}


models_p2 = {
    'KNN Regression (P2)': Pipeline(steps=[
        ('preprocessor', prep_2_scaled), 
        ('regressor', KNeighborsRegressor(n_neighbors=5))
    ]),
    'SGD Linear Regression (P2)': Pipeline(steps=[
        ('preprocessor', prep_2_scaled), 
        ('regressor', SGDRegressor(random_state=42))
    ]),
    'Random Forest (P2)': Pipeline(steps=[
        ('preprocessor', prep_2_scaled), 
        ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
    ]),
    'Regression Tree (P2)': Pipeline(steps=[
        ('preprocessor', prep_2_scaled), 
        ('regressor', DecisionTreeRegressor(random_state=42))
    ])
}

In [10]:
print("\n" + "="*50)
print("--- Training van de 4 Modellen (met Pipeline 1) ---")
print("="*50)

trained_models = {}
for name, model_pipeline in models_p1.items():
    # Training op de volledige dataset (X, y)
    print(f"Start training voor: {name}...")
    model_pipeline.fit(X, y)
    trained_models[name] = model_pipeline
    print(f"✅ {name} getraind en opgeslagen.")

print("\nAlle modellen zijn getraind. Ze zijn nu klaar om geëvalueerd te worden op de ongeziene testdataset (MAE).")


--- Training van de 4 Modellen (met Pipeline 1) ---
Start training voor: KNN Regression...
✅ KNN Regression getraind en opgeslagen.
Start training voor: SGD Linear Regression...
✅ SGD Linear Regression getraind en opgeslagen.
Start training voor: Random Forest (RF)...
✅ Random Forest (RF) getraind en opgeslagen.
Start training voor: Regression Tree (RT)...
✅ Regression Tree (RT) getraind en opgeslagen.

Alle modellen zijn getraind. Ze zijn nu klaar om geëvalueerd te worden op de ongeziene testdataset (MAE).


### MAE

In [11]:
baseline_guess = y.median()

y_baseline_pred = np.full_like(y, baseline_guess)
mae_baseline = mean_absolute_error(y, y_baseline_pred)

print("--- Baseline Model MAE ---")
print(f"De mediane uren gewerkt (Baseline Gok): {baseline_guess:.2f} uur")
print(f"De MAE van deze Baseline Gok: {mae_baseline:.2f} uur")

--- Baseline Model MAE ---
De mediane uren gewerkt (Baseline Gok): 35.00 uur
De MAE van deze Baseline Gok: 15.92 uur


In [12]:
all_models = {**models_p1, **models_p2}
results_mae = {}
K_FOLDS = 5

for name, model in all_models.items():
    print(f"Start CV voor: {name}...")
    
    # Voer Cross-Validation uit
    scores = cross_val_score(
        model, 
        X, y, 
        scoring='neg_mean_absolute_error', 
        cv=K_FOLDS, 
        n_jobs=-1
    )
    
    mean_mae = -scores.mean()
    std_mae = scores.std()
    results_mae[name] = mean_mae
    
    print(f"✅ {name:<30} - Gem. MAE: {mean_mae:.2f} uur (+/- {std_mae:.2f} uur)")

print("\n--- Samenvatting van de Generalisatieprestaties (P1 vs. P2 Poly) ---")
results_df = pd.Series(results_mae).sort_values()
print(results_df)

Start CV voor: KNN Regression...
✅ KNN Regression                 - Gem. MAE: 12.25 uur (+/- 0.34 uur)
Start CV voor: SGD Linear Regression...
✅ SGD Linear Regression          - Gem. MAE: 12.30 uur (+/- 0.32 uur)
Start CV voor: Random Forest (RF)...
✅ Random Forest (RF)             - Gem. MAE: 11.80 uur (+/- 0.34 uur)
Start CV voor: Regression Tree (RT)...
✅ Regression Tree (RT)           - Gem. MAE: 14.22 uur (+/- 0.11 uur)
Start CV voor: KNN Regression (P2)...
✅ KNN Regression (P2)            - Gem. MAE: 12.35 uur (+/- 0.31 uur)
Start CV voor: SGD Linear Regression (P2)...
✅ SGD Linear Regression (P2)     - Gem. MAE: 9409100856.13 uur (+/- 10243813289.61 uur)
Start CV voor: Random Forest (P2)...
✅ Random Forest (P2)             - Gem. MAE: 11.90 uur (+/- 0.33 uur)
Start CV voor: Regression Tree (P2)...
✅ Regression Tree (P2)           - Gem. MAE: 14.06 uur (+/- 0.50 uur)

--- Samenvatting van de Generalisatieprestaties (P1 vs. P2 Poly) ---
Random Forest (RF)            1.180247e+01
R

# Autograder 

In the autograder you will need to provide two things: 1) estimate of the MAE of your model on unseen data, 2) the predictions on the autograder data. For the autograder data we only provide the features and not the regression targets. Thus, you cannot compute the MAE on this data yourself - you need to estimate that with the data provided above. 

In [13]:
data_autograder = pd.read_csv('/kaggle/input/autograder/health_insurance_autograde.csv')
data_autograder.head()


Unnamed: 0,hhi,whi,hhi2,education,race,hispanic,experience,kidslt6,kids618,husby,region
0,no,yes,no,16years,white,no,25.0,0.0,0.0,0.0,west
1,no,no,no,12years,white,no,41.0,0.0,0.0,0.0,south
2,yes,yes,yes,>16years,white,no,21.0,0.0,2.0,34.3,other
3,no,yes,yes,9-11years,,no,46.5,0.0,0.0,0.0,other
4,no,no,no,13-15years,white,no,9.0,0.0,0.0,25.0,northcentral


In [14]:
# TODO Replace this with your own estimate of the MAE of your best model
estimate_MAE_on_new_data = np.array([1.0])

# TODO Replace this with the predictions of your best model
# via e.g. prediction = model.predict(data_autograder)
predictions_autograder_data = np.array([-1] * 17272)

# Upload this file to the Vocareum autograder:
result = np.append(estimate_MAE_on_new_data, predictions_autograder_data)
pd.DataFrame(result).to_csv("autograder_submission.txt", index=False, header=False)