# **Experiment Notebook**



---
## Setup Environment

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
!pip install -q utstd

from utstd.folders import *
from utstd.ipyrenders import *

at = AtFolder(
    course_code=36106,
    assignment="AT1",
)
at.run()

import warnings
warnings.simplefilter(action='ignore')

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
umap-learn 0.5.9.post2 requires scikit-learn>=1.6, but you have scikit-learn 1.5.2 which is incompatible.[0m[31m
[0mDrive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

You can now save your data files in: /content/gdrive/MyDrive/36106/assignment/AT1/data


---
## Student Information

In [None]:
# <Student to fill this section and then remove this comment>
student_name = "Parisasadat Kalaki"
student_id = "25969686"

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h1", key='student_name', value=student_name)

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h1", key='student_id', value=student_id)

---
## 0. Python Packages

### 0.a Install Additional Packages

> If you are using additional packages, you need to install them here using the command: `! pip install <package_name>`

In [None]:
# <Student to fill this section and then remove this comment>

### 0.b Import Packages

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
import pandas as pd
import altair as alt
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error


---
## A. Experiment Description

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
experiment_id = "1"
print_tile(size="h1", key='experiment_id', value=experiment_id)

In [None]:
experiment_hypothesis = """
The hypothesis is that Ridge Linear Regression will improve the prediction of net premium amount compared to the dummy baseline, because it reduces the impact of multicollinearity among vehicle-related features while keeping all predictors in the model. This is worthwhile since many features appear moderately informative, and Ridge’s coefficient shrinkage should lead to better generalization and lower test MAE.

"""

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='experiment_hypothesis', value=experiment_hypothesis)

In [None]:
experiment_expectations = """
I expect Ridge Regression to significantly outperform the dummy baseline (which simply predicts the mean premium). Because Ridge applies L2 regularization, it should handle feature correlations more effectively and lead to more stable predictions. The goal is to achieve a substantially lower MAE than the dummy baseline.

Possible Scenarios

Ridge >> Dummy → Hypothesis confirmed; Ridge is able to extract signal from features.

Ridge slightly better than Dummy → Linear patterns exist, but the predictive power is weak; may need feature engineering or nonlinear models.

Ridge ≈ Dummy → Ridge fails to capture useful patterns; features may lack predictive power for the target.

Ridge underperforms Dummy → Possible data leakage, preprocessing errors, or inappropriate feature scaling.
"""

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='experiment_expectations', value=experiment_expectations)

---
## B. Feature Selection


In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
# Load data
try:
  X_train = pd.read_csv(at.folder_path / 'X_train.csv')
  y_train = pd.read_csv(at.folder_path / 'y_train.csv')

  X_val = pd.read_csv(at.folder_path / 'X_val.csv')
  y_val = pd.read_csv(at.folder_path / 'y_val.csv')

  X_test = pd.read_csv(at.folder_path / 'X_test.csv')
  y_test = pd.read_csv(at.folder_path / 'y_test.csv')
except Exception as e:
  print(e)

In [None]:
X_val.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10700 entries, 0 to 10699
Data columns (total 20 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   vehicle_value                        10700 non-null  float64
 1   vehicle_weight                       10700 non-null  float64
 2   vehicle_horsepower                   10700 non-null  float64
 3   vehicle_cylinder                     10700 non-null  float64
 4   matriculation_year                   10700 non-null  float64
 5   seniority                            10700 non-null  float64
 6   current_policies_held                10700 non-null  float64
 7   max_products_held                    10700 non-null  float64
 8   lapsed_policies                      10700 non-null  float64
 9   total_claims_cost_in_current_year    10700 non-null  float64
 10  total_claims_number_in_current_year  10700 non-null  float64
 11  total_claims_number_in_histo

In [None]:
features_list = ['vehicle_value', 'vehicle_weight', 'vehicle_horsepower',
       'vehicle_cylinder', 'matriculation_year', 'seniority',
       'current_policies_held', 'max_products_held',
       'lapsed_policies', 'total_claims_cost_in_current_year',
       'total_claims_number_in_current_year', 'total_claims_number_in_history',
       'total_claims_number_ratio', 'vehicle_doors', 'distribution_channel',
       'vehicle_fuel_type', 'policy_type', 'second_driver',
       'driving_experience_years',
       'years_since_last_renewal']

In [58]:
feature_selection_explanations = """
For the modeling, features that could cause data leakage or reveal future outcomes, such as payment_method, lapsed_date, and next_renewal_date, contract_start_date, were removed, as this information would not be available at prediction time. Max_policies_held and vehicle_length were removed because of multicollinearity. Additionally, to address ethical considerations and reduce potential bias, gender and age were excluded from the model. Other features with low predictive value, high missingness, or potential privacy concerns, such as vehicle_length, max_policies_held, customer_id, prefix, first_name, last_name, birth_date, phone_number, email, and address-related fields, were also dropped. The remaining features were retained for modeling, ensuring both predictive relevance and ethical responsibility.net_premium_amount is the target variable.
"""

In [59]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='feature_selection_explanations', value=feature_selection_explanations)

---
## C. Train Machine Learning Model

### C.1 Import Algorithm



In [None]:
from sklearn.linear_model import Ridge

In [None]:
algorithm_selection_explanations = """
Ridge Regression is a good fit for this task because our dataset contains many correlated numeric and categorical features. Ridge addresses this issue by applying L2 regularization, which shrinks coefficients and distributes their influence more evenly, improving model stability. Since our goal is to predict a continuous outcome (net premium amount) with reasonably high dimensional data, Ridge provides a balance between interpretability and robustness, making it a strong candidate as a first regularized linear model."""

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='algorithm_selection_explanations', value=algorithm_selection_explanations)

### C.2 Set Hyperparameters


In [None]:
alpha_values = [0.001, 0.01, 0.1, 1, 10, 100]

In [None]:
hyperparameters_selection_explanations = """
For Ridge Regression, the main hyperparameter to tune is alpha.
It controls how strongly coefficients are penalized. A very small α (close to 0) makes Ridge behave like ordinary linear regression, which risks overfitting if multicollinearity is present. A very large α forces coefficients to shrink too much, leading to underfitting. Tuning α helps us find the balance where the model generalizes best on unseen data.
"""

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='hyperparameters_selection_explanations', value=hyperparameters_selection_explanations)

### C.3 Fit Model

In [None]:
trained_ridge_models = {}

for alpha in alpha_values:
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train)
    trained_ridge_models[alpha] = model

---
## D. Model Evaluation

### D.1 Model Technical Performance

In [None]:
from sklearn.metrics import mean_absolute_error


ridge_results = {}

for alpha, model in trained_ridge_models.items():
    y_train_pred = model.predict(X_train)
    y_val_pred = model.predict(X_val)

    ridge_results[alpha] = {
        "MAE_train": mean_absolute_error(y_train, y_train_pred),
        "MAE_val": mean_absolute_error(y_val, y_val_pred)
    }

ridge_results_df = pd.DataFrame(ridge_results).T
ridge_results_df.sort_index(inplace=True)
print(ridge_results_df)


         MAE_train     MAE_val
0.001    35.519678  132.591655
0.010    35.519678  132.591657
0.100    35.519680  132.591669
1.000    35.519700  132.591796
10.000   35.519908  132.593063
100.000  35.521981  132.605806


In [None]:
# best model performance on test set
best_alpha = 0.001
best_model = trained_ridge_models[best_alpha]
y_test_pred = best_model.predict(X_test)
test_mae = mean_absolute_error(y_test, y_test_pred)
print("Test MAE:", test_mae)


Test MAE: 219.43152627020325


In [None]:


# ---------------------------
# 1. Get top 15 features from best model
# ---------------------------
feature_names = X_train.columns
coefficients = best_model.coef_

importance = pd.DataFrame({
    "feature": feature_names,
    "coefficient": coefficients,
    "abs_coef": np.abs(coefficients)
}).sort_values("abs_coef", ascending=False)

top_features = importance.head(15)["feature"].tolist()
print("Top 15 features:", top_features)

# ---------------------------
# 2. Subset the data
# ---------------------------
X_train_top = X_train[top_features]
X_val_top   = X_val[top_features]
X_test_top  = X_test[top_features]

# ---------------------------
# 3. Retrain Ridge with best alpha
# ---------------------------
best_alpha = 0.001  # keep same alpha
ridge_top = Ridge(alpha=best_alpha)
ridge_top.fit(X_train_top, y_train)

# ---------------------------
# 4. Evaluate model performance
# ---------------------------
y_train_pred = ridge_top.predict(X_train_top)
y_val_pred   = ridge_top.predict(X_val_top)
y_test_pred  = ridge_top.predict(X_test_top)

results = {
    "MAE_train": mean_absolute_error(y_train, y_train_pred),
    "MAE_val": mean_absolute_error(y_val, y_val_pred),
    "MAE_test": mean_absolute_error(y_test, y_test_pred)
}

print("Ridge with top 15 features performance:")
print(pd.DataFrame([results]))


Top 15 features: ['vehicle_value', 'vehicle_weight', 'second_driver', 'driving_experience_years', 'distribution_channel', 'total_claims_number_in_history', 'total_claims_number_ratio', 'matriculation_year', 'lapsed_policies', 'years_since_last_renewal', 'current_policies_held', 'policy_type', 'total_claims_number_in_current_year', 'seniority', 'vehicle_horsepower']
Ridge with top 15 features performance:
   MAE_train     MAE_val   MAE_test
0  35.528861  132.492522  219.51878


In [None]:
importance

Unnamed: 0,feature,coefficient,abs_coef
0,vehicle_value,5.566607,5.566607
1,vehicle_weight,4.793926,4.793926
17,second_driver,4.303044,4.303044
18,driving_experience_years,-4.206639,4.206639
14,distribution_channel,3.782557,3.782557
11,total_claims_number_in_history,3.700698,3.700698
12,total_claims_number_ratio,3.579113,3.579113
4,matriculation_year,2.919135,2.919135
8,lapsed_policies,2.912117,2.912117
19,years_since_last_renewal,2.764648,2.764648


In [None]:

# Convert to 1D arrays if they are DataFrames
y_test_arr = np.ravel(y_test)
y_test_preds_arr = np.ravel(y_test_pred)

me_test = np.mean(y_test_arr - y_test_preds_arr)
print("testing ME:", me_test)


testing ME: 186.37123597393008


In [None]:
# <Student to fill this section and then remove this comment>
model_performance_explanations = """
Based on the results, we can make the following observations about model performance as alpha changes in Ridge Regression:

Across all tested values of α (0.001 → 100), the training, validation, and test MAE remain almost unchanged. This indicates that the model is not very sensitive to regularization strength for this dataset.

The training MAE (~35.52) is consistently lower than the validation (~132.59) and test (~219.52) MAE, which shows a clear generalization gap, the model performs well on seen data but less accurately on unseen data.

Increasing α slightly increases the MAE on all sets, which is expected because stronger regularization shrinks coefficients more, slightly reducing model flexibility.

Very small α (0.001 → 0.1) essentially behaves like ordinary linear regression, giving minimal penalty and almost identical performance to α = 1.

Overall, the model’s performance is stable, suggesting that multicollinearity is present but not extreme enough to dramatically affect predictions when α changes."""

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='model_performance_explanations', value=model_performance_explanations)

### D.2 Business Impact from Current Model Performance


In [None]:
business_impacts_explanations = """
The training MAE (~35.52) is much lower than the validation (~132.59) and test MAE (~219.52), showing that while the model fits the training data well, it generalizes less effectively to unseen customers.

Because the test ME is positive (~186.37), the model systematically underestimates premiums on average, which could directly lead to revenue loss if applied in practice. At the same time, overestimation in some cases could still discourage customers from purchasing.

Since the test MAE (~219) is about 37% of the average premium (~579), prediction errors are substantial and limit reliability for precise pricing. This suggests that improvements require better feature engineering or more advanced nonlinear models rather than only tuning regularization.

These limitations mean that while the model identifies key risk drivers, using it in isolation could bias business outcomes, so additional checks and refinements are needed before deployment.
"""


In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='business_impacts_explanations', value=business_impacts_explanations)

## E. Conclusion

In [None]:
experiment_outcome = "Hypothesis Partially Confirmed" # Either 'Hypothesis Confirmed', 'Hypothesis Partially Confirmed' or 'Hypothesis Rejected'

In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h2", key='experiment_outcomes_explanations', value=experiment_outcome)

In [None]:
experiment_results_explanations = """
In this experiment, Ridge Regression was applied with different α (alpha) values ranging from 0.001 to 100. Performance remained stable, with low training MAE (~35) but much higher validation (~132) and test MAEs (~219), showing limited generalization.

The results indicate that while Ridge mitigates overfitting by penalizing large coefficients, its linear assumptions restrict further improvements, and regularization alone cannot capture the complexity of the data. Ridge does perform better than the Dummy Regressor, but the gain is modest.

Future experiments should therefore focus on approaches that add more flexibility, such as Lasso Regression for feature selection"""


In [None]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h2", key='experiment_results_explanations', value=experiment_results_explanations)