# **Experiment Notebook**



---
## Setup Environment

In [1]:
# DO NOT MODIFY THE CODE IN THIS CELL
!pip install -q utstd

from utstd.folders import *
from utstd.ipyrenders import *

at = AtFolder(
    course_code=36106,
    assignment="AT1",
)
at.run()

import warnings
warnings.simplefilter(action='ignore')

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m48.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
umap-learn 0.5.9.post2 requires scikit-learn>=1.6, but you have scikit-learn 1.5.2 which is incompatible.[0m[31m
[0mMounted at /content/gdrive

You can now save your data files in: /content/gdrive/MyDrive/36106/assignment/AT1/data


---
## Student Information

In [2]:
# <Student to fill this section and then remove this comment>
student_name = "Parisasadat"
student_id = "kalaki"

In [3]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h1", key='student_name', value=student_name)

In [4]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h1", key='student_id', value=student_id)

---
## 0. Python Packages

### 0.a Install Additional Packages

> If you are using additional packages, you need to install them here using the command: `! pip install <package_name>`

In [5]:
# <Student to fill this section and then remove this comment>

### 0.b Import Packages

In [6]:
# DO NOT MODIFY THE CODE IN THIS CELL
import pandas as pd
import numpy as np
import altair as alt
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Lasso


---
## A. Experiment Description

In [7]:
# DO NOT MODIFY THE CODE IN THIS CELL
experiment_id = "2"
print_tile(size="h1", key='experiment_id', value=experiment_id)

In [8]:
# <Student to fill this section and then remove this comment>
experiment_hypothesis = """
I want to test whether using Lasso Regression (L1 regularization) can improve the prediction of net_premium_amount.
Unlike Ridge, which only shrinks coefficients, Lasso can push some coefficients all the way to zero.
This means it can automatically filter out less important features and give us a simpler, more interpretable model.
Since our data has correlated features and some noise, I believe Lasso might help improve generalization on unseen data
and highlight the most important drivers of premium amounts.
"""

In [9]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='experiment_hypothesis', value=experiment_hypothesis)

In [10]:
experiment_expectations = """
Expected outcome:
I expect Lasso to give a sparser, more interpretable linear model by zeroing-out truly unhelpful features, and to slightly improve out-of-sample MAE compared to the ridge/dummy baselines. A reasonable target is a **modest reduction in test MAE (~5–15%)** relative to the current Ridge result — i.e. we hope to move the test MAE down from ~217 toward ~185–206 if Lasso finds useful sparsity and removes noisy predictors.

Why this makes sense:
Our data contains many correlated and moderately informative features. Ridge showed that regularization helps but tuning alpha alone didn’t change performance much. Lasso’s L1 penalty can remove irrelevant/noisy features entirely, which both reduces variance and simplifies interpretation for the business.

Possible scenarios:
1. Lasso substantially improves performance (best case) — test MAE drops 10–15% and many coefficients become zero.
   - Interpretation: noise/irrelevant features were hurting generalization; use the smaller feature set going forward and consider simpler models for deployment.

2. Lasso provides modest improvement (likely) — test MAE improves 3–10%, with a moderate number of features zeroed.
   - Interpretation: some noise existed and Lasso helps; follow up with ElasticNet / feature engineering to squeeze more gain.

3. Lasso performs similar to Ridge (no real change) — MAE roughly unchanged and few coefficients are zeroed.
   - Interpretation: linear sparsity isn’t the limiting factor; focus on non-linear models or richer features.

4. Lasso performs worse (over-sparse / underfitting) — MAE increases noticeably.
   - Interpretation: L1 removed useful predictors or α is too large. Action: reduce alpha, try ElasticNet, or move to tree-based models.

Next steps depending on outcome:
- If Lasso selects a small, stable feature set → validate those features, retrain simpler models, consider deployment pipeline.
- If modest gains → run knn next.

Overall, Lasso is a low-risk experiment that can yield interpretability and modest performance gains; whether we proceed further depends on how much MAE improves and how sparse/consistent the selected features are.
"""


In [11]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='experiment_expectations', value=experiment_expectations)

---
## B. Feature Selection


In [12]:
# DO NOT MODIFY THE CODE IN THIS CELL
# Load data
try:
  X_train = pd.read_csv(at.folder_path / 'X_train.csv')
  y_train = pd.read_csv(at.folder_path / 'y_train.csv')

  X_val = pd.read_csv(at.folder_path / 'X_val.csv')
  y_val = pd.read_csv(at.folder_path / 'y_val.csv')

  X_test = pd.read_csv(at.folder_path / 'X_test.csv')
  y_test = pd.read_csv(at.folder_path / 'y_test.csv')
except Exception as e:
  print(e)

In [13]:
features_list = ['vehicle_value', 'vehicle_weight', 'vehicle_horsepower',
       'vehicle_cylinder', 'matriculation_year', 'seniority',
       'current_policies_held', 'max_products_held',
       'lapsed_policies', 'total_claims_cost_in_current_year',
       'total_claims_number_in_current_year', 'total_claims_number_in_history',
       'total_claims_number_ratio', 'vehicle_doors', 'distribution_channel',
       'vehicle_fuel_type', 'policy_type', 'second_driver',
       'driving_experience_years',
       'years_since_last_renewal']

In [14]:
feature_selection_explanations = """
For the modeling, features that could cause data leakage or reveal future outcomes, such as payment_method, lapsed_date, and next_renewal_date, contract_start_date, were removed, as this information would not be available at prediction time. Max_policies_held and vehicle_length were removed because of multicollinearity. Additionally, to address ethical considerations and reduce potential bias, gender and age were excluded from the model. Other features with low predictive value, high missingness, or potential privacy concerns, such as vehicle_length, max_policies_held, customer_id, prefix, first_name, last_name, birth_date, phone_number, email, and address-related fields, were also dropped. The remaining features were retained for modeling, ensuring both predictive relevance and ethical responsibility.
"""

In [15]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='feature_selection_explanations', value=feature_selection_explanations)

---
## C. Train Machine Learning Model

### C.1 Import Algorithm



In [16]:
from sklearn.linear_model import Lasso

In [17]:
algorithm_selection_explanations = """
Lasso Regression is a good fit because it not only regularizes the model like Ridge,
but also performs feature selection by shrinking some coefficients to exactly zero.
This is useful for our dataset, which contains many correlated and potentially noisy
vehicle and customer features. By automatically removing irrelevant predictors,
Lasso can simplify the model, improve interpretability, and help reduce overfitting.
Since our goal is to predict net premium while keeping the model both accurate and
understandable for the business, Lasso offers a balance of predictive performance
and transparency.
"""


In [18]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='algorithm_selection_explanations', value=algorithm_selection_explanations)

### C.2 Set Hyperparameters


In [19]:
alpha_values = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]

In [20]:
hyperparameters_selection_explanations = """
For Lasso Regression, the main hyperparameter to tune is alpha, which controls
the strength of the L1 penalty. A small alpha makes the model behave more like
ordinary linear regression, keeping most features but risking overfitting. A large
alpha increases regularization, forcing more coefficients to zero, which can improve
generalization but risks underfitting. Tuning alpha allows us to balance predictive
accuracy with feature selection, ensuring the model captures the most relevant
drivers of net premium while avoiding unnecessary complexity.
"""


In [21]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='hyperparameters_selection_explanations', value=hyperparameters_selection_explanations)

### C.3 Fit Model

In [22]:
trained_lasso_models = {}

for alpha in alpha_values:
    model = Lasso(alpha=alpha)
    model.fit(X_train, y_train)
    trained_lasso_models[alpha] = model

---
## D. Model Evaluation

### D.1 Model Technical Performance

In [23]:

lasso_results = {}

for alpha, model in trained_lasso_models.items():
    y_train_pred = model.predict(X_train)
    y_val_pred = model.predict(X_val)

    lasso_results[alpha] = {
        "MAE_train": mean_absolute_error(y_train, y_train_pred),
        "MAE_val": mean_absolute_error(y_val, y_val_pred)
    }

lasso_results_df = pd.DataFrame(lasso_results).T
lasso_results_df.sort_index(inplace=True)
print(lasso_results_df)


          MAE_train     MAE_val
0.0001    35.519698  132.591847
0.0010    35.519893  132.593502
0.0100    35.521855  132.609830
0.1000    35.543447  132.766172
1.0000    35.852378  134.518355
10.0000   38.091688  147.717284
100.0000  38.091688  147.717284


In [24]:
# best model performance on test data
best_alpha = 0.0001
best_lasso = trained_lasso_models[best_alpha]
y_test_pred = best_lasso.predict(X_test)
test_mae = mean_absolute_error(y_test, y_test_pred)
print("Test MAE:", test_mae)


Test MAE: 219.43178612338758


In [25]:


# ---------------------------
# 1. Rank features by absolute coefficient
# ---------------------------
feature_names = X_train.columns
coefficients = best_lasso.coef_

importance = pd.DataFrame({
    "feature": feature_names,
    "coefficient": coefficients,
    "abs_coef": np.abs(coefficients)
}).sort_values("abs_coef", ascending=False)

top_features = importance.head(15)["feature"].tolist()
print("Top 15 features (Lasso):", top_features)

# ---------------------------
# 2. Subset data with top 15 features
# ---------------------------
X_train_top = X_train[top_features]
X_val_top   = X_val[top_features]
X_test_top  = X_test[top_features]

# ---------------------------
# 3. Retrain Lasso with top 15 features
# ---------------------------
lasso_top = Lasso(alpha=best_alpha, max_iter=10000)
lasso_top.fit(X_train_top, y_train)

# ---------------------------
# 4. Evaluate retrained model
# ---------------------------
y_train_pred = lasso_top.predict(X_train_top)
y_val_pred   = lasso_top.predict(X_val_top)
y_test_pred  = lasso_top.predict(X_test_top)

results = {
    "MAE_train": mean_absolute_error(y_train, y_train_pred),
    "MAE_val": mean_absolute_error(y_val, y_val_pred),
    "MAE_test": mean_absolute_error(y_test, y_test_pred)
}

print("Lasso retrained with top 15 features:")
print(pd.DataFrame([results]))


Top 15 features (Lasso): ['vehicle_value', 'vehicle_weight', 'second_driver', 'driving_experience_years', 'distribution_channel', 'total_claims_number_in_history', 'total_claims_number_ratio', 'matriculation_year', 'lapsed_policies', 'years_since_last_renewal', 'current_policies_held', 'policy_type', 'total_claims_number_in_current_year', 'seniority', 'vehicle_horsepower']
Lasso retrained with top 15 features:
   MAE_train     MAE_val   MAE_test
0  35.528882  132.492727  219.51901


In [26]:
importance

Unnamed: 0,feature,coefficient,abs_coef
0,vehicle_value,5.566486,5.566486
1,vehicle_weight,4.793618,4.793618
17,second_driver,4.302906,4.302906
18,driving_experience_years,-4.206534,4.206534
14,distribution_channel,3.782486,3.782486
11,total_claims_number_in_history,3.700425,3.700425
12,total_claims_number_ratio,3.579003,3.579003
4,matriculation_year,2.919195,2.919195
8,lapsed_policies,2.912041,2.912041
19,years_since_last_renewal,2.764567,2.764567


In [27]:

# Convert to 1D arrays if they are DataFrames
y_test_arr = np.ravel(y_test)
y_test_preds_arr = np.ravel(y_test_pred)

me_test = np.mean(y_test_arr - y_test_preds_arr)
print("testing ME:", me_test)


testing ME: 186.37145492557198


In [28]:
model_performance_explanations = """
The Lasso model shows similar MAE performance on training, validation, and test sets compared to Ridge, indicating that both models predict net premium amounts similarly well.
While tuning different alpha values, we observed minimal changes in error, suggesting that the model is relatively stable with respect to regularization strength.
The key advantage of Lasso lies in its ability to shrink some coefficients to zero, which could simplify the model and improve interpretability, although prediction accuracy remains comparable to Ridge.
"""

In [29]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='model_performance_explanations', value=model_performance_explanations)

### D.2 Business Impact from Current Model Performance


In [30]:
business_impacts_explanations = """
Lasso achieved nearly identical performance to Ridge, with training MAE (~35.53), validation MAE (~132.49), and test MAE (~219.52).
Because the test ME is positive (~186.37), it also systematically underestimates premiums on average, creating potential revenue loss if applied directly.
While prediction errors remain substantial (about 37% of the average premium), Lasso adds business value by highlighting a smaller set of key predictors,
making the model more interpretable and helping the business focus on the most influential factors driving premium amounts.
"""


In [31]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h3", key='business_impacts_explanations', value=business_impacts_explanations)

## E. Conclusion

In [32]:
# <Student to fill this section and then remove this comment>
experiment_outcome = "Hypothesis Partially Confirmed" # Either 'Hypothesis Confirmed', 'Hypothesis Partially Confirmed' or 'Hypothesis Rejected'

In [33]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h2", key='experiment_outcome', value=experiment_outcome)

In [34]:
experiment_results_explanations = """
The experiment partially confirmed the hypothesis: Lasso performs comparably to Ridge in terms of MAE, slightly improving interpretability through feature selection but not significantly reducing test error.
New insights include the stability of the model across a range of alpha values and the identification of a smaller set of key predictors.
Next steps could involve testing non-linear models such as KNN to capture local patterns and potentially improve predictions beyond what linear regularized models achieve.
If business goals prioritize model interpretability, Lasso remains useful; otherwise, exploration of non-linear approaches is recommended for better generalization.
"""


In [35]:
# DO NOT MODIFY THE CODE IN THIS CELL
print_tile(size="h2", key='experiment_outcomes_explanations', value=experiment_results_explanations)