# Part B – LASSO Regression (No Outlier Handling, Tuned)

This notebook implements a **complete machine learning pipeline** using **LASSO regression**
to predict `metabolic_pressure` from the SMHS dataset.

Rules respected:
- `latent_cluster` is **not** used as a predictor.
- Full ML pipeline: EDA → preprocessing → model → evaluation → interpretation.
- Hyperparameter tuning for LASSO (`alpha`) with cross-validation.
- Final model is used to predict the external validation set
  (`SMHS_validation_students-1.csv`).


In [1]:
# 1. Imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 140)

## 2. Load data

We load the training dataset (`SMHS_general-1.csv`) and the external validation dataset
(`SMHS_validation_students-1.csv`). Place them in the same folder as this notebook.

In [2]:
# Paths (adapt if needed)
general_path = "SMHS_general-1.csv"
validation_path = "SMHS_validation_students-1.csv"

general = pd.read_csv(general_path)
validation = pd.read_csv(validation_path)

general.shape, validation.shape

((1500, 29), (100, 28))

## 3. Basic structure and missingness (quick EDA)


In [3]:
general.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         1500 non-null   int64  
 1   age                        1500 non-null   int64  
 2   sex                        1500 non-null   object 
 3   ethnicity                  1500 non-null   object 
 4   latent_cluster             1500 non-null   int64  
 5   deprivation_index          1500 non-null   float64
 6   smoking_status             1500 non-null   object 
 7   alcohol_units_week         259 non-null    float64
 8   physical_activity_minutes  1500 non-null   float64
 9   diet_quality_score         1500 non-null   float64
 10  depression_score           1500 non-null   float64
 11  bmi                        1500 non-null   float64
 12  waist_hip_ratio            1500 non-null   float64
 13  sbp                        1500 non-null   float

In [4]:
general.head()

Unnamed: 0,id,age,sex,ethnicity,latent_cluster,deprivation_index,smoking_status,alcohol_units_week,physical_activity_minutes,diet_quality_score,depression_score,bmi,waist_hip_ratio,sbp,dbp,sleep_duration_hours,sleep_variability_hours,sleep_efficiency,steps_per_day,fasting_glucose,fasting_insulin,triglycerides,hdl,ldl,crp,alt,shift_worker,family_history_diabetes,metabolic_pressure
0,1,56,F,White,3,74.968216,Former,,92.872646,60.614177,5.584738,29.748838,1.046043,149.311516,93.937432,6.021335,1.887725,74.954053,5957.130999,7.367951,21.589497,2.607227,0.73726,2.7041,28.432087,37.267948,No,0,73.087968
1,2,59,M,White,2,34.912176,Never,,26.103352,72.10727,5.188831,34.552028,0.875128,133.590633,78.725356,6.182194,1.079962,95.105654,7237.822331,6.717636,16.379792,3.112712,1.654086,3.750358,1.62455,29.042661,No,0,59.54175
2,3,49,M,Black,2,54.207211,Former,,210.401582,49.249182,7.037791,32.157909,0.843049,123.986203,62.293022,5.295797,1.013001,88.639733,10968.377469,5.052053,17.104307,4.525291,1.031543,3.8125,2.236999,33.364052,No,1,54.864634
3,4,39,F,White,2,27.870993,Never,,214.605575,61.195129,1.599334,36.395585,0.859905,130.854414,76.428276,6.495996,0.52839,83.566288,8455.317304,6.018634,21.739572,1.322513,1.232799,4.086235,,22.496442,No,0,49.199538
4,5,45,F,Other,3,72.684155,Never,21.249861,135.063492,49.828539,15.358964,34.769534,1.052577,156.756304,98.913783,4.906051,1.685047,90.506972,6689.506787,5.826059,16.918793,2.312573,1.014607,3.390955,5.999061,51.853812,No,0,70.674299


In [5]:
missing_ratio = general.isna().mean().sort_values(ascending=False)
missing_ratio

Unnamed: 0,0
alcohol_units_week,0.827333
steps_per_day,0.144
crp,0.080667
sleep_variability_hours,0.046
fasting_insulin,0.024667
latent_cluster,0.0
deprivation_index,0.0
smoking_status,0.0
physical_activity_minutes,0.0
age,0.0


## 4. Define target and predictor variables

- Target: `metabolic_pressure`
- Exclude: `id`, `latent_cluster`
- Predictors: all remaining columns


In [6]:
target_col = "metabolic_pressure"
cluster_col = "latent_cluster"
id_col = "id"

y = general[target_col]
X = general.drop(columns=[target_col, cluster_col, id_col])

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()

numeric_features, categorical_features

(['age',
  'deprivation_index',
  'alcohol_units_week',
  'physical_activity_minutes',
  'diet_quality_score',
  'depression_score',
  'bmi',
  'waist_hip_ratio',
  'sbp',
  'dbp',
  'sleep_duration_hours',
  'sleep_variability_hours',
  'sleep_efficiency',
  'steps_per_day',
  'fasting_glucose',
  'fasting_insulin',
  'triglycerides',
  'hdl',
  'ldl',
  'crp',
  'alt',
  'family_history_diabetes'],
 ['sex', 'ethnicity', 'smoking_status', 'shift_worker'])

## 5. Train–test split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape

((1200, 26), (300, 26))

## 6. Preprocessing pipeline

Numeric: median imputation + standardisation
Categorical: most-frequent imputation + one-hot encoding


In [8]:
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

## 7. Baseline LASSO model

In [9]:
lasso_baseline = Lasso(alpha=0.1, max_iter=10000, random_state=42)

lasso_pipe_baseline = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", lasso_baseline)
    ]
)

lasso_pipe_baseline.fit(X_train, y_train)

y_pred_test_baseline = lasso_pipe_baseline.predict(X_test)

mae_baseline = mean_absolute_error(y_test, y_pred_test_baseline)
rmse_baseline = np.sqrt(mean_squared_error(y_test, y_pred_test_baseline))

print("Baseline LASSO")
print("Test MAE :", mae_baseline)
print("Test RMSE:", rmse_baseline)

Baseline LASSO
Test MAE : 4.122845078166603
Test RMSE: 5.510861105566081


## 8. Hyperparameter tuning for LASSO (`alpha`)

In [10]:
lasso = Lasso(max_iter=10000, random_state=42)

lasso_pipe = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", lasso)
    ]
)

alpha_grid = np.logspace(-3, 1, 9)  # 0.001 to 10

param_grid = {"model__alpha": alpha_grid}

grid_search_lasso = GridSearchCV(
    estimator=lasso_pipe,
    param_grid=param_grid,
    scoring="neg_mean_absolute_error",
    cv=5,
    n_jobs=-1
)

grid_search_lasso.fit(X_train, y_train)

best_alpha = grid_search_lasso.best_params_["model__alpha"]
best_cv_mae = -grid_search_lasso.best_score_

print("Best alpha (LASSO):", best_alpha)
print("Best CV MAE       :", best_cv_mae)

Best alpha (LASSO): 0.1
Best CV MAE       : 4.1931474412844425


## 9. Evaluation of tuned LASSO model

In [11]:
best_lasso_pipe = grid_search_lasso.best_estimator_

y_pred_test_best = best_lasso_pipe.predict(X_test)

test_mae = mean_absolute_error(y_test, y_pred_test_best)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test_best))

print("Tuned LASSO – Test MAE :", test_mae)
print("Tuned LASSO – Test RMSE:", test_rmse)

Tuned LASSO – Test MAE : 4.122845078166603
Tuned LASSO – Test RMSE: 5.510861105566081


## 10. Cross-validation on full dataset

In [12]:
cv_scores_mae = cross_val_score(
    best_lasso_pipe,
    X, y,
    cv=5,
    scoring="neg_mean_absolute_error",
    n_jobs=-1
)

cv_scores_mse = cross_val_score(
    best_lasso_pipe,
    X, y,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=-1
)

cv_mae_mean = -cv_scores_mae.mean()
cv_mae_std = cv_scores_mae.std()

cv_rmse_mean = np.sqrt(-cv_scores_mse.mean())

print("LASSO CV MAE  (mean ± sd):", cv_mae_mean, "+/-", cv_mae_std)
print("LASSO CV RMSE (mean):     ", cv_rmse_mean)

LASSO CV MAE  (mean ± sd): 4.1668843671565945 +/- 0.24106908357540494
LASSO CV RMSE (mean):      5.412487581468929


## 11. Inspect LASSO coefficients

In [13]:
preprocessor = best_lasso_pipe.named_steps["preprocess"]
model = best_lasso_pipe.named_steps["model"]

num_features_out = numeric_features

cat_encoder = preprocessor.named_transformers_["cat"].named_steps["onehot"]
cat_features_out = cat_encoder.get_feature_names_out(categorical_features)

all_feature_names = np.concatenate([num_features_out, cat_features_out])

coeffs = model.coef_

coef_df = pd.DataFrame({
    "feature": all_feature_names,
    "coefficient": coeffs
}).sort_values(by="coefficient", key=lambda s: s.abs(), ascending=False)

coef_df.head(20)

Unnamed: 0,feature,coefficient
14,fasting_glucose,5.868792
6,bmi,3.835097
10,sleep_duration_hours,-2.656455
15,fasting_insulin,2.654778
20,alt,1.227456
16,triglycerides,0.80943
17,hdl,-0.700729
0,age,0.568105
8,sbp,0.413606
12,sleep_efficiency,-0.295621


## 12. Fit tuned LASSO on full data and predict validation set

In [14]:
best_lasso_pipe.fit(X, y)

X_val = validation.drop(columns=[id_col, cluster_col], errors="ignore")
val_pred_lasso = best_lasso_pipe.predict(X_val)

submission_lasso = pd.DataFrame({
    "id": validation[id_col],
    "predicted_metabolic_pressure": val_pred_lasso
})

submission_lasso.head()

Unnamed: 0,id,predicted_metabolic_pressure
0,New_1501,57.822207
1,New_1502,58.308866
2,New_1503,70.784126
3,New_1504,34.080968
4,New_1505,37.227208


In [15]:
lasso_submission_path = "SMHS_LASSO_validation_predictions.csv"
submission_lasso.to_csv(lasso_submission_path, index=False)
lasso_submission_path

'SMHS_LASSO_validation_predictions.csv'

## 13. Summary

- Built a full LASSO regression pipeline with preprocessing.
- Tuned `alpha` using GridSearchCV.
- Evaluated performance with test-set metrics and cross-validation.
- Inspected LASSO coefficients for interpretation.
- Generated predictions for the external validation set.
