<a href="https://colab.research.google.com/github/ghrmzn/data-driven-materials-optimization/blob/main/notebooks/02_feature_selection_lasso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Selection with LASSO (L1-Regularized Regression)

This notebook performs feature selection using **LASSO regression** to identify the most influential composition and processing parameters affecting the target property in nanocomposite systems.

## Objective
- Select a sparse subset of key predictors
- Reduce model complexity and multicollinearity
- Improve interpretability for materials design and optimization


In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso, LassoCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

# TODO: update your dataset path
# Example: df = pd.read_csv("../data/data.csv")
# If you already used df in Notebook 01, copy same loading block here.
df = pd.read_csv("YOUR_DATA.csv")

# TODO: set your target column name
target_col = "y"

X = df.drop(columns=[target_col])
y = df[target_col]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = Pipeline([
    ("scaler", StandardScaler()),
    ("lasso_cv", LassoCV(alphas=None, cv=5, random_state=42, max_iter=10000))
])

model.fit(X_train, y_train)

best_alpha = model.named_steps["lasso_cv"].alpha_
best_alpha


In [None]:
lasso = model.named_steps["lasso_cv"]
coef = pd.Series(lasso.coef_, index=X.columns)

selected = coef[coef != 0].sort_values(key=np.abs, ascending=False)
dropped = coef[coef == 0].index.tolist()

print("Best alpha:", best_alpha)
print("Selected features (non-zero coefficients):")
display(selected.to_frame("Coefficient"))

y_pred = model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"Test RMSE: {rmse:.4f}")
print(f"Test R^2 : {r2:.4f}")


## Interpretation

- LASSO selects a **sparse** subset of predictors by shrinking less-informative coefficients to zero.
- The retained (non-zero) coefficients highlight variables most strongly associated with the target property.
- This feature set can be used in subsequent models to reduce complexity while preserving predictive performance.
