<a href="https://colab.research.google.com/github/Mohammed-Saif-07/ML-winter-quarter/blob/main/EX6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.compose import TransformedTargetRegressor


In [12]:
# Upload insurance.csv in Colab before running this
df = pd.read_csv("insurance.csv")
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [13]:
X = df.drop("charges", axis=1)
y = df["charges"]


In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [15]:
numeric_features = ["age", "bmi", "children"]
categorical_features = ["sex", "smoker", "region"]

numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])

categorical_transformer = OneHotEncoder(drop="first")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)


In [18]:
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    return rmse, r2



In [19]:
baseline_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

baseline_model.fit(X_train, y_train)

baseline_rmse, baseline_r2 = evaluate_model(
    baseline_model, X_test, y_test
)

baseline_rmse, baseline_r2


(np.float64(5796.2846592762735), 0.7835929767120723)

In [20]:
poly_linear_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("regressor", LinearRegression())
])

poly_linear_model.fit(X_train, y_train)

poly_rmse, poly_r2 = evaluate_model(
    poly_linear_model, X_test, y_test
)

poly_rmse, poly_r2


(np.float64(4551.132385233194), 0.866583090316484)

In [21]:
ridge_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("regressor", Ridge(alpha=1.0))
])

ridge_model.fit(X_train, y_train)

ridge_rmse, ridge_r2 = evaluate_model(
    ridge_model, X_test, y_test
)

ridge_rmse, ridge_r2


(np.float64(4550.233996414042), 0.8666357578372921)

In [22]:
lasso_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("regressor", Lasso(alpha=0.001, max_iter=10000))
])

lasso_model.fit(X_train, y_train)

lasso_rmse, lasso_r2 = evaluate_model(
    lasso_model, X_test, y_test
)

lasso_rmse, lasso_r2


  model = cd_fast.enet_coordinate_descent(


(np.float64(4551.126815587279), 0.8665834168657593)

In [23]:
elastic_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("regressor", ElasticNet(alpha=0.001, l1_ratio=0.5, max_iter=10000))
])

elastic_model.fit(X_train, y_train)

elastic_rmse, elastic_r2 = evaluate_model(
    elastic_model, X_test, y_test
)

elastic_rmse, elastic_r2


(np.float64(4550.49963540955), 0.8666201859887035)

In [24]:
results = pd.DataFrame({
    "Model": [
        "Baseline Linear",
        "Polynomial Linear",
        "Ridge + Polynomial",
        "Lasso + Polynomial",
        "Elastic Net + Polynomial"
    ],
    "RMSE": [
        baseline_rmse,
        poly_rmse,
        ridge_rmse,
        lasso_rmse,
        elastic_rmse
    ],
    "R2 Score": [
        baseline_r2,
        poly_r2,
        ridge_r2,
        lasso_r2,
        elastic_r2
    ]
})

results


Unnamed: 0,Model,RMSE,R2 Score
0,Baseline Linear,5796.284659,0.783593
1,Polynomial Linear,4551.132385,0.866583
2,Ridge + Polynomial,4550.233996,0.866636
3,Lasso + Polynomial,4551.126816,0.866583
4,Elastic Net + Polynomial,4550.499635,0.86662


## Summary of Findings

In this assignment, several regression models were trained to predict insurance charges using the insurance dataset.

A baseline linear regression model was first implemented using appropriate preprocessing, including scaling of numerical features and one-hot encoding of categorical variables. While this model performed reasonably well, it was limited to linear relationships.

To improve performance, polynomial features of degree 2 were introduced. This allowed the model to capture non-linear interactions between predictors such as age, BMI, and smoking status. The polynomial model significantly improved both RMSE and R² compared to the baseline.

Next, regularization techniques were applied to control model complexity and reduce overfitting. Ridge regression improved model stability by shrinking coefficients, while Lasso regression performed feature selection by setting some coefficients to zero. Elastic Net combined the strengths of both Ridge and Lasso regularization.

Among all models tested, **Elastic Net with polynomial features achieved the best performance**, producing the lowest RMSE and the highest R² score. This demonstrates that combining non-linear feature expansion with balanced regularization leads to the most effective predictive model for this dataset.
