# Cross-Validation Comparison of Regression Models for Predicting Insurance Charges

## Libraries
#### pandas
For data manipulation and analysis.
#### numpy
Provides support for numerical operations and arrays.
#### OneHotEncoder
Encodes categorical variables into binary format.
#### cross_val_score
Evaluates a model’s performance using cross-validation.
#### LinearRegression, Lasso, Ridge
Implements linear regression models; Lasso uses L1 regularization, Ridge uses L2 regularization.
#### DecisionTreeRegressor
Implements decision tree regression for modeling non-linear relationships.

In [8]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor

## Load the dataset

In [9]:
df = pd.read_csv('../dataset/insurance.csv')

## Drop duplicate rows

In [10]:
df = df.drop_duplicates()

## Apply One Hot Encode

In [11]:
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_data = encoder.fit_transform(df[['smoker', 'sex', 'region']])
encoded_columns = encoder.get_feature_names_out(['smoker', 'sex', 'region'])
df_encoded = pd.DataFrame(encoded_data, columns=encoded_columns, index=df.index)
df = pd.concat([df.drop(columns=['smoker', 'sex', 'region']), df_encoded], axis=1)

## Split the data into features (X) and target (y)

In [12]:
X = df.drop(columns=['charges'])  # Features
y = df['charges']  # Target

## Define models

In [13]:
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Lasso Regression': Lasso(alpha=1.0, random_state=42, max_iter=10000),
    'Ridge Regression': Ridge(alpha=1.0, random_state=42, max_iter=10000)
}

## Perform 5-fold cross-validation and print mean scores
Linear Regression, Lasso Regression, and Ridge Regression are better suited for predicting charges in this insurance dataset, providing lower and more consistent errors compared to Decision Tree Regression

Regularization (Lasso and Ridge) did not significantly improve performance over standard Linear Regression, likely due to the dataset’s characteristics.

In [14]:
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    rmse_scores = np.sqrt(-scores)
    print(f"{name}: Mean RMSE = {rmse_scores.mean():.2f}, Std = {rmse_scores.std():.2f}")

Linear Regression: Mean RMSE = 6074.29, Std = 194.30
Decision Tree: Mean RMSE = 6597.09, Std = 310.63
Lasso Regression: Mean RMSE = 6074.19, Std = 193.97
Ridge Regression: Mean RMSE = 6074.48, Std = 189.73
