# Load Libraries

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

Import all necessary libraries to run prediction objective

# Data Preparation

In [6]:
# Load dataset from Google Drive
os.chdir('/content/drive/MyDrive/Colab/Datasets/')
df = pd.read_csv('insurance.csv')

# Check for null values
df.isnull().sum()

# Drop nan values if any
df.dropna(inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Encoding categorical variables
df = pd.get_dummies(df, columns=['sex', 'smoker', 'region'], drop_first=True)

# Data spliting training 80% and test 20%
X = df.drop('charges', axis=1)
y = df['charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check sizes
print(f'Training set: {X_train.shape}')
print(f'Test set: {X_test.shape}')

Training set: (1069, 8)
Test set: (268, 8)


load dataset from google drive, check for null values, drop if any, drop duplicates if any. Encode categorical data to binaries then split data to features and target variable, split train 80% and test 20%.

# Fitting the Models

In [11]:
# Run Dataset on LinearRegression
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)

y_pred_lr = model_lr.predict(X_test)

# Run Dataset on Lasso
model_lasso = Lasso(alpha=2.0)
model_lasso.fit(X_train, y_train)

y_pred_lasso = model_lasso.predict(X_test)

# Run Dataset on MLPRegressor
model_mlp = MLPRegressor(hidden_layer_sizes=(10, 5), max_iter=5000)
model_mlp.fit(X_train, y_train)

y_pred_mlp = model_mlp.predict(X_test)

# Run Dataset on Gradient Boosting Regression
model_gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model_gbr.fit(X_train, y_train)

y_pred_gbr = model_gbr.predict(X_test)

# Run Dataset on SVR
model_svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
model_svr.fit(X_train, y_train)

y_pred_svr = model_svr.predict(X_test)

Training on different algorithms
- LinearRegression straight-forward has no hyperparamaters
- Lasso takes a penatly of 2.0, the higher the penalty the better feature selection(reducing the weight to zero) in done on the dataset.
- MLPRegressor takes two hidden_layers, 10 neurons and 5 neurons on each layer, trains on a max iteration of 5000 till model converges
- GradientBoostingRegressor, takes 100 trees(n_estimators), very low learning_rate so the model has stable convergence, max_depth 3 for how complex level of the trees, random_state for reproducibility.
- SVR takes the RBF so it suits non-linear relationships, C for penalty, chose 1 to avoid overfitting, epsilon the margin for error tolerance

# Evaluation Metrics

## Metrics on Test Data

In [13]:
# Calculate the MSE for model predictions
mse_lr = mean_squared_error(y_test, y_pred_lr)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
mse_neural = mean_squared_error(y_test, y_pred_mlp)
mse_gradient = mean_squared_error(y_test, y_pred_gbr)
mse_svr = mean_squared_error(y_test, y_pred_svr)

print(f'Linear Regression - MSE: {mse_lr:.2f}')
print(f'Lasso Regression - MSE: {mse_lasso:.2f}')
print(f'Neural Network - MSE: {mse_neural:.2f}')
print(f'Gradient Boosting - MSE: {mse_gradient:.2f}')
print(f'SVR - MSE: {mse_svr:.2f}')

Linear Regression - MSE: 35478020.68
Lasso Regression - MSE: 35491022.54
Neural Network - MSE: 36263054.07
Gradient Boosting - MSE: 18218239.92
SVR - MSE: 208462453.98


Mean Squared Error for each Model, measures the average squared difference between actual and predicted values. Lower MSE means better model performance.

## Comparion Analysis

The **Mean Squared Error (MSE)** shows how far predictions are from actual values—the lower, the better. **Gradient Boosting performed best** with the lowest MSE (18M), meaning it made the most accurate predictions. **Linear Regression and Lasso Regression** had similar MSE (35M), showing that regularization (Lasso) didn’t help much. **Neural Network (MLP)** did slightly worse (~36M), possibly due to poor training or unsuitable data. **SVR performed the worst** (208M), likely due to bad hyperparameters or struggling with non-linear data. This suggests **Gradient Boosting is the best choice**, as it captures complex patterns better than the other models.