# Insurance Cost Prediction: Linear vs Polynomial Regression Analysis

## Overview
This notebook compares the performance of **Linear Regression** and **Degree 2 Polynomial Regression** models for predicting insurance costs based on patient demographics and health factors.

### Dataset
- **Target Variable**: `charges` - Insurance cost charged to patients
- **Features**: age, sex, BMI, children, smoker status, region (US northeast, southeast, southwest, northwest.)
- **Total Records**: 1,338 patients

### Models to Compare
1. **Linear Regression** - Standard linear model
2. **Polynomial Regression (Degree 2)** - Degree 2 polynomial model

## Section 1: Imports and Data Loading, Linear Regression

**Purpose**: Import required libraries and load the insurance dataset.

**Purpose**: Train and evaluate a standard linear regression model
- Data preprocessing and feature encoding
- Train-test split
- Model training and prediction
- Performance evaluation (MSE, R²)


In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures, StandardScaler, MinMaxScaler

# reading data
data = pd.read_csv('insurance.csv')
print("Dataset shape:", data.shape)
print("\nData head:")
print(data.head())

# Basic data info
print("\nBASIC DATA INFO")
print("\nData types:")
print(data.dtypes)

# Preprocessing features, target variable is "charges"
X = data.drop("charges", axis=1)
y = data["charges"]

print(f"\nTarget variable (charges) range: {y.min():.2f} to {y.max():.2f}")

# One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)  # avoids dummy variable trap
print(f"\nFeatures after encoding (shape: {X.shape}):")
print(X.head())

# Convert boolean columns to numeric (0/1) to prevent scaling issues
print(f"\n=== CONVERTING BOOLEAN TO NUMERIC ===")
for col in X.columns:
    if X[col].dtype == 'bool':
        X[col] = X[col].astype(int)
        print(f"Converted {col} from bool to int")

print(f"\nData types after conversion:")
print(X.dtypes)

# Splitting dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=333
)

print(f"\nFEATURE SCALING")
# Applyfeature scaling for stability
feature_scaler = StandardScaler()
X_train_scaled = feature_scaler.fit_transform(X_train)
X_test_scaled = feature_scaler.transform(X_test)

print("Features scaled using StandardScaler for numerical stability")
print("All features are now numeric (0/1 for categorical, scaled for continuous)")

# Training and fitting the linear regression model with minimal regularization
print(f"\nLINEAR REGRESSION - MINIMAL APPROACH")
lin_reg = Ridge(alpha=0.01, random_state=333)  # Very small regularization
lin_reg.fit(X_train_scaled, y_train)

print("Linear regression intercept:", lin_reg.intercept_)
print("Linear regression coefficients:", lin_reg.coef_)

# Make predictions
y_pred = lin_reg.predict(X_test_scaled)

print(f"\nLINEAR REGRESSION EVALUATION")
# Evaluate model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R^2 Score:", r2)

Dataset shape: (1338, 7)

Data head:
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

BASIC DATA INFO

Data types:
age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

Target variable (charges) range: 1121.87 to 63770.43

Features after encoding (shape: (1338, 8)):
   age     bmi  children  sex_male  smoker_yes  region_northwest  \
0   19  27.900         0     False        True             False   
1   18  33.770         1      True       False             False   
2   28  33.000         3      True       False             False   
3   33  22.705         0

## Section 2: Polynomial Regression Model (Degree 2)

**Purpose**: Train and evaluate a polynomial regression model with degree 2
- Create polynomial features
- Train polynomial regression model
- Make predictions and evaluate performance
- Compare with linear regression results

In [6]:
# Create polynomial features (degree 2) with minimal approach
print("POLYNOMIAL REGRESSION")

# Create polynomial features from scaled features
print("Creating polynomial features...")
poly_features = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
X_train_poly = poly_features.fit_transform(X_train_scaled)
X_test_poly = poly_features.transform(X_test_scaled)

print(f"Polynomial features created: {X_train_poly.shape[1]} features")

# Apply scaling to polynomial features for stability
print("Scaling polynomial features...")
poly_scaler = StandardScaler()
X_train_poly_scaled = poly_scaler.fit_transform(X_train_poly)
X_test_poly_scaled = poly_scaler.transform(X_test_poly)

print("Polynomial features scaled")

# Train with Ridge regression
print("Training Ridge regression model...")
poly_reg = Ridge(alpha=0.1, random_state=333, max_iter=1000)
poly_reg.fit(X_train_poly_scaled, y_train)

print("Model trained successfully")

# Make predictions
print("Making predictions...")
y_pred_poly = poly_reg.predict(X_test_poly_scaled)

# Evaluate polynomial model
print("Evaluating model performance...")
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)
rmse_poly = np.sqrt(mse_poly)

print("\nPOLYNOMIAL REGRESSION RESULTS...")
print("Mean Squared Error:", mse_poly)
print("Root Mean Squared Error:", rmse_poly)
print("R^2 Score:", r2_poly)
print("Polynomial intercept:", poly_reg.intercept_)
print("Number of polynomial features:", X_train_poly.shape[1])
print("Regularization alpha:", poly_reg.alpha)


POLYNOMIAL REGRESSION
Creating polynomial features...
Polynomial features created: 36 features
Scaling polynomial features...
Polynomial features scaled
Training Ridge regression model...
Model trained successfully
Making predictions...
Evaluating model performance...

POLYNOMIAL REGRESSION RESULTS...
Mean Squared Error: 20935700.36815831
Root Mean Squared Error: 4575.554651422963
R^2 Score: 0.8666711064844506
Polynomial intercept: 13060.530650756074
Number of polynomial features: 36
Regularization alpha: 0.1


## Section 3: Sample Predictions Comparison

**Purpose**: Compare predictions from both models using 3 sample cases
- Create 3 realistic sample patients
- Make predictions using both linear and polynomial models
- Display results in a comparison dataframe
- Show differences between model predictions


In [7]:
# Create 3 sample predictions using dataframes
print("SAMPLE PREDICTIONS COMPARISON")

# Create sample data for predictions
sample_data = pd.DataFrame({
    'age': [25, 45, 60],
    'sex': ['male', 'female', 'male'],
    'bmi': [22.5, 28.3, 35.2],
    'children': [0, 2, 1],
    'smoker': ['no', 'yes', 'no'],
    'region': ['northeast', 'southwest', 'northwest']
})

print("Sample data for predictions:")
print(sample_data)

# One-hot encode the sample data, same preprocessing as training data
sample_encoded = pd.get_dummies(sample_data, drop_first=True)

# Convert boolean columns to numeric, same as training data
for col in sample_encoded.columns:
    if sample_encoded[col].dtype == 'bool':
        sample_encoded[col] = sample_encoded[col].astype(int)

# Ensure all columns match the training data
# Add missing columns with 0 values
for col in X.columns:
    if col not in sample_encoded.columns:
        sample_encoded[col] = 0

# Reorder columns to match training data
sample_encoded = sample_encoded[X.columns]

print("\nEncoded sample data (with boolean to numeric conversion):")
print(sample_encoded)

# Make predictions with linear regression
sample_scaled = feature_scaler.transform(sample_encoded)
linear_predictions = lin_reg.predict(sample_scaled)

# Make predictions with polynomial regression
sample_poly = poly_features.transform(sample_scaled)
sample_poly_scaled = poly_scaler.transform(sample_poly)
poly_predictions = poly_reg.predict(sample_poly_scaled)

# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Sample': [1, 2, 3],
    'Age': sample_data['age'],
    'Sex': sample_data['sex'],
    'BMI': sample_data['bmi'],
    'Children': sample_data['children'],
    'Smoker': sample_data['smoker'],
    'Region': sample_data['region'],
    'Linear_Prediction': linear_predictions,
    'Polynomial_Prediction': poly_predictions,
    'Difference': poly_predictions - linear_predictions
})

print("\nPREDICTION COMPARISON")
print(comparison_df.round(2))


SAMPLE PREDICTIONS COMPARISON
Sample data for predictions:
   age     sex   bmi  children smoker     region
0   25    male  22.5         0     no  northeast
1   45  female  28.3         2    yes  southwest
2   60    male  35.2         1     no  northwest

Encoded sample data (with boolean to numeric conversion):
   age   bmi  children  sex_male  smoker_yes  region_northwest  \
0   25  22.5         0         1           0                 0   
1   45  28.3         2         0           1                 0   
2   60  35.2         1         1           0                 1   

   region_southeast  region_southwest  
0                 0                 0  
1                 0                 1  
2                 0                 0  

PREDICTION COMPARISON
   Sample  Age     Sex   BMI  Children Smoker     Region  Linear_Prediction  \
0       1   25    male  22.5         0     no  northeast            1802.07   
1       2   45  female  28.3         2    yes  southwest           32983.20   
2

## Section 4: Model Performance Comparison & Analysis

**Purpose**: Comprehensive comparison of both models' performance
- Compare MSE, RMSE, and R² scores
- Calculate improvement percentages
- Determine which model performs better
- Provide final summary and recommendations


In [8]:
# Comprehensive comparison of models
print("MODEL PERFORMANCE COMPARISON")

# Create comparison dataframe for metrics
metrics_comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Polynomial Regression (Degree 2)'],
    'MSE': [mse, mse_poly],
    'RMSE': [rmse, rmse_poly],
    'R² Score': [r2, r2_poly],
    'Number of Features': [X_train.shape[1], X_train_poly.shape[1]]
})

print(metrics_comparison.round(4))

# Calculate improvement
mse_improvement = ((mse - mse_poly) / mse) * 100
r2_improvement = ((r2_poly - r2) / abs(r2)) * 100

print(f"\nIMPROVEMENT ANALYSIS")
print(f"MSE Improvement: {mse_improvement:.2f}%")
print(f"R² Score Improvement: {r2_improvement:.2f}%")

if mse_poly < mse:
    print("Polynomial regression has LOWER MSE (better)")
else:
    print("Linear regression has LOWER MSE (better)")

if r2_poly > r2:
    print("Polynomial regression has HIGHER R² (better)")
else:
    print("Linear regression has HIGHER R² (better)")

print(f"\nSUMMARY")
print(f"Linear Regression - MSE: {mse:.2f}, R²: {r2:.4f}")
print(f"Polynomial Regression - MSE: {mse_poly:.2f}, R²: {r2_poly:.4f}")
print(f"Best performing model: {'Polynomial Regression' if mse_poly < mse and r2_poly > r2 else 'Linear Regression'}")


MODEL PERFORMANCE COMPARISON
                              Model           MSE       RMSE  R² Score  \
0                 Linear Regression  3.352968e+07  5790.4819    0.7865   
1  Polynomial Regression (Degree 2)  2.093570e+07  4575.5547    0.8667   

   Number of Features  
0                   8  
1                  36  

IMPROVEMENT ANALYSIS
MSE Improvement: 37.56%
R² Score Improvement: 10.20%
Polynomial regression has LOWER MSE (better)
Polynomial regression has HIGHER R² (better)

SUMMARY
Linear Regression - MSE: 33529681.10, R²: 0.7865
Polynomial Regression - MSE: 20935700.37, R²: 0.8667
Best performing model: Polynomial Regression
