# Health Insurance Cost Prediction - Traditional ML Approach
# Author: Data Science Team
# Date: July 2025

## Project Objective

The goal of this project is to build a predictive model to estimate individual health insurance costs. The model will be trained on a dataset containing various features such as age, gender, BMI, number of children, smoker status, and region. We will use traditional machine learning algorithms and evaluate their performance to select the best one. The final model will be prepared for deployment.

### Key Tasks

1.  **Data Collection**: Load a suitable dataset (e.g., "Medical Cost Personal Datasets" from Kaggle).
2.  **Data Preprocessing**: Clean the data, handle missing values (if any), outliers, and convert categorical variables into a numeric format.
3.  **Feature Engineering & EDA**: Explore the data to find patterns and relationships between variables using visualizations and statistical analysis.
4.  **Model Building**: Develop several regression models using algorithms like Linear Regression, Decision Trees, and Gradient Boosting.
5.  **Model Evaluation**: Assess model performance using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared score.
6.  **Model Deployment Preparation**: Save the best-performing model and related artifacts for future deployment on platforms like GitHub and Hugging Face Spaces.

# ======================================================================
# 1. SETUP AND IMPORTS
# ======================================================================

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings
import joblib
import gradio as gr

warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📊 Health Insurance Cost Prediction Project")
print("=" * 50)

📊 Health Insurance Cost Prediction Project


# =============================================================================
# 2. DATA LOADING AND INITIAL EXPLORATION
# =============================================================================

# Load the dataset. The insurance.csv file is assumed to be in the same directory as the notebook.
# Note: This dataset is available on Kaggle (mirichoi0218/insurance).
try:
    df = pd.read_csv('insurance.csv')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'insurance.csv' not found. Please ensure the dataset file is uploaded.")
    exit()

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [3]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [4]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# =============================================================================
# 3. EXPLORATORY DATA ANALYSIS (EDA)
# =============================================================================

In [5]:
def perform_eda(df):
    """Comprehensive EDA function for the dataset."""

    # Check for missing values
    print("\n🔍 Missing Values Check:")
    missing_values = df.isnull().sum()
    print(missing_values[missing_values > 0] if missing_values.sum() > 0 else "No missing values found! ✅")

    # Check for duplicates
    duplicates = df.duplicated().sum()
    print(f"\n🔄 Duplicate rows: {duplicates}")
    if duplicates > 0:
      df.drop_duplicates(inplace=True)
      print(f"✅ Duplicate rows removed. New shape: {df.shape}")

    # Visualize distributions
    plt.figure(figsize=(15, 10))

    plt.subplot(2, 3, 1)
    sns.histplot(df['charges'], kde=True, alpha=0.7, edgecolor='black')
    plt.title('Distribution of Insurance Charges')
    plt.xlabel('Charges ($)')
    plt.ylabel('Frequency')

    plt.subplot(2, 3, 2)
    sns.histplot(df['age'], kde=True, bins=30, alpha=0.7, edgecolor='black')
    plt.title('Age Distribution')
    plt.xlabel('Age')
    plt.ylabel('Frequency')

    plt.subplot(2, 3, 3)
    sns.histplot(df['bmi'], kde=True, bins=30, alpha=0.7, edgecolor='black')
    plt.title('BMI Distribution')
    plt.xlabel('BMI')
    plt.ylabel('Frequency')

    # Visualize relationships with 'charges'
    plt.subplot(2, 3, 4)
    sns.boxplot(data=df, x='smoker', y='charges')
    plt.title('Charges by Smoker Status')
    plt.xticks(rotation=45)

    plt.subplot(2, 3, 5)
    sns.boxplot(data=df, x='region', y='charges')
    plt.title('Charges by Region')
    plt.xticks(rotation=45)

    plt.subplot(2, 3, 6)
    sns.boxplot(data=df, x='children', y='charges')
    plt.title('Charges by Number of Children')

    plt.tight_layout()
    plt.show()

    # Correlation analysis
    print("\n📊 Correlation Analysis:")
    df_corr = df.copy()
    le = LabelEncoder()
    for col in ['sex', 'smoker', 'region']:
        df_corr[col] = le.fit_transform(df_corr[col])

    plt.figure(figsize=(10, 8))
    correlation_matrix = df_corr.corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
    plt.title('Correlation Matrix')
    plt.show()

    return df

# Run EDA on the loaded dataframe
df_clean = perform_eda(df)


🔍 Missing Values Check:
No missing values found! ✅

🔄 Duplicate rows: 1
✅ Duplicate rows removed. New shape: (1337, 7)
