# Medical Insurance Cost Prediction
### Regression Analysis Project

---

## 1️.) Problem Statement & Dataset

I am using the Medical Cost Personal Dataset from Kaggle:
https://www.kaggle.com/datasets/mirichoi0218/insurance

Goal: 
Predict a person's medical insurance charges using factors like age, sex, BMI, number of children, smoking status, and region.

Target Variable: charges (continuous numeric variable)

Prediction Type: Regression

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from scipy import stats

## 2️.) Load and Inspect Dataset

In [None]:
data = pd.read_csv("insurance.csv")

print("First 5 rows of dataset:")
display(data.head())

print("\nDataset Info:")
data.info()

print("\nMissing values per column:")
print(data.isnull().sum())

print("\nNumber of duplicates:", data.duplicated().sum())

## 3.) Basic Data Cleaning

In [None]:
data = data.drop_duplicates()

data['sex'] = data['sex'].str.lower()
data['smoker'] = data['smoker'].str.lower()
data['region'] = data['region'].str.lower()

data = pd.get_dummies(data, drop_first=True)

print("Cleaned Data Sample:")
display(data.head())

print("\nNumeric Summary:")
display(data.describe())

## 4️.) Identify and Remove Outliers

In [None]:
# Plot boxplots to visualize outliers
numeric_cols = ['age', 'bmi', 'children', 'charges']
for col in numeric_cols:
    plt.figure(figsize=(5,3))
    sns.boxplot(x=data[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

# Remove outliers in 'charges' using Z-score method
z = np.abs(stats.zscore(data['charges']))
data = data[(z < 3)]

print("Shape after removing outliers:", data.shape)

## 5️.) Feature Engineering

In [None]:
if 'smoker_yes' in data.columns:
    data['bmi_smoker_interaction'] = data['bmi'] * data['smoker_yes']

# Create BMI category feature
data['bmi_category'] = pd.cut(data['bmi'], bins=[0,18.5,25,30,100], labels=['Underweight','Normal','Overweight','Obese'])
data = pd.get_dummies(data, columns=['bmi_category'], drop_first=True)

print("Feature engineered data sample:")
display(data.head())

## 6️.) Feature Selection

In [None]:
# Correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(data.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Heatmap")
plt.show()

# Check for low variance features
variance = data.var()
low_var = variance[variance < 0.01]
print("Low variance features:")
print(low_var)

## 7️.) Feature Scaling

In [None]:
X = data.drop('charges', axis=1)
y = data['charges']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 8️.) Build the Prediction Model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Model Evaluation Metrics:")
print("MAE:", round(mae, 2))
print("MSE:", round(mse, 2))
print("R² Score:", round(r2, 3))

## 9.) Model Performance Discussion

### Model Interpretation:
- The R² score shows how well the model explains variance in insurance charges
- If R² is low (<0.6), it means:
  - The dataset has high noise (many personal cost factors not included)
  - Linear model may not capture non-linear relationships
  - Feature engineering could be expanded.

### Possible Improvements:
- Try non-linear models
- Add polynomial features
- Collect more features (medical history, lifestyle)

## Conclusion

We successfully:
Loaded and cleaned the dataset  
Removed duplicates & outliers  
Transformed categorical variables  
Engineered useful features  
Selected & scaled features  
Built and evaluated a regression model  