# Task 4: Predicting Insurance Claim Amounts

Objective: Estimate the medical insurance claim amount based on personal data.

**Dataset:** Medical Cost Personal Dataset (synthetic replica generated for this notebook).

**Instructions:**
- Train a Linear Regression model to predict `charges`.
- Visualize how BMI, age, and smoking status impact insurance charges.
- Evaluate performance using MAE and RMSE.

**Skills:** Regression modeling, feature correlation & visualization, error evaluation (MAE/RMSE).

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import math

# Load dataset
df = pd.read_csv("insurance.csv")
df.head()


## Basic EDA

In [None]:

df.info()


In [None]:

df.describe(include='all')


## One-Hot Encode Categorical Features

In [None]:

X = pd.get_dummies(df.drop(columns=["charges"]), drop_first=True)
y = df["charges"]
X.head()


## Train/Test Split

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape


## Train Linear Regression

In [None]:

linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_pred = linreg.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = math.sqrt(mean_squared_error(y_test, y_pred))

print(f"Test MAE:  {mae:,.2f}")
print(f"Test RMSE: {rmse:,.2f}")


## Visualizations

In [None]:

# Age vs Charges
plt.figure()
plt.scatter(X_test["age"], y_test, alpha=0.6, label="Actual")
coef_age_only = np.polyfit(X_test["age"], y_test, 1)
x_line = np.linspace(X_test["age"].min(), X_test["age"].max(), 100)
y_line = np.polyval(coef_age_only, x_line)
plt.plot(x_line, y_line, linewidth=2, label="Trend")
plt.xlabel("Age")
plt.ylabel("Insurance Charges")
plt.title("Impact of Age on Charges")
plt.legend()
plt.show()


In [None]:

# BMI vs Charges
plt.figure()
plt.scatter(X_test["bmi"], y_test, alpha=0.6)
coef_bmi_only = np.polyfit(X_test["bmi"], y_test, 1)
x_line_bmi = np.linspace(X_test["bmi"].min(), X_test["bmi"].max(), 100)
y_line_bmi = np.polyval(coef_bmi_only, x_line_bmi)
plt.plot(x_line_bmi, y_line_bmi, linewidth=2)
plt.xlabel("BMI")
plt.ylabel("Insurance Charges")
plt.title("Impact of BMI on Charges")
plt.show()


In [None]:

# Charges by Smoking Status
plt.figure()
data_no = y_test[X_test["smoker_yes"] == 0]
data_yes = y_test[X_test["smoker_yes"] == 1]
plt.boxplot([data_no, data_yes], labels=["non-smoker", "smoker"], showfliers=False)
plt.ylabel("Insurance Charges")
plt.title("Impact of Smoking on Charges")
plt.show()


## Notes
- These plots help visualize how age and BMI correlate with charges and how smokers typically incur higher charges.
- The Linear Regression model is a baseline; consider trying regularization (Ridge/Lasso) or non-linear models for improvements.