# Healthcare Cost Prediction

This notebook walks through EDA, hypothesis testing, preprocessing, linear regression modeling, and evaluation.

## 0. Requirements
Run this to install recommended packages:
```
pip install pandas numpy matplotlib seaborn scikit-learn scipy statsmodels
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# Adjust matplotlib inline if running in Jupyter
%matplotlib inline


## 1. Load data
Place `insurance.csv` into a `data/` folder beside this notebook.

In [None]:
from pathlib import Path
data_path = Path('..') / 'data' / 'insurance.csv'
# if running from repository root where notebook is in /notebook, fallback to ../data
if not data_path.exists():
    data_path = Path('data') / 'insurance.csv'
df = pd.read_csv(data_path)
print('Rows, columns:', df.shape)
df.head()

## 2. Quick info and descriptive statistics

In [None]:
df.info()
df.describe()

## 3. EDA — plots

In [None]:
plt.figure(figsize=(8,5))
plt.title('Charges distribution')
plt.hist(df['charges'], bins=50)
plt.xlabel('Charges')
plt.show()

plt.figure(figsize=(8,5))
sns.boxplot(x='smoker', y='charges', data=df)
plt.title('Charges by smoker status')
plt.show()

plt.figure(figsize=(8,5))
sns.scatterplot(x='bmi', y='charges', hue='smoker', data=df)
plt.title('BMI vs Charges colored by smoker')
plt.show()


## 4. Descriptive stats examples

In [None]:
print('Average BMI:', df['bmi'].mean())
print('Median charges:', df['charges'].median())

## 5. Hypothesis test: Do smokers pay more than non-smokers?

In [None]:
smokers = df[df['smoker']=='yes']['charges']
non_smokers = df[df['smoker']=='no']['charges']

tstat, pval = ttest_ind(smokers, non_smokers, equal_var=False)
print('T-statistic:', tstat)
print('P-value:', pval)

## 6. Preprocessing — encoding and scaling

In [None]:
df_encoded = pd.get_dummies(df, drop_first=True)
X = df_encoded.drop('charges', axis=1)
y = df_encoded['charges']

num_cols = ['age','bmi','children']
scaler = StandardScaler()
X[num_cols] = scaler.fit_transform(X[num_cols])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

## 7. Linear Regression baseline

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print('MAE:', mean_absolute_error(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('R2:', r2_score(y_test, y_pred))

## 8. Coefficients — feature influence

In [None]:
coef_df = pd.DataFrame({'feature': X.columns, 'coefficient': lr.coef_}).sort_values(by='coefficient', ascending=False)
coef_df.head(12)

## 9. Predictions sample

In [None]:
preds = pd.DataFrame({'actual': y_test, 'predicted': y_pred})
preds.head()

## 10. Next steps
- Try Ridge/Lasso and tree-based models
- Add interaction features (e.g., bmi*smoker)
- Deploy as a simple Flask app or Streamlit dashboard