# üè• Healthcare Insurance Cost Analysis  
## üìä Notebook 04 ‚Äì Smoking status significantly increases insurance charges. 

| Feild | Description |
|-------|-------------|
|**Author:** | Robert Steven Elliott  |
|**Course:** | Code Institute ‚Äì Data Analytics with AI Bootcamp |  
|**Project Type:** | Individual Formative Project  | 
|**Date:** | October 2025  |


$$
    \begin{align}
        H_{0}&: \text{Smoking status has no affect on insurance charges.} \\
        H_{1}&: \text{Smoking status significantly increases insurance charges.}
    \end{align}
$$

---

## Change Working Directory

In [1]:
import sys
from pathlib import Path
PROJECT_ROOT = Path.cwd().parent
sys.path.insert(0, str(PROJECT_ROOT))
print("‚úÖ Working directory set to project root:", PROJECT_ROOT)

‚úÖ Working directory set to project root: /home/robert/Projects/health-insurance-cost-analysis


---

## Import Libraries and Dataset

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from utils.data_handler import load_data, data_overview, clean_data

pd.set_option('display.max_columns', None)
sns.set_theme(style="whitegrid")

input_path = PROJECT_ROOT / "data" / "final" / "insurance_final.csv"
figure_path = PROJECT_ROOT / "figures"

if not figure_path.exists():
    figure_path.mkdir(parents=True, exist_ok=True)
    print(f"‚úÖ Created figure directory at: {figure_path}")

df = load_data(input_path)
df = clean_data(df, categorical_cols=['sex', 'smoker', 'region', 'bmi_category', 'age_group', 'family_size_category'])
data_overview(df)
print("‚úÖ Data loaded successfully.")
df.head()

DataFrame Shape: (1337, 10)

Data Types:
 age                        int64
sex                     category
bmi                      float64
children                   int64
smoker                  category
region                  category
charges                  float64
bmi_category            category
age_group               category
family_size_category    category
dtype: object

Missing Values:
 age                     0
sex                     0
bmi                     0
children                0
smoker                  0
region                  0
charges                 0
bmi_category            0
age_group               0
family_size_category    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1337 entries, 0 to 1336
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   age                   1337 non-null   int64   
 1   sex                   1337 non-null   category
 2   bmi

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,bmi_category,age_group,family_size_category
0,19,female,27.9,0,yes,southwest,16884.924,Overweight,18-25,No Children
1,18,male,33.77,1,no,southeast,1725.5523,Obese,18-25,Small Family
2,28,male,33.0,3,no,southeast,4449.462,Obese,26-35,Medium Family
3,33,male,22.705,0,no,northwest,21984.47061,Normal,26-35,No Children
4,32,male,28.88,0,no,northwest,3866.8552,Overweight,26-35,No Children


## Smoking Status vs Charges

In [3]:
fig = px.box(
    df,
    x='smoker',
    y='charges',
    color='smoker',
    title="H1: Insurance Charges by Smoking Status",
    points='all'
)
fig.write_html(figure_path / "charges_by_smoking_status.html")
fig.write_image(figure_path / "charges_by_smoking_status.png")
fig.show()

### Observation:
- Smokers have dramatically higher median and mean insurance charges.
- This confirms H1, showing smoking is the single largest driver of cost increases.