# üè• Healthcare Insurance Cost Analysis  
## üìä Notebook 05 ‚Äì Higher BMI correlates with higher insurance charges.

| Feild | Description |
|-------|-------------|
|**Author:** | Robert Steven Elliott  |
|**Course:** | Code Institute ‚Äì Data Analytics with AI Bootcamp |  
|**Project Type:** | Individual Formative Project  | 
|**Date:** | December 2025  |


$$
    \begin{align}
        H_{0}&: \text{BMI has no affect on insurance charges.} \\
        H_{1}&: \text{Higher BMI correlates with higher insurance charges.}
    \end{align}
$$

---

## Change Working Directory

In [1]:
import sys
from pathlib import Path
PROJECT_ROOT = Path.cwd().parent
sys.path.insert(0, str(PROJECT_ROOT))
print("‚úÖ Working directory set to project root:", PROJECT_ROOT)

‚úÖ Working directory set to project root: /home/robert/Projects/health-insurance-cost-analysis


---

## Import Libraries and Dataset

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from utils.data_handler import load_data, data_overview, clean_data

pd.set_option('display.max_columns', None)
sns.set_theme(style="whitegrid")

input_path = PROJECT_ROOT / "data" / "final" / "insurance_final.csv"
figure_path = PROJECT_ROOT / "figures"

if not figure_path.exists():
    figure_path.mkdir(parents=True, exist_ok=True)
    print(f"‚úÖ Created figure directory at: {figure_path}")

df = load_data(input_path)
df = clean_data(df, categorical_cols=['sex', 'smoker', 'region', 'bmi_category', 'age_group', 'family_size_category'])
data_overview(df)
print("‚úÖ Data loaded successfully.")
df.head()

DataFrame Shape: (1337, 10)

Data Types:
 age                        int64
sex                     category
bmi                      float64
children                   int64
smoker                  category
region                  category
charges                  float64
bmi_category            category
age_group               category
family_size_category    category
dtype: object

Missing Values:
 age                     0
sex                     0
bmi                     0
children                0
smoker                  0
region                  0
charges                 0
bmi_category            0
age_group               0
family_size_category    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1337 entries, 0 to 1336
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   age                   1337 non-null   int64   
 1   sex                   1337 non-null   category
 2   bmi

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,bmi_category,age_group,family_size_category
0,19,female,27.9,0,yes,southwest,16884.924,Overweight,18-25,No Children
1,18,male,33.77,1,no,southeast,1725.5523,Obese,18-25,Small Family
2,28,male,33.0,3,no,southeast,4449.462,Obese,26-35,Medium Family
3,33,male,22.705,0,no,northwest,21984.47061,Normal,26-35,No Children
4,32,male,28.88,0,no,northwest,3866.8552,Overweight,26-35,No Children


## BMI vs Charges

In [3]:
# Create a subplot figure: 1 row, 2 columns
fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=[
        "BMI vs Charges",
        "Average Charges by BMI Category & Smoking Status"
    ]
)

color_map = {'yes': '#EF553B', 'no': '#636EFA'}

# -----------------------------
# Scatter plot with trendline
# -----------------------------

scatter_fig = px.scatter(
    df,
    x='bmi',
    y='charges',
    color='smoker',
    color_discrete_map=color_map,
    trendline='ols'
)

# Add scatter traces, only show legend for the first trace of each color

added_legends = set()
for trace in scatter_fig['data']:
    # Only show legend if we haven't added this name yet
    if trace.name in added_legends:
        trace.showlegend = False
    else:
        trace.showlegend = True
        added_legends.add(trace.name)
    fig.add_trace(trace, row=1, col=1)


# -----------------------------
# Grouped bar chart
# -----------------------------
charges_avg = df.groupby(['bmi_category', 'smoker'], observed=False)['charges'].mean().reset_index()
charges_avg.rename(columns={'charges': 'charges'}, inplace=True)

barchart_fig = px.bar(
    charges_avg,
    x='bmi_category',
    y='charges',
    color='smoker',
    color_discrete_map=color_map,
    barmode='group',
    category_orders={'bmi_category': ['Underweight', 'Normal', 'Overweight', 'Obese', 'Severely Obese']}
)

for trace in barchart_fig['data']:
    if trace.name in added_legends:
        trace.showlegend = False
    else:
        trace.showlegend = True
        added_legends.add(trace.name)
    fig.add_trace(trace, row=1, col=2)

# -----------------------------
# Layout adjustments
# -----------------------------
fig.update_layout(
    showlegend=True,
    legend=dict(
        title=dict(text="Smoking Status"), 
        x=1.05, 
        y=1
    ),
    height=500,
    width=1200,
    title_text="H2: BMI vs Insurance Charges (coloured by Smoker)",
)
fig.write_html(figure_path / "bmi_charges_smoking_analysis.html")
fig.write_image(figure_path / "bmi_charges_smoking_analysis.png")
fig.show()



### Observation:
- A positive correlation exists between BMI and charges.
- Smokers with high BMI show particularly high expenses, supporting H2.