# 🏥 Healthcare Insurance Cost Analysis  
## 📊 Notebook 04 – Exploratory Data Analysis and Visualisations 

| Feild | Description |
|-------|-------------|
|**Author:** |Robert Steven Elliott  |
|**Course:** |Code Institute – Data Analytics with AI Bootcamp |  
|**Project Type:** |Individual Formative Project  | 
|**Date:** |October 2025  |

---

## **Objectives**
- Conduct exploratory data analysis (EDA) to uncover trends and patterns.  
- Produce static visualisations using Matplotlib and Seaborn.  
- Create interactive dashboards using Plotly.  
- Summarise relationships between key features and healthcare insurance charges.

## **Inputs**
- `data/processed/insurance_enriched.csv`

## **Outputs**
- Static and interactive visualisations saved to `reports/figures/`  
- Analytical insights summarised for presentation and documentation.

## **Additional Comments**
Run after `03_feature_engineering.ipynb`.  
Ensure your enriched dataset is available in `data/processed/insurance_enriched.csv`.


---

# Change Working Directory

In [1]:
import os
PROJECT_ROOT = os.path.join(os.getcwd(), "..")
os.chdir(PROJECT_ROOT)
print("✅ Working directory set to project root:", os.getcwd())

✅ Working directory set to project root: /home/robert/Projects/health-insurance-cost-analysis


---

# Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Aesthetic setup
sns.set(style="whitegrid", palette="muted")
print("✅ Libraries imported successfully.")

✅ Libraries imported successfully.


---

# Load Enriched Dataset

In [5]:
data_path = "data/final/insurance_final.csv"

try:
    df = pd.read_csv(data_path)
    print(f"✅ Enriched dataset loaded. Shape: {df.shape}")
except FileNotFoundError:
    raise FileNotFoundError("❌ insurance_final.csv not found. Please run Notebook 03 first.")

display(df.head())

✅ Enriched dataset loaded. Shape: (1337, 10)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges,bmi_category,age_group,family_size_category
0,19,female,27.9,0,yes,southwest,16884.924,Overweight,18-25,No Children
1,18,male,33.77,1,no,southeast,1725.5523,Obese,18-25,Small Family
2,28,male,33.0,3,no,southeast,4449.462,Obese,26-35,Medium Family
3,33,male,22.705,0,no,northwest,21984.47061,Normal,26-35,No Children
4,32,male,28.88,0,no,northwest,3866.8552,Overweight,26-35,No Children


## Data Type Restoration
when loading from CSV, categorical columns (`sex`, `smoker`, `region`, `bmi_category`,`age_group`,`family_size_category`) were re-cast to the `category` dtype  
to maintain consistency with the ETL pipeline and improve plotting performance.

In [6]:
for col in ['sex', 'smoker', 'region', 'bmi_category', 'age_group', 'family_size_category']:
    df[col] = df[col].astype('category')
print("✅ Data types after restoration:")
print(df.dtypes)

✅ Data types after restoration:
age                        int64
sex                     category
bmi                      float64
children                   int64
smoker                  category
region                  category
charges                  float64
bmi_category            category
age_group               category
family_size_category    category
dtype: object


---

# Summary Statistics and Overview

In [12]:
# Basic descriptive stats
df.describe()


Unnamed: 0,age,bmi,children,charges
count,1337.0,1337.0,1337.0,1337.0
mean,39.222139,30.663452,1.095737,13279.121487
std,14.044333,6.100468,1.205571,12110.359656
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29,0.0,4746.344
50%,39.0,30.4,1.0,9386.1613
75%,51.0,34.7,2.0,16657.71745
max,64.0,53.13,5.0,63770.42801


In [13]:

# Correlation with target variable
corr = df.corr(numeric_only=True)
display(corr['charges'].sort_values(ascending=False))

charges     1.000000
age         0.298308
bmi         0.198401
children    0.067389
Name: charges, dtype: float64

---

# Identify Top Correlations

In [11]:
top_corr = corr.drop('charges').head(5)
top_corr_df = top_corr.reset_index()
top_corr_df.columns = ['Feature', 'Correlation_with_Charges']

print("🔍 Top 5 correlated features with Charges:")
display(top_corr_df)

🔍 Top 5 correlated features with Charges:


Unnamed: 0,Feature,Correlation_with_Charges
0,age,0.298308
1,bmi,0.198401
2,children,0.067389
