# 🏥 Healthcare Insurance Cost Analysis  
## 📊 Notebook 04 – Exploratory Data Analysis and Visualisations 

| Feild | Description |
|-------|-------------|
|**Author:** |Robert Steven Elliott  |
|**Course:** |Code Institute – Data Analytics with AI Bootcamp |  
|**Project Type:** |Individual Formative Project  | 
|**Date:** |October 2025  |

---

## **Objectives**
- Conduct exploratory data analysis (EDA) to uncover trends and patterns.  
- Produce static visualisations using Matplotlib and Seaborn.  
- Create interactive dashboards using Plotly.  
- Summarise relationships between key features and healthcare insurance charges.

## **Inputs**
- `data/processed/insurance_enriched.csv`

## **Outputs**
- Static and interactive visualisations saved to `reports/figures/`  
- Analytical insights summarised for presentation and documentation.

## **Additional Comments**
Run after `03_feature_engineering.ipynb`.  
Ensure your enriched dataset is available in `data/processed/insurance_enriched.csv`.


---

# Change Working Directory

In [None]:
import os
PROJECT_ROOT = os.path.join(os.getcwd(), "..")
os.chdir(PROJECT_ROOT)
print("✅ Working directory set to project root:", os.getcwd())

---

# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Aesthetic setup
sns.set(style="whitegrid", palette="muted")
print("✅ Libraries imported successfully.")

---

# Load Enriched Dataset

In [None]:
data_path = "data/final/insurance_final.csv"

try:
    df = pd.read_csv(data_path)
    print(f"✅ Enriched dataset loaded. Shape: {df.shape}")
except FileNotFoundError:
    raise FileNotFoundError("❌ insurance_final.csv not found. Please run Notebook 03 first.")

display(df.head())

## Data Type Restoration
when loading from CSV, categorical columns (`sex`, `smoker`, `region`, `bmi_category`,`age_group`,`family_size_category`) were re-cast to the `category` dtype  
to maintain consistency with the ETL pipeline and improve plotting performance.

In [None]:
for col in ['sex', 'smoker', 'region', 'bmi_category', 'age_group', 'family_size_category']:
    df[col] = df[col].astype('category')
print("✅ Data types after restoration:")
print(df.dtypes)

---

# Summary Statistics and Overview

In [None]:
# Basic descriptive stats
df.describe()


In [None]:

# Correlation with target variable
corr = df.corr(numeric_only=True)
display(corr['charges'].sort_values(ascending=False))

---

# Identify Top Correlations

In [None]:
top_corr = corr.drop('charges').head(5)['charges']
top_corr_df = top_corr.reset_index()
top_corr_df.columns = ['Feature', 'Correlation_with_Charges']

print("🔍 Top 5 correlated features with Charges:")
display(top_corr_df)

---

# Distribution of Key Variables

## Distribution of Charges

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df['charges'], bins=30, kde=True, color='skyblue')
plt.title("Distribution of Insurance Charges")
plt.xlabel("Charges ($)")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

## Distribution of BMI

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df['bmi'], bins=20, kde=True, color='salmon')
plt.title("Distribution of BMI")
plt.xlabel("BMI")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

## Distribution of Age

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df['age'], bins=15, kde=True, color='lightgreen')
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()