# üè• Healthcare Insurance Cost Analysis  
## üß© Notebook 01 ‚Äì Data Cleaning and Transformation  
| Field | Description |
|-------|-------------|
|**Author:** | Robert Steven Elliott  |
|**Course:** | Code Institute ‚Äì Data Analytics with AI Bootcamp  |
|**Project Type:** | Individual Formative Project  |
|**Date:** | October 2025  |

---

## **Objectives**
- Clean the raw healthcare dataset.
- Handle missing values and remove duplicates.
- Encode categorical variables for analysis.
- Prepare the dataset for feature engineering and visualisation.

## **Inputs**
- `data/raw/insurance.csv`

## **Outputs**
- `data/processed/insurance_clean.csv`  
- Cleaned, encoded DataFrame ready for analysis.

## **Additional Comments**
Ensure that Notebook 01 was executed successfully before running this notebook.
All cleaning steps must be reproducible and clearly documented.

---

# Change Working Directory

In [1]:
import sys
from pathlib import Path
PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))
print("‚úÖ Working directory set to project root:", PROJECT_ROOT)

‚úÖ Working directory set to project root: /home/robert/Projects/health-insurance-cost-analysis


---

# Import Dependencies

In [2]:
import pandas as pd
import numpy as np
from utils.data_handler import (load_data, 
                                data_overview, 
                                clean_data)

---

# Load Dataset

In [3]:
input_file = PROJECT_ROOT / "data" / "raw" / "insurance.csv"
df = load_data(input_file)


---

# Inspect Dataset

In [4]:
data_overview(df)

DataFrame Shape: (1337, 7)

Data Types:
 age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

Missing Values:
 age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Index: 1337 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 83.6+ KB

üìã Dataset Info:
 None

Statistical Summary:
                 age   sex          bmi     children smoker     region  \
count   1337.000000  1337  1337.000000 

---

## Use new Cleaning Function

In [9]:
df = clean_data(df, categorical_cols=['sex', 'smoker', 'region'])

# Basic integrity checks

In [10]:

print(f"‚úÖ DataFrame shape: {df.shape}")
print("‚úÖ No nulls remaining:", df.isnull().sum().sum() == 0)
print("‚úÖ Numeric columns:", df.select_dtypes(include=np.number).columns.tolist())
print("‚úÖ Categorical columns:", df.select_dtypes(include='category').columns.tolist())

‚úÖ DataFrame shape: (1337, 7)
‚úÖ No nulls remaining: True
‚úÖ Numeric columns: ['age', 'bmi', 'children', 'charges']
‚úÖ Categorical columns: ['sex', 'smoker', 'region']


---

# Save Cleaned Dataset

In [11]:
output_path = PROJECT_ROOT / "data" / "processed" / "insurance_clean.csv"
df.to_csv(output_path, index=False)
print(f"üìÅ Cleaned dataset saved to: {output_path}")

üìÅ Cleaned dataset saved to: /home/robert/Projects/health-insurance-cost-analysis/data/processed/insurance_clean.csv
