# 🏥 Healthcare Insurance Cost Analysis  
## 🧩 Notebook 02 – Data Cleaning and Transformation  
| Field | Description |
|-------|-------------|
|**Author:** | Robert Steven Elliott  |
|**Course:** | Code Institute – Data Analytics with AI Bootcamp  |
|**Project Type:** | Individual Formative Project  |
|**Date:** | October 2025  |

---

## **Objectives**
- Clean the raw healthcare dataset.
- Handle missing values and remove duplicates.
- Encode categorical variables for analysis.
- Prepare the dataset for feature engineering and visualisation.

## **Inputs**
- `data/raw/insurance.csv`

## **Outputs**
- `data/processed/insurance_clean.csv`  
- Cleaned, encoded DataFrame ready for analysis.

## **Additional Comments**
Ensure that Notebook 01 was executed successfully before running this notebook.
All cleaning steps must be reproducible and clearly documented.

---

# Change Working Directory

In [None]:
import os
PROJECT_ROOT = os.path.join(os.getcwd(), "..")
os.chdir(PROJECT_ROOT)
print("✅ Working directory set to project root:", os.getcwd())

✅ Working directory set to project root: /home/robert/Projects/health-insurance-cost-analysis


Confirm the new current directory

In [2]:
current_dir = os.getcwd()
current_dir

'/home/robert/Projects/health-insurance-cost-analysis'

---

# Import Dependencies

In [3]:
import pandas as pd
import numpy as np

---

# Load Dataset

In [4]:
data_path = "data/raw/insurance.csv"

try:
    df = pd.read_csv(data_path)
    print(f"✅ Dataset loaded. Shape: {df.shape}")
except FileNotFoundError:
    raise FileNotFoundError("❌ insurance.csv not found. Place it in data/raw/")

✅ Dataset loaded. Shape: (1338, 7)


---

# Inspect Dataset

In [5]:
print("📋 DataFrame info:")
df.info()

📋 DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [6]:
print("\n🔍 Preview:")
display(df.head())


🔍 Preview:


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [7]:
print("\n🔢 Missing Values:")
display(df.isna().sum())


🔢 Missing Values:


age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

---

# Handle Missing Values

There are no missing values so this is not required.

---

# Remove Duplicates

In [8]:
before = df.shape[0]
df.drop_duplicates(inplace=True)
after = df.shape[0]
print(f"✅ Removed {before - after} duplicate rows.")

✅ Removed 1 duplicate rows.


Duplicate entries were checked and removed from the dataset to ensure data quality and prevent bias in descriptive and correlation analyses.  
This step improves the dataset’s integrity before further encoding and transformation.

---

# Encode Categorical Variables

In [9]:
categorical_columns = ['sex', 'smoker', 'region']
df[categorical_columns] = df[categorical_columns].astype('category')

print("✅ All categorical variables set to 'category' dtype for readability.")
display(df[categorical_columns].head())

✅ All categorical variables set to 'category' dtype for readability.


Unnamed: 0,sex,smoker,region
0,female,yes,southwest
1,male,no,southeast
2,male,no,southeast
3,male,no,northwest
4,male,no,northwest


All categorical variables (`sex`, `smoker`, and `region`) were retained in their original text format  
and converted to Pandas `category` dtype for clarity and consistency.  
This ensures that visualisations remain easy to interpret without unnecessary numeric encoding.

---

# Basic integrity checks

In [11]:

print(f"✅ DataFrame shape: {df.shape}")
print("✅ No nulls remaining:", df.isnull().sum().sum() == 0)
print("✅ Numeric columns:", df.select_dtypes(include=np.number).columns.tolist())
print("✅ Categorical columns:", df.select_dtypes(include='category').columns.tolist())

✅ DataFrame shape: (1337, 7)
✅ No nulls remaining: True
✅ Numeric columns: ['age', 'bmi', 'children', 'charges']
✅ Categorical columns: ['sex', 'smoker', 'region']


---

# Save Cleaned Dataset

In [12]:
output_path = "data/processed/insurance_clean.csv"
df.to_csv(output_path, index=False)
print(f"📁 Cleaned dataset saved to: {output_path}")

📁 Cleaned dataset saved to: data/processed/insurance_clean.csv
