# Cleaning and Validating Healthcare Data Using Python

Time estimate: **30** minutes

## Objectives

After completing this lab, you will be able to:

 - Identify and handle common data quality issues including missing values, duplicates, inconsistent entries, and outliers.
 - Remove Personally Identifiable Information (PII) to ensure data privacy compliance with HIPAA and GDPR.
 - Apply data transformation techniques including standardization, normalization, and encoding.


## What you will do in this lab

In this hands-on lab, you will work with real-world healthcare data that contains typical quality issues found in medical datasets. You'll learn to clean and prepare this data for machine learning applications while maintaining patient privacy.

You will:

- Explore and identify data quality issues in a healthcare dataset
- Remove sensitive patient information (PII) to comply with privacy regulations
- Handle missing values using appropriate imputation strategies
- Standardize inconsistent categorical data and mixed units
- Detect and handle outliers using statistical methods
- Engineer meaningful features such as BMI and temporal indicators
- Encode categorical variables and scale numeric features for ML readiness

## Overview

Data preprocessing is a critical step in any machine learning pipeline, but it becomes especially important in healthcare applications where data quality directly impacts patient outcomes. Raw healthcare data often contains inconsistencies, missing values, mixed formats, and privacy-sensitive information that must be carefully addressed before building predictive models.

In this lab, you'll work with a synthetic healthcare dataset that simulates real-world challenges such as inconsistent gender labels (M/Male/F/Female), mixed measurement units (lbs/kg), various date formats, and missing diagnostic information. You'll learn systematic approaches to clean this data while maintaining its utility for analysis.

The preprocessing pipeline you'll build follows industry best practices: first removing PII for privacy, then addressing data quality issues, followed by feature engineering to create more informative variables, and finally transforming the data into a format suitable for machine learning algorithms. These skills are directly applicable to real healthcare analytics projects where clean, privacy-compliant data is essential.

By the end of this lab, you'll have a complete understanding of how to transform messy healthcare data into a clean, standardized dataset ready for predictive modeling tasks such as risk assessment or disease diagnosis.

## About the dataset

This lab uses a synthetic healthcare dataset designed to simulate real-world medical data challenges.

### Dataset overview

The dataset contains patient health records including demographics, vital measurements, diagnostic information, and risk indicators. This data simulates what you might encounter in electronic health records (EHR) systems, complete with the messiness and inconsistencies typical of real medical data. The dataset includes 200 patient records with intentionally introduced quality issues such as missing values, inconsistent formatting, mixed units, and duplicate entries to provide realistic preprocessing practice.

### Column descriptions

1. **Patient_ID** - Unique identifier for each patient (e.g., P016, P_new_126)
2. **Age** - Age of the patient in years (may contain missing values or outliers)
3. **Gender** - Gender of the patient (inconsistent formats: M, Male, F, Female, Other)
4. **Ethnicity** - Ethnic background of the patient (Asian, African, Caucasian, Hispanic with inconsistent capitalization)
5. **Weight** - Weight of the patient in mixed units (kg or lbs, e.g., 70, 150lbs)
6. **Height_cm** - Height of the patient in centimeters (numeric values)
7. **Diagnosis_Date** - Date when diagnosis was made (multiple date formats: YYYY-MM-DD, DD/MM/YYYY)
8. **Diagnosis_Code** - Medical diagnosis code or abbreviation (DEP=Depression, OCD=Obsessive Compulsive Disorder, ANX=Anxiety, ANXITY=typo for Anxiety)
9. **Glucose_mg_dL** - Blood glucose level in mg/dL (may indicate diabetes risk)
10. **Risk** - Binary risk indicator (0 = low risk, 1 = high risk for adverse health outcomes)
11. **Patient_Name** - Full name of the patient (PII - to be removed)
12. **EmailID** - Email address of the patient (PII - to be removed)

## Setup

### Installing required libraries

The following libraries are required to run this lab. Pandas will be used for data manipulation, NumPy for numerical operations, SciPy for statistical functions, and Scikit-learn for preprocessing utilities.

In [1]:
# Install the libraries required for this lab
!pip install pandas
!pip install numpy
!pip install scipy
!pip install scikit-learn



In [2]:
# Optional: suppress warnings for cleaner output
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

### Importing required libraries

In [3]:
import pandas as pd   # For data loading, manipulation, cleaning, and saving (DataFrame operations)
import numpy as np    # For numerical operations, array handling, and missing value operations
import re             # For regular expression pattern matching in text processing

from datetime import datetime, timedelta  # For parsing and manipulating date/time values
import random  # For generating random values during data exploration

from scipy import stats  # For statistical functions like z-score for outlier detection
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# StandardScaler: standardizes features (mean=0, std=1) for ML algorithms
# MinMaxScaler: scales features to a fixed range (typically 0-1)
# OneHotEncoder: converts categorical features into binary indicator variables

print("All libraries imported successfully!")
print("Ready to begin healthcare data preprocessing.")

All libraries imported successfully!
Ready to begin healthcare data preprocessing.


## Step 1: Load and explore the raw data

Before cleaning data, it's essential to understand what you're working with. In this step, you'll load the healthcare dataset and perform an initial exploration to identify data quality issues. This exploration phase helps you make informed decisions about which preprocessing techniques to apply.

In [4]:
# Load the raw healthcare data from CSV file
df = pd.read_csv("https://foundations-of-healthcare-data-analytics-4e579d.gitlab.io/labs/Cleaning_and_Validating_Healthcare_Data_Using_Python/raw_data.csv")

# Display the first few rows to get an initial sense of the data
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,Patient_ID,Age,Gender,Ethnicity,Weight,Height_cm,Diagnosis_Date,Diagnosis_Code,Glucose_mg_dL,Risk,Patient_Name,EmailID
0,P_new_63,,,asian,,175.0,2021-10-21,,140.0,0,David Jones,kurt31@example.org
1,P016,50.0,M,,,165.0,2020-05-20,DEP,100.0,0,Cristina White,parkerjennifer@example.org
2,P006,25.0,F,Caucasian,,,15/06/2021,OCD,110.0,0,Laura Bates,sean22@example.net
3,P_new_126,34.0,M,asian,70.0,160.0,2016-10-19,ocd,,1,Rebecca Smith,johnmcclure@example.net
4,P_new_96,25.0,,Asian,70.0,180.0,2016-02-09,ANXITY,140.0,0,Brittany Wise,nholloway@example.net


## Step 2: Understand data quality issues

Real-world healthcare data commonly suffers from four main quality issues:

1. **Missing data**: Important fields left blank or null
2. **Duplicates**: Identical records appearing multiple times
3. **Inconsistent entries**: Same category with different labels (e.g., M vs Male)
4. **Outliers**: Extreme or impossible values (e.g., Age=200, Glucose=500)

Let's systematically identify these issues in the dataset.

In [5]:
# Basic dataset structure
print("Dataset dimensions:")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print("\nColumn names and data types:")
print(df.dtypes)

Dataset dimensions:
Number of rows: 200
Number of columns: 12

Column names and data types:
Patient_ID         object
Age               float64
Gender             object
Ethnicity          object
Weight             object
Height_cm         float64
Diagnosis_Date     object
Diagnosis_Code     object
Glucose_mg_dL     float64
Risk                int64
Patient_Name       object
EmailID            object
dtype: object


In [6]:
# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())
print("\nPercentage of missing values:")
print((df.isnull().sum() / len(df) * 100).round(2))


Missing values per column:
Patient_ID         0
Age               28
Gender            38
Ethnicity         35
Weight            44
Height_cm         40
Diagnosis_Date    10
Diagnosis_Code    33
Glucose_mg_dL     42
Risk               0
Patient_Name       0
EmailID            0
dtype: int64

Percentage of missing values:
Patient_ID         0.0
Age               14.0
Gender            19.0
Ethnicity         17.5
Weight            22.0
Height_cm         20.0
Diagnosis_Date     5.0
Diagnosis_Code    16.5
Glucose_mg_dL     21.0
Risk               0.0
Patient_Name       0.0
EmailID            0.0
dtype: float64


In [7]:
# Check for duplicate rows
dup_rows = df.duplicated(keep=False)
print(f"\nNumber of duplicate rows: {dup_rows.sum()}")
if dup_rows.any():
    display(df[dup_rows])


Number of duplicate rows: 0


In [8]:
# Identify inconsistent categorical entries
print("\nUnique values in Gender column:")
print(df['Gender'].unique())
print("\nUnique values in Ethnicity column:")
print(df['Ethnicity'].unique())
print("\nUnique values in Diagnosis_Code column:")
print(df['Diagnosis_Code'].unique())


Unique values in Gender column:
[nan 'M' 'F' 'Male' 'Female' 'Other']

Unique values in Ethnicity column:
['asian' nan 'Caucasian' 'Asian' 'African' 'Hispanic' 'caucasian']

Unique values in Diagnosis_Code column:
[nan 'DEP' 'OCD' 'ocd' 'ANXITY' 'ANX']


In [9]:
# Check for mixed units in Weight column
print("\nUnique Weight values (showing mixed units):")
display(df['Weight'].unique())


Unique Weight values (showing mixed units):


array([nan, '70', '110', '150lbs', '90', '75', '160lbs', '80'],
      dtype=object)

In [10]:
# Statistical summary to identify potential outliers
print("\nStatistical summary of numeric columns:")
display(df[['Age', 'Glucose_mg_dL']].describe())


Statistical summary of numeric columns:


Unnamed: 0,Age,Glucose_mg_dL
count,172.0,158.0
mean,70.087209,220.981013
std,62.340026,177.443324
min,25.0,85.0
25%,34.0,90.0
50%,45.0,110.0
75%,60.0,500.0
max,200.0,500.0


In [11]:
# Check date format inconsistencies
print("\nSample of Diagnosis_Date values (showing mixed formats):")
display(df['Diagnosis_Date'].sample(10, random_state=42))


Sample of Diagnosis_Date values (showing mixed formats):


95     2023-03-23
15     10/01/2018
30     2022-01-10
158    15/06/2021
128    19/03/2019
115    25/02/2016
69     2018-03-31
170    2020-02-04
174    26/09/2021
45     2020-12-31
Name: Diagnosis_Date, dtype: object

In [12]:
# Check class balance for the target variable
print("\nRisk value distribution:")
print(df['Risk'].value_counts())
print("\nRisk percentage distribution:")
print((df['Risk'].value_counts() / len(df) * 100).round(2))


Risk value distribution:
0    107
1     93
Name: Risk, dtype: int64

Risk percentage distribution:
0    53.5
1    46.5
Name: Risk, dtype: float64


## Step 3: Detect outliers using statistical methods

Outliers can significantly impact machine learning models. You'll use two common statistical methods to detect them:

### Interquartile Range (IQR) method
- **Formula**: IQR = Q3 − Q1 (difference between 75th and 25th percentiles)
- **Outlier definition**: Values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR
- **Best for**: Non-normally distributed data (robust against skewness)

### Z-Score method
- **Formula**: Z = (Value − Mean) / Standard Deviation
- **Outlier definition**: |Z-score| > 3 (more than 3 standard deviations from mean)
- **Best for**: Normally distributed data

In [13]:
# Function to detect outliers using IQR method
def iqr_outliers(series):
    """
    Detect outliers using the Interquartile Range (IQR) method.

    Parameters:
    series: pandas Series - numeric column to check for outliers

    Returns:
    pandas Series - containing only the outlier values
    """
    q1 = series.quantile(0.25)  # 25th percentile
    q3 = series.quantile(0.75)  # 75th percentile
    iqr = q3 - q1                # Interquartile range
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    return series[(series < lower_bound) | (series > upper_bound)]

# Detect outliers in Age
print("Age outliers (IQR method):")
age_outliers = iqr_outliers(df['Age'].dropna())
print(f"Found {len(age_outliers)} outliers in Age")
print(f"Outlier values: {age_outliers.unique()}")

# Detect outliers in Glucose
print("\nGlucose outliers (IQR method):")
glucose_outliers = iqr_outliers(df['Glucose_mg_dL'].dropna())
print(f"Found {len(glucose_outliers)} outliers in Glucose_mg_dL")
if len(glucose_outliers) > 0:
    print(f"Outlier values: {glucose_outliers.unique()}")

Age outliers (IQR method):
Found 31 outliers in Age
Outlier values: [200.]

Glucose outliers (IQR method):
Found 0 outliers in Glucose_mg_dL


## Step 4: Create a clean copy and remove PII

Privacy protection is paramount in healthcare data. **Personally Identifiable Information (PII)** includes any data that can directly or indirectly identify an individual. Common PII in healthcare includes:

- **Direct identifiers**: Patient names, email addresses, phone numbers, addresses
- **Semi-identifiers**: Patient IDs (can be kept if properly anonymized)
- **Sensitive dates**: Birth dates, exact diagnosis dates (often generalized)

Regulations like **HIPAA** (USA) and **GDPR** (Europe) require removing or anonymizing PII before data analysis or sharing.

You'll create a copy of the original data (to preserve the raw data) and remove PII columns.

In [14]:
# Create a working copy - keep original data untouched for reference
df_clean = df.copy()
print("Working copy created. Original data preserved.")

Working copy created. Original data preserved.


In [15]:
# Identify and remove PII columns
pii_columns = ['Patient_Name', 'EmailID']
print(f"Removing PII columns: {pii_columns}")

df_clean = df_clean.drop(columns=pii_columns, errors='ignore')

print("\nColumns after removing PII:")
print(df_clean.columns.tolist())
print(f"\nReduced from {len(df.columns)} to {len(df_clean.columns)} columns")

Removing PII columns: ['Patient_Name', 'EmailID']

Columns after removing PII:
['Patient_ID', 'Age', 'Gender', 'Ethnicity', 'Weight', 'Height_cm', 'Diagnosis_Date', 'Diagnosis_Code', 'Glucose_mg_dL', 'Risk']

Reduced from 12 to 10 columns


## Step 5: Remove duplicate rows

Duplicate records can occur due to data entry errors, system glitches, or merging datasets. They can:
- Bias analysis by overrepresenting certain patients
- Inflate dataset size artificially
- Cause data leakage in train-test splits

You'll identify and remove exact duplicate rows, keeping only the first occurrence.

In [16]:
# Count duplicates before removal
duplicates_before = df_clean.duplicated().sum()
rows_before = len(df_clean)

# Remove exact duplicate rows (keep first occurrence)
df_clean = df_clean.drop_duplicates(keep='first')

# Report results
rows_after = len(df_clean)
print(f"Rows before deduplication: {rows_before}")
print(f"Duplicate rows found: {duplicates_before}")
print(f"Rows after deduplication: {rows_after}")
print(f"Rows removed: {rows_before - rows_after}")

Rows before deduplication: 200
Duplicate rows found: 2
Rows after deduplication: 198
Rows removed: 2


## Step 6: Standardize inconsistent categorical variables

Inconsistent categorical data is common in healthcare due to:
- Multiple data entry personnel with different conventions
- Data merging from different systems
- Typos and abbreviations

You need to standardize:
- **Gender**: Convert M/Male/m → 'Male', F/Female/f → 'Female'
- **Ethnicity**: Standardize capitalization (asian/Asian/ASIAN → 'Asian')
- **Diagnosis_Code**: Fix typos and standardize (OCD/ocd → 'OCD', ANXITY → 'ANX')

This ensures categorical data is **clean, consistent, and machine-readable**.

In [17]:
# Standardize Gender column
print("Before standardization - Gender unique values:")
print(df_clean['Gender'].value_counts(dropna=False))

# Convert to lowercase and strip whitespace for consistent matching
df_clean['Gender'] = df_clean['Gender'].astype(str).str.strip().str.lower()

# Define mapping for known variations
gender_map = {
    'male': 'Male', 'm': 'Male',
    'female': 'Female', 'f': 'Female',
    'other': 'Other',
    'nan': np.nan, 'none': np.nan
}

# Apply mapping
df_clean['Gender'] = df_clean['Gender'].replace({'nan': np.nan})
df_clean['Gender'] = df_clean['Gender'].map(
    lambda x: gender_map.get(x, x.capitalize() if pd.notna(x) else x)
)

print("\nAfter standardization - Gender unique values:")
print(df_clean['Gender'].value_counts(dropna=False))

Before standardization - Gender unique values:
M         46
F         39
NaN       38
Female    35
Male      33
Other      7
Name: Gender, dtype: int64

After standardization - Gender unique values:
Male      79
Female    74
NaN       38
Other      7
Name: Gender, dtype: int64


In [18]:
# Standardize Ethnicity column
print("Before standardization - Ethnicity unique values:")
print(df_clean['Ethnicity'].value_counts(dropna=False))

# Standardize capitalization
df_clean['Ethnicity'] = df_clean['Ethnicity'].astype(str).str.strip()
df_clean['Ethnicity'] = df_clean['Ethnicity'].replace({'nan': np.nan})
df_clean['Ethnicity'] = df_clean['Ethnicity'].where(
    df_clean['Ethnicity'].isna(),
    df_clean['Ethnicity'].str.capitalize()
)

print("\nAfter standardization - Ethnicity unique values:")
print(df_clean['Ethnicity'].value_counts(dropna=False))

Before standardization - Ethnicity unique values:
Asian        41
NaN          35
asian        34
African      32
Hispanic     31
Caucasian    18
caucasian     7
Name: Ethnicity, dtype: int64

After standardization - Ethnicity unique values:
Asian        75
NaN          35
African      32
Hispanic     31
Caucasian    25
Name: Ethnicity, dtype: int64


In [19]:
# Standardize Diagnosis_Code column
print("Before standardization - Diagnosis_Code unique values:")
print(df_clean['Diagnosis_Code'].value_counts(dropna=False))

# Convert to uppercase and fix common typos
df_clean['Diagnosis_Code'] = df_clean['Diagnosis_Code'].astype(str).str.strip().str.upper()
df_clean['Diagnosis_Code'] = df_clean['Diagnosis_Code'].replace({
    'NAN': np.nan,
    'ANXITY': 'ANX',  # Fix typo
    'OCD.': 'OCD'      # Remove trailing period
})

print("\nAfter standardization - Diagnosis_Code unique values:")
print(df_clean['Diagnosis_Code'].value_counts(dropna=False))

Before standardization - Diagnosis_Code unique values:
OCD       40
DEP       39
ocd       35
NaN       32
ANX       31
ANXITY    21
Name: Diagnosis_Code, dtype: int64

After standardization - Diagnosis_Code unique values:
OCD    75
ANX    52
DEP    39
NaN    32
Name: Diagnosis_Code, dtype: int64


In [20]:
# Fill missing categorical values with 'Unknown'
df_clean['Gender'] = df_clean['Gender'].fillna('Unknown')
df_clean['Ethnicity'] = df_clean['Ethnicity'].fillna('Unknown')
df_clean['Diagnosis_Code'] = df_clean['Diagnosis_Code'].fillna('Unknown')

print("Missing categorical values filled with 'Unknown'")
print("\nFinal categorical value counts:")
print("\nGender:")
print(df_clean['Gender'].value_counts())
print("\nEthnicity:")
print(df_clean['Ethnicity'].value_counts())
print("\nDiagnosis_Code:")
print(df_clean['Diagnosis_Code'].value_counts())

Missing categorical values filled with 'Unknown'

Final categorical value counts:

Gender:
Male       79
Female     74
Unknown    38
Other       7
Name: Gender, dtype: int64

Ethnicity:
Asian        75
Unknown      35
African      32
Hispanic     31
Caucasian    25
Name: Ethnicity, dtype: int64

Diagnosis_Code:
OCD        75
ANX        52
DEP        39
Unknown    32
Name: Diagnosis_Code, dtype: int64


## Step 7: Normalize mixed units and engineer BMI feature

Healthcare data often contains mixed measurement units due to different countries or systems using different standards (metric vs imperial). You need to:

1. **Normalize Weight**: Convert all weights to kg (from mixed kg and lbs)
2. **Convert Height**: Convert cm to meters for BMI calculation
3. **Engineer BMI**: Body Mass Index is a clinically important derived feature

**BMI Formula**: BMI = Weight(kg) / Height(m)²

**BMI Categories**:
- Underweight: < 18.5
- Normal: 18.5 - 24.9
- Overweight: 25 - 29.9
- Obese: ≥ 30

In [21]:
# Function to convert weight to kg (handles both numeric kg and string 'lbs' format)
def weight_to_kg(x):
    """
    Convert weight to kilograms.
    Handles numeric values (assumed kg) and strings with 'lbs' suffix.

    Examples:
    70 -> 70.0 kg
    '150lbs' -> 68.04 kg
    '150 lbs' -> 68.04 kg
    """
    if pd.isna(x):
        return np.nan

    # If already numeric, assume it's in kg
    if isinstance(x, (int, float, np.integer, np.floating)):
        return float(x)

    # Handle string values
    s = str(x).strip().lower()

    # Check for lbs pattern (e.g., '150lbs' or '150 lbs')
    match = re.match(r'^\s*([0-9]+(?:\.[0-9]+)?)\s*lbs?\s*$', s)
    if match:
        lbs = float(match.group(1))
        return round(lbs * 0.45359237, 2)  # Convert lbs to kg

    # Try to parse as numeric (assume kg)
    try:
        return float(s)
    except:
        return np.nan

# Apply weight conversion
df_clean['Weight_kg'] = df_clean['Weight'].apply(weight_to_kg)

print("Weight conversion examples:")
display(df_clean[['Weight', 'Weight_kg']].head(10))

Weight conversion examples:


Unnamed: 0,Weight,Weight_kg
0,,
1,,
2,,
3,70,70.0
4,70,70.0
5,110,110.0
6,,
7,,
8,150lbs,68.04
9,150lbs,68.04


In [27]:
# Convert height from cm to meters
df_clean['Height_cm'] = pd.to_numeric(df_clean['Height_cm'], errors='coerce')
df_clean['Height_m'] = df_clean['Height_cm'] / 100.0

print("Height converted from cm to meters")

Height converted from cm to meters


In [28]:
# Calculate BMI (Body Mass Index)
df_clean['BMI'] = df_clean.apply(
    lambda row: round(row['Weight_kg'] / (row['Height_m'] ** 2), 2)
    if pd.notna(row['Weight_kg']) and pd.notna(row['Height_m']) and row['Height_m'] > 0
    else np.nan,
    axis=1
)

print("BMI calculated successfully")
print("\nSample of engineered features:")
display(df_clean[['Weight', 'Weight_kg', 'Height_cm', 'Height_m', 'BMI']].head(10))

BMI calculated successfully

Sample of engineered features:


Unnamed: 0,Weight,Weight_kg,Height_cm,Height_m,BMI
0,,,175.0,1.75,
1,,,165.0,1.65,
2,,,,,
3,70,70.0,160.0,1.6,27.34
4,70,70.0,180.0,1.8,21.6
5,110,110.0,180.0,1.8,33.95
6,,,175.0,1.75,
7,,,175.0,1.75,
8,150lbs,68.04,,,
9,150lbs,68.04,,,


## Step 8: Handle missing values with imputation

Missing values are inevitable in healthcare data. Common causes include:
- Tests not performed for all patients
- Data entry errors
- Equipment failures
- Patient privacy restrictions

### Why use Median instead of Mean?

For healthcare data, **median imputation** is often preferred over mean because:

1. **Robust to outliers**: Healthcare data often contains extreme values (very high glucose, unusual ages)
2. **Mean is sensitive**: A few extreme values can skew the mean significantly
3. **Median represents center**: The middle value of sorted data, unaffected by extremes
4. **Preserves distribution**: Better maintains the shape of skewed distributions
5. **Simple and fast**: Computationally efficient with no assumptions about distribution

You'll impute missing values in numeric columns (Age, Weight_kg, Height_cm, BMI, Glucose_mg_dL) using their respective medians.

In [29]:
# Check missing values before imputation
print("Missing values before imputation:")
numeric_cols = ['Age', 'Weight_kg', 'Height_cm', 'BMI', 'Glucose_mg_dL']
print(df_clean[numeric_cols].isnull().sum())

Missing values before imputation:
Age              28
Weight_kg        43
Height_cm        39
BMI              75
Glucose_mg_dL    42
dtype: int64


In [32]:
# Median imputation for numeric columns
for col in numeric_cols:
    median_val = df_clean[col].median()
    df_clean[col] = df_clean[col].fillna(median_val)
    print(f"Imputed {col} with median = {median_val}")

Imputed Age with median = 45.0
Imputed Weight_kg with median = 75.0
Imputed Height_cm with median = 165.0
Imputed BMI with median = 27.55
Imputed Glucose_mg_dL with median = 110.0


In [31]:
# Verify imputation
print("\nMissing values after imputation:")
print(df_clean[numeric_cols].isnull().sum())
print("\nAll numeric missing values successfully imputed!")


Missing values after imputation:
Age              0
Weight_kg        0
Height_cm        0
BMI              0
Glucose_mg_dL    0
dtype: int64

All numeric missing values successfully imputed!


## Step 9: Parse dates and engineer temporal features

Temporal features can be highly informative in healthcare:
- **Diagnosis year**: May reflect changes in diagnostic practices or disease prevalence
- **Time since diagnosis**: Important for understanding disease progression
- **Seasonal patterns**: Some conditions vary by time of year

You'll parse the inconsistent date formats and extract useful temporal features.

In [33]:
# Parse dates with mixed formats (YYYY-MM-DD and DD/MM/YYYY)
df_clean['Diagnosis_Date_parsed'] = pd.to_datetime(
    df_clean['Diagnosis_Date'],
    errors='coerce',  # Convert unparseable dates to NaT (Not a Time)
    dayfirst=True     # Assume day comes first in ambiguous formats
)

print("Date parsing results:")
print(f"Successfully parsed: {df_clean['Diagnosis_Date_parsed'].notna().sum()} dates")
print(f"Failed to parse: {df_clean['Diagnosis_Date_parsed'].isna().sum()} dates")

print("\nSample of original vs parsed dates:")
display(df_clean[['Diagnosis_Date', 'Diagnosis_Date_parsed']].head(10))

Date parsing results:
Successfully parsed: 188 dates
Failed to parse: 10 dates

Sample of original vs parsed dates:


Unnamed: 0,Diagnosis_Date,Diagnosis_Date_parsed
0,2021-10-21,2021-10-21
1,2020-05-20,2020-05-20
2,15/06/2021,2021-06-15
3,2016-10-19,2016-10-19
4,2016-02-09,2016-02-09
5,2024-05-13,2024-05-13
6,2017-12-31,2017-12-31
7,2022-12-15,2022-12-15
8,2024-03-04,2024-03-04
9,2022-10-10,2022-10-10


In [34]:
# Extract year from diagnosis date
df_clean['Diagnosis_Year'] = df_clean['Diagnosis_Date_parsed'].dt.year

# Calculate days since diagnosis (relative to most recent date in dataset)
ref_date = df_clean['Diagnosis_Date_parsed'].max()
if pd.isna(ref_date):
    ref_date = pd.to_datetime("today")

df_clean['Days_Since_Diagnosis'] = (
    ref_date - df_clean['Diagnosis_Date_parsed']
).dt.days

print(f"\nReference date for calculating time since diagnosis: {ref_date.date()}")
print("\nSample of engineered temporal features:")
display(df_clean[['Diagnosis_Date_parsed', 'Diagnosis_Year', 'Days_Since_Diagnosis']].head(10))


Reference date for calculating time since diagnosis: 2025-08-20

Sample of engineered temporal features:


Unnamed: 0,Diagnosis_Date_parsed,Diagnosis_Year,Days_Since_Diagnosis
0,2021-10-21,2021.0,1399.0
1,2020-05-20,2020.0,1918.0
2,2021-06-15,2021.0,1527.0
3,2016-10-19,2016.0,3227.0
4,2016-02-09,2016.0,3480.0
5,2024-05-13,2024.0,464.0
6,2017-12-31,2017.0,2789.0
7,2022-12-15,2022.0,979.0
8,2024-03-04,2024.0,534.0
9,2022-10-10,2022.0,1045.0


## Step 10: Encode categorical variables

Machine learning algorithms require numeric input. Categorical variables must be converted to numbers through **encoding**.

### One-hot encoding

One-hot encoding creates **binary (0/1) columns** for each category:

**Example**: If Diagnosis_Code has values ['DEP', 'OCD', 'ANX']
- Creates columns: `Diagnosis_Code_DEP`, `Diagnosis_Code_OCD`, `Diagnosis_Code_ANX`
- A patient with 'OCD' gets: [0, 1, 0]

**Why use one-hot encoding?**
- Treats all categories equally (no implicit ordering)
- Works with all ML algorithms
- Prevents models from assuming numerical relationships between categories

**Alternative**: Label Encoding (1, 2, 3...) should only be used for ordinal data with natural ordering.

You'll apply one-hot encoding to Gender, Ethnicity, and Diagnosis_Code.

In [35]:
# Apply one-hot encoding to categorical columns
print("Columns before encoding:")
print(df_clean.columns.tolist())
print(f"Total columns: {len(df_clean.columns)}")

df_final = pd.get_dummies(
    df_clean,
    columns=['Diagnosis_Code', 'Gender', 'Ethnicity'],
    drop_first=False  # Keep all columns (set True to drop one for linear models)
)

print("\nColumns after encoding:")
print(df_final.columns.tolist())
print(f"Total columns: {len(df_final.columns)}")
print(f"\nNew encoded columns created: {len(df_final.columns) - len(df_clean.columns)}")

Columns before encoding:
['Patient_ID', 'Age', 'Gender', 'Ethnicity', 'Weight', 'Height_cm', 'Diagnosis_Date', 'Diagnosis_Code', 'Glucose_mg_dL', 'Risk', 'Weight_kg', 'Height_m', 'BMI', 'Diagnosis_Date_parsed', 'Diagnosis_Year', 'Days_Since_Diagnosis']
Total columns: 16

Columns after encoding:
['Patient_ID', 'Age', 'Weight', 'Height_cm', 'Diagnosis_Date', 'Glucose_mg_dL', 'Risk', 'Weight_kg', 'Height_m', 'BMI', 'Diagnosis_Date_parsed', 'Diagnosis_Year', 'Days_Since_Diagnosis', 'Diagnosis_Code_ANX', 'Diagnosis_Code_DEP', 'Diagnosis_Code_OCD', 'Diagnosis_Code_Unknown', 'Gender_Female', 'Gender_Male', 'Gender_Other', 'Gender_Unknown', 'Ethnicity_African', 'Ethnicity_Asian', 'Ethnicity_Caucasian', 'Ethnicity_Hispanic', 'Ethnicity_Unknown']
Total columns: 26

New encoded columns created: 10


In [36]:
# Preview the encoded dataset
print("Sample of encoded data:")
display(df_final.head())

Sample of encoded data:


Unnamed: 0,Patient_ID,Age,Weight,Height_cm,Diagnosis_Date,Glucose_mg_dL,Risk,Weight_kg,Height_m,BMI,...,Diagnosis_Code_Unknown,Gender_Female,Gender_Male,Gender_Other,Gender_Unknown,Ethnicity_African,Ethnicity_Asian,Ethnicity_Caucasian,Ethnicity_Hispanic,Ethnicity_Unknown
0,P_new_63,45.0,,175.0,2021-10-21,140.0,0,75.0,1.75,27.55,...,1,0,0,0,1,0,1,0,0,0
1,P016,50.0,,165.0,2020-05-20,100.0,0,75.0,1.65,27.55,...,0,0,1,0,0,0,0,0,0,1
2,P006,25.0,,165.0,15/06/2021,110.0,0,75.0,,27.55,...,0,1,0,0,0,0,0,1,0,0
3,P_new_126,34.0,70.0,160.0,2016-10-19,110.0,1,70.0,1.6,27.34,...,0,0,1,0,0,0,1,0,0,0
4,P_new_96,25.0,70.0,180.0,2016-02-09,140.0,0,70.0,1.8,21.6,...,0,0,0,0,1,0,1,0,0,0


## Step 11: Scale numeric features

### Why scale features?

Many machine learning algorithms are sensitive to feature scale:
- **Example**: Age (range 0-100) vs Glucose (range 70-500)
- Without scaling, algorithms may give more importance to features with larger values
- Algorithms affected: Logistic Regression, SVM, KNN, Neural Networks, K-Means
- Algorithms NOT affected: Tree-based models (Decision Trees, Random Forest, XGBoost)

### StandardScaler (Z-score normalization)

**Formula**: z = (x - μ) / σ
- Transforms data to have **mean = 0** and **standard deviation = 1**
- **Best for**: Algorithms assuming normal distribution (Linear/Logistic Regression, SVM)
- **Range**: Typically between -3 and +3 (but unbounded)

You'll apply StandardScaler to all numeric features.

In [39]:
# Define numeric columns to scale
numeric_cols_to_scale = ['Age', 'Weight_kg', 'Height_cm', 'BMI', 'Glucose_mg_dL', 'Days_Since_Diagnosis']

# Filter to only existing columns
numeric_cols_existing = [col for col in numeric_cols_to_scale if col in df_final.columns]

print(f"Scaling {len(numeric_cols_existing)} numeric features:")
print(numeric_cols_existing)

Scaling 6 numeric features:
['Age', 'Weight_kg', 'Height_cm', 'BMI', 'Glucose_mg_dL', 'Days_Since_Diagnosis']


In [40]:
# Fill any remaining missing values with median before scaling
df_final[numeric_cols_existing] = df_final[numeric_cols_existing].fillna(
    df_final[numeric_cols_existing].median()
)

print("Verified no missing values before scaling:")
print(df_final[numeric_cols_existing].isna().sum())

Verified no missing values before scaling:
Age                     0
Weight_kg               0
Height_cm               0
BMI                     0
Glucose_mg_dL           0
Days_Since_Diagnosis    0
dtype: int64


In [41]:
# Apply StandardScaler
scaler = StandardScaler()
scaled_columns = [col + '_scaled' for col in numeric_cols_existing]
df_final[scaled_columns] = scaler.fit_transform(df_final[numeric_cols_existing])

print("Scaling complete!")
print("\nScaled features statistics (should have mean≈0, std≈1):")
display(df_final[scaled_columns].describe())

Scaling complete!

Scaled features statistics (should have mean≈0, std≈1):


Unnamed: 0,Age_scaled,Weight_kg_scaled,Height_cm_scaled,BMI_scaled,Glucose_mg_dL_scaled,Days_Since_Diagnosis_scaled
count,198.0,198.0,198.0,198.0,198.0,198.0
mean,1.278439e-16,-6.902447e-16,9.621933e-16,-5.517472e-16,2.607342e-17,-5.3829000000000005e-17
std,1.002535,1.002535,1.002535,1.002535,1.002535,1.002535
min,-0.7156602,-0.8994704,-1.154814,-1.558938,-0.6928224,-1.839041
25%,-0.5618235,-0.7654141,-0.4555688,-0.4181248,-0.5404692,-0.7716356
50%,-0.3738008,-0.4234338,-0.4555688,-0.2198113,-0.5404692,-0.07427804
75%,-0.1174063,0.6025072,0.9429215,0.1563709,-0.3576453,0.8161819
max,2.275609,1.970428,1.642167,2.932759,1.836241,1.912513


In [42]:
# Compare original vs scaled values
print("\nComparison of original vs scaled values:")
comparison_cols = ['Age', 'Age_scaled', 'Glucose_mg_dL', 'Glucose_mg_dL_scaled']
display(df_final[comparison_cols].head(10))


Comparison of original vs scaled values:


Unnamed: 0,Age,Age_scaled,Glucose_mg_dL,Glucose_mg_dL_scaled
0,45.0,-0.373801,140.0,-0.357645
1,50.0,-0.288336,100.0,-0.60141
2,25.0,-0.71566,110.0,-0.540469
3,34.0,-0.561823,110.0,-0.540469
4,25.0,-0.71566,140.0,-0.357645
5,25.0,-0.71566,110.0,-0.540469
6,45.0,-0.373801,90.0,-0.662352
7,60.0,-0.117406,110.0,-0.540469
8,45.0,-0.373801,140.0,-0.357645
9,45.0,-0.373801,90.0,-0.662352


## Step 12: Save the cleaned dataset

Now that you've completed all preprocessing steps, you'll save the cleaned dataset to a CSV file. This file is now ready for:
- Exploratory data analysis (EDA)
- Machine learning model training
- Statistical analysis
- Sharing with team members (with PII removed)

In [43]:
# Save cleaned dataset
output_path = "healthcare_cleaned_data.csv"
df_final.to_csv(output_path, index=False)

print(f"✓ Cleaned dataset saved to: {output_path}")
print(f"\nFinal dataset shape: {df_final.shape[0]} rows × {df_final.shape[1]} columns")
print(f"Original dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")

✓ Cleaned dataset saved to: healthcare_cleaned_data.csv

Final dataset shape: 198 rows × 32 columns
Original dataset shape: 200 rows × 12 columns


In [44]:
# Display working directory
import os
print(f"\nFile saved in directory: {os.getcwd()}")


File saved in directory: C:\Users\sugne\Documents\HealthcareData


## Step 13: Evaluate data cleaning results

Let's compare the raw and cleaned datasets to verify preprocessing was successful.

In [45]:
# Compare column structures
print("="*60)
print("COLUMN COMPARISON")
print("="*60)
print(f"\nRaw data columns ({len(df.columns)}):")
print(df.columns.tolist())
print(f"\nCleaned data columns ({len(df_clean.columns)}):")
print(df_clean.columns.tolist())
print(f"\nFinal encoded data columns ({len(df_final.columns)}):")
print(df_final.columns.tolist())

COLUMN COMPARISON

Raw data columns (12):
['Patient_ID', 'Age', 'Gender', 'Ethnicity', 'Weight', 'Height_cm', 'Diagnosis_Date', 'Diagnosis_Code', 'Glucose_mg_dL', 'Risk', 'Patient_Name', 'EmailID']

Cleaned data columns (16):
['Patient_ID', 'Age', 'Gender', 'Ethnicity', 'Weight', 'Height_cm', 'Diagnosis_Date', 'Diagnosis_Code', 'Glucose_mg_dL', 'Risk', 'Weight_kg', 'Height_m', 'BMI', 'Diagnosis_Date_parsed', 'Diagnosis_Year', 'Days_Since_Diagnosis']

Final encoded data columns (32):
['Patient_ID', 'Age', 'Weight', 'Height_cm', 'Diagnosis_Date', 'Glucose_mg_dL', 'Risk', 'Weight_kg', 'Height_m', 'BMI', 'Diagnosis_Date_parsed', 'Diagnosis_Year', 'Days_Since_Diagnosis', 'Diagnosis_Code_ANX', 'Diagnosis_Code_DEP', 'Diagnosis_Code_OCD', 'Diagnosis_Code_Unknown', 'Gender_Female', 'Gender_Male', 'Gender_Other', 'Gender_Unknown', 'Ethnicity_African', 'Ethnicity_Asian', 'Ethnicity_Caucasian', 'Ethnicity_Hispanic', 'Ethnicity_Unknown', 'Age_scaled', 'Weight_kg_scaled', 'Height_cm_scaled', 'BMI_sc

In [38]:
# Compare missing values
print("\n" + "="*60)
print("MISSING VALUES COMPARISON")
print("="*60)
print("\nRaw data missing values:")
print(df.isna().sum())
print(f"\nTotal missing values in raw data: {df.isna().sum().sum()}")

print("\nCleaned data missing values:")
print(df_clean.isna().sum())
print(f"\nTotal missing values in cleaned data: {df_clean.isna().sum().sum()}")


MISSING VALUES COMPARISON

Raw data missing values:
Patient_ID         0
Age               28
Gender            38
Ethnicity         35
Weight            44
Height_cm         40
Diagnosis_Date    10
Diagnosis_Code    33
Glucose_mg_dL     42
Risk               0
Patient_Name       0
EmailID            0
dtype: int64

Total missing values in raw data: 270

Cleaned data missing values:
Patient_ID                0
Age                       0
Gender                    0
Ethnicity                 0
Weight                   43
Height_cm                 0
Diagnosis_Date           10
Diagnosis_Code            0
Glucose_mg_dL             0
Risk                      0
Weight_kg                 0
Height_m                 39
BMI                       0
Diagnosis_Date_parsed    10
Diagnosis_Year           10
Days_Since_Diagnosis     10
dtype: int64

Total missing values in cleaned data: 122


In [46]:
# Compare statistical summaries
print("\n" + "="*60)
print("STATISTICAL SUMMARY COMPARISON")
print("="*60)
print("\nRaw data summary:")
display(df.describe())

print("\nCleaned data summary:")
display(df_clean.describe())


STATISTICAL SUMMARY COMPARISON

Raw data summary:


Unnamed: 0,Age,Height_cm,Glucose_mg_dL,Risk
count,172.0,160.0,158.0,200.0
mean,70.087209,169.03125,220.981013,0.465
std,62.340026,7.780335,177.443324,0.500025
min,25.0,160.0,85.0,0.0
25%,34.0,160.0,90.0,0.0
50%,45.0,165.0,110.0,0.0
75%,60.0,175.0,500.0,1.0
max,200.0,180.0,500.0,1.0



Cleaned data summary:


Unnamed: 0,Age,Height_cm,Glucose_mg_dL,Risk,Weight_kg,Height_m,BMI,Diagnosis_Year,Days_Since_Diagnosis
count,198.0,198.0,198.0,198.0,198.0,159.0,198.0,188.0,188.0
mean,66.868687,168.257576,198.686869,0.469697,81.190909,1.690566,28.625152,2020.319149,1768.047872
std,58.651885,7.168694,164.508321,0.500346,14.657786,0.077983,4.903646,2.737002,987.00345
min,25.0,160.0,85.0,0.0,68.04,1.6,21.0,2015.0,0.0
25%,34.0,165.0,110.0,0.0,70.0,1.6,26.58,2018.0,957.25
50%,45.0,165.0,110.0,0.0,75.0,1.65,27.55,2020.0,1693.0
75%,60.0,175.0,140.0,1.0,90.0,1.75,29.39,2023.0,2576.25
max,200.0,180.0,500.0,1.0,110.0,1.8,42.97,2025.0,3599.0


In [47]:
# Compare categorical standardization (Gender example)
print("\n" + "="*60)
print("CATEGORICAL STANDARDIZATION - GENDER EXAMPLE")
print("="*60)
print("\nRaw Gender value counts:")
print(df['Gender'].value_counts(dropna=False))

print("\nCleaned Gender value counts:")
print(df_clean['Gender'].value_counts(dropna=False))

print(" Successfully standardized from 6 variations to 4 consistent categories!")


CATEGORICAL STANDARDIZATION - GENDER EXAMPLE

Raw Gender value counts:
M         46
F         40
NaN       38
Female    36
Male      33
Other      7
Name: Gender, dtype: int64

Cleaned Gender value counts:
Male       79
Female     74
Unknown    38
Other       7
Name: Gender, dtype: int64
 Successfully standardized from 6 variations to 4 consistent categories!




# Exercises

Now it's your turn! Apply what you've learned to a new synthetic healthcare dataset. The following exercises will test your understanding of the data preprocessing pipeline.

## Exercise 1: Load and prepare data

Load the `synthetic_data.csv` file into a DataFrame and create a clean working copy.

In [53]:
# your code goes here
# Load the raw healthcare data from CSV file
df = pd.read_csv("https://foundations-of-healthcare-data-analytics-4e579d.gitlab.io/labs/Cleaning_and_Validating_Healthcare_Data_Using_Python/synthetic_data.csv")

# Display the first few rows to get an initial sense of the data
print("First 5 rows of the dataset:")
df.head()

# Create a working copy
df_clean = df.copy()

First 5 rows of the dataset:


<details>
    <summary>Click here for a hint</summary>
    
Use the `read_csv()` function to load the data, then use `.copy()` to create a working copy. Reference **Step 1** for the exact syntax.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Load the synthetic healthcare data
df = pd.read_csv("https://foundations-of-healthcare-data-analytics-4e579d.gitlab.io/labs/Cleaning_and_Validating_Healthcare_Data_Using_Python/synthetic_data.csv")

# Create a working copy
df_clean = df.copy()

# Display column names to verify
print("Columns in dataset:")
print(df_clean.columns.tolist())
print(f"\nDataset loaded: {df_clean.shape[0]} rows × {df_clean.shape[1]} columns")
```

</details>

## Exercise 2: Remove personal data

Identify and remove all PII (Personally Identifiable Information) columns from the dataset. Common PII includes: Patient_ID, Name, Address, Phone, Email.

In [54]:
# your code goes here
# Identify and remove PII columns
pii_columns = ['Patient_Name', 'EmailID']
print(f"Removing PII columns: {pii_columns}")

df_clean = df_clean.drop(columns=pii_columns, errors='ignore')

print("\nColumns after removing PII:")
print(df_clean.columns.tolist())
print(f"\nReduced from {len(df.columns)} to {len(df_clean.columns)} columns")

Removing PII columns: ['Patient_Name', 'EmailID']

Columns after removing PII:
['Patient_ID', 'Age', 'Gender', 'Ethnicity', 'Weight', 'Height_cm', 'Diagnosis_Date', 'Diagnosis_Code', 'Glucose_mg_dL', 'Risk']

Reduced from 12 to 10 columns


<details>
    <summary>Click here for a hint</summary>
    
Use the `.drop()` method with `columns` parameter. Set `errors='ignore'` to avoid errors if a column doesn't exist. Reference **Step 4** for the syntax.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Define PII columns to remove
pii_cols = ['Patient_ID', 'Name', 'Address', 'Phone', 'Email']

# Remove PII columns (only if they exist)
df_clean = df_clean.drop(
    columns=[col for col in pii_cols if col in df_clean.columns],
    errors='ignore'
)

print("After removing PII columns:")
print(df_clean.columns.tolist())
print(f"\nColumns remaining: {len(df_clean.columns)}")
```

</details>

## Exercise 3: Drop duplicate rows

Check for and remove any duplicate rows in the dataset. Report how many duplicates were found and removed.

In [61]:
# your code goes here
# Check for duplicate rows
rows_before=len(df_clean)
print(f"rows before: {rows_before}")

# Remove exact duplicate rows (keep first occurrence)
df_clean = df_clean.drop_duplicates(keep='first')

# Count rows after deduplication
rows_after = len(df_clean)
print(f"rows after: {rows_after}")

rows_removed=rows_before-rows_after
print(f"rows removed :{rows_removed}")

rows before: 198
rows after: 198
rows removed :0


<details>
    <summary>Click here for a hint</summary>
    
Use `.drop_duplicates()` method with `keep='first'` parameter. Count rows before and after to see how many were removed. Reference **Step 5**.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Count rows before deduplication
rows_before = len(df_clean)

# Remove exact duplicate rows (keep first occurrence)
df_clean = df_clean.drop_duplicates(keep='first')

# Count rows after deduplication
rows_after = len(df_clean)

# Report results
print(f"Rows before deduplication: {rows_before}")
print(f"Rows after deduplication: {rows_after}")
print(f"Duplicate rows removed: {rows_before - rows_after}")
```

</details>

## Exercise 4: Handle missing values in numeric columns

Identify all numeric columns, check for missing values, and impute them using the median strategy. Verify that all missing values have been filled.

In [70]:
# your code goes here
# Check missing values before imputation

number_cols=df_clean.select_dtypes(include=[np.number]).columns.tolist()
print(number_cols)

print(df_clean[numeric_cols].isnull().sum())

# Convert to numeric (coerce invalid values to NaN)
for col in numeric_cols:
    df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')

# Median imputation
for col in numeric_cols:
    median_val = df_clean[col].median()
    df_clean[col] = df_clean[col].fillna(median_val)
    print(f"Imputed {col} with median = {median_val}")

# Verify
print("\nMissing values after imputation:")
print(df_clean[numeric_cols].isnull().sum())
print("\n✓ All numeric missing values imputed successfully!")


['Age', 'Height_cm', 'Glucose_mg_dL', 'Risk']
Age               28
Gender            38
Ethnicity         35
Weight            43
Height_cm         39
Diagnosis_Date    10
Diagnosis_Code    32
Glucose_mg_dL     42
Risk               0
dtype: int64
Imputed Age with median = 45.0
Imputed Gender with median = nan
Imputed Ethnicity with median = nan
Imputed Weight with median = 90.0
Imputed Height_cm with median = 165.0
Imputed Diagnosis_Date with median = nan
Imputed Diagnosis_Code with median = nan
Imputed Glucose_mg_dL with median = 110.0
Imputed Risk with median = 0.0

Missing values after imputation:
Age                 0
Gender            198
Ethnicity         198
Weight              0
Height_cm           0
Diagnosis_Date    198
Diagnosis_Code    198
Glucose_mg_dL       0
Risk                0
dtype: int64

✓ All numeric missing values imputed successfully!


<details>
    <summary>Click here for a hint</summary>
    
First, use `.select_dtypes(include=[np.number])` to get numeric columns. Then use `.median()` and `.fillna()` for each column. Reference **Step 8** for the complete approach.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Identify numeric columns
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric columns found: {numeric_cols}")

# Check missing values before imputation
print("\nMissing values before imputation:")
print(df_clean[numeric_cols].isnull().sum())

# Convert to numeric (coerce invalid values to NaN)
for col in numeric_cols:
    df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')

# Median imputation
for col in numeric_cols:
    median_val = df_clean[col].median()
    df_clean[col] = df_clean[col].fillna(median_val)
    print(f"Imputed {col} with median = {median_val}")

# Verify
print("\nMissing values after imputation:")
print(df_clean[numeric_cols].isnull().sum())
print("\n✓ All numeric missing values imputed successfully!")
```

</details>

## Exercise 5: Standardize categorical variables

Standardize the inconsistent categorical entries in the `Gender` and `Disease_Type` columns. Print the unique values before and after standardization.

**Hint**: Gender variations might include: M/Male/m/F/Female/f  
**Hint**: Disease_Type variations might include: ckd/CKD, LD/ld/Liver Disease

In [73]:
# Compare categorical standardization (Gender example)
print("\n" + "="*60)
print("CATEGORICAL STANDARDIZATION - GENDER EXAMPLE")
print("="*60)
print("\nRaw Gender value counts:")
print(df['Gender'].value_counts(dropna=False))

print("\nCleaned Gender value counts:")
print(df_clean['Gender'].value_counts(dropna=False))


# Standardize Gender column
print("Before standardization - Gender unique values:")
print(df_clean['Gender'].value_counts(dropna=False))

# Convert to lowercase and strip whitespace for consistent matching
df_clean['Gender'] = df_clean['Gender'].astype(str).str.strip().str.lower()

# Define mapping for known variations
gender_map = {
    'male': 'Male', 'm': 'Male',
    'female': 'Female', 'f': 'Female',
    'other': 'Other',
    'nan': np.nan, 'none': np.nan
}

# Apply mapping
df_clean['Gender'] = df_clean['Gender'].replace({'nan': np.nan})
df_clean['Gender'] = df_clean['Gender'].map(
    lambda x: gender_map.get(x, x.capitalize() if pd.notna(x) else x)
)



print("\nAfter standardization - Gender unique values:")
print(df_clean['Gender'].value_counts(dropna=False))


CATEGORICAL STANDARDIZATION - GENDER EXAMPLE

Raw Gender value counts:
M         46
F         40
NaN       38
Female    36
Male      33
Other      7
Name: Gender, dtype: int64

Cleaned Gender value counts:
NaN    198
Name: Gender, dtype: int64
Before standardization - Gender unique values:
NaN    198
Name: Gender, dtype: int64

After standardization - Gender unique values:
NaN    198
Name: Gender, dtype: int64


1<details>
    <summary>Click here for a hint</summary>
    
Use `.str.lower()` and `.str.strip()` first, then create a mapping dictionary to standardize variations. Reference **Step 6** for the complete pattern.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Standardize Gender
print("Before standardization - Gender:")
print(df_clean['Gender'].value_counts(dropna=False))

df_clean['Gender'] = df_clean['Gender'].astype(str).str.strip().str.lower()
gender_map = {
    'male': 'Male', 'm': 'Male',
    'female': 'Female', 'f': 'Female',
    'other': 'Other',
    'nan': np.nan, 'none': np.nan
}
df_clean['Gender'] = df_clean['Gender'].replace({'nan': np.nan})
df_clean['Gender'] = df_clean['Gender'].map(
    lambda x: gender_map.get(x, x.capitalize() if pd.notna(x) else x)
)

print("\nAfter standardization - Gender:")
print(df_clean['Gender'].value_counts(dropna=False))


```

</details>

---

# Congratulations!

You have successfully completed this lab on healthcare data preprocessing! You've learned how to systematically clean messy real-world data by handling missing values, removing duplicates, standardizing inconsistent entries, engineering meaningful features, and preparing data for machine learning. These skills are essential for any data science project, especially in healthcare where data quality directly impacts patient outcomes and model reliability.

## Authors

[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/)

Copyright © 2025 SkillUp. All rights reserved.