# 🧼 Healthcare Data Cleaning Notebook

This notebook performs basic data cleaning operations on the raw healthcare CSV files.

**Steps:**
- Load all raw data from `data/datasets/`
- Clean column names (remove spaces, special characters)
- Handle missing values (fill with `'NA'`)
- Drop duplicates
- Export cleaned files to `data/outputs/`

In [2]:
import pandas as pd
import os

input_dir = "../data/datasets/"
output_dir = "../data/outputs/"
os.makedirs(output_dir, exist_ok=True)

## 📄 Step 1: Define list of raw CSV files to clean

In [3]:
files = [
    "FactTable.csv",
    "DimPatient.csv",
    "DimPhysician.csv",
    "DimSpeciality.csv",
    "DimHospital.csv",
    "DimPayer.csv",
    "DimCptCode.csv",
    "DimDiagnosisCode.csv",
    "DimDate.csv",
    "DimTransaction.csv"
]

## 🧽 Step 2: Define a function to clean column names
- Replaces spaces, dashes, and parentheses with underscores

In [4]:
def clean_column(col):
    return col.strip().replace(" ", "_").replace("(", "").replace(")", "").replace("-", "_")

## 🧹 Step 3: Loop through each file
- Read, clean, handle nulls, drop duplicates
- Save cleaned file to `data/outputs/`

In [15]:
for file in files:
    try:
        df = pd.read_csv(os.path.join(input_dir, file))
        df.columns = [clean_column(c) for c in df.columns]

        df.drop_duplicates(inplace=True)
        df.fillna("NA", inplace=True)

        # ✅ Convert and fix DimDate
        if file == "DimDate.csv":
            for col in df.columns:
                if "date" in col.lower():
                    try:
                        df[col] = pd.to_datetime(df[col], dayfirst=True).dt.strftime('%Y-%m-%d')
                    except Exception as e:
                        print(f"Date conversion failed for {col}: {e}")
            if "Month" in df.columns:
                try:
                    df["Month"] = pd.to_datetime(df["Date"]).dt.month
                except Exception as e:
                    print(f"Month conversion failed: {e}")

        # ✅ Convert and fix FactTable
        if file == "FactTable.csv":
            df.replace({"#NUM!": 0, "#DIV/0!": 0}, inplace=True)
            for col in df.columns:
                if "date" in col.lower():
                    try:
                        df[col] = pd.to_datetime(df[col], errors='coerce', dayfirst=True).dt.strftime('%Y-%m-%d')
                    except Exception as e:
                        print(f"❌ Failed to convert {col}: {e}")


        # ✅ Save after conversion
        cleaned_filename = "cleaned_" + file
        df.to_csv(os.path.join(output_dir, cleaned_filename), index=False)
        print(f"✅ Cleaned and saved: {cleaned_filename}")

    except Exception as e:
        print(f"❌ Error processing {file}: {e}")

  df = pd.read_csv(os.path.join(input_dir, file))


✅ Cleaned and saved: cleaned_FactTable.csv
✅ Cleaned and saved: cleaned_DimPatient.csv
✅ Cleaned and saved: cleaned_DimPhysician.csv
✅ Cleaned and saved: cleaned_DimSpeciality.csv
✅ Cleaned and saved: cleaned_DimHospital.csv
✅ Cleaned and saved: cleaned_DimPayer.csv
✅ Cleaned and saved: cleaned_DimCptCode.csv
✅ Cleaned and saved: cleaned_DimDiagnosisCode.csv
✅ Cleaned and saved: cleaned_DimDate.csv
✅ Cleaned and saved: cleaned_DimTransaction.csv
