# 🧠 2_eda_analysis.ipynb: Exploratory Data Analysis

This notebook explores cleaned healthcare data for quick insights.

**Key Sections:**
- Load cleaned CSVs
- Summary stats (shape, dtypes, describe)
- Null value checks
- Value counts (e.g. gender, state, diagnosis groups)


In [1]:
import pandas as pd
import os

# Load cleaned data from output folder
data_dir = "../data/outputs/"

files = [
    "cleaned_FactTable.csv",
    "cleaned_DimPatient.csv",
    "cleaned_DimPhysician.csv",
    "cleaned_DimSpeciality.csv",
    "cleaned_DimHospital.csv",
    "cleaned_DimPayer.csv",
    "cleaned_DimCptCode.csv",
    "cleaned_DimDiagnosisCode.csv",
    "cleaned_DimDate.csv",
    "cleaned_DimTransaction.csv"
]

dfs = {}
for file in files:
    name = file.replace("cleaned_", "").replace(".csv", "")
    dfs[name] = pd.read_csv(os.path.join(data_dir, file))
    print(f"✅ Loaded: {file} → {name}, Shape: {dfs[name].shape}")


✅ Loaded: cleaned_FactTable.csv → FactTable, Shape: (83999, 18)
✅ Loaded: cleaned_DimPatient.csv → DimPatient, Shape: (5117, 22)
✅ Loaded: cleaned_DimPhysician.csv → DimPhysician, Shape: (932, 5)
✅ Loaded: cleaned_DimSpeciality.csv → DimSpeciality, Shape: (34, 4)
✅ Loaded: cleaned_DimHospital.csv → DimHospital, Shape: (11, 2)
✅ Loaded: cleaned_DimPayer.csv → DimPayer, Shape: (4, 2)
✅ Loaded: cleaned_DimCptCode.csv → DimCptCode, Shape: (1256, 4)
✅ Loaded: cleaned_DimDiagnosisCode.csv → DimDiagnosisCode, Shape: (4801, 4)
✅ Loaded: cleaned_DimDate.csv → DimDate, Shape: (383, 7)
✅ Loaded: cleaned_DimTransaction.csv → DimTransaction, Shape: (955, 4)


In [2]:
# Example: Summary of FactTable
df = dfs['FactTable']
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83999 entries, 0 to 83998
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   FactTablePK         83999 non-null  int64  
 1   Check_Dimension     83999 non-null  int64  
 2   dimPatientPK        83999 non-null  int64  
 3   dimPhysicianPK      83999 non-null  int64  
 4   dimDateServicePK    50706 non-null  object 
 5   dimDatePostPK       48031 non-null  object 
 6   dimCPTCodePK        83999 non-null  int64  
 7   dimPayerPK          83999 non-null  int64  
 8   dimTransactionPK    83999 non-null  int64  
 9   dimLocationPK       83999 non-null  int64  
 10  PatientNumber       83999 non-null  int64  
 11  dimDiagnosisCodePK  83999 non-null  int64  
 12  CPTUnits            83999 non-null  int64  
 13  Gross_Expenses      83999 non-null  float64
 14  Adjustment          83999 non-null  int64  
 15  Insurance_Payment   83999 non-null  int64  
 16  Pati

In [3]:
# Null check across all
for name, df in dfs.items():
    print(f"📊 {name} → Nulls:\n{df.isnull().sum()}\n")

📊 FactTable → Nulls:
FactTablePK               0
Check_Dimension           0
dimPatientPK              0
dimPhysicianPK            0
dimDateServicePK      33293
dimDatePostPK         35968
dimCPTCodePK              0
dimPayerPK                0
dimTransactionPK          0
dimLocationPK             0
PatientNumber             0
dimDiagnosisCodePK        0
CPTUnits                  0
Gross_Expenses            0
Adjustment                0
Insurance_Payment         0
Patient_Payment           0
AR                        0
dtype: int64

📊 DimPatient → Nulls:
dimPatientPK           0
PatientNumber          0
FirstName              0
LastName               0
Email                  0
PatientGender          0
PatientAge             0
PatientHeightin_cms    0
Year_of_Birth          0
Month                  0
Day                    0
BloodGroup             0
Tobacco                0
Alcohol                0
Exercise               0
Diet                   0
Ethinicity             0
Zip_Codes     

In [4]:
# Gender distribution
print(dfs['DimPatient']['PatientGender'].value_counts())


PatientGender
Female    3006
Male      2111
Name: count, dtype: int64


In [5]:
# Diagnosis groups
if 'DiagnosisCodeGroup' in dfs['DimDiagnosisCode'].columns:
    print(dfs['DimDiagnosisCode']['DiagnosisCodeGroup'].value_counts().head(10))

DiagnosisCodeGroup
Persons encountering health services for examination and investigation    413
Hypertensive diseases                                                     246
Symptoms and signs involving the circulatory and respiratory systems      232
Other forms of heart disease                                              227
Diabetes mellitus                                                         167
Ischaemic heart diseases                                                  144
Symptoms and signs involving the digestive system and \tabdomen           141
General symptoms and signs                                                141
Dorsopathies: Deforming dorsopathies                                      116
Acute upper respiratory infections                                         83
Name: count, dtype: int64
