# hospital data curation project
## phase 2: data profiling and quality assessment

this notebook generates comprehensive profiling reports using ydata-profiling and sweetviz to identify data quality issues, missing values, and anomalies.

In [14]:
# import required libraries
import sys
import os
from pathlib import Path

# add src directory to python path
notebook_dir = Path(os.getcwd())
src_dir = notebook_dir / 'src'
if str(src_dir) not in sys.path:
    sys.path.insert(0, str(src_dir))

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

from ydata_profiling import ProfileReport
import sweetviz as sv

# import project modules
import config
import data_loader
import utils

# use imported modules
RAW_DATA_DIR = config.RAW_DATA_DIR
CLEANED_DATA_DIR = config.CLEANED_DATA_DIR
LOGS_DIR = config.LOGS_DIR
DATA_FILES = config.DATA_FILES
PROFILING_DIR = config.PROFILING_DIR
SWEETVIZ_DIR = config.SWEETVIZ_DIR
DataLoader = data_loader.DataLoader
setup_logging = utils.setup_logging
print_section_header = utils.print_section_header
generate_report_timestamp = utils.generate_report_timestamp

## 1. load datasets

In [15]:
# setup logging
logger = setup_logging()

# load all datasets
loader = DataLoader(data_dir=RAW_DATA_DIR)
datasets = loader.load_all_datasets()

print_section_header("data profiling initialization")
print(f"datasets loaded: {list(datasets.keys())}")
print(f"profiling reports directory: {PROFILING_DIR}")
print(f"sweetviz reports directory: {SWEETVIZ_DIR}")

2025-11-10 17:43:48,729 - utils - INFO - successfully loaded patients.csv: 3000 rows, 7 columns
2025-11-10 17:43:48,729 - utils - INFO - loaded patients: 3000 rows
2025-11-10 17:43:48,729 - utils - INFO - loaded patients: 3000 rows
2025-11-10 17:43:48,745 - utils - INFO - successfully loaded visits.csv: 5000 rows, 7 columns
2025-11-10 17:43:48,745 - utils - INFO - successfully loaded visits.csv: 5000 rows, 7 columns


2025-11-10 17:43:48,747 - utils - INFO - loaded visits: 5000 rows
2025-11-10 17:43:48,764 - utils - INFO - successfully loaded diagnoses.csv: 8000 rows, 4 columns
2025-11-10 17:43:48,767 - utils - INFO - loaded diagnoses: 8000 rows
2025-11-10 17:43:48,781 - utils - INFO - successfully loaded medications.csv: 6000 rows, 6 columns
2025-11-10 17:43:48,764 - utils - INFO - successfully loaded diagnoses.csv: 8000 rows, 4 columns
2025-11-10 17:43:48,767 - utils - INFO - loaded diagnoses: 8000 rows
2025-11-10 17:43:48,781 - utils - INFO - successfully loaded medications.csv: 6000 rows, 6 columns
2025-11-10 17:43:48,783 - utils - INFO - loaded medications: 6000 rows
2025-11-10 17:43:48,783 - utils - INFO - loaded medications: 6000 rows
2025-11-10 17:43:48,787 - utils - INFO - successfully loaded staff.csv: 500 rows, 5 columns
2025-11-10 17:43:48,789 - utils - INFO - loaded staff: 500 rows
2025-11-10 17:43:48,792 - utils - INFO - successfully loaded hospital_info.csv: 20 rows, 5 columns
2025-11


                         data profiling initialization                          

datasets loaded: ['patients', 'visits', 'diagnoses', 'medications', 'staff', 'hospital_info']
profiling reports directory: d:\Github Desktop\Python\Hospital Data Curation\reports\profiling
sweetviz reports directory: d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz


## 2. generate ydata profiling reports

comprehensive profiling reports for each dataset identifying:
- missing values and nulls
- duplicate records
- data types and distributions
- correlations
- outliers and anomalies

In [16]:
# generate ydata profiling reports for each dataset
timestamp = generate_report_timestamp()

print_section_header("generating ydata profiling reports")

for dataset_name, df in datasets.items():
    print(f"\ngenerating profile for {dataset_name}...")
    
    # create profile report
    profile = ProfileReport(
        df,
        title=f"hospital data profiling: {dataset_name}",
        explorative=True,
        minimal=False
    )
    
    # save report
    report_file = PROFILING_DIR / f"{dataset_name}_profile_{timestamp}.html"
    profile.to_file(report_file)
    
    print(f"  ✓ saved: {report_file}")
    
    # display key statistics
    print(f"  - rows: {len(df)}")
    print(f"  - columns: {len(df.columns)}")
    print(f"  - missing cells: {df.isnull().sum().sum()}")
    print(f"  - duplicate rows: {df.duplicated().sum()}")

print("\n✓ all ydata profiling reports generated successfully")



                       generating ydata profiling reports                       


generating profile for patients...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 7/7 [00:00<00:00, 16.72it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\profiling\patients_profile_20251110_174348.html
  - rows: 3000
  - columns: 7
  - missing cells: 0
  - duplicate rows: 0

generating profile for visits...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 7/7 [00:00<00:00, 38.70it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\profiling\visits_profile_20251110_174348.html
  - rows: 5000
  - columns: 7
  - missing cells: 0
  - duplicate rows: 0

generating profile for diagnoses...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 4/4 [00:00<00:00, 18.44it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\profiling\diagnoses_profile_20251110_174348.html
  - rows: 8000
  - columns: 4
  - missing cells: 0
  - duplicate rows: 0

generating profile for medications...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 6/6 [00:00<00:00, 41.16it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\profiling\medications_profile_20251110_174348.html
  - rows: 6000
  - columns: 6
  - missing cells: 0
  - duplicate rows: 0

generating profile for staff...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:00<00:00, 32.09it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\profiling\staff_profile_20251110_174348.html
  - rows: 500
  - columns: 5
  - missing cells: 0
  - duplicate rows: 0

generating profile for hospital_info...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:00<00:00, 86.83it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\profiling\hospital_info_profile_20251110_174348.html
  - rows: 20
  - columns: 5
  - missing cells: 0
  - duplicate rows: 0

✓ all ydata profiling reports generated successfully


## 3. generate sweetviz reports

interactive visualizations for data exploration and comparison.

In [17]:
# generate sweetviz reports for each dataset
print_section_header("generating sweetviz reports")

for dataset_name, df in datasets.items():
    print(f"\ngenerating sweetviz report for {dataset_name}...")
    
    # create sweetviz analysis
    analysis = sv.analyze(df, target_feat=None)
    
    # save report
    report_file = SWEETVIZ_DIR / f"{dataset_name}_sweetviz_{timestamp}.html"
    analysis.show_html(str(report_file), open_browser=False)
    
    print(f"  ✓ saved: {report_file}")

print("\n✓ all sweetviz reports generated successfully")


                          generating sweetviz reports                           


generating sweetviz report for patients...


                                             |          | [  0%]   00:00 -> (? left)

Report d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\patients_sweetviz_20251110_174348.html was generated.
  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\patients_sweetviz_20251110_174348.html

generating sweetviz report for visits...


                                             |          | [  0%]   00:00 -> (? left)

Report d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\visits_sweetviz_20251110_174348.html was generated.
  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\visits_sweetviz_20251110_174348.html

generating sweetviz report for diagnoses...


                                             |          | [  0%]   00:00 -> (? left)

Report d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\diagnoses_sweetviz_20251110_174348.html was generated.
  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\diagnoses_sweetviz_20251110_174348.html

generating sweetviz report for medications...


                                             |          | [  0%]   00:00 -> (? left)

Report d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\medications_sweetviz_20251110_174348.html was generated.
  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\medications_sweetviz_20251110_174348.html

generating sweetviz report for staff...


                                             |          | [  0%]   00:00 -> (? left)

Report d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\staff_sweetviz_20251110_174348.html was generated.
  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\staff_sweetviz_20251110_174348.html

generating sweetviz report for hospital_info...


                                             |          | [  0%]   00:00 -> (? left)

Report d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\hospital_info_sweetviz_20251110_174348.html was generated.
  ✓ saved: d:\Github Desktop\Python\Hospital Data Curation\reports\sweetviz\hospital_info_sweetviz_20251110_174348.html

✓ all sweetviz reports generated successfully


## 4. identify data quality issues

summary of key issues found across all datasets.

In [18]:
# analyze data quality issues
print_section_header("data quality issues summary")

issues_summary = []

for dataset_name, df in datasets.items():
    print(f"\n{dataset_name.upper()} DATASET:")
    print("-" * 60)
    
    # missing values analysis
    missing_cols = df.columns[df.isnull().any()].tolist()
    if missing_cols:
        print(f"\n✗ columns with missing values ({len(missing_cols)}):")
        for col in missing_cols:
            missing_count = df[col].isnull().sum()
            missing_pct = (missing_count / len(df)) * 100
            print(f"  - {col}: {missing_count} ({missing_pct:.2f}%)")
    else:
        print("\n✓ no missing values")
    
    # duplicates
    dup_count = df.duplicated().sum()
    if dup_count > 0:
        print(f"\n✗ duplicate rows: {dup_count}")
    else:
        print("\n✓ no duplicate rows")
    
    # high cardinality columns
    high_card_cols = []
    for col in df.select_dtypes(include=['object']).columns:
        unique_ratio = df[col].nunique() / len(df)
        if unique_ratio > 0.8:
            high_card_cols.append((col, df[col].nunique()))
    
    if high_card_cols:
        print(f"\n⚠ high cardinality columns:")
        for col, unique_count in high_card_cols:
            print(f"  - {col}: {unique_count} unique values")
    
    # numeric columns with potential outliers
    numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
    if len(numeric_cols) > 0:
        print(f"\n⚠ numeric columns for outlier analysis:")
        for col in numeric_cols:
            q1 = df[col].quantile(0.25)
            q3 = df[col].quantile(0.75)
            iqr = q3 - q1
            outliers = ((df[col] < (q1 - 1.5 * iqr)) | (df[col] > (q3 + 1.5 * iqr))).sum()
            if outliers > 0:
                print(f"  - {col}: {outliers} potential outliers")
    
    print("\n" + "=" * 60)


                          data quality issues summary                           


PATIENTS DATASET:
------------------------------------------------------------

✓ no missing values

✓ no duplicate rows

⚠ high cardinality columns:
  - patient_id: 3000 unique values
  - name: 2945 unique values
  - dob: 2876 unique values
  - contact_number: 3000 unique values
  - email: 2996 unique values
  - address: 3000 unique values


VISITS DATASET:
------------------------------------------------------------

✓ no missing values

✓ no duplicate rows

⚠ high cardinality columns:
  - visit_id: 5000 unique values


DIAGNOSES DATASET:
------------------------------------------------------------

✓ no missing values

✓ no duplicate rows

⚠ high cardinality columns:
  - diagnosis_id: 8000 unique values
  - description: 7999 unique values


MEDICATIONS DATASET:
------------------------------------------------------------

✓ no missing values

✓ no duplicate rows

⚠ high cardinality columns:
  - med_i

## 5. dataset-specific validations

In [19]:
# perform dataset-specific checks
print_section_header("dataset-specific validations")

# patients dataset
if 'patients' in datasets:
    patients_df = datasets['patients']
    print("\nPATIENTS DATASET CHECKS:")
    
    # check age distribution
    if 'age' in patients_df.columns:
        print(f"  age range: {patients_df['age'].min()} - {patients_df['age'].max()}")
        print(f"  mean age: {patients_df['age'].mean():.1f}")
    
    # check gender distribution
    if 'gender' in patients_df.columns:
        print(f"\n  gender distribution:")
        print(patients_df['gender'].value_counts().to_string())

# visits dataset
if 'visits' in datasets:
    visits_df = datasets['visits']
    print("\n\nVISITS DATASET CHECKS:")
    
    # check date columns
    if 'admission_date' in visits_df.columns and 'discharge_date' in visits_df.columns:
        visits_df['admission_date'] = pd.to_datetime(visits_df['admission_date'], errors='coerce')
        visits_df['discharge_date'] = pd.to_datetime(visits_df['discharge_date'], errors='coerce')
        
        # invalid date sequences
        invalid_dates = (visits_df['discharge_date'] < visits_df['admission_date']).sum()
        print(f"  ✗ records with discharge before admission: {invalid_dates}")
        
        # date range
        print(f"  admission date range: {visits_df['admission_date'].min()} to {visits_df['admission_date'].max()}")

# diagnoses dataset
if 'diagnoses' in datasets:
    diagnoses_df = datasets['diagnoses']
    print("\n\nDIAGNOSES DATASET CHECKS:")
    
    if 'icd_code' in diagnoses_df.columns:
        # icd code format validation
        import re
        icd_pattern = r'^[A-Z][0-9][0-9A-Z]'
        valid_codes = diagnoses_df['icd_code'].astype(str).str.match(icd_pattern).sum()
        total_codes = len(diagnoses_df)
        print(f"  valid icd-10 format: {valid_codes}/{total_codes} ({valid_codes/total_codes*100:.1f}%)")
        print(f"  unique diagnoses: {diagnoses_df['icd_code'].nunique()}")

# medications dataset
if 'medications' in datasets:
    medications_df = datasets['medications']
    print("\n\nMEDICATIONS DATASET CHECKS:")
    
    if 'medication_name' in medications_df.columns:
        print(f"  unique medications: {medications_df['medication_name'].nunique()}")
        print(f"  top 5 prescribed medications:")
        print(medications_df['medication_name'].value_counts().head().to_string())


                          dataset-specific validations                          


PATIENTS DATASET CHECKS:

  gender distribution:
gender
M         518
Male      514
Female    500
O         499
Other     494
F         475


VISITS DATASET CHECKS:
  ✗ records with discharge before admission: 1254
  admission date range: 2023-07-31 00:00:00 to 2025-07-29 00:00:00


DIAGNOSES DATASET CHECKS:
  valid icd-10 format: 8000/8000 (100.0%)
  unique diagnoses: 3711


MEDICATIONS DATASET CHECKS:
  unique medications: 5
  top 5 prescribed medications:
medication_name
Ibuprofen       1271
Amoxicillin     1199
Metformin       1198
Paracetamol     1177
Atorvastatin    1155


## summary

key findings from profiling:
1. detailed html reports generated for all datasets
2. missing values, duplicates, and outliers identified
3. data type mismatches and formatting issues detected
4. ready for data cleaning phase

reports location:
- ydata profiling: `reports/profiling/`
- sweetviz: `reports/sweetviz/`