# Project Requirements
## 01- Data Ingestion
### Objective:
Ingest Synthea-generated Parquet data into the Bronze Delta layer on Databricks

### Criteria:
- Input source: dbfs:/mnt/raw/synthea/

## 02- Data Normalization(Silver Layer)
### Objective:
Transform Bronze data into normalized, schema-enforced Silver tables.

### Criteria:
- Deduplicate by patient_id, encounter_id
- Normalize gender values (M/F/Unknown)
- Enforce data types (DateType for all timestamps)
- Null handling: replace NULL → N/A for categorical, 0 for numeric
- Schema validation before write
- Output path: dbfs:/mnt/silver/table_name

## 03- FR-03 — Data Aggregation (Gold Layer)

### Objective:
Create analytics-ready datasets and metrics.

### Criteria:
- Joins between Silver tables (patients, conditions, encounters, observations)
- Metrics:
  - encounters_per_disease
  - avg_stay_duration
  - readmission_rate
  - disease_prevalence_by_age_group
- Write to Delta with Z-order indexing on condition_code and encounter_date



## 04— ML Feature Store

### Objective:
Create a curated dataset for ML modeling and register in MLflow.

### Criteria:
- Merge features from Gold tables → patient_feature_store
- Include derived metrics (avg_BMI, condition_count, medication_count)
- Version-controlled Delta table (/mnt/ml/features/v1)
- Track training runs in MLflow
- Model metrics: AUC ≥ 0.80 for baseline logistic regression


## 05 — Dashboard Layer

### Objective:
Expose Gold tables to analytics layer for visualization.

### Criteria:
- Connect Databricks SQL endpoint to Power BI
- Create base views:
  - vw_patient_journey
  - vw_disease_trends
  - vw_quality_metrics


Ingestion → Silver → Gold → ML → Dashboard refresh



