# UCI Heart Disease Data Multi-source Parser

##  Project Overview
This project provides a robust, modular Python pipeline for extracting and standardizing clinical heart disease data from four international research sites:
* **Cleveland Clinic Foundation**
* **Hungarian Institute of Cardiology, Budapest**
* **University Hospital, Zurich, Switzerland**
* **VA Medical Center, Long Beach, CA**

The primary challenge addressed is the inconsistent data formatting across sites, where record lengths vary between 76 and 90 attributes.

##  Technical Implementation
The core logic is built using a **single-responsibility function** approach to ensure maintainability and scalability. Key features include:

* **Custom Stream Parsing:** Uses Python's `with open()` context manager to read raw `.data` streams, segmenting records based on the clinical `name` anchor.
* **Schema Standardization:** Implements a strict validation and slicing logic that standardizes all incoming records to the "Golden 76" attribute format, discarding metadata noise from 90-column datasets.
* **Data Integrity:** Automatically injects a `source_dataset` identifier as the final column of each record before concatenation to preserve data provenance without shifting clinical feature indices.


##  Pipeline Flow
1. **Ingestion:** Reads raw text files from the UCI Machine Learning Repository.
2. **Validation:** Loops through records to audit lengths, identifying and filtering anomalies (records < 76 columns).
3. **Transformation:** Slices records to 76 attributes and appends the source site ID.
4. **Consolidation:** Vertically concatenates site-specific DataFrames into a single `heart_disease_master.csv`.

##  Quick Start
To generate the master dataset, ensure you have `pandas` and `numpy` installed, then run:

```python
from heart_parser import parse_clinical_site
import pandas as pd

# Define files
files = ['cleveland.data', 'hungarian.data', 'switzerland.data', 'long-beach-va.data']

# Process and Combine
all_dfs = [parse_clinical_site(f) for f in files]
master_df = pd.concat(all_dfs, ignore_index=True)
master_df.to_csv('heart_disease_master.csv', index=False)