In [1]:
import os
import pandas as pd

## PRA_HIST Subset Cleaning

1. **Load the raw subset**  
   Read in the extracted `pra_hist_subset.csv`, which contains all candidate‐level PRA history records.

2. **Drop unneeded columns**  
   We remove three PRA fields with high missingness or only useful when grouping:
   - **CANHX_ALLOC_PRA** (Allocation PRA; 60.3% missing)  
   - **CANHX_CUR_PRA** (Current PRA; 57.9% missing)  
   - **CANHX_SRTR_PEAK_PRA** (Peak PRA; 57.7% missing)

3. **Retain the core PRA features**  
   Keep the essential fields with low missingness or primary importance:
   - **PX_ID** – Patient identifier (key)  
   - **WL_ORG** – Organ type  
   - **CANHX_BEGIN_DT** – Date the PRA record was last changed  
   - **CANHX_CPRA** – Calculated PRA (raw fraction; only 3.4% missing)

4. **Save the cleaned dataset**  
   Write out to `clean_subsets_ver1/pra_hist_clean.csv` for downstream analysis.


In [3]:
SUBSET_FOLDER = "/Users/chanyoungwoo/Thesis/Data_Extraction/extracted_subsets"
CLEAN_FOLDER = "/Users/chanyoungwoo/Thesis/Data_Extraction/clean_subsets_ver1"
os.makedirs(CLEAN_FOLDER, exist_ok=True)

in_path = os.path.join(SUBSET_FOLDER, "pra_hist_subset.csv")
pra = pd.read_csv(in_path)

to_drop = [
    "CANHX_ALLOC_PRA",
    "CANHX_CUR_PRA",
    "CANHX_SRTR_PEAK_PRA",
]
pra_clean = pra.drop(columns=to_drop)

out_path = os.path.join(CLEAN_FOLDER, "pra_hist_clean.csv")
pra_clean.to_csv(out_path, index=False)

print(f"Saved cleaned PRA_HIST to {out_path} (shape {pra_clean.shape})")

Saved cleaned PRA_HIST to /Users/chanyoungwoo/Thesis/Data_Extraction/clean_subsets_ver1/pra_hist_clean.csv (shape (4425413, 4))
