In [2]:
import os
import pandas as pd

**Drop (remove) the following features**

- REC_COD3 & REC_COD2: Missing value 99%, Contributory Cause of Death is not important for our analysis

- REC_ACUTE_REJ_BIOPSY_CONFIRMED: 99% Missing value, feature not important for our analysis

- REC_AVN: 97% Missing value, holds no information

- DON_HAPLO_TY_MATCH: 78.7% Missing value, holds no information

- REC_ANTIVRL_THERAPY_TY: 69% Missing value, use REC_ANTIVRL_THERAPY instead for simpler true/false data

- REC_DIAL_DT: 18% Missing value, but Date Receiver went on Dialysis holds no important information

- DON_HIST_CIGARETTE_GT20_PKYR: 39% Missing value, Cigarette Consume of donor is irrelevant

- DON_HIST_COCAINE: 36% Missing value, Cocaine Consume of donor is irrelevant

- ORG_TY: 0.0% Missing value, already satisfied in CAND_KIPA

- ORG_AR: 0.0% Missing value, already satisfied in CAND_KIPA



**Features that are potentially redundant but retain it for now**

- PERS_RETX & PERS_RETX_TRR_ID & PERS_RELIST: 91% Missing value, evaluate: Is the relist Important?

- CAN_LAST_CUR_PRA & CAN_LAST_ALLOC_PR: 67% Missing value, check if those features are redundant to PRA_HIST

- REC_PRA_MOST_RECENT: 42% Missing value, check if those features are redundant to PRA_HIST

- CAN_LAST_SRTR_PEAK_PRA: 39% Missing value, check if those features are redundant to PRA_HIST

- REC_BMI: 10% Missing value, redundant to CAND_KIPA BMI?

1. Features which are potentially redundant to features in CAND_KIPA:

- CAN_TOT_ALBUMIN: 30% Missing value


**Features with high Missing value but with potential importance** 

- REC_PREV_GRAFT1_DT & REC_PREV_GRAFT2_DT & REC_PREV_GRAFT3_DT: about 98-99% Missing value, but Missingness might be so high because there are very little people who had 1st/2nd/3rd most recent Graft Failure (Date)

- REC_MALIG_TY: 99.1% Missing value, but Missingness could indicate that Receiver did not have any Malignancies

- REC_COD: 99% Missing value, but Missingness could indicate that Receiver did not die after TX

- DON_CMV_IGG: 75.34% Missing value, but holds valuable Information about CMV-IgG-Serostatus of Donor

- DON_KI_CREAT_PREOP: 74% Missing value, but holds information about the Creatinine Level of Donor pre TX

- REC_CREAT_DECLINE_GE25: 62% Missing value, but important predictor for Graft Failure

- REC_PROD_URINE_GT40_24HRS: 62% Missing value, but important predictor for Graft Failure

**Keep (retain) the following core features**

1. Receiver Features at TX: 

- REC_FAIL_DT: 97% Missing value, BUT indicates wether receipient had Graft Failure or not

- REC_FAIL_CAUSE_TY: 97% Missing value, main reason for Graft Failure

- REC_FAIL_SURG_COMPL & REC_FAIL_RECUR_DISEASE & REC_FAIL_RECUR_DISEASE & REC_FAIL_UROL_COMPL & REC_FAIL_INFECT & REC_FAIL_REJ_ACUTE: about 98% Missing value, BUT holds information about the reason of the Graft Failure (Graft Failure yes --> one of those reasons)

- REC_ANTIVRL_THERAPY: 42% Missing value

- REC_CMV_IGM & REC_CMV_IGG: 41% Missing value

- REC_MED_COND: 38% Missing value

- REC_ACUTE_REJ_EPISODE: 32% Missing value

- REC_EBV_STAT: 21% Missing value

- REC_CREAT: 20% Missing value

- REC_HCV_STAT: 12% Missing value

- REC_MM_EQUIV_TX & REC_DR_MM_EQUIV_TX & REC_B_MM_EQUIV_TX & REC_A_MM_EQUIV_TX: 6% Missing value

- REC_CMV_STAT: 4% Missing value

- REC_HBV_ANTIBODY & REC_HBV_SURF_ANTIGEN: 3% Missing value

- REC_MALIG: 2% Missing value

- REC_DISCHRG_CREAT: 2% Missing value

- REC_DISCHRG_DT: 1% Missing value

- REC_GRAFT_STAT: 0% Missing value

- REC_MM_EQUIV_CUR & REC_DR_MM_EQUIV_CUR & REC_B_MM_EQUIV_CUR & REC_A_MM_EQUIV_CUR: 1% Missing value

- REC_A1 & REC_A2 & REC_B1 & REC_B2 & REC_DR1 & REC_DR2: 0% Missing value

- REC_DGN: 0% Missing value

- REC_PX_STAT_DT & REC_PX_STAT: 0% Missing value

- REC_FIRST_WEEK_DIAL: 0% Missing value

- REC_FUNCTN_STAT,: EDS,0% Missing value

- REC_AGE_AT_TX: 0% Missing value

- REC_TX_DT: 0.0% Missing value


3. Person Features at TX: 

- PERS_OPTN_COD: 71% Missing value but important feature

- PERS_OPTN_DEATH_DT: 63% Missing value, Death date of person


2. Follow up features after TX:

- TFL_GRAFT_DT: 77% Missing value, BUT indicates the date of the Graft Failure 

- TFL_COD & TFL_DEATH_DT: 75% Missing value, BUT indicates if the receiver is dead or not (with date)

- TFL_LASTATUS: 0% Missing value

- TFL_LAFUDATE: 0.0% Missing value

- TFL_ENDTXFU: 0.0% Missing value


3. Donor features at TX:

- DON_CREAT & DON_HIGH_CREAT: 39% Missing value

- DON_HIST_DIAB: 39% Missing value

- DON_ANTI_HCV: 39% Missing value

- DON_ANTI_CMV: 31% Missing value

- DON_EXPAND_DON_KI: 31% Missing value, Meets expanded donor criteria for kidney (1= yes , 0= no)

- DON_HIST_CANCER: 18% Missing value

- DON_A1 & DON_A2 & DON_B1 & DON_B2 & DON_DR1 & DON_DR2: 1% Missing value

- DON_RACE_SRTR & DON_RACE & DON_ETHNICITY_SRTR: 0% Missing value

- DON_AGE: 0% Missing value

- DON_ABO: 0% Missing value

- DON_GENDER: 0% Missing value


4. KEYS/IDs

- PERS_ID: 0.0% Missing value

- REC_HISTO_TX_ID: 0.0% Missing value

- TX_ID: 0.0% Missing value

- TRR_ID: 0.0% Missing value

- PX_ID: 0.0% Missing value

- DONOR_ID: 0.0% Missing value


ToDo & Questions:

- REC_FAIL_DT vs. TFL_GRAFT_DT: Both features hold information about the date of the Graft Failure ->Which feature should we use for our Outcome-Variable?

- REC_FAIL_CAUSE_TY vs. TFL_COD: REC_FAIL_CAUSE_TY is more accurate because it holds direct information about the cause of Graft Failure

In [3]:
SUBSET_FOLDER = "/Users/chanyoungwoo/Thesis/extracted_subsets"
CLEAN_FOLDER = "/Users/chanyoungwoo/Thesis/Data_Extraction/clean_subsets_ver2"
os.makedirs(CLEAN_FOLDER, exist_ok=True)

in_path = os.path.join(SUBSET_FOLDER, "tx_ki_subset_ver1.csv")
tx = pd.read_csv(in_path)

to_drop = [
    "REC_COD3", "REC_COD2",
    "REC_ACUTE_REJ_BIOPSY_CONFIRMED", "REC_AVN",
    "DON_HAPLO_TY_MATCH", "REC_ANTIVRL_THERAPY_TY",
    "REC_DIAL_DT", "DON_HIST_CIGARETTE_GT20_PKYR",
    "DON_HIST_COCAINE", "ORG_TY",
    "ORG_AR",
]
tx_clean = tx.drop(columns=to_drop)

out_path = os.path.join(CLEAN_FOLDER, "tx_ki_subset_ver2.csv")
tx_clean.to_csv(out_path, index=False)

print(f"Saved cleaned TX_KI to {out_path} (shape {tx_clean.shape})")
tx_clean.head()

  tx = pd.read_csv(in_path)


Saved cleaned TX_KI to /Users/chanyoungwoo/Thesis/Data_Extraction/clean_subsets_ver2/tx_ki_subset_ver2.csv (shape (604267, 96))


Unnamed: 0,PERS_ID,REC_TX_DT,PX_ID,REC_A_MM_EQUIV_TX,REC_A_MM_EQUIV_CUR,REC_B_MM_EQUIV_TX,REC_B_MM_EQUIV_CUR,REC_DR_MM_EQUIV_TX,REC_DR_MM_EQUIV_CUR,DON_AGE,...,TFL_DEATH_DT,TFL_COD,PERS_RELIST,PERS_RETX,PERS_RETX_TRR_ID,TFL_LAFUDATE,REC_MM_EQUIV_TX,REC_MM_EQUIV_CUR,PERS_OPTN_COD,DONOR_ID
0,4001087.0,2008-12-22,465596.0,0.0,0.0,1.0,1.0,1.0,1.0,35.0,...,2016-02-10,998.0,,,,2016-02-10,2.0,2.0,998.0,324508.0
1,4543228.0,2008-11-26,656153.0,2.0,2.0,2.0,2.0,2.0,2.0,48.0,...,2015-08-05,999.0,,,,2015-08-05,6.0,6.0,999.0,323406.0
2,3868540.0,2008-10-15,464088.0,2.0,2.0,1.0,1.0,0.0,0.0,27.0,...,,,,,,2018-10-24,3.0,3.0,,193071.0
3,2682274.0,2009-01-13,668706.0,2.0,2.0,2.0,2.0,2.0,2.0,42.0,...,,,,,,2012-01-20,6.0,6.0,,325616.0
4,4359956.0,2008-11-21,647400.0,0.0,0.0,2.0,2.0,1.0,1.0,17.0,...,,,2018-07-27,2021-11-03,921749.0,2011-11-13,3.0,3.0,,322736.0


**Important features**

Information about Graft Failure/Death of Receiver:

- REC_FAIL_DT: 97% Missing value, BUT indicates wether receipient had Graft Failure or not

- REC_FAIL_CAUSE_TY: 97% Missing value, main reason for Graft Failure

- REC_FAIL_SURG_COMPL & REC_FAIL_RECUR_DISEASE & REC_FAIL_RECUR_DISEASE & REC_FAIL_UROL_COMPL & REC_FAIL_INFECT & REC_FAIL_REJ_ACUTE: Holds detailed information of the cause of the graft failure

- REC_MED_COND: Medical Condition of Receiver

- REC_ACUTE_REJ_EPISODE: Did patient have any acute rejection episodes between transplant and discharge

- REC_GRAFT_STAT: Graft Status

- REC_PX_STAT & REC_PX_STAT_DT: Patients Status & Date

- REC_FUNCTN_STAT: Patients Functional Status

Information about TFL (Follow-Up): 

- TFL_GRAFT_DT: 77% Missing value, BUT indicates the date of the Graft Failure 

- TFL_COD & TFL_DEATH_DT: 75% Missing value, BUT indicates if the receiver is dead or not (with date)

- TFL_ENDTXFU: Cohort Date

- TFL_LAFUDATE: Last Graft Follow up Date

- TFL_LASTATUS: Last Status

Information about Person: 

- PERS_OPTN_COD: 71% Missing value but important feature

- PERS_OPTN_DEATH_DT: 63% Missing value, Death date of person



Analyse if there are several entrys for same PX_ID:

In [4]:
file_path = "/Users/chanyoungwoo/Thesis/Data_Extraction/clean_subsets_ver2/tx_ki_subset_ver2.csv"
tx = pd.read_csv(file_path)
px_counts = tx["PX_ID"].value_counts()
duplicates = px_counts[px_counts > 1]

print(f"Total records: {len(tx)}")
print(f"Unique PX_IDs: {px_counts.size}")
print(f"PX_IDs with duplicates: {len(duplicates)}")

if not duplicates.empty:
    print("\nBeispiele für mehrfach vorkommende PX_IDs und deren Counts:")
    print(duplicates.head(10))
else:
    print("\nAlle PX_IDs kommen nur einmal vor.")

  tx = pd.read_csv(file_path)


Total records: 604267
Unique PX_IDs: 604267
PX_IDs with duplicates: 0

Alle PX_IDs kommen nur einmal vor.
