# Introduction to this Notebook

This Jupyter Notebook encompassess a series of scripts written in Python by Daniel Teixeira dos Santos, a Data Community Innovator at the Data Community of Practice ([link to my forum account](https://rcop.michaeljfox.org/u/danieltds/summary)). These scripts were written using data from PPMI, obtained through LONI. These files are linked to the MJFF Research Community's GitHub repository ([link here](https://github.com/MJFF-ResearchCommunity/Useful-PPMI-Clinical-Codes))

The goal of these scripts is to provide researchers some relevant clinical data that are extracted in a meaningful way from the data that is already available in PPMI. All the necessary input datasets can be obtained [here](https://ida.loni.usc.edu/pages/access/studyData.jsp?project=PPMI) after applying for registration for access to the PPMI data. All outputs from the analyses were removed to comply with privacy and data sharing principles. Some of these scripts were developed with the help of AI tools such as ChatGPT 5o. However, all code was revised and confirmed was working as intended.

This analysis requires two different folders to exist within the main folder. Those are "data" and "priv". The "data" folder is the place where you should store your datasets downloaded from LONI. The priv folder is the one the results will be exported to. These folders will be generated automatically at the beginning of this script, if they don't exist.

# Importing and Setting Paths

In [None]:
import os
import pandas as pd
import numpy as np
import warnings
import sys

#add path to utils folder with shared functions
sys.path.append("../utils")
from helpers import get_latest_file, safe_to_numeric

# Automatically find the "Useful PPMI Clinical Codes" directory
CURRENT_DIR = os.getcwd()
while not CURRENT_DIR.endswith("Useful-PPMI-Clinical-Codes") and os.path.dirname(CURRENT_DIR) != CURRENT_DIR:
    CURRENT_DIR = os.path.dirname(CURRENT_DIR)

BASE_DIR = CURRENT_DIR

# Define paths for "data" and "report" directories
DATA_DIR = os.path.join(BASE_DIR, "data")
PRIV_DIR = os.path.join(BASE_DIR, "priv")

# Ensure both directories exist, create them if not
for directory in [DATA_DIR, PRIV_DIR]:
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"Created missing folder: {directory}")
    else:
        print(f"Found folder: {directory}")

# Ignore persistent warnings
warnings.simplefilter("ignore", UserWarning)

# Configure Pandas for better data visualization
pd.set_option('display.max_rows', 250)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = "{:,.3f}".format

# List available files in both directories
print("Files in data directory:", os.listdir(DATA_DIR))
print("Files in priv directory:", os.listdir(PRIV_DIR))


# Death (PPMI)

Several outcomes could be useful for use in PD prediction such as falls, medication response, dementia status etc. One of such outcomes could also be progression to death. The PPMI dataset also collects this variable, which could be used for several correlation or prediction analyses.

**Necessary PPMI datasets:** Participant_Status

**Last Update:** November 11, 2025

## Loading and Subsetting

In [None]:
COHORT_FILE = get_latest_file(prefix="Participant_Status", directory=DATA_DIR)
df = pd.read_csv(COHORT_FILE)
print('Length of the dataset:', len(df))
df.head()

Selecting those who make sense for this analysis

In [None]:
df['ENROLL_STATUS'].value_counts()

In [None]:
# Values to filter
filter_values = ['Enrolled', 'Withdrew', 'Withdraw Deceased', 'Excluded', 'Complete']

# Subsetting the data
df = df[df['ENROLL_STATUS'].isin(filter_values)]

df['ENROLL_STATUS'].value_counts()

In [None]:
df['COHORT_DEFINITION'].value_counts(dropna=False)

Now we will create a mapping with the most common status patients can have and that make sense to presume the patient is or was actively being seen

In [None]:
# Define mapping from ENROLL_STATUS to new flag values
mapping = {
    'Withdraw Deceased': 'yes',
    'Enrolled':          'no',
    'Withdrew':          'no',
    'Complete':          'no',
    'Excluded':          'no'
}

# Create the new column; unmapped or NaN statuses remain NaN
df['Withdraw Deceased'] = df['ENROLL_STATUS'].map(mapping)

# Quick check
print(df['Withdraw Deceased'].value_counts(dropna=False))
df.head()

Checking data across cases and controls

In [None]:
# Raw counts of Withdraw Deceased by cohort
pd.crosstab(df["COHORT_DEFINITION"], df["Withdraw Deceased"], dropna=False)

In [None]:
df.columns

In [None]:
df = df[['PATNO', 'COHORT', 'COHORT_DEFINITION','ENROLL_DATE','ENROLL_STATUS','STATUS_DATE','Withdraw Deceased']]
df.head()

## Exporting cross sectional dataset

This dataset has data from all patient groups (controls, PD and prodromal)

In [None]:
# Exporting
df.to_csv(os.path.join(PRIV_DIR, "Deceased_patients_cross_sectional.csv"), index=False)

## Standardizing timepoints

Loading MDS III dataset (will be used to align longitudinal timepoints to PPMI's EVENT_ID)

In [None]:
PPMI_FILE = get_latest_file(prefix="MDS-UPDRS_Part_III", directory=DATA_DIR)
ppmiupdrs = pd.read_csv(PPMI_FILE)
print('Length of the dataset:', len(ppmiupdrs))
ppmiupdrs.head()

First, let's drop duplicate rows for the same timepoint per PATNO (most of them are ON/OFF testing, but we don't care about any values here. We are just using this dataset as a proxy to have the timepoints for each patient)

In [None]:
# Number of rows before dropping duplicates
before = ppmiupdrs.shape[0]

# Drop duplicates
ppmiupdrs = ppmiupdrs.drop_duplicates(subset=["PATNO", "EVENT_ID"], keep="first")

# Number of rows after
after = ppmiupdrs.shape[0]

# How many were removed
dropped = before - after
print(f"Rows before: {before}")
print(f"Rows after:  {after}")
print(f"Rows dropped: {dropped}")


Let's just subset what we need

In [None]:
ppmiupdrs = ppmiupdrs[['PATNO','EVENT_ID','INFODT']]
ppmiupdrs.head()

For this code to work, we will subset the dataset to standard follow-ups

In [None]:
event_id_mapping = {
    'BL': 0,
    'V01': 0.25,
    'V02': 0.5,
    'V03': 0.75,
    'V04': 1,
    'V05': 1.5,
    'V06': 2,
    'V07': 2.5,
    'V08': 3,
    'V09': 3.5,
    'V10': 4,
    'V11': 4.5,
    'V12': 5,
    'V13': 6,
    'V14': 7,
    'V15': 8,
    'V16': 9,
    'V17': 10,
    'V18': 11,
    'V19': 12,
    'V20': 13,
    'V21': 14,
    'V22': 15,
    'V23': 16
}

# Keep only rows whose EVENT_ID is one of BL, V01, V02, ...
valid_events = set(event_id_mapping.keys())

ppmiupdrs = ppmiupdrs[ppmiupdrs["EVENT_ID"].isin(valid_events)].copy()

print("Unique EVENT_ID after filtering:", sorted(ppmiupdrs["EVENT_ID"].unique()))
ppmiupdrs[["PATNO", "EVENT_ID"]].head()


In [None]:
# Let's rename some cols to work better
ppmiupdrs = ppmiupdrs.rename(columns={
    "EVENT_ID": "EVENT_ID_MDS",
    "INFODT": "INFODT_MDS"
})

ppmiupdrs.head()


In [None]:
# Merging: add UPDRS data (ppmiupdrs) to df
df_long = pd.merge(df, ppmiupdrs, on="PATNO", how="inner")

# Unique PATNO counts
n_left   = df["PATNO"].nunique()
n_right  = ppmiupdrs["PATNO"].nunique()
n_merged = df_long["PATNO"].nunique()

# Shapes and PATNO counts
print(f"Left dataset (df):        shape = {df.shape},       unique PATNO = {n_left}")
print(f"Right dataset (ppmiupdrs): shape = {ppmiupdrs.shape}, unique PATNO = {n_right}")
print(f"Merged dataset (df_long): shape = {df_long.shape},   unique PATNO = {n_merged}")

# Proportion of left dataset PATNO retained in merged
retained_pct = (n_merged / n_left) * 100
print(f"PATNO retained from left into merged: {n_merged}/{n_left} ({retained_pct:.2f}%)")

df_long.head()

# Update df to the merged version
df = df_long.copy()


This new dataset now has several rows for each patient with the timepoints from the MDS and the corresponding time in years to that timepoint. We will clean this later

## Time since disease onset

Now will add to the dataset information on how many years it took for the patient to do the surgery and how many years it took since onset and all follow-ups

In [None]:
DIAGNOSIS_FILE = get_latest_file(prefix="PD_Diagnosis_History", directory=DATA_DIR)
dxtime = pd.read_csv(DIAGNOSIS_FILE)
print('Length of the dataset:', len(dxtime))
dxtime.head()

In [None]:
# Best to use PDDXDT (more data) - PD Diagnosis
# SXDT = Symptom onset
dxtime[['SXDT','PDDXDT']].describe(include='all')

Check for missingness in INFO_DT_MDS (for later calcs)

In [None]:
df = df_long.copy()

In [None]:
print(df["INFODT_MDS"].isna().sum())
df[df["INFODT_MDS"].isna()].head(10)

Mergining datasets

In [None]:
# Merging: add PDDXDT from dxtime to df
df_final = pd.merge(
    df,
    dxtime[["PATNO", "PDDXDT"]],
    on="PATNO",
    how="inner"
)

# Unique PATNO counts
n_left   = df["PATNO"].nunique()
n_right  = dxtime["PATNO"].nunique()
n_merged = df_final["PATNO"].nunique()

# Shapes and PATNO counts
print(f"Left dataset (df):        shape = {df.shape},       unique PATNO = {n_left}")
print(f"Right dataset (dxtime):   shape = {dxtime.shape},   unique PATNO = {n_right}")
print(f"Merged dataset (df_final): shape = {df_final.shape}, unique PATNO = {n_merged}")

# Proportion of left dataset PATNO retained in merged
retained_pct = (n_merged / n_left) * 100
print(f"PATNO retained from left into merged: {n_merged}/{n_left} ({retained_pct:.2f}%)")

df_final.head(20)


One relevant note: we do lose data on some patients that didn't do surgery by requiring information on diagnosis date. This can be adapted by not requiring this. However, I think this is relevant and we don't lose information on patients that did undergo surgery

Now we create two useful columns that will tell us how many years have passed since that patient has done surgery or between his follow-up

In [None]:
# Convert date columns to datetime format (month/year)
date_cols = ["PDDXDT", "ENROLL_DATE", "STATUS_DATE", "INFODT_MDS"]
for col in date_cols:
    df_final[col] = pd.to_datetime(df_final[col], format="%m/%Y", errors="coerce")

# 1) Time from diagnosis to enrolment
df_final["duration_at_enrolment"] = (
    (df_final["ENROLL_DATE"] - df_final["PDDXDT"]).dt.days / 365.25
)

# 2) Time from diagnosis to latest status
df_final["duration_at_latest_status"] = (
    (df_final["STATUS_DATE"] - df_final["PDDXDT"]).dt.days / 365.25
)

# 3) Time from enrolment to last follow up (MDS visit)
df_final["duration_at_follow_up"] = (
    (df_final["INFODT_MDS"] - df_final["PDDXDT"]).dt.days / 365.25
)

df_final.head()


In [None]:
# Work on a copy
df_longitudinal = df_final.copy()

# Keep only patients with Withdraw Deceased == "yes"
wd = df_longitudinal[df_longitudinal["Withdraw Deceased"] == "yes"].copy()

# Ensure INFODT_MDS is datetime (if not already)
wd["INFODT_MDS"] = pd.to_datetime(wd["INFODT_MDS"], errors="coerce")

# For each PATNO, get the row with the latest available follow-up (max INFODT_MDS)
idx_latest = wd.groupby("PATNO")["INFODT_MDS"].idxmax()
wd_latest_followup = wd.loc[idx_latest].sort_values("PATNO")

# Look at the latest follow up rows for Withdraw Deceased == "yes"
wd_latest_followup.head(5)

In [None]:
# Unique PATNO counts of Withdraw Deceased by cohort
patno_ct = (
    df_longitudinal
    .groupby(["COHORT_DEFINITION", "Withdraw Deceased"])["PATNO"]
    .nunique()
    .unstack(fill_value=0)
)

patno_ct


## Exporting longitudinal dataset

This dataset has data mostly from PD and can have some prodromal, as it requires information regarding the MDS scores

In [None]:
# Exporting
df_longitudinal.to_csv(os.path.join(PRIV_DIR, "Deceased_patients_longitudinal_duration_at_death.csv"), index=False)