# MDS-UPDRS Longitudinal Deltas & Minimal Clinically Important Difference Helper

This Jupyter Notebook encompassess a series of scripts written in Python by Daniel Teixeira dos Santos, and this one speciically was authored by Ana Jimena HernÃ¡ndez Medrano. These scripts were written based on data from PPMI, obtained through LONI. These files are linked to the MJFF Research Community's GitHub repository ([link here](https://github.com/MJFF-ResearchCommunity/Useful-PPMI-Clinical-Codes))

The goal of these scripts is to provide researchers some relevant clinical data that are extracted in a meaningful way from the data that is already available in PPMI. All the necessary input datasets can be obtained [here](https://ida.loni.usc.edu/pages/access/studyData.jsp?project=PPMI) after applying for registration for access to the PPMI data. All outputs from the analyses were removed to comply with privacy and data sharing principles.

This analysis requires two different folders to exist within the main folder. Those are "data" and "priv". The "data" folder is the place where you should store your datasets downloaded from LONI. The priv folder is the one the results will be exported to. These folders will be generated automatically at the beginning of this script, if they don't exist.

**Information regarding this specific script**
**Author:** Ana Jimena HernÃ¡ndez Medrano, MD, MSc  
**Contact:** anajimenahdz@gmail.com & ajhernandezmedrano@liigh.unam.mx  
**GitHub:** @jimenahmedrano

This notebook builds a clean, analysis-ready longitudinal dataset of MDS-UPDRS scores from PPMI raw CSVs and derives clinically meaningful progression metrics.

## What this notebook does

1. **Reads PPMI MDS-UPDRS CSVs**  
   - Part I (rater and patient questionnaire)  
   - Part II (patient questionnaire)  
   - Part III (motor exam, with PDSTATE)  
   - Part IV (motor complications)

2. **Computes total scores per visit and patient**
   - `UPDRS1_rater_total`, `UPDRS1_patient_total`, `UPDRS2_total`, `UPDRS3_total`, `UPDRS4_total`
   - Combined Part I total: `UPDRS1_total`
   - Global total: `UPDRST_total`
   - Keeps `INFODT` (assessment date) and `PDSTATE` (ON/OFF, when available)

3. **Creates visit-to-visit longitudinal intervals**
   - For each `PATNO`, visits are ordered by `INFODT`
   - Consecutive intervals are defined: current visit â†’ next visit
   - Time between visits is computed in days and years (`delta_days`, `delta_years`)

4. **Derives MCID-based progression flags**
   - Visit-to-visit deltas:
     - `Î”_UPDRS1`, `Î”_UPDRS2`, `Î”_UPDRS3`, `Î”_UPDRS4`, `Î”_UPDRST`
   - MCID worsening flags per part:
     - `MCID_updrs1`, `MCID_updrs2`, `MCID_updrs3`, `MCID_updrs4`, `MCID_updrst`
   - Composite progression:
     - `MCID_composite_parts` (any part crosses its MCID)
     - `MCID_composite_parts_plus_total` (parts and/or total cross MCID)

   MCID thresholds can be edited in one place to match your protocol or analysis plan.

## Intended use

- As a **preprocessing layer** to generate longitudinal outcomes for:
  - Classical regression / survival models
  - Mixed-effects models
  - ML pipelines for progression prediction

- As a **transparent, reproducible workflow** for turning raw PPMI MDS-UPDRS files into clean longitudinal deltas and MCID-based labels.

**Necessary PPMI datasets:** MDS-UPDRS Part I Patient Questionnaire: Non-Motor Aspects of Experiences of Daily Living (nM-EDL), MDS-UPDRS Part I: Non-Motor Aspects of Experiences of Daily Living (nM-EDL), MDS-UPDRS Part II Patient Questionnaire: Motor Aspects of Experiences of Daily Living (M-EDL), MDS-UPDRS Part III Treatment Determination and Part III: Motor Examination and MDS-UPDRS Part IV: Motor Complications 

**Last Update:** December 10, 2025

**NEXT STEPS**

* Individual participants' progression slope estimation
* MCID time-to-event
* Next-visit ML prediction

# Importing

In [None]:
import os
import pandas as pd
import numpy as np
import warnings
import sys
import seaborn as sns
import matplotlib.pyplot as plt

#add path to utils folder with shared functions
sys.path.append("../utils")
from helpers import get_latest_file, safe_to_numeric

# Automatically find the "Useful PPMI Clinical Codes" directory
CURRENT_DIR = os.getcwd()
while not CURRENT_DIR.endswith("Useful-PPMI-Clinical-Codes") and os.path.dirname(CURRENT_DIR) != CURRENT_DIR:
    CURRENT_DIR = os.path.dirname(CURRENT_DIR)

BASE_DIR = CURRENT_DIR

# Define paths for "data" and "report" directories
DATA_DIR = os.path.join(BASE_DIR, "data")
PRIV_DIR = os.path.join(BASE_DIR, "priv")

# Ensure both directories exist, create them if not
for directory in [DATA_DIR, PRIV_DIR]:
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"Created missing folder: {directory}")
    else:
        print(f"Found folder: {directory}")

# Ignore persistent warnings
warnings.simplefilter("ignore", UserWarning)

# Configure Pandas for better data visualization
pd.set_option('display.max_rows', 250)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = "{:,.3f}".format

# List available files in both directories
print("Files in data directory:", os.listdir(DATA_DIR))
print("Files in priv directory:", os.listdir(PRIV_DIR))


In [None]:
# -------------------------------------------------------------------
# 1) Read each raw MDS UPDRS file exported from PPMI (dynamic latest)
# -------------------------------------------------------------------

# Get latest file for each part by prefix
f_updrs1_rater = get_latest_file(
    prefix="MDS-UPDRS_Part_I",
    directory=DATA_DIR
)

f_updrs1_patient = get_latest_file(
    prefix="MDS-UPDRS_Part_I_Patient_Questionnaire",
    directory=DATA_DIR
)

f_updrs2_patient = get_latest_file(
    prefix="MDS_UPDRS_Part_II__Patient_Questionnaire",
    directory=DATA_DIR
)

f_updrs3 = get_latest_file(
    prefix="MDS-UPDRS_Part_III",
    directory=DATA_DIR
)

f_updrs4 = get_latest_file(
    prefix="MDS-UPDRS_Part_IV__Motor_Complications",
    directory=DATA_DIR
)

updrs1_rater   = pd.read_csv(f_updrs1_rater)
updrs1_patient = pd.read_csv(f_updrs1_patient)
updrs2_patient = pd.read_csv(f_updrs2_patient)
updrs3_df      = pd.read_csv(f_updrs3)
updrs4_df      = pd.read_csv(f_updrs4)

# Optional quick sanity checks
print("UPDRS I rater rows:", len(updrs1_rater))
print("UPDRS I patient rows:", len(updrs1_patient))
print("UPDRS II patient rows:", len(updrs2_patient))
print("UPDRS III rows:", len(updrs3_df))
print("UPDRS IV rows:", len(updrs4_df))

updrs1_rater.head()

# Full script

In [None]:
# -------------------------------------------------------------------
# 2) Clean and keep only total scores per MDS-UPDRS part
#    (adjust INFODT if date column is named differently in your files)
# -------------------------------------------------------------------

# Part I - Rater (NP1RTOT)
# Convert NP1RTOT to numeric and keep only ID, visit, date, and total score
updrs1_rater["NP1RTOT"] = pd.to_numeric(updrs1_rater["NP1RTOT"], errors="coerce")
updrs1_tot = updrs1_rater[["PATNO", "EVENT_ID", "INFODT", "NP1RTOT"]].copy()
updrs1_tot = updrs1_tot.rename(columns={"NP1RTOT": "UPDRS1_rater_total"})

# Part I - Patient (NP1PTOT)
# Patient-completed non-motor Part I total
updrs1_patient["NP1PTOT"] = pd.to_numeric(updrs1_patient["NP1PTOT"], errors="coerce")
updrs1p_tot = updrs1_patient[["PATNO", "EVENT_ID", "INFODT", "NP1PTOT"]].copy()
updrs1p_tot = updrs1p_tot.rename(columns={"NP1PTOT": "UPDRS1_patient_total"})

# Part II - Patient (NP2PTOT)
# Motor aspects of daily living (patient questionnaire) total
updrs2_patient["NP2PTOT"] = pd.to_numeric(updrs2_patient["NP2PTOT"], errors="coerce")
updrs2_tot = updrs2_patient[["PATNO", "EVENT_ID", "INFODT", "NP2PTOT"]].copy()
updrs2_tot = updrs2_tot.rename(columns={"NP2PTOT": "UPDRS2_total"})

# Part III (NP3TOT)  + PDSTATE (ON/OFF) if available
# Motor examination total; PDSTATE indicates ON/OFF state for the exam
updrs3_df["NP3TOT"] = pd.to_numeric(updrs3_df["NP3TOT"], errors="coerce")

cols_3 = ["PATNO", "EVENT_ID", "INFODT", "NP3TOT"]
# In PPMI, PDSTATE only exists in Part III; include it if present
if "PDSTATE" in updrs3_df.columns:
    cols_3.append("PDSTATE")

updrs3_tot = updrs3_df[cols_3].copy()
updrs3_tot = updrs3_tot.rename(columns={"NP3TOT": "UPDRS3_total"})

# Part IV (NP4TOT)
# Motor complications total
updrs4_df["NP4TOT"] = pd.to_numeric(updrs4_df["NP4TOT"], errors="coerce")
updrs4_tot = updrs4_df[["PATNO", "EVENT_ID", "INFODT", "NP4TOT"]].copy()
updrs4_tot = updrs4_tot.rename(columns={"NP4TOT": "UPDRS4_total"})

In [None]:
# -------------------------------------------------------------------
# 3) Merge into a single table with all totals per PATNO + EVENT_ID
#    -> Keep INFODT only from Part I (rater)
#       and PDSTATE only from Part III (UPDRS3)
# -------------------------------------------------------------------

def drop_cols_safe(df, cols):
    """
    Drop columns if they exist in the DataFrame.

    This is a small helper to avoid errors when a column listed in `cols`
    is not present in a given DataFrame.
    """
    cols_to_drop = [c for c in cols if c in df.columns]
    return df.drop(columns=cols_to_drop)

df_totals = (
    updrs1_tot
    .merge(
        drop_cols_safe(updrs1p_tot, ["INFODT", "PDSTATE"]),   # do not take INFODT/PDSTATE from here
        on=["PATNO", "EVENT_ID"],
        how="outer"
    )
    .merge(
        drop_cols_safe(updrs2_tot, ["INFODT", "PDSTATE"]),    # same here: keep only totals
        on=["PATNO", "EVENT_ID"],
        how="outer"
    )
    .merge(
        drop_cols_safe(updrs3_tot, ["INFODT"]),               # here we DO want PDSTATE, only drop INFODT
        on=["PATNO", "EVENT_ID"],
        how="outer"
    )
    .merge(
        drop_cols_safe(updrs4_tot, ["INFODT", "PDSTATE"]),    # only Part IV total
        on=["PATNO", "EVENT_ID"],
        how="outer"
    )
)

# In df_totals now:
# - INFODT comes from updrs1_tot (Part I rater)
# - PDSTATE (ON/OFF) comes from updrs3_tot (Part III), when available

In [None]:
# -------------------------------------------------------------------
# 4) Build combined Part I total and overall UPDRS total
# -------------------------------------------------------------------

# Combined Part I = rater + patient questionnaire (when available)
df_totals["UPDRS1_total"] = df_totals[
    ["UPDRS1_rater_total", "UPDRS1_patient_total"]
].sum(axis=1, min_count=1)

# Global total = sum of Parts Iâ€“IV
df_totals["UPDRST_total"] = df_totals[
    ["UPDRS1_total", "UPDRS2_total", "UPDRS3_total", "UPDRS4_total"]
].sum(axis=1, min_count=1)

# Quick sanity check of the merged table
print(df_totals.head())

In [None]:
# -------------------------------------------------------------------
# 5) Parse INFODT into a proper datetime column (INFODT_dt)
#     e.g., handles "MM/YYYY" or "MM/DD/YYYY" formats
# -------------------------------------------------------------------

df_totals = df_totals.copy()
df_totals["INFODT_dt"] = pd.to_datetime(df_totals["INFODT"], errors="coerce")

# Optionally drop rows without a valid date
df_totals = df_totals.dropna(subset=["INFODT_dt"])

In [None]:
# Quick look at the cleaned, merged UPDRS totals table
df_totals.head()

In [None]:
# ---------------------------------------------------------
# Helper to compute visit-to-visit deltas and MCID flags
# using INFODT as the time axis
#
#   - For each PATNO:
#       * Sort visits by date (INFODT_dt)
#       * Define intervals: current visit â†’ next visit
#       * Compute time deltas and UPDRS deltas per interval
#   - We do not drop intervals with NaN scores:
#       * If a score is missing on either side, the corresponding delta is NaN
#       * MCID flags for that delta are set to <NA>, not forced to 0
# ---------------------------------------------------------

def compute_updrs_deltas_and_mcid_visit_to_visit(df_totals):
    score_cols = [
        "UPDRS1_total",
        "UPDRS2_total",
        "UPDRS3_total",
        "UPDRS4_total",
        "UPDRST_total",
    ]

    df = df_totals.copy()

    # Ensure we have a datetime column for the visit date
    if "INFODT_dt" not in df.columns:
        df["INFODT_dt"] = pd.to_datetime(df["INFODT"], errors="coerce")

    # Require only a valid date; scores may be partially missing
    df_sorted = (
        df
        .dropna(subset=["INFODT_dt"])
        .sort_values(["PATNO", "INFODT_dt"])
    )

    # For each patient, create "next visit" columns
    df_sorted["next_INFODT_dt"]  = df_sorted.groupby("PATNO")["INFODT_dt"].shift(-1)
    df_sorted["next_EVENT_ID"]   = df_sorted.groupby("PATNO")["EVENT_ID"].shift(-1)

    for col in score_cols:
        df_sorted[f"{col}_next"] = df_sorted.groupby("PATNO")[col].shift(-1)

    # Keep only rows that have a next visit (this is required for an interval)
    intervals = df_sorted.dropna(subset=["next_INFODT_dt"]).copy()

    # Ensure the next visit is strictly later in time
    intervals = intervals[intervals["next_INFODT_dt"] > intervals["INFODT_dt"]].copy()

    # Rename dates and EVENT_IDs for clarity
    intervals = intervals.rename(
        columns={
            "INFODT_dt": "date_current",
            "next_INFODT_dt": "date_next",
            "EVENT_ID": "event_current",
            "next_EVENT_ID": "event_next",
        }
    )

    # Time differences between visits
    intervals["delta_days"]  = (intervals["date_next"] - intervals["date_current"]).dt.days
    intervals["delta_years"] = intervals["delta_days"] / 365.25

    # UPDRS deltas (next visit - current visit)
    intervals["Î”_UPDRS1"]  = intervals["UPDRS1_total_next"] - intervals["UPDRS1_total"]
    intervals["Î”_UPDRS2"]  = intervals["UPDRS2_total_next"] - intervals["UPDRS2_total"]
    intervals["Î”_UPDRS3"]  = intervals["UPDRS3_total_next"] - intervals["UPDRS3_total"]
    intervals["Î”_UPDRS4"]  = intervals["UPDRS4_total_next"] - intervals["UPDRS4_total"]
    intervals["Î”_UPDRST"]  = intervals["UPDRST_total_next"] - intervals["UPDRST_total"]

    # --- Helper: MCID flag that respects NaNs in the delta ---
    def mcid_from_delta(delta_series, threshold):
        """
        Given a delta series and a threshold, return an Int64 series where:
        - 1 = delta >= threshold (worsening meets MCID)
        - 0 = delta < threshold
        - <NA> = delta is NaN (cannot be evaluated)
        """
        mcid = pd.Series(pd.NA, index=delta_series.index, dtype="Int64")
        mask = delta_series.notna()
        mcid[mask] = (delta_series[mask] >= threshold).astype("Int64")
        return mcid

    # MCID (worsening = increase >= threshold in that interval)
    intervals["MCID_updrs1"] = mcid_from_delta(intervals["Î”_UPDRS1"],  2.45)  # optional
    intervals["MCID_updrs2"] = mcid_from_delta(intervals["Î”_UPDRS2"],  2.51)
    intervals["MCID_updrs3"] = mcid_from_delta(intervals["Î”_UPDRS3"],  4.63)
    intervals["MCID_updrs4"] = mcid_from_delta(intervals["Î”_UPDRS4"],  1.00)
    intervals["MCID_updrst"] = mcid_from_delta(intervals["Î”_UPDRST"], 10.59)

    # Composite: at least one part crosses its MCID in that interval
    # (if all part-level MCIDs are NA, composite is also NA)
    mcid_parts = intervals[["MCID_updrs1", "MCID_updrs2", "MCID_updrs3", "MCID_updrs4"]]

    # any==1 across parts, ignoring NAs; if everything is NA â†’ composite = NA
    any_mcid = mcid_parts.eq(1).any(axis=1)
    any_notna = mcid_parts.notna().any(axis=1)
    composite = pd.Series(pd.NA, index=intervals.index, dtype="Int64")
    composite[any_notna] = any_mcid[any_notna].astype("Int64")

    intervals["MCID_composite_parts"] = composite

    # Composite including total (parts and/or UPDRST)
    mcid_total = intervals["MCID_updrst"]
    comp_plus = pd.Series(pd.NA, index=intervals.index, dtype="Int64")
    mask_any_info = composite.notna() | mcid_total.notna()
    comp_plus[mask_any_info] = (
        (composite.fillna(0) == 1) | (mcid_total.fillna(0) == 1)
    )[mask_any_info].astype("Int64")

    intervals["MCID_composite_parts_plus_total"] = comp_plus

    return intervals

In [None]:
# Run the helper on df_totals to build visit-to-visit intervals
intervals = compute_updrs_deltas_and_mcid_visit_to_visit(df_totals)

# Inspect the first 50 intervals across patients
intervals.head(50)

In [None]:
# Display the full intervals DataFrame (use with care if very large)
intervals

In [None]:
# Quick check of all available columns in the intervals table
list(intervals.columns)

In [None]:
# -------------------------------------------------------------------
# 12) Percentage of patients with â‰¥1 interval meeting MCID composite
# -------------------------------------------------------------------

# For each PATNO, check if there is at least one interval with MCID_composite_parts_plus_total == 1
mcid_by_patient = (
    intervals
    .groupby("PATNO")["MCID_composite_parts_plus_total"]
    .max(min_count=1)   # if all are <NA> for a patient, result stays <NA>
)

# Patients with at least one MCID event (ignoring patients with only <NA>)
pct_patients = (mcid_by_patient == 1).mean() * 100

print(f"{pct_patients:.2f}% of patients have at least one interval with MCID_composite_parts_plus_total = 1.")

In [None]:
# Exporting
intervals.to_csv(os.path.join(PRIV_DIR, "Longitudinal_MDS_and_MCID_dataset.csv"), index=False)

> Notebook curated by Ana Jimena HernÃ¡ndez-Medrano (PPMI/GP2 nerd squad ðŸ¤“)