# Introduction to this Notebook

This Jupyter Notebook encompassess a series of scripts written in Python by Daniel Teixeira dos Santos, a Data Community Innovator at the Data Community of Practice ([link to my forum account](https://rcop.michaeljfox.org/u/danieltds/summary)). These scripts were written using Google Colab by accessing local files present in my Google Drive that I downloaded from LONI. These files are linked to the MJFF Research Community's GitHub repository ([link here](https://github.com/MJFF-ResearchCommunity/Useful-PPMI-Clinical-Codes))

The goal of these scripts is to provide researchers some relevant clinical data that are extracted in a meaningful way form the data that is already available in PPMI. All the necessary input datasets can be obtained [here](https://ida.loni.usc.edu/pages/access/studyData.jsp?project=PPMI) after applying for registration for access to the PPMI data. All outputs from the analyses were removed to comply with privacy and data sahring principles. Some of these scripts were developed with the help of AI tools such as ChatGPT 4o.

# Importing Google Drive

In [None]:
# Loading Google Drive (requires login)
from google.colab import drive
drive.mount('/content/drive')

# Setting working directory (working path for me)
import os
os.chdir("/content/drive/MyDrive/Colab/Useful PPMI Clinical Codes/")

# Installing and downloading useful modules
import pandas as pd # Data analysis and manipulation tool - https://pandas.pydata.org/
import numpy as np # Scientific computing - https://numpy.org/
import math # Math!
import glob # Needed for a search for filetypes within a folder

# Some warning are persistent and I tend to ignore
import warnings
warnings.simplefilter("ignore", UserWarning)

# More data displayed in pandas columns
pd.set_option('display.max_rows', 250)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None) # Maximum width of each column
pd.options.display.float_format = "{:,.3f}".format # Show to at most 2 decimals values

# Print what is in the loaded directory
os.listdir()

# LEDD calculation

Levodopa Equivalent Daily Dose (LEDD) is a concept that identifies the total dosage of medications used to treat PD. It is important because it gives us information on how difficult it is to treat a specific patients. This data is already present in a PPMI data cut, however, this script also tries to produce LEDD of specific subgroups of medications such as levodopa-specific LEDD, MAO-B inhibitors, dopamina agonists etc

The latest LEDD calculation formulas are based on: https://movementdisorders.onlinelibrary.wiley.com/doi/full/10.1002/mds.29410

**Necessary PPMI datasets:** LEDD Concomitant Medication Log and MDS-UPDRS Part III Treatment Determination and Part III: Motor Examination

**Last Update:** February 9, 2025

## Reading

MDS-UPDRS Part III. This will be used as a proxy for a merger of EVENT_ID

In [None]:
MDS3 = pd.read_csv('data/MDS-UPDRS_Part_III_09Feb2025.csv')
print('Lenght of the dataset:', len(MDS3))
MDS3.head()

LEDD datasheet

In [None]:
LEDD = pd.read_csv('data/LEDD_Concomitant_Medication_Log_12Jan2025.csv')
print('Lenght of the dataset:', len(LEDD))
LEDD.head()

**Explanation:** Differently from most PPMI notebooks, data in this dataset is not organized in EVENT_ID, so each we will need to apply the total LEDD for each EVENT_ID. Columns of interest are "LEDTRT" (name of the medication), "STARTDT" (Start Data), STOPDT (Stop Date) and "LEDD" (total already calculated LEDD). This script will, therefore, try to interpret this dataset and create columns that represent the LEDD for all medication times applied to each timepoint, so that it can facilitate analyses.

However, some rows doesn't have their LEDD calculated, and some adjustments can be necessary. Also, mediciation-specific type of LEDD are not calculated (e.g. dopamine agonist, MAO-B inhibitors etc). This script will address these issues.

## Adjusting Levodopa to COMT

Entacapone enhances the half-life of levodopa, so, usually, there needs to happen a multiplication of the LEDD when someone used Entacapone Some rows have an incomplete LEDD value just saying "LD x 0.33" or something like that, which indicates that the total dose of levodopa must be multiplied and added that amount (example: Carbidopa/Levodopa/Entacapone, telling the doses of everyone). Other rows have only this information on multiplication but no adjacent levodopa dose (example: Entacapone).

For privacy reasons, I can't give direct numerical examples of the patients that have this type of entry in my code, but to identify those, just look for PATNOs that have "Carbidopa/Levodopa/Entacapone" in the column "LEDTRT" and you can see that, in those patients, the LEDD column isn't calculated. It is "LD x 0.33" instead. You can also noticed that "LEDDSTRMG" brings forth the dosage of levodopa present in that compound, so we can further calculate


This codes creates a LEDD for instances where entacapone formulations are combined with levodopa

In [None]:
LEDD = pd.read_csv('data/LEDD_Concomitant_Medication_Log_12Jan2025.csv')
LEDD.head()

In [None]:
# Extract the number that appears after "LD x" in the "LEDD" column
LEDD['LD_multiplier'] = LEDD['LEDD'].str.extract(r'LD x (\d+\.\d+)').astype(float)

Adding an already converted value of LEDD in the first type of case

In [None]:
# Function to calculate LEDD
def calculate_ledd_ld(row):
    # Check if "levodopa" is present in LEDTRT (case insensitive)
    contains_levodopa = isinstance(row['LEDTRT'], str) and 'levodopa' in row['LEDTRT'].lower()

    # Proceed with calculation if LD_multiplier and LEDDSTRMG are not NaN and LEDTRT contains "levodopa"
    if pd.notna(row['LD_multiplier']) and pd.notna(row['LEDDSTRMG']) and contains_levodopa:
        result = (row['LEDDOSE'] * row['LEDDOSFRQ'] * row['LEDDSTRMG']) + \
                 (row['LEDDOSE'] * row['LEDDOSFRQ'] * row['LEDDSTRMG'] * row['LD_multiplier'])
        return result
    return np.nan  # Return NaN for rows where conditions are not met

# Running the function on all rows
LEDD['Calculated_LEDD_LD'] = LEDD.apply(calculate_ledd_ld, axis=1)

# Count number of modified rows
num_modified = LEDD['Calculated_LEDD_LD'].notna().sum()

# Update LEDD where Calculated_LEDD_LD is not NaN
LEDD.loc[pd.notna(LEDD['Calculated_LEDD_LD']), 'LEDD'] = LEDD['Calculated_LEDD_LD']

# Print number of modified rows
print(f"Number of rows modified: {num_modified}")

# Checking the end result and all columns involved in their calculations
LEDD[LEDD['LEDTRT'] == 'Carbidopa/Levodopa/Entacapone'][['LEDTRT', 'LEDDOSSTR', 'LEDDSTRMG', 'LEDDOSE', 'LEDDOSFRQ', 'LD_multiplier', 'LEDD','Calculated_LEDD_LD']].head(5)

In [None]:
# Now just applying this to the original LEDD column
# This only applies to columns where Calculated_LEDD_LD is not nan
# Count the number of rows where LEDD will be updated
num_modified = LEDD['Calculated_LEDD_LD'].notna().sum()

# Update LEDD values where Calculated_LEDD_LD is not NaN
LEDD.loc[pd.notna(LEDD['Calculated_LEDD_LD']), 'LEDD'] = LEDD['Calculated_LEDD_LD']

# Print the number of modified rows
print(f"Number of rows modified: {num_modified}")

## LD x 0.2, 0.33 and 0.5

Some rows have the COMT inhibitor isolated, therefore, we need to apply the calculations to the respective levodopa dosage that matches the period the patient was taking any of those medications.

Most common options are Entacapone (0.33) and Opicapone (0.5). However, Istradefylline, also uses a 0.2 increased conversion factor.

In [None]:
# Identifying how many entries had one medication that needs LEDD correction
LEDD['LD_multiplier'].value_counts()

Before analysing: in order for the code to work, entries should have exactly the same duration as the COMT inhibitor specified start and stop dates. If not, these calclations won't work. Therefore, we have to subdivide into new rows.

This script has been verified that it works correctly by selecting some specific patients and confirming that it correctly creates the necessary new rows that align with medication usage and that the original rows are deleted to not do double calculations.

Select some specific PATNOs modified by this script to verify its feasibility

**IMPORTANT:** as patients currently undertaking medication have a STOPDT of NaN (which invalidates the script), the first two ines of the script are a comment and a setting for the date you downloaded the dataset, so it will fill all NaN with the specified date!

In [None]:
# Input the date you downloaded the dataset (this is important for this script to work)
LEDD['STOPDT'] = LEDD['STOPDT'].fillna(pd.to_datetime('2025-01-01'))  # YYYY-MM-DD

# Mark original rows
LEDD['Row_Type'] = 'Original'

# Convert dates for processing
LEDD['STARTDT'] = pd.to_datetime(LEDD['STARTDT'], format='%m/%Y', errors='coerce')
LEDD['STOPDT'] = pd.to_datetime(LEDD['STOPDT'], format='%m/%Y', errors='coerce')

# Identify patients (PATNO) with a non-NaN LD_multiplier
patients_with_ldx = LEDD.loc[LEDD['LD_multiplier'].notna(), 'PATNO'].unique()

# Create lists to store new and junk rows
new_rows = []
junk_rows = []
modified_patnos = set()  # Track PATNOs where at least one change occurred

# Process each patient separately
for patno in patients_with_ldx:
    # Extract the periods where the patient has a valid LD_multiplier
    ldx_periods = LEDD[(LEDD['PATNO'] == patno) & (LEDD['LD_multiplier'].notna())][['STARTDT', 'STOPDT', 'LD_multiplier']]

    # Extract all Levodopa rows for the patient
    levodopa_rows = LEDD[(LEDD['PATNO'] == patno) & (LEDD['LEDTRT'].str.contains('Levodopa', case=False, na=False))]

    for _, ldx in ldx_periods.iterrows():
        ld_start, ld_stop, ld_multiplier = ldx['STARTDT'], ldx['STOPDT'], ldx['LD_multiplier']

        for _, levodopa in levodopa_rows.iterrows():
            levo_start, levo_stop = levodopa['STARTDT'], levodopa['STOPDT']

            # Identify non-exact but overlapping matches
            if (levo_start < ld_stop) and (levo_stop > ld_start) and not ((levo_start == ld_start) and (levo_stop == ld_stop)):

                # Move the original row to the junk dataset before replacing it
                junk_rows.append(levodopa.copy())

                # Mark PATNO as modified
                modified_patnos.add(patno)

                # First row: Exact match to LD multiplier time
                exact_match = levodopa.copy()
                exact_match['STARTDT'] = ld_start
                exact_match['STOPDT'] = ld_stop
                exact_match['Row_Type'] = 'Generated'
                new_rows.append(exact_match)

                # Second row: Period before LD multiplier medication
                if levo_start < ld_start:
                    before_match = levodopa.copy()
                    before_match['STOPDT'] = ld_start
                    before_match['Row_Type'] = 'Generated'
                    new_rows.append(before_match)

                # Third row: Period after LD multiplier medication
                if (levo_stop is pd.NaT) or (levo_stop > ld_stop):
                    after_match = levodopa.copy()
                    after_match['STARTDT'] = ld_stop
                    after_match['Row_Type'] = 'Generated'
                    new_rows.append(after_match)

# Convert the new and junk rows into DataFrames
new_rows_df = pd.DataFrame(new_rows)
junk_rows_df = pd.DataFrame(junk_rows)

# Remove the junk rows from the main dataset
LEDD = LEDD[~LEDD.index.isin(junk_rows_df.index)]

# Append the generated rows to the original dataset
LEDD = pd.concat([LEDD, new_rows_df], ignore_index=True)

# Sort the dataset by PATNO and STARTDT to maintain alignment
LEDD = LEDD.sort_values(by=['PATNO', 'STARTDT']).reset_index(drop=True)

# Print summary statistics
print(f"Number of unique PATNOs modified: {len(modified_patnos)}")
print(f"Total number of new rows added: {len(new_rows_df)}")

Now we are going to prepare to run the script that recalculates the LEDD. But first, we need to set different possibilities for levodopa names

In [None]:
# Creating the list
levodopa_names = ['Levodopa', 'Dhivy', 'Duodopa', 'Duopa', 'Inbrija',
                  'Parcopa','Prolopa','Rytary','Sinemet','Stalevo']

# Create regex pattern by joining the list elements with "|"
levodopa_pattern = '|'.join(levodopa_names)

# Seeing the result
print(levodopa_pattern)

Now the real deal: multiplying according levodopa LEDD in which the timestamp is similar

In [None]:
# Initialize 'Multiplied' column
LEDD['Multiplied'] = 'No'

# Creating the "Original_LEDD" values to check if any inconsistencies may happen
LEDD['LEDD'] = LEDD['LEDD'].astype(str)

# Extract numeric values only when "LD" is NOT in the string
LEDD['Original_LEDD'] = np.where(
    LEDD['LEDD'].str.contains('LD', case=False, na=False),
    np.nan,
    LEDD['LEDD'].str.extract(r'(\d+\.\d+|\d+)', expand=False).astype(float)
)

# Iterate over unique PATNO values to process each patient separately
for patno in LEDD['PATNO'].unique():
    # Filter dataset for the specific patient
    patient_data = LEDD[LEDD['PATNO'] == patno]

    # Identify rows with 'LD x' in LEDD for the specific patient
    rows_with_ldx = patient_data[patient_data['LEDD'].str.contains('LD x', na=False)]

    # Iterating over identified rows
    for index, row in rows_with_ldx.iterrows():
        multiplier = row['LD_multiplier']
        start_date = row['STARTDT']
        stop_date = row['STOPDT']

        # Identify corresponding Levodopa rows within the same date range for the same patient
        corresponding_rows = LEDD[
            (LEDD['PATNO'] == patno) &  # Ensuring only within the same patient
            (LEDD['LEDTRT'].str.contains(levodopa_pattern, case=False, na=False)) &
            (LEDD['STARTDT'] == start_date) &
            ((LEDD['STOPDT'] == stop_date) | (LEDD['STOPDT'].isna() & pd.isna(stop_date))) &
            (LEDD['Multiplied'] == 'No')
        ]

        # Apply the multiplication logic correctly
        for corr_index, corr_row in corresponding_rows.iterrows():
            original_ledd = LEDD.loc[corr_index, 'Original_LEDD']
            if pd.notna(multiplier):  # Ensure multiplier is not NaN
                new_ledd = original_ledd + (original_ledd * multiplier)
                LEDD.loc[corr_index, 'LEDD'] = new_ledd
                LEDD.loc[corr_index, 'Multiplied'] = 'Yes'

## Specific medication LEDD and joining to the timepoint organization

This part of the script accomplishes two different goals: first it merges the data present in the LEDD script in a way that is is aligned with the EVENT_ID in the dataset (BL, V02, V04 etc). For this, we use the MDS-UPDRS 3 questionnaire as a proxy to gather these timeframes, as it seems to be one of the most complete questionnaires. Second, it calculates medication_specific LEDD based on synthax understanding. It calculates for levodopa, MAO-B, amantadine, anticholinergics etc

This script is just to showcase how you can identify the different ways levodopa are written. I did a manual check for most medication types that will be presented below to ensure most of them could be included in LEDD-specific medication calculations

In [None]:
# Selecting rows where the "LEDTRT" column contains 'Levodopa'
levodopa_rows = LEDD[LEDD['LEDTRT'].str.contains('Levodopa', case=False, na=False)]

# Printing or further processing the selected rows
levodopa_rows['LEDTRT'].value_counts().index.tolist()[0:5]

## Uniting LEDD to timepoints

In [None]:
# Defining names for specific categories
# Levodopa
levodopa_names = ['Levodopa', 'Dhivy', 'Duodopa', 'Duopa', 'Inbrija',
                  'Parcopa','Prolopa','Rytary','Sinemet','Stalevo']

# Create a regex pattern that matches any of the terms
levodopopa_pattern = '|'.join(levodopa_names)

# Dopamine agonists
dopamine_agonist_names = ['Pramipexol', 'Mirapex', 'Mirapexin', 'Sifrol',
                          'Ropirinol', 'Requip', 'Rotigotin', 'Neupro',
                          'Apomorphin', 'Apokyn']

# Create a regex pattern that matches any of the terms
dopamine_agonist_pattern = '|'.join(dopamine_agonist_names)

# MAO-B
maob_names = ['Selegilin', 'Eldepryl', 'Zelapar', 'Rasagilin', 'Azilect'
                          'Safinamid', 'Xadago']

# Create a regex pattern that matches any of the terms
maob_pattern = '|'.join(maob_names)

# COMT
comt_names = ['Entacapon', 'Comtan', 'Tolcapon', 'Tasmar',
              'Opicapon', 'Ongentys']

# Create a regex pattern that matches any of the terms
comt_pattern = '|'.join(comt_names)

# Muscarinic antagonist
anticholingergic_names = ['Trihexyphenidyl', 'Artanis', 'Biperiden', 'Akineton']

# Create a regex pattern that matches any of the terms
anticholingergic_pattern = '|'.join(anticholingergic_names)

# Amantadine (some typos present in the dataset, so giving all options)
amantadine_names = ['AMANDATINE', 'AMANDTADINE', 'AMANTADIN', 'AMANTADIN 100',
                    'AMANTADIN 150', 'AMANTADINA', 'AMANTADINE', 'AMANTADINE (100 MG)',
                    'AMANTADINE 100 MG', 'AMANTADINE 100MG', 'AMANTADINE HCL',
                    'AMANTADINE09', 'GOCOVERI', 'GOCOVRI', 'GOCOVRI 137 MG',
                    'GOCOVRI ER', 'Gocovri (Amantadine CR )', 'Gocovri (Amantadine CR)',
                     'OSMOLEX ER', 'Osmolex (Amantadine ER)']

# Create a regex pattern that matches any of the terms
amantadine_pattern = '|'.join(amantadine_names)

# Creating lists of names for drugs
levodopa_values = LEDD[LEDD['LEDTRT'].str.contains(levodopopa_pattern, case=False, regex=True)]['LEDTRT'].drop_duplicates().tolist()
dopamine_agonist_values = LEDD[LEDD['LEDTRT'].str.contains(dopamine_agonist_pattern, case=False, regex=True)]['LEDTRT'].drop_duplicates().tolist()
maob_values = LEDD[LEDD['LEDTRT'].str.contains(maob_pattern, case=False, regex=True)]['LEDTRT'].drop_duplicates().tolist()
comt_values = LEDD[LEDD['LEDTRT'].str.contains(comt_pattern, case=False, regex=True)]['LEDTRT'].drop_duplicates().tolist()
anticholingergic_values = LEDD[LEDD['LEDTRT'].str.contains(anticholingergic_pattern, case=False, regex=True)]['LEDTRT'].drop_duplicates().tolist()
amantadine_values = LEDD[LEDD['LEDTRT'].str.contains(amantadine_pattern, case=False, regex=True)]['LEDTRT'].drop_duplicates().tolist()

In [None]:
# Convert the date columns to datetime format
MDS3['INFODT'] = pd.to_datetime(MDS3['INFODT'], format='%m/%Y')
LEDD['STARTDT'] = pd.to_datetime(LEDD['STARTDT'], format='%m/%Y')
LEDD['STOPDT'] = pd.to_datetime(LEDD['STOPDT'], format='%m/%Y', errors='coerce')  # Handle NaN

# Convert 'LEDD' column to numeric
LEDD['LEDD'] = pd.to_numeric(LEDD['LEDD'], errors='coerce')

# Function to calculate LEDD sums for specific medication categories
def calculate_ledd_sums(relevant_meds, values_list):
    return relevant_meds[relevant_meds['LEDTRT'].str.lower().isin([val.lower() for val in values_list])]['LEDD'].sum()

# Initialize lists to collect results
led_values = []
amantadine_values_list = []
levodopa_values_list = []
dopamine_agonist_values_list = []
maob_values_list = []
comt_values_list = []
anticholingergic_values_list = []

# Iterate through unique patients in MDS3
for patno in MDS3['PATNO'].unique():
    patient_mds3 = MDS3[MDS3['PATNO'] == patno]
    patient_ledd = LEDD[LEDD['PATNO'] == patno]

    for index, row in patient_mds3.iterrows():
        infodt = row['INFODT']
        relevant_meds = patient_ledd[(patient_ledd['STARTDT'] <= infodt) &
                                     ((patient_ledd['STOPDT'] >= infodt) | pd.isna(patient_ledd['STOPDT']))]

        # Sum the total LEDD values
        led_values.append(relevant_meds['LEDD'].sum())

        # Sum the LEDD values for the specified 'LEDTRT' names for each category
        amantadine_values_list.append(calculate_ledd_sums(relevant_meds, amantadine_values))
        levodopa_values_list.append(calculate_ledd_sums(relevant_meds, levodopa_values))
        dopamine_agonist_values_list.append(calculate_ledd_sums(relevant_meds, dopamine_agonist_values))
        maob_values_list.append(calculate_ledd_sums(relevant_meds, maob_values))
        comt_values_list.append(calculate_ledd_sums(relevant_meds, comt_values))
        anticholingergic_values_list.append(calculate_ledd_sums(relevant_meds, anticholingergic_values))

# Assign the collected results back to the DataFrame
MDS3['LEDD'] = led_values
MDS3['AMANTADINE_LEDD'] = amantadine_values_list
MDS3['LEVODOPA_LEDD'] = levodopa_values_list
MDS3['DOPAMINE_AGONIST_LEDD'] = dopamine_agonist_values_list
MDS3['MAOB_LEDD'] = maob_values_list
MDS3['COMT_LEDD'] = comt_values_list
MDS3['ANTICHOLINERGIC_LEDD'] = anticholingergic_values_list

MDS3[['PATNO','INFODT','LEDD', 'AMANTADINE_LEDD', 'DOPAMINE_AGONIST_LEDD', 'MAOB_LEDD', 'COMT_LEDD', 'ANTICHOLINERGIC_LEDD', 'LEVODOPA_LEDD']].head()

In [None]:
# Convert the date columns to datetime format
MDS3['INFODT'] = pd.to_datetime(MDS3['INFODT'], format='%m/%Y')
LEDD['STARTDT'] = pd.to_datetime(LEDD['STARTDT'], format='%m/%Y')
LEDD['STOPDT'] = pd.to_datetime(LEDD['STOPDT'], format='%m/%Y', errors='coerce')  # Handle NaN

# Convert 'LEDD' column to numeric
LEDD['LEDD'] = pd.to_numeric(LEDD['LEDD'], errors='coerce')

# Function to calculate LEDD sums for specific medication categories
def calculate_ledd_sums(relevant_meds, values_list):
    return relevant_meds[relevant_meds['LEDTRT'].str.lower().isin([val.lower() for val in values_list])]['LEDD'].sum()

# Function to check COMT inhibitor presence
def check_comt_inhibitor(relevant_meds, comt_names):
    return "Yes" if relevant_meds['LEDTRT'].str.lower().isin([val.lower() for val in comt_names]).any() else "No"

# Initialize lists to collect results
led_values = []
amantadine_values_list = []
levodopa_values_list = []
dopamine_agonist_values_list = []
maob_values_list = []
comt_presence_list = []  # Store Yes/No instead of sum
anticholingergic_values_list = []

# Iterate through unique patients in MDS3
for patno in MDS3['PATNO'].unique():
    patient_mds3 = MDS3[MDS3['PATNO'] == patno]
    patient_ledd = LEDD[LEDD['PATNO'] == patno]

    for index, row in patient_mds3.iterrows():
        infodt = row['INFODT']
        relevant_meds = patient_ledd[(patient_ledd['STARTDT'] <= infodt) &
                                     ((patient_ledd['STOPDT'] >= infodt) | pd.isna(patient_ledd['STOPDT']))]

        # Sum the total LEDD values
        led_values.append(relevant_meds['LEDD'].sum())

        # Sum the LEDD values for the specified 'LEDTRT' names for each category
        amantadine_values_list.append(calculate_ledd_sums(relevant_meds, amantadine_values))
        levodopa_values_list.append(calculate_ledd_sums(relevant_meds, levodopa_values))
        dopamine_agonist_values_list.append(calculate_ledd_sums(relevant_meds, dopamine_agonist_values))
        maob_values_list.append(calculate_ledd_sums(relevant_meds, maob_values))
        anticholingergic_values_list.append(calculate_ledd_sums(relevant_meds, anticholingergic_values))

        # Check for COMT inhibitor presence
        comt_presence_list.append(check_comt_inhibitor(relevant_meds, comt_values))

# Assign the collected results back to the DataFrame
MDS3['LEDD'] = led_values
MDS3['AMANTADINE_LEDD'] = amantadine_values_list
MDS3['LEVODOPA_LEDD'] = levodopa_values_list
MDS3['DOPAMINE_AGONIST_LEDD'] = dopamine_agonist_values_list
MDS3['MAOB_LEDD'] = maob_values_list
MDS3['COMT_INHIBITOR'] = comt_presence_list  # Changed from sum to Yes/No
MDS3['ANTICHOLINERGIC_LEDD'] = anticholingergic_values_list

# Display the results
MDS3[['PATNO','INFODT','LEDD', 'AMANTADINE_LEDD', 'DOPAMINE_AGONIST_LEDD', 'MAOB_LEDD', 'COMT_INHIBITOR', 'ANTICHOLINERGIC_LEDD', 'LEVODOPA_LEDD']].head()

## Exporting

In [None]:
# Subsetting the dataset to important variables
LEDD_dataset = MDS3[['PATNO','INFODT','LEDD', 'AMANTADINE_LEDD', 'DOPAMINE_AGONIST_LEDD', 'MAOB_LEDD', 'COMT_INHIBITOR', 'ANTICHOLINERGIC_LEDD', 'LEVODOPA_LEDD']]
LEDD_dataset.head()

In [None]:
# Exporting
LEDD_dataset.to_csv('data/LEDD_Dataset.csv', index=False)

# Levodopa responsiveness

Levodopa responsiveness is a very interesting marker that different studies have approached (example: https://pubmed.ncbi.nlm.nih.gov/38898616/). The PPMI protocol states that, whenever possible, patients should be evaluated both in the OFF and ON states. In that way, we can calculate levodopa challenge responses by using the formula: response = (off - on) / off x 100.

**Necessary PPMI datasets:** MDS-UPDRS Part III Treatment Determination and Part III: Motor Examination

**Last Update:** February 9, 2025

## Organizing

### General read and view

In [None]:
MDS3 = pd.read_csv('data/MDS-UPDRS_Part_III_09Feb2025.csv')
print('Lenght of the dataset:', len(MDS3))
MDS3.head()

In [None]:
# Define concepts with their corresponding columns
# This can be useful to calculate characteristic-specific levodopa responses
concepts = {
    'Rigidity': ["NP3RIGLL", "NP3RIGLU", "NP3RIGN", "NP3RIGRL", "NP3RIGRU"], # 3.3 (all elements)
    "Tremor": ["NP3KTRML", "NP3KTRMR", "NP3PTRML", "NP3PTRMR", "NP3RTALJ", "NP3RTALL", "NP3RTALU", "NP3RTARL", "NP3RTARU", "NP3RTCON"], # 3.15 + 3.16 + 3.17 + 3.18
    "Gait_and_Posture": ["NP3RISNG", "NP3GAIT", "NP3FRZGT", "NP3PSTBL", "NP3POSTR"], # 3.9 + 3.10 + 3.11 + 3.12 + 3.13
    "Bradykinesia": ["NP3FTAPR", "NP3FTAPL", "NP3HMOVR", "NP3HMOVL", "NP3PRSPR", "NP3PRSPL", "NP3TTAPR", "NP3TTAPL", "NP3LGAGR", "NP3LGAGL", "NP3BRADY"],# 	3.4 + 3.5 + 3.6 + 3.7 + 3.8 + 3.14
    'All_MDS3': ["NP3SPCH", "NP3FACXP", "NP3RIGN", "NP3RIGRU", "NP3RIGLU", "NP3RIGRL", "NP3RIGLL",
    "NP3FTAPR", "NP3FTAPL", "NP3HMOVR", "NP3HMOVL", "NP3PRSPR", "NP3PRSPL", "NP3TTAPR",
    "NP3TTAPL", "NP3LGAGR", "NP3LGAGL", "NP3RISNG", "NP3GAIT", "NP3FRZGT", "NP3PSTBL",
    "NP3POSTR", "NP3BRADY", "NP3PTRMR", "NP3PTRML", "NP3KTRMR", "NP3KTRML", "NP3RTARU",
    "NP3RTALU", "NP3RTARL", "NP3RTALL", "NP3RTALJ", "NP3RTCON"]}

In [None]:
# Function to compute the sum, considering NaN
def sum_with_nan(series):
    if series.isna().any():
        return np.nan
    else:
        return series.sum()

concepts_list = []

# Add new columns with the sum of values for each concept
for concept, columns in concepts.items():
    sum_column = concept
    concepts_list.append(sum_column)
    MDS3[sum_column] = MDS3[columns].apply(sum_with_nan, axis=1)

In [None]:
# Let's check the overall distribution of the PDSTATE (situation in which the patient was examined)
MDS3['PDSTATE'].value_counts()

Check how many different entries (more than one for patient) have at least 2 MDS evaluations (one in the OFF and one in the ON states)

In [None]:
# Group by PATNO and EVENT_ID and filter groups with more than one entry
duplicates = MDS3.groupby(['PATNO', 'EVENT_ID']).filter(lambda x: len(x) > 1)

# Extract unique PATNO and EVENT_ID pairs with values in the specified columns
result_test = duplicates.groupby(['PATNO', 'EVENT_ID']).agg({
    'PDSTATE': lambda x: tuple(x), # Which functional state is the participant currently in?
    'HRPOSTMED': lambda x: tuple(x), # Hours between last dose of PD medication and NUPDRS3 exam
    'EXAMTM': lambda x: tuple(x), # Time of NUPDRS3 exam
    'HRDBSON': lambda x: tuple(x), # Hours between DBS device turned on and NUPDRS3 exam
    'DBSYN': lambda x: tuple(x), # Does participant have DBS
    'ONOFFORDER': lambda x: tuple(x), # First Part III exam OFF or ON
    'OFFEXAM': lambda x: tuple(x), # OFF exam performed
    'OFFNORSN': lambda x: tuple(x), # Reason OFF exam not performed
    'DBSOFFTM': lambda x: tuple(x), # Time DBS turned off before OFF exam
    'ONEXAM': lambda x: tuple(x), # ON exam performed
    'ONNORSN': lambda x: tuple(x), # Reason ON exam not performed
    'DBSONTM': lambda x: tuple(x), # Time DBS turned on before ON exam
    'PDMEDDT': lambda x: tuple(x), # Date of most recent PD med dose before exam
    'PDMEDTM': lambda x: tuple(x), # Time of most recent PD med dose before exam
}).reset_index()

print('Lenght of the dataset considering at least 2 evaluations:', len(result_test))
result_test.head()

Check how many different entries (more than one for patient) have at least 3 MDS evaluations (one in the OFF and one in the ON states).

This is unusual, but the number is low. Did this just to check how data are displayed.

In [None]:
# Group by PATNO and EVENT_ID and filter groups with more than one entry
duplicates = MDS3.groupby(['PATNO', 'EVENT_ID']).filter(lambda x: len(x) > 2)

# Extract unique PATNO and EVENT_ID pairs with values in the specified columns
result_test = duplicates.groupby(['PATNO', 'EVENT_ID']).agg({
    'PDSTATE': lambda x: tuple(x), # Which functional state is the participant currently in?
    'HRPOSTMED': lambda x: tuple(x), # Hours between last dose of PD medication and NUPDRS3 exam
    'HRDBSON': lambda x: tuple(x), # Hours between DBS device turned on and NUPDRS3 exam
    'DBSYN': lambda x: tuple(x), # Does participant have DBS
    'ONOFFORDER': lambda x: tuple(x), # First Part III exam OFF or ON
    'OFFEXAM': lambda x: tuple(x), # OFF exam performed
    'OFFNORSN': lambda x: tuple(x), # Reason OFF exam not performed
    'DBSOFFTM': lambda x: tuple(x), # Time DBS turned off before OFF exam
    'ONEXAM': lambda x: tuple(x), # ON exam performed
    'ONNORSN': lambda x: tuple(x), # Reason ON exam not performed
    'DBSONTM': lambda x: tuple(x), # Time DBS turned on before ON exam
    'PDMEDDT': lambda x: tuple(x), # Date of most recent PD med dose before exam
    'PDMEDTM': lambda x: tuple(x), # Time of most recent PD med dose before exam
}).reset_index()

print('Lenght of the dataset considering more than 2 evaluations:', len(result_test))
result_test.head()

### Missing correction

Missing values in the MDS-UPDRS can be written as 101 (Unable to Rate). Let's identify those and treat them as NaN for our analysis not to be biased by those high numbers

In [None]:
# Dataset has some "101", which are "Unable to Rate"
MDS3['NP3RIGLL'].value_counts(dropna=False)

In [None]:
# Converting unables to rate to nan
MDS3 = MDS3.replace(101,np.nan)

# Forcing to become float, ignore errors
MDS3 = MDS3.apply(pd.to_numeric, errors='ignore')

# Checking
MDS3['NP3RIGLL'].value_counts(dropna=False)

### DBS categories

Patients also undergo DBS as this is informed in the MDS3 scale. We will want, in this code, generate separate values for evaluations that had a DBS and evaluations without a DBS

In [None]:
print('Length of the dataset before removing DBS:', len(MDS3))
MDS3_nodbs = MDS3[MDS3['DBSYN'].isin([0, np.nan])].reset_index(drop=True)
print('Length of the dataset after removing DBS:', len(MDS3_nodbs))

In [None]:
print('Length of the dataset before subsetting by DBS:', len(MDS3))
MDS3_dbs = MDS3[MDS3['DBSYN'].isin([1])].reset_index(drop=True)
print('Length of the dataset after subsetting by DBS:', len(MDS3_dbs))

## Calculating response (no DBS)

In [None]:
# Group by PATNO and EVENT_ID and filter groups with more than one entry
duplicates = MDS3_nodbs.groupby(['PATNO', 'EVENT_ID']).filter(lambda x: len(x) == 2)

# Extract unique PATNO and EVENT_ID pairs with values in the specified columns
result = duplicates.groupby(['PATNO', 'EVENT_ID']).agg({
    'PDSTATE': lambda x: tuple(x), # Which functional state is the participant currently in?
    'HRPOSTMED': lambda x: tuple(x), # Hours between last dose of PD medication and NUPDRS3 exam
    'ONOFFORDER': lambda x: tuple(x), # First Part III exam OFF or ON
    'OFFEXAM': lambda x: tuple(x), # OFF exam performed
    'ONEXAM': lambda x: tuple(x), # ON exam performed
    'Rigidity': lambda x: tuple(x), # Rigidity
    'Tremor': lambda x: tuple(x), # Tremor
    'Gait_and_Posture': lambda x: tuple(x), # Gait and Posture
    'Bradykinesia': lambda x: tuple(x), # Bradykinesia
    'All_MDS3': lambda x: tuple(x), # MDS_3 Complete
}).reset_index()

print('Lenght of the dataset considering exactly 2 evaluations:', len(result))
result.head()

Main function to calculate the response. It uses the tuples of OFF and ON to calculate these responses, then generates new columns detailing the actual responses. There are also other columns that helps us better understaning what is happening, such as "Time_since_levodopa", which helps us be sure that the responses are correctly calculating OFF versus ON responses, and not the opposite.

The lowest HRPOSTMED (hours after last medication) is used to define the ON state

In [None]:
# Define the function to calculate the response
def calculate_response(on, off):
    if pd.isna(on) or pd.isna(off) or off == 0:
        return np.nan
    return ((off - on) / off) * 100

# Define a function to round values and handle errors
def round_to_int(value):
    try:
        return int(round(value))
    except (ValueError, TypeError):
        return np.nan

# Function to process the dataset
def process_dataset(df, concepts_list):
    new_columns = []
    df['Time_since_levodopa'] = df['HRPOSTMED'].apply(lambda x: min(x) if not any(pd.isna(v) for v in x) else np.nan) * 60  # Convert to minutes
    df['Time_since_levodopa'] = df['Time_since_levodopa'].apply(round_to_int)
    for var in concepts_list:
        new_col_name = f'{var}_resp'
        new_columns.append(new_col_name)
        df[new_col_name] = df.apply(lambda row: calculate_response(
            row[var][0] if not any(pd.isna(v) for v in row['HRPOSTMED']) and row['HRPOSTMED'][0] < row['HRPOSTMED'][1] else row[var][1],  # on
            row[var][1] if not any(pd.isna(v) for v in row['HRPOSTMED']) and row['HRPOSTMED'][0] < row['HRPOSTMED'][1] else row[var][0]   # off
        ) if not any(pd.isna(v) for v in row['HRPOSTMED']) else np.nan, axis=1)
    return df, new_columns

# Assuming result is your DataFrame
levodopa_response, newcols = process_dataset(result, concepts_list)

# Display the processed DataFrame
levodopa_response.head()

There are some negative values, however, they are a minority of the data. Some of the most significant ones can be typos, and other just a slight paradoxical worsening / variation due to clinician's judgment

In [None]:
levodopa_response[newcols].describe(include='all')

Let's check thesep patients as an example

In [None]:
# First, let's have a general idea of some patients with negative responses
print(len(levodopa_response[levodopa_response['All_MDS3_resp'] < 0]))
levodopa_response[levodopa_response['All_MDS3_resp'] < 0].head(10)[['PDSTATE','HRPOSTMED','All_MDS3','All_MDS3_resp']]

In [None]:
# Second, let's have a general idea of some patients with EXTREME negative responses
print(len(levodopa_response[levodopa_response['All_MDS3_resp'] < -100]))
levodopa_response[levodopa_response['All_MDS3_resp'] < -100].head(10)[['PDSTATE','HRPOSTMED','All_MDS3','All_MDS3_resp']]

As you can see, most patients with responses between -100 and 0 have just a mild paradoxical response or variation due to clinical judgment. However, most patients < -100 probably have typos that invalidated the analysis. Fortunately, thery are only a few.

I will not remove those, but **I highly suggest you take that into account in your analysis**

Exporting

In [None]:
# Exporting
levodopa_response.to_csv('data/levodopa_challenge_no_DBS.csv', index=False)

## Calculating response (DBS)

In [None]:
# Group by PATNO and EVENT_ID and filter groups with more than one entry
duplicates = MDS3_dbs.groupby(['PATNO', 'EVENT_ID']).filter(lambda x: len(x) == 2)

# Extract unique PATNO and EVENT_ID pairs with values in the specified columns
result = duplicates.groupby(['PATNO', 'EVENT_ID']).agg({
    'PDSTATE': lambda x: tuple(x), # Which functional state is the participant currently in?
    'HRPOSTMED': lambda x: tuple(x), # Hours between last dose of PD medication and NUPDRS3 exam
    'EXAMTM': lambda x: tuple(x), # Time of NUPDRS3 exam
    'ONOFFORDER': lambda x: tuple(x), # First Part III exam OFF or ON
    'OFFEXAM': lambda x: tuple(x), # OFF exam performed
    'ONEXAM': lambda x: tuple(x), # ON exam performed
    'DBSYN': lambda x: tuple(x), # Does participant have DBS
    'DBSOFFTM': lambda x: tuple(x), # Time DBS turned off before OFF exam
    'DBSONTM': lambda x: tuple(x), # Time DBS turned on before ON exam
    'HRDBSON': lambda x: tuple(x), # Hours between DBS device turned on and NUPDRS3 exam
    'Rigidity': lambda x: tuple(x), # Rigidity
    'Tremor': lambda x: tuple(x), # Tremor
    'Gait_and_Posture': lambda x: tuple(x), # Gait and Posture
    'Bradykinesia': lambda x: tuple(x), # Bradykinesia
    'All_MDS3': lambda x: tuple(x), # MDS_3 Complete
}).reset_index()

print('Lenght of the dataset considering exactly 2 evaluations:', len(result))
result.head()

In [None]:
# Define the function to calculate the response
def calculate_response(on, off):
    if pd.isna(on) or pd.isna(off) or off == 0:
        return np.nan
    return ((off - on) / off) * 100

# Define a function to round values and handle errors
def round_to_int(value):
    try:
        return int(round(value))
    except (ValueError, TypeError):
        return np.nan

# Function to process the dataset
def process_dataset(df, concepts_list):
    new_columns = []
    df['Time_since_levodopa'] = df['HRPOSTMED'].apply(lambda x: min(x) if not any(pd.isna(v) for v in x) else np.nan) * 60  # Convert to minutes
    df['Time_since_levodopa'] = df['Time_since_levodopa'].apply(round_to_int)
    for var in concepts_list:
        new_col_name = f'{var}_resp'
        new_columns.append(new_col_name)
        df[new_col_name] = df.apply(lambda row: calculate_response(
            row[var][0] if not any(pd.isna(v) for v in row['HRPOSTMED']) and row['HRPOSTMED'][0] < row['HRPOSTMED'][1] else row[var][1],  # on
            row[var][1] if not any(pd.isna(v) for v in row['HRPOSTMED']) and row['HRPOSTMED'][0] < row['HRPOSTMED'][1] else row[var][0]   # off
        ) if not any(pd.isna(v) for v in row['HRPOSTMED']) else np.nan, axis=1)
    return df, new_columns

# Assuming result is your DataFrame
levodopa_response, newcols = process_dataset(result, concepts_list)

# Display the processed DataFrame
filtered_df = levodopa_response[levodopa_response['Time_since_levodopa'].notna()]

# Count the unique values in the 'Time_since_levodopa' column
unique_patients_levodopa = filtered_df['Time_since_levodopa'].nunique()
print('Length of entire dataset:', len(levodopa_response))
print('Length of subsetted dataset:', len(filtered_df))
print('Number of unique patients using DBS that took levodopa in a challenge:', unique_patients_levodopa)

# Showing the subsetted dataset
filtered_df.head()

In [None]:
# Exporting
levodopa_response.to_csv('data/levodopa_challenge_DBS.csv', index=False)

# Medical Conditions

Several medical conditions are associated with a higher PD risk and/or progression (examples: https://pubmed.ncbi.nlm.nih.gov/36865411/ and https://pubmed.ncbi.nlm.nih.gov/33682937/). So having a way to understand in more detail each patient's diagnosis may be useful for correlation analyses.

**Necessary PPMI datasets:** Medical Conditions Log and MDS-UPDRS Part III Treatment Determination and Part III: Motor Examination

**Last Update:** February 9, 2025

## Reading

Reading MDS data to use as a surrogate for the timepoints

In [None]:
# Using MDS3 as a timepoint proxy
MDS3 = pd.read_csv('data/MDS-UPDRS_Part_III_09Feb2025.csv')
print('Lenght of the dataset:', len(MDS3))
MDS3.head()

In [None]:
# Reading the medical conditions dataset
conditions = pd.read_csv('data/Medical_Conditions_Log_09Feb2025.csv')
print('Lenght of the dataset:', len(conditions))
conditions.head()

This and other datasets don't have information in the EVENT_ID format, however, they provide the "INFODT" (Assessment Date), "RESYR" (Year of Resolution), "MHDIAGYR" (Year of Diagnosis), "MHDIAGDT" (Date ate diagnosis) and "RESOLVD" (Resolved).

The most logical way to extract this information, I think, is to identify if it was present in the same time assessments of the EVENT_ID, then label if the patient had or not this condition by that time (BL, V02, V04 etc).

So, for a patient to have a condition, it must: (1) have this diagnosis in a period earlier or equal to the EVENT_ID - "MHTERM" + "MHDIAGDT" and (2) not having resolved this by the time of this "RESOLVD"

Diabetes example

In [None]:
# Getting columns with the diagnosis we want
elements = ['diabetes']

# Converting 'elements' to lowercase to ensure case-insensitive matching
elements_lower = [element.lower() for element in elements]

# Selecting the patients that have one of the criterias
tempdf = conditions[conditions['MHTERM'].astype(str).str.lower().apply(lambda x: any(element in x for element in elements))]
print('Lenght of patients with the desired condition:', len(tempdf))
print('Different values of the obtained dataset:', list(set(tempdf['MHTERM']))) # Printing without duplicates
tempdf.head()

## Definitions

For this code, we will be using the example for the Charlson comorbidity index (https://www.mdcalc.com/calc/3917/charlson-comorbidity-index-cci) and will extract the conditions present in that score. Osteoporosis was added also added as a test.

Of course, you could modify this to any condition of your liking, just having to think about all the different names this could be written in the dataset in order to extract it.

In [None]:
# List of Charlson Comorbidity Index conditions
charlson_conditions = {
    'Myocardial Infarction': ['myocardial infarction', 'heart attack', 'MI'],
    'Congestive Heart Failure': ['heart failure', 'CHF', 'congestive heart failure'],
    'Peripheral Vascular Disease': ['peripheral vascular disease', 'PVD', 'peripheral artery disease'],
    'Cerebrovascular Disease': ['cerebrovascular disease', 'stroke', 'CVA', 'cerebrovascular accident'],
    'Dementia': ['dementia', 'Alzheimer\'s disease', 'alzheimer'],
    'Chronic Pulmonary Disease': ['chronic pulmonary disease', 'COPD', 'chronic obstructive pulmonary disease', 'emphysema', 'chronic bronchitis'],
    'Connective Tissue Disease': ['connective tissue disease', 'lupus', 'rheumatoid arthritis', 'systemic lupus erythematosus', 'SLE'],
    'Peptic Ulcer Disease': ['peptic ulcer disease', 'PUD', 'stomach ulcer', 'gastric ulcer'],
    'Mild Liver Disease': ['mild liver disease', 'chronic hepatitis', 'hepatitis B', 'hepatitis C'],
    'Diabetes without Complication': ['diabetes', 'diabetes mellitus'],
    'Diabetes with Complication': ['diabetic retinopathy', 'diabetic nephropathy', 'diabetes with complications', 'diabetic neuropathy'],
    'Hemiplegia or Paraplegia': ['hemiplegia', 'paraplegia', 'paralysis'],
    'Renal Disease': ['renal disease', 'chronic kidney disease', 'CKD', 'kidney failure', 'chronic renal failure', 'reduced kidney function'],
    'Cancer (non-metastatic)': ['cancer', 'tumor', 'carcinoma', 'malignancy'],
    'Leukemia': ['leukemia', 'blood cancer'],
    'Lymphoma': ['lymphoma', 'lymphatic cancer', 'Hodgkin\'s lymphoma', 'non-Hodgkin\'s lymphoma'],
    'Moderate or Severe Liver Disease': ['cirrhosis', 'severe liver disease', 'liver cirrhosis', 'end-stage liver disease'],
    'Metastatic Solid Tumor': ['metastatic cancer', 'metastasis',  'metastatic', 'stage IV', 'advanced cancer'],
    'AIDS': ['AIDS', 'HIV', 'acquired immunodeficiency syndrome', 'human immunodeficiency virus'],
    'Osteoporosis':['osteoporosis']}

## Running

Working code, includes per timepoints

In [None]:
# Convert 'MHTERM' to lowercase to ensure case-insensitive matching
conditions['MHTERM_lower'] = conditions['MHTERM'].str.lower()

# Merge conditions and events on 'PATNO'
merged_df = pd.merge(MDS3, conditions, on='PATNO', suffixes=('_event', '_condition'))

# Initialize an empty list to collect results
results = []

# Function to check if any condition term is in the disease name
def check_conditions(disease_name):
    if not isinstance(disease_name, str):
        return []
    conditions_found = []
    for condition, terms in charlson_conditions.items():
        if any(term in disease_name for term in terms):
            conditions_found.append(condition)
    return conditions_found

# Determine the active status of each condition for each timepoint
for index, row in merged_df.iterrows():
    diag_date = pd.to_datetime(row['MHDIAGDT'], format='%m/%Y')
    info_date = pd.to_datetime(row['INFODT_event'], format='%m/%Y')
    resolved_date = pd.to_datetime(row['RESDT'], format='%m/%Y') if pd.notna(row['RESDT']) else None

    # Initialize conditions for this patient and event
    patient_condition = {'PATNO': row['PATNO'], 'EVENT_ID': row['EVENT_ID_event']}
    for condition in charlson_conditions.keys():
        patient_condition[condition] = 0

    # Calculate years since diagnosis for "BL" and "SC" timepoints, only if the diagnosis was discovered on or before the timepoint
    if row['EVENT_ID_event'] in ['BL', 'SC']:
        if diag_date <= info_date:
            years_since_diag = (info_date.year - diag_date.year) + (info_date.month - diag_date.month) / 12.0
            conditions_found = check_conditions(row['MHTERM_lower'])
            for condition in conditions_found:
                patient_condition[condition] = years_since_diag
    else:
        # Check if the diagnosis was active at the timepoint
        if (diag_date <= info_date) and (row['RESOLVD'] == 0 or (resolved_date and resolved_date >= info_date)):
            conditions_found = check_conditions(row['MHTERM_lower'])
            for condition in conditions_found:
                patient_condition[condition] = 1

    # Collect the result for this patient and event
    results.append(patient_condition)

# Create a DataFrame from the collected results
patients_conditions = pd.DataFrame(results)

# This analysis above yields a code with repetitive values, and even some Falses among Trues for the same timepoint (the True are correct), so let's subset
# Define columns to check for "True" values
columns_to_check = list(charlson_conditions.keys())

# Create a column that will be True if any of the columns_to_check are True
patients_conditions['any_true'] = patients_conditions[columns_to_check].any(axis=1)

# Sort by PATNO, EVENT_ID and the 'any_true' column
df_sorted = patients_conditions.sort_values(by=['PATNO', 'EVENT_ID', 'any_true'], ascending=[True, True, False])

# Drop duplicates, keeping the first (which has 'True' if there was any)
df_deduplicated = df_sorted.drop_duplicates(subset=['PATNO', 'EVENT_ID'], keep='first')

# Drop the helper column
patients_conditions_correct = df_deduplicated.drop(columns=['any_true'])

# Display the first few rows of the resulting DataFrame
patients_conditions_correct.head(5)

In [None]:
patients_conditions_correct.describe()

In [None]:
# Identifying which patients ever had a diagnosis of osteoporosis
print('Number of patients with osteoporosis:', len(patients_conditions_correct[patients_conditions_correct['Osteoporosis'] > 1]))
patients_conditions_correct[patients_conditions_correct['Osteoporosis'] > 1].head()

## Testing

Doing some testing to confirm the accuracy of these measures

In [None]:
# Reshape the DataFrame to long format
long_df = pd.melt(patients_conditions_correct, id_vars=['PATNO', 'EVENT_ID'], var_name='Cancer (non-metastatic)', value_name='Status')

# Group by PATNO and Condition, then check if there are both True and False values
grouped = long_df.groupby(['PATNO', 'Cancer (non-metastatic)'])['Status'].agg(['any', 'all']).reset_index()

# Find PATNOs with both True and False statuses for the same condition
testing = grouped[(grouped['any'] == True) & (grouped['all'] == False)]
testing.head(10)

For privacy reasons, I can't share individual patient's data, even as a comment section. I encourage you to look out for some PATNOs for the description of their conditions (see code above) and confirm in the original dataset if the code was able to extract it!

Exporting

In [None]:
# Exporting
patients_conditions_correct.to_csv('data/Medical Conditions.csv', index=False)

# Medications

Several medications are associated with a lower/higher PD risk and/or progression. So having a way to understand in more detail each patient's non-PD medication may be useful for correlation analyses.

**Necessary PPMI datasets:** Concomitant Medication Log and MDS-UPDRS Part III Treatment Determination and Part III: Motor Examination

**Last Update:** February 9, 2025

**Useful links to find all the different names a medication can have:**

Link 1: https://go.drugbank.com/

Link 2: https://www.rxlist.com/search/rxl/exenat


### Reading

Reading MDS data to use as a surrogate for the timepoints

In [None]:
# Using MDS3 as a timepoint proxy
MDS3 = pd.read_csv('data/MDS-UPDRS_Part_III_09Feb2025.csv')
print('Lenght of the dataset:', len(MDS3))
MDS3.head()

In [None]:
# Reading the medication dataset
medications = pd.read_csv('data/Concomitant_Medication_Log_09Feb2025.csv')
medications.head(5)

Looking at an example drawn from GLP-1 agonists

In [None]:
# Getting columns with the diagnosis we want
elements = ['liraglutide', 'victoza', 'saxenda']

# Converting 'elements' to lowercase to ensure case-insensitive matching
elements_lower = [element.lower() for element in elements]

# Selecting the patients that have one of the criterias
tempdf = medications[medications['CMTRT'].astype(str).str.lower().apply(lambda x: any(element in x for element in elements))]
print('Lenght of patients with the desired condition:', len(tempdf))
print('Different values of the obtained dataset:', list(set(tempdf['CMTRT']))) # Printing without duplicates
tempdf.head()

## Creating doses for medications

There are multiple ways to describe a medication dosage. This part of the code tries to interpret the strings written in an organized manner to consolidate everything

In [None]:
# Identifying different pattern in informing dosage
top_elements = medications['CMDOSFRQ'].value_counts().index[:100] # This is the number of unique entries

# Criar um novo dataset com um exemplo de cada um dos 30 elementos mais comuns
new_df = medications[medications['CMDOSFRQ'].isin(top_elements)].drop_duplicates(subset=['CMDOSFRQ'])

# Show
list(new_df['CMDOSFRQ'].value_counts().index.tolist())

In [None]:
# Doses dict setting
# This is a dict that uses the most common used terms to describe each regimen
# The keys are values that will be used to multiply the dose
# The values are names that represent those concepts

daily_dose = {
    '1': ['QD', 'SD', 'OD', 'QHS', 'DAILY', '1X', 'HS',
          'X1', 'QAM', 'QPM', '1XQD', 'NOCTE', '1 X QD', '1X WEEKLY',
          ' QD', 'QS', 'X1', 'QPM', 'QAM', 'QDHS'],  # Once daily
    '2': ['BID', '2X', 'BD', 'QAD', '2 X DAILY', 'TDS', 'TT OD'],  # Twice daily
    '3': ['TID', '3/DAY', '3X', 'TDS'],  # Thrice daily
    '4': ['QID', 'QDS', '4/DAY', 'Q6H', '4X', '4XD', '4XQD', 'TDS'],  # Four times a day
    '6': ['Q4H', '6XD', '6XDAY'],  # Six times a day
    '0.5': ['QOD', 'EOD', 'QAD', 'Q48H', 'ALT DAY', 'Q2 DAYS', 'Q 2 DAYS'],  # Every two days
    '0.714': ['TIW', '3X/WEEK', '3X WEEK', '3X A WEEK'],  # Thrice a week
    '0.429': ['5X WEEK', '5XWK'],  # Five times a week
    '0.2857': ['BIW', '2/WEEK', '2X WEEK'],  # Twice a week
    '0.1429': ['QW', 'QWK', 'WEEKLY', 'QWEEK', 'X1/WK', '1XWK', 'QIW', '1X WEEK', 'WK', '1/WK',
               'Q WK', '1XWEEK', '1/WEEK', 'Q1WK', 'QWEEKLY'],  # Weekly
    '0.0714': ['Q2WK'], # Every two weeks
    '0.0333': ['MONTHLY', 'QM', '1XMONTH', '1/MONTH', 'QMONTH', 'Q4WK', '1X MONTH', 'MONTH'],  # Monthly
    '0.0111': ['Q3MONTH', 'Q3MON', 'Q 3 MONTHS', 'Q3 MOS', 'Q3M','EVERY 3 MO', 'Q3MOS', 'Q3 MONTH'],  # Every three months
    '0.0056': ['Q6M', 'Q6MONTHS', 'Q 6 MONTHS', 'Q6MTHS']  # Every 6 months
}

In [None]:
# Convert CMDOSFRQ to lowercase
medications['CMDOSFRQ_lower'] = medications['CMDOSFRQ'].str.lower()

# Function to find the multiplication factor
def get_multiplication_factor(dosage_frequency):
    for factor, terms in daily_dose.items():
        if dosage_frequency in [term.lower() for term in terms]:
            return float(factor)
    return None  # Default factor if no match is found

# Apply the function to each row
medications['dose_factor'] = medications['CMDOSFRQ_lower'].apply(get_multiplication_factor)

# Calculate the final dose
medications['final_dose'] = medications['CMDOSE'] * medications['dose_factor']

# Drop the helper column
medications = medications.drop(columns=['CMDOSFRQ_lower'])

# Display the result
medications[['CMTRT','CMDOSE','CMDOSU','CMDOSFRQ','dose_factor','final_dose']].head(5)

Now let's do some testing with groups of medications

In [None]:
# Combined dictionary of medications with prefixes
medications_dict = {
    'glp1_Exenatide': ['exenatide', 'byetta', 'bydureon'],
    'glp1_Liraglutide': ['liraglutide', 'victoza', 'saxenda', 'Xultophy'],
    'glp1_Lixisenatide': ['lixisenatide', 'adlyxin', 'lyxumia', 'Soliqua'],
    'glp1_Dulaglutide': ['dulaglutide', 'trulicity'],
    'glp1_Semaglutide': ['semaglutide', 'ozempic', 'rybelsus', 'Wegovy'],
    'glp1_Albiglutide': ['albiglutide', 'tanzeum', 'eperzan'],
    'glp1_Efpeglenatide': ['efpeglenatide'],
    'glp1_Tirzepatide': ['tirzepatide', 'mounjaro', 'zepbound']}

## Running the function

This function will identify, at each specific timepoint, if the patient was taking the medication or not. It will also try to calculate the dosage of that specific medication the patient was taking at each timepoint.

At each medication's column, whenever positive, it will also calculate how many years have passed since the patient's initiation of motor symptoms and that specific timepoint being analysed. So, for example, if a patient is taking liraglutide roughly since the year his disease started and his first BL or SC visit is 2 years after the beginning of his symptoms, that column for BL or SC will be 2.

In [None]:
# Convert 'CMTRT' to lowercase to ensure case-insensitive matching
medications['CMTRT_lower'] = medications['CMTRT'].str.lower()

# Merge conditions and events on 'PATNO'
merged_df = pd.merge(MDS3, medications, on='PATNO', suffixes=('_event', '_medication'))

# Initialize an empty list to collect results
results = []

# Function to check if any medication term is in the medication name
def check_medications(medication_name):
    if not isinstance(medication_name, str):
        return []
    medications_found = []
    for medication, terms in medications_dict.items():
        if any(term in medication_name for term in terms):
            medications_found.append(medication)
    return medications_found

# Determine the active status and dose of each medication for each timepoint
for index, row in merged_df.iterrows():
    diag_date = pd.to_datetime(row['STARTDT'], format='%m/%Y')
    info_date = pd.to_datetime(row['INFODT'], format='%m/%Y')
    resolved_date = pd.to_datetime(row['STOPDT'], format='%m/%Y') if pd.notna(row['STOPDT']) else None

    # Initialize medications for this patient and event
    patient_medication = {'PATNO': row['PATNO'], 'EVENT_ID': row['EVENT_ID_event']}
    patient_medication_dose = {'PATNO': row['PATNO'], 'EVENT_ID': row['EVENT_ID_event']}
    for medication in medications_dict.keys():
        patient_medication[medication] = 0
        patient_medication_dose[medication + '_dose'] = None

    # Calculate years since diagnosis for "BL" and "SC" timepoints, only if the diagnosis was discovered on or before the timepoint
    if row['EVENT_ID_event'] in ['BL', 'SC']:
        if diag_date <= info_date:
            years_since_diag = (info_date.year - diag_date.year) + (info_date.month - diag_date.month) / 12.0
            medications_found = check_medications(row['CMTRT_lower'])
            for medication in medications_found:
                patient_medication[medication] = years_since_diag
    else:
        # Check if the medication was active at the timepoint
        if (diag_date <= info_date) and (resolved_date is None or resolved_date >= info_date):
            medications_found = check_medications(row['CMTRT_lower'])
            for medication in medications_found:
                patient_medication[medication] = 1
                patient_medication_dose[medication + '_dose'] = row['final_dose']

    # Collect the result for this patient and event
    results.append({**patient_medication, **patient_medication_dose})

# Create a DataFrame from the collected results
patients_medications = pd.DataFrame(results)

# Define columns to check for "True" values
columns_to_check = list(medications_dict.keys())

# Create a column that will be True if any of the columns_to_check are True
patients_medications['any_true'] = patients_medications[columns_to_check].any(axis=1)

# Sort by PATNO, EVENT_ID and the 'any_true' column
df_sorted = patients_medications.sort_values(by=['PATNO', 'EVENT_ID', 'any_true'], ascending=[True, True, False])

# Drop duplicates, keeping the first (which has 'True' if there was any)
df_deduplicated = df_sorted.drop_duplicates(subset=['PATNO', 'EVENT_ID'], keep='first')

# Drop the helper column
patients_medications_correct = df_deduplicated.drop(columns=['any_true'])

# Display the result
patients_medications_correct.head()

Checking moments in which a patient is taking Liraglutide

In [None]:
# Identifying which patients ever had a diagnosis of osteoporosis
print('Number of patients with liraglutide use:', len(patients_medications_correct[patients_medications_correct['glp1_Liraglutide'] > 1]))
patients_medications_correct[patients_medications_correct['glp1_Liraglutide'] > 1].head()

## Testing

Doing some testing to confirm the accuracy of these measures

In [None]:
# Reshape the DataFrame to long format
long_df = pd.melt(patients_medications_correct, id_vars=['PATNO', 'EVENT_ID'], var_name='glp1_Liraglutide', value_name='Status')

# Group by PATNO and Condition, then check if there are both True and False values
grouped = long_df.groupby(['PATNO', 'glp1_Liraglutide'])['Status'].agg(['any', 'all']).reset_index()

# Find PATNOs with both True and False statuses for the same condition
testing = grouped[(grouped['any'] == True) & (grouped['all'] == False)]
testing.head(10)

For privacy reasons, I can't share individual patient's data, even as a comment section. I encourage you to look out for some PATNOs for the description of their conditions (see code above) and confirm in the original dataset if the code was able to extract it!

Exporting

In [None]:
# Exporting
patients_medications_correct.to_csv('data/Non PD Medications.csv', index=False)