# Introduction to this Notebook

This Jupyter Notebook encompassess a series of scripts written in Python by Daniel Teixeira dos Santos, a Data Community Innovator at the Data Community of Practice ([link to my forum account](https://rcop.michaeljfox.org/u/danieltds/summary)). These scripts were written using Google Colab by accessing local files present in my Google Drive that I downloaded from LONI. These files are linked to the MJFF Research Community's GitHub repository ([link here](https://github.com/MJFF-ResearchCommunity/Useful-PPMI-Clinical-Codes))

The goal of these scripts is to provide researchers some relevant clinical data that are extracted in a meaningful way form the data that is already available in PPMI. All the necessary input datasets can be obtained [here](https://ida.loni.usc.edu/pages/access/studyData.jsp?project=PPMI) after applying for registration for access to the PPMI data. All outputs from the analyses were removed to comply with privacy and data sharing principles. Some of these scripts were developed with the help of AI tools such as ChatGPT 4o.

This analysis requires two different folders to exist within the main folder. Those are "data" and "priv". The "data" folder is the place where you should store your datasets downloaded from LONI. The priv folder is the one the results will be exported to. These folders will be generated automatically at the beginning of this script, if they don't exist.

# Importing and Setting Paths

In [None]:
import os
import pandas as pd
import numpy as np
import math
import glob
import warnings

# Automatically find the "Useful PPMI Clinical Codes" directory
CURRENT_DIR = os.getcwd()
while not CURRENT_DIR.endswith("Useful PPMI Clinical Codes") and os.path.dirname(CURRENT_DIR) != CURRENT_DIR:
    CURRENT_DIR = os.path.dirname(CURRENT_DIR)

BASE_DIR = CURRENT_DIR

# Define paths for "data" and "report" directories
DATA_DIR = os.path.join(BASE_DIR, "data")
PRIV_DIR = os.path.join(BASE_DIR, "priv")

# Ensure both directories exist, create them if not
for directory in [DATA_DIR, PRIV_DIR]:
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"Created missing folder: {directory}")
    else:
        print(f"Found folder: {directory}")

# Ignore persistent warnings
warnings.simplefilter("ignore", UserWarning)

# Configure Pandas for better data visualization
pd.set_option('display.max_rows', 250)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = "{:,.3f}".format

# List available files in both directories
print("Files in data directory:", os.listdir(DATA_DIR))
print("Files in priv directory:", os.listdir(PRIV_DIR))


# Medical Conditions

Several medical conditions are associated with a higher PD risk and/or progression (examples: https://pubmed.ncbi.nlm.nih.gov/36865411/ and https://pubmed.ncbi.nlm.nih.gov/33682937/). So having a way to understand in more detail each patient's diagnosis may be useful for correlation analyses.

**Necessary PPMI datasets:** Medical Conditions Log and MDS-UPDRS Part III Treatment Determination and Part III: Motor Examination

**Last Update:** February 9, 2025

## Reading

Reading MDS data to use as a surrogate for the timepoints

In [None]:
# Using MDS3 as a timepoint proxy
MDS3 = pd.read_csv(os.path.join(DATA_DIR, "MDS-UPDRS_Part_III_09Feb2025.csv"))
print('Lenght of the dataset:', len(MDS3))
MDS3.head()

In [None]:
# Reading the medical conditions dataset
conditions = pd.read_csv(os.path.join(DATA_DIR, "Medical_Conditions_Log_09Feb2025.csv"))
print('Lenght of the dataset:', len(conditions))
conditions.head()

This and other datasets don't have information in the EVENT_ID format, however, they provide the "INFODT" (Assessment Date), "RESYR" (Year of Resolution), "MHDIAGYR" (Year of Diagnosis), "MHDIAGDT" (Date ate diagnosis) and "RESOLVD" (Resolved).

The most logical way to extract this information, I think, is to identify if it was present in the same time assessments of the EVENT_ID, then label if the patient had or not this condition by that time (BL, V02, V04 etc).

So, for a patient to have a condition, it must: (1) have this diagnosis in a period earlier or equal to the EVENT_ID - "MHTERM" + "MHDIAGDT" and (2) not having resolved this by the time of this "RESOLVD"

Now here follows a code that showcases the different ways the word "diabetes" are written in the dataset

In [None]:
# Getting columns with the diagnosis we want
elements = ['diabetes']

# Converting 'elements' to lowercase to ensure case-insensitive matching
elements_lower = [element.lower() for element in elements]

# Selecting the patients that have one of the criterias
tempdf = conditions[conditions['MHTERM'].astype(str).str.lower().apply(lambda x: any(element in x for element in elements))]
print('Lenght of patients with the desired condition:', len(tempdf))
print('Different values of the obtained dataset:', list(set(tempdf['MHTERM']))) # Printing without duplicates
tempdf.head()

## Definitions

For this code, we will be using the example for the Charlson comorbidity index (https://www.mdcalc.com/calc/3917/charlson-comorbidity-index-cci) and will extract the conditions present in that score. Osteoporosis was added also added as a test.

Of course, you could modify this to any condition of your liking, just having to think about all the different names this could be written in the dataset in order to extract it.

In [None]:
# List of Charlson Comorbidity Index conditions
charlson_conditions = {
    'Myocardial Infarction': ['myocardial infarction', 'heart attack', 'MI'],
    'Congestive Heart Failure': ['heart failure', 'CHF', 'congestive heart failure'],
    'Peripheral Vascular Disease': ['peripheral vascular disease', 'PVD', 'peripheral artery disease'],
    'Cerebrovascular Disease': ['cerebrovascular disease', 'stroke', 'CVA', 'cerebrovascular accident'],
    'Dementia': ['dementia', 'Alzheimer\'s disease', 'alzheimer'],
    'Chronic Pulmonary Disease': ['chronic pulmonary disease', 'COPD', 'chronic obstructive pulmonary disease', 'emphysema', 'chronic bronchitis'],
    'Connective Tissue Disease': ['connective tissue disease', 'lupus', 'rheumatoid arthritis', 'systemic lupus erythematosus', 'SLE'],
    'Peptic Ulcer Disease': ['peptic ulcer disease', 'PUD', 'stomach ulcer', 'gastric ulcer'],
    'Mild Liver Disease': ['mild liver disease', 'chronic hepatitis', 'hepatitis B', 'hepatitis C'],
    'Diabetes without Complication': ['diabetes', 'diabetes mellitus'],
    'Diabetes with Complication': ['diabetic retinopathy', 'diabetic nephropathy', 'diabetes with complications', 'diabetic neuropathy'],
    'Hemiplegia or Paraplegia': ['hemiplegia', 'paraplegia', 'paralysis'],
    'Renal Disease': ['renal disease', 'chronic kidney disease', 'CKD', 'kidney failure', 'chronic renal failure', 'reduced kidney function'],
    'Cancer (non-metastatic)': ['cancer', 'tumor', 'carcinoma', 'malignancy'],
    'Leukemia': ['leukemia', 'blood cancer'],
    'Lymphoma': ['lymphoma', 'lymphatic cancer', 'Hodgkin\'s lymphoma', 'non-Hodgkin\'s lymphoma'],
    'Moderate or Severe Liver Disease': ['cirrhosis', 'severe liver disease', 'liver cirrhosis', 'end-stage liver disease'],
    'Metastatic Solid Tumor': ['metastatic cancer', 'metastasis',  'metastatic', 'stage IV', 'advanced cancer'],
    'AIDS': ['AIDS', 'HIV', 'acquired immunodeficiency syndrome', 'human immunodeficiency virus'],
    'Osteoporosis':['osteoporosis']}

## Running

Working code, includes per timepoints

In [None]:
# Convert 'MHTERM' to lowercase to ensure case-insensitive matching
conditions['MHTERM_lower'] = conditions['MHTERM'].str.lower()

# Merge conditions and events on 'PATNO'
merged_df = pd.merge(MDS3, conditions, on='PATNO', suffixes=('_event', '_condition'))

# Initialize an empty list to collect results
results = []

# Function to check if any condition term is in the disease name
def check_conditions(disease_name):
    if not isinstance(disease_name, str):
        return []
    conditions_found = []
    for condition, terms in charlson_conditions.items():
        if any(term in disease_name for term in terms):
            conditions_found.append(condition)
    return conditions_found

# Determine the active status of each condition for each timepoint
for index, row in merged_df.iterrows():
    diag_date = pd.to_datetime(row['MHDIAGDT'], format='%m/%Y')
    info_date = pd.to_datetime(row['INFODT_event'], format='%m/%Y')
    resolved_date = pd.to_datetime(row['RESDT'], format='%m/%Y') if pd.notna(row['RESDT']) else None

    # Initialize conditions for this patient and event
    patient_condition = {'PATNO': row['PATNO'], 'EVENT_ID': row['EVENT_ID_event']}
    for condition in charlson_conditions.keys():
        patient_condition[condition] = 0

    # Calculate years since diagnosis for "BL" and "SC" timepoints, only if the diagnosis was discovered on or before the timepoint
    if row['EVENT_ID_event'] in ['BL', 'SC']:
        if diag_date <= info_date:
            years_since_diag = (info_date.year - diag_date.year) + (info_date.month - diag_date.month) / 12.0
            conditions_found = check_conditions(row['MHTERM_lower'])
            for condition in conditions_found:
                patient_condition[condition] = years_since_diag
    else:
        # Check if the diagnosis was active at the timepoint
        if (diag_date <= info_date) and (row['RESOLVD'] == 0 or (resolved_date and resolved_date >= info_date)):
            conditions_found = check_conditions(row['MHTERM_lower'])
            for condition in conditions_found:
                patient_condition[condition] = 1

    # Collect the result for this patient and event
    results.append(patient_condition)

# Create a DataFrame from the collected results
patients_conditions = pd.DataFrame(results)

# This analysis above yields a code with repetitive values, and even some Falses among Trues for the same timepoint (the True are correct), so let's subset
# Define columns to check for "True" values
columns_to_check = list(charlson_conditions.keys())

# Create a column that will be True if any of the columns_to_check are True
patients_conditions['any_true'] = patients_conditions[columns_to_check].any(axis=1)

# Sort by PATNO, EVENT_ID and the 'any_true' column
df_sorted = patients_conditions.sort_values(by=['PATNO', 'EVENT_ID', 'any_true'], ascending=[True, True, False])

# Drop duplicates, keeping the first (which has 'True' if there was any)
df_deduplicated = df_sorted.drop_duplicates(subset=['PATNO', 'EVENT_ID'], keep='first')

# Drop the helper column
patients_conditions_correct = df_deduplicated.drop(columns=['any_true'])

# Display the first few rows of the resulting DataFrame
patients_conditions_correct.head(5)

In [None]:
patients_conditions_correct.describe()

In [None]:
# Identifying which patients ever had a diagnosis of osteoporosis
print('Number of patients with osteoporosis:', len(patients_conditions_correct[patients_conditions_correct['Osteoporosis'] > 1]))
patients_conditions_correct[patients_conditions_correct['Osteoporosis'] > 1].head()

## Testing

Doing some testing to confirm the accuracy of these measures

In [None]:
# Reshape the DataFrame to long format
long_df = pd.melt(patients_conditions_correct, id_vars=['PATNO', 'EVENT_ID'], var_name='Cancer (non-metastatic)', value_name='Status')

# Group by PATNO and Condition, then check if there are both True and False values
grouped = long_df.groupby(['PATNO', 'Cancer (non-metastatic)'])['Status'].agg(['any', 'all']).reset_index()

# Find PATNOs with both True and False statuses for the same condition
testing = grouped[(grouped['any'] == True) & (grouped['all'] == False)]
testing.head(10)

For privacy reasons, I can't share individual patient's data, even as a comment section. I encourage you to look out for some PATNOs for the description of their conditions (see code above) and confirm in the original dataset if the code was able to extract it!

Exporting

In [None]:
# Exporting
patients_conditions_correct.to_csv(os.path.join(PRIV_DIR, "Medical_Conditions_Charlson.csv"), index=False)

# Medications

Several medications are associated with a lower/higher PD risk and/or progression. So having a way to understand in more detail each patient's non-PD medication may be useful for correlation analyses.

**Necessary PPMI datasets:** Concomitant Medication Log and MDS-UPDRS Part III Treatment Determination and Part III: Motor Examination

**Last Update:** February 9, 2025

**Useful links to find all the different names a medication can have:**

Link 1: https://go.drugbank.com/

Link 2: https://www.rxlist.com/search/rxl/exenat


### Reading

Reading MDS data to use as a surrogate for the timepoints

In [None]:
# Using MDS3 as a timepoint proxy
MDS3 = pd.read_csv(os.path.join(DATA_DIR, "MDS-UPDRS_Part_III_09Feb2025.csv"))
print('Lenght of the dataset:', len(MDS3))
MDS3.head()

In [None]:
# Reading the medication dataset
medications = pd.read_csv(os.path.join(DATA_DIR, "Concomitant_Medication_Log_09Feb2025.csv"), low_memory=False)
print('Lenght of the dataset:', len(medications))
medications.head()

Looking at an example drawn from GLP-1 agonists (one recently published article suggested they could be neuroprotective 

Link: https://www.nejm.org/doi/full/10.1056/NEJMoa2312323

In [None]:
# Getting columns with the diagnosis we want
elements = ['liraglutide', 'victoza', 'saxenda']

# Converting 'elements' to lowercase to ensure case-insensitive matching
elements_lower = [element.lower() for element in elements]

# Selecting the patients that have one of the criterias
tempdf = medications[medications['CMTRT'].astype(str).str.lower().apply(lambda x: any(element in x for element in elements))]
print('Lenght of patients with the desired condition:', len(tempdf))
print('Different values of the obtained dataset:', list(set(tempdf['CMTRT']))) # Printing without duplicates
tempdf.head()

## Creating doses for medications

There are multiple ways to describe a medication dosage. This part of the code tries to interpret the strings written in an organized manner to consolidate everything

In [None]:
# Identifying different pattern in informing dosage
top_elements = medications['CMDOSFRQ'].value_counts().index[:100] # This is the number of unique entries

# Criar um novo dataset com um exemplo de cada um dos 30 elementos mais comuns
new_df = medications[medications['CMDOSFRQ'].isin(top_elements)].drop_duplicates(subset=['CMDOSFRQ'])

# Show
list(new_df['CMDOSFRQ'].value_counts().index.tolist())

In [None]:
# Doses dict setting
# This is a dict that uses the most common used terms to describe each regimen
# The keys are values that will be used to multiply the dose
# The values are names that represent those concepts

daily_dose = {
    '1': ['QD', 'SD', 'OD', 'QHS', 'DAILY', '1X', 'HS',
          'X1', 'QAM', 'QPM', '1XQD', 'NOCTE', '1 X QD', '1X WEEKLY',
          ' QD', 'QS', 'X1', 'QPM', 'QAM', 'QDHS'],  # Once daily
    '2': ['BID', '2X', 'BD', 'QAD', '2 X DAILY', 'TDS', 'TT OD'],  # Twice daily
    '3': ['TID', '3/DAY', '3X', 'TDS'],  # Thrice daily
    '4': ['QID', 'QDS', '4/DAY', 'Q6H', '4X', '4XD', '4XQD', 'TDS'],  # Four times a day
    '6': ['Q4H', '6XD', '6XDAY'],  # Six times a day
    '0.5': ['QOD', 'EOD', 'QAD', 'Q48H', 'ALT DAY', 'Q2 DAYS', 'Q 2 DAYS'],  # Every two days
    '0.714': ['TIW', '3X/WEEK', '3X WEEK', '3X A WEEK'],  # Thrice a week
    '0.429': ['5X WEEK', '5XWK'],  # Five times a week
    '0.2857': ['BIW', '2/WEEK', '2X WEEK'],  # Twice a week
    '0.1429': ['QW', 'QWK', 'WEEKLY', 'QWEEK', 'X1/WK', '1XWK', 'QIW', '1X WEEK', 'WK', '1/WK',
               'Q WK', '1XWEEK', '1/WEEK', 'Q1WK', 'QWEEKLY'],  # Weekly
    '0.0714': ['Q2WK'], # Every two weeks
    '0.0333': ['MONTHLY', 'QM', '1XMONTH', '1/MONTH', 'QMONTH', 'Q4WK', '1X MONTH', 'MONTH'],  # Monthly
    '0.0111': ['Q3MONTH', 'Q3MON', 'Q 3 MONTHS', 'Q3 MOS', 'Q3M','EVERY 3 MO', 'Q3MOS', 'Q3 MONTH'],  # Every three months
    '0.0056': ['Q6M', 'Q6MONTHS', 'Q 6 MONTHS', 'Q6MTHS']  # Every 6 months
}

In [None]:
# Convert CMDOSFRQ to lowercase
medications['CMDOSFRQ_lower'] = medications['CMDOSFRQ'].str.lower()

# Function to find the multiplication factor
def get_multiplication_factor(dosage_frequency):
    for factor, terms in daily_dose.items():
        if dosage_frequency in [term.lower() for term in terms]:
            return float(factor)
    return None  # Default factor if no match is found

# Apply the function to each row
medications['dose_factor'] = medications['CMDOSFRQ_lower'].apply(get_multiplication_factor)

# Calculate the final dose
medications['final_dose'] = medications['CMDOSE'] * medications['dose_factor']

# Drop the helper column
medications = medications.drop(columns=['CMDOSFRQ_lower'])

# Display the result
medications[['CMTRT','CMDOSE','CMDOSU','CMDOSFRQ','dose_factor','final_dose']].head(5)

Now let's do some testing with groups of medications

In [None]:
# Combined dictionary of medications with prefixes
medications_dict = {
    'glp1_Exenatide': ['exenatide', 'byetta', 'bydureon'],
    'glp1_Liraglutide': ['liraglutide', 'victoza', 'saxenda', 'Xultophy'],
    'glp1_Lixisenatide': ['lixisenatide', 'adlyxin', 'lyxumia', 'Soliqua'],
    'glp1_Dulaglutide': ['dulaglutide', 'trulicity'],
    'glp1_Semaglutide': ['semaglutide', 'ozempic', 'rybelsus', 'Wegovy'],
    'glp1_Albiglutide': ['albiglutide', 'tanzeum', 'eperzan'],
    'glp1_Efpeglenatide': ['efpeglenatide'],
    'glp1_Tirzepatide': ['tirzepatide', 'mounjaro', 'zepbound']}

## Running the function

This function will identify, at each specific timepoint, if the patient was taking the medication or not. It will also try to calculate the dosage of that specific medication the patient was taking at each timepoint.

At each medication's column, whenever positive, it will also calculate how many years have passed since the patient's initiation of motor symptoms and that specific timepoint being analysed. So, for example, if a patient is taking liraglutide roughly since the year his disease started and his first BL or SC visit is 2 years after the beginning of his symptoms, that column for BL or SC will be 2.

In [None]:
# Convert 'CMTRT' to lowercase to ensure case-insensitive matching
medications['CMTRT_lower'] = medications['CMTRT'].str.lower()

# Merge conditions and events on 'PATNO'
merged_df = pd.merge(MDS3, medications, on='PATNO', suffixes=('_event', '_medication'))

# Initialize an empty list to collect results
results = []

# Function to check if any medication term is in the medication name
def check_medications(medication_name):
    if not isinstance(medication_name, str):
        return []
    medications_found = []
    for medication, terms in medications_dict.items():
        if any(term in medication_name for term in terms):
            medications_found.append(medication)
    return medications_found

# Determine the active status and dose of each medication for each timepoint
for index, row in merged_df.iterrows():
    diag_date = pd.to_datetime(row['STARTDT'], format='%m/%Y')
    info_date = pd.to_datetime(row['INFODT'], format='%m/%Y')
    resolved_date = pd.to_datetime(row['STOPDT'], format='%m/%Y') if pd.notna(row['STOPDT']) else None

    # Initialize medications for this patient and event
    patient_medication = {'PATNO': row['PATNO'], 'EVENT_ID': row['EVENT_ID_event']}
    patient_medication_dose = {'PATNO': row['PATNO'], 'EVENT_ID': row['EVENT_ID_event']}
    for medication in medications_dict.keys():
        patient_medication[medication] = 0
        patient_medication_dose[medication + '_dose'] = None

    # Calculate years since diagnosis for "BL" and "SC" timepoints, only if the diagnosis was discovered on or before the timepoint
    if row['EVENT_ID_event'] in ['BL', 'SC']:
        if diag_date <= info_date:
            years_since_diag = (info_date.year - diag_date.year) + (info_date.month - diag_date.month) / 12.0
            medications_found = check_medications(row['CMTRT_lower'])
            for medication in medications_found:
                patient_medication[medication] = years_since_diag
    else:
        # Check if the medication was active at the timepoint
        if (diag_date <= info_date) and (resolved_date is None or resolved_date >= info_date):
            medications_found = check_medications(row['CMTRT_lower'])
            for medication in medications_found:
                patient_medication[medication] = 1
                patient_medication_dose[medication + '_dose'] = row['final_dose']

    # Collect the result for this patient and event
    results.append({**patient_medication, **patient_medication_dose})

# Create a DataFrame from the collected results
patients_medications = pd.DataFrame(results)

# Define columns to check for "True" values
columns_to_check = list(medications_dict.keys())

# Create a column that will be True if any of the columns_to_check are True
patients_medications['any_true'] = patients_medications[columns_to_check].any(axis=1)

# Sort by PATNO, EVENT_ID and the 'any_true' column
df_sorted = patients_medications.sort_values(by=['PATNO', 'EVENT_ID', 'any_true'], ascending=[True, True, False])

# Drop duplicates, keeping the first (which has 'True' if there was any)
df_deduplicated = df_sorted.drop_duplicates(subset=['PATNO', 'EVENT_ID'], keep='first')

# Drop the helper column
patients_medications_correct = df_deduplicated.drop(columns=['any_true'])

# Display the result
patients_medications_correct.head()

Checking moments in which a patient is taking Liraglutide

In [None]:
# Identifying which patients ever had a diagnosis of osteoporosis
print('Number of entries with liraglutide use:', len(patients_medications_correct[patients_medications_correct['glp1_Liraglutide'] > 1]))
patients_medications_correct[patients_medications_correct['glp1_Liraglutide'] > 1].head()

## Testing

Doing some testing to confirm the accuracy of these measures

In [None]:
# Reshape the DataFrame to long format
long_df = pd.melt(patients_medications_correct, id_vars=['PATNO', 'EVENT_ID'], var_name='glp1_Liraglutide', value_name='Status')

# Group by PATNO and Condition, then check if there are both True and False values
grouped = long_df.groupby(['PATNO', 'glp1_Liraglutide'])['Status'].agg(['any', 'all']).reset_index()

# Find PATNOs with both True and False statuses for the same condition
testing = grouped[(grouped['any'] == True) & (grouped['all'] == False)]
testing.head(10)

For privacy reasons, I can't share individual patient's data, even as a comment section. I encourage you to look out for some PATNOs for the description of their conditions (see code above) and confirm in the original dataset if the code was able to extract it!

Exporting

In [None]:
# Exporting
patients_medications_correct.to_csv(os.path.join(PRIV_DIR, "Non_PD_Medications_GLP1.csv"), index=False)