## ADNIMERGE Data Preprocessing - Baseline Data Cleaning

Objective: Clean and prepare the baseline data from the ADNIMERGE dataset for downstream analysis, specifically CSF prediction.

Input Data:

ADNIMERGE dataset (`ADNIMERGE_25Apr2025.csv`) containing clinical, demographic, biomarker, and imaging data.

Processing Steps:

1.  **Load Raw Data**: Load the raw ADNIMERGE dataset into a pandas DataFrame.
2.  **Select Baseline Data**: Filter the DataFrame to include only baseline measurements and key identifiers.
3.  **Drop Ecog Scores**: Remove columns related to Ecog scores.
4.  **Deduplicate**: Remove duplicate rows and keep the first entry for each unique participant ID (`PTID`).
5.  **Handle High Missingness**: Identify and drop columns with more than 50% missing values.
6.  **Impute Missing Values**: Apply mean imputation to continuous variables and most frequent imputation to categorical variables.

Output:

Cleaned baseline dataset (`df_clean.csv`) with reduced missingness and a unique entry per participant.

**Reasoning**:
The goal is to create a function to load and filter the baseline data. This involves defining the function, reading the CSV, selecting columns, and returning the filtered DataFrame.



In [1]:
def load_and_filter_baseline_data(file_path):
    """
    Loads the ADNIMERGE data from a CSV file and selects a predefined set of
    baseline columns.

    Args:
        file_path (str): The path to the ADNIMERGE CSV file.

    Returns:
        pandas.DataFrame: A DataFrame containing only the baseline columns.
    """
    # Handle potential mixed types based on previous warning
    df = pd.read_csv(file_path, low_memory=False)

    baseline_cols = [
        'RID', 'PTID', 'DX_bl', 'AGE', 'PTGENDER', 'PTEDUCAT', 'APOE4',
        # Biomarkers
        'ABETA_bl', 'TAU_bl', 'PTAU_bl', 'FDG_bl', 'PIB_bl', 'AV45_bl', 'FBB_bl',
        # Clinical scores
        'CDRSB_bl', 'ADAS11_bl', 'ADAS13_bl', 'ADASQ4_bl', 'MMSE_bl',
        'RAVLT_immediate_bl', 'RAVLT_learning_bl', 'RAVLT_forgetting_bl', 'RAVLT_perc_forgetting_bl',
        'LDELTOTAL_BL', 'DIGITSCOR_bl', 'TRABSCOR_bl', 'FAQ_bl',
        'MOCA_bl', 'EcogPtMem_bl', 'EcogPtLang_bl', 'EcogPtVisspat_bl',
        'EcogPtPlan_bl', 'EcogPtOrgan_bl', 'EcogPtDivatt_bl', 'EcogPtTotal_bl',
        'EcogSPMem_bl', 'EcogSPLang_bl', 'EcogSPVisspat_bl', 'EcogSPPlan_bl',
        'EcogSPOrgan_bl', 'EcogSPDivatt_bl', 'EcogSPTotal_bl',
        # MRI volumes
        'Ventricles_bl', 'Hippocampus_bl', 'WholeBrain_bl', 'Entorhinal_bl',
        'Fusiform_bl', 'MidTemp_bl', 'ICV_bl'
    ]

    df_bl = df[baseline_cols]

    return df_bl

# Example usage (assuming the file is in the same directory)
# df_bl = load_and_filter_baseline_data("ADNIMERGE_25Apr2025.csv")
# display(df_bl.head())

## Refactor column dropping

### Subtask:
Create a function to drop the specified Ecog score columns and the high-missingness columns from the DataFrame.


**Reasoning**:
Define a function to drop specified columns from the DataFrame, including Ecog scores and high-missingness features.



In [2]:
def drop_irrelevant_and_high_missingness_columns(df):
    """
    Drops specified Ecog score columns and high-missingness columns from the DataFrame.

    Args:
        df (pandas.DataFrame): The input DataFrame.

    Returns:
        pandas.DataFrame: The DataFrame with specified columns dropped.
    """
    ecog_cols = [
        'EcogPtMem_bl', 'EcogPtLang_bl', 'EcogPtVisspat_bl',
        'EcogPtPlan_bl', 'EcogPtOrgan_bl', 'EcogPtDivatt_bl', 'EcogPtTotal_bl',
        'EcogSPMem_bl', 'EcogSPLang_bl', 'EcogSPVisspat_bl', 'EcogSPPlan_bl',
        'EcogSPOrgan_bl', 'EcogSPDivatt_bl', "EcogSPTotal_bl",
    ]
    high_missingness_cols = ['PIB_bl', 'FBB_bl', 'DIGITSCOR_bl', 'AV45_bl']

    cols_to_drop = ecog_cols + high_missingness_cols

    df_dropped = df.drop(columns=cols_to_drop, errors='ignore') # Use errors='ignore' to avoid issues if a column is already missing

    return df_dropped

# Example usage (assuming df_bl is the DataFrame after loading and filtering)
# df_dropped = drop_irrelevant_and_high_missingness_columns(df_bl)
# display(df_dropped.head())
# print(df_dropped.shape)

## Refactor deduplication

### Subtask:
Create a function to handle deduplication of the DataFrame based on the 'PTID' column, keeping the first entry for each unique participant.


**Reasoning**:
Define a function to handle deduplication based on 'PTID'.



In [3]:
def deduplicate_by_ptid(df):
    """
    Deduplicates the DataFrame based on the 'PTID' column, keeping the first
    entry for each unique participant.

    Args:
        df (pandas.DataFrame): The input DataFrame.

    Returns:
        pandas.DataFrame: The deduplicated DataFrame with reset index.
    """
    # Remove exact duplicate rows first
    df_nodup = df.drop_duplicates()

    # Then deduplicate based on PTID, keeping the first entry
    df_nodup = df_nodup.drop_duplicates(subset='PTID', keep='first')

    # Reset index
    df_nodup.reset_index(drop=True, inplace=True)

    return df_nodup

# Example usage (assuming df_bl1 is the DataFrame after dropping columns)
# df_bl1_nodup = deduplicate_by_ptid(df_bl1)
# print("Unique PTIDs after deduplication:", df_bl1_nodup['PTID'].nunique())
# print("Shape of cleaned baseline DataFrame:", df_bl1_nodup.shape)
# display(df_bl1_nodup.head())

## Refactor missing value imputation

### Subtask:
Create a function to impute missing values in the DataFrame, applying mean imputation to continuous variables and most frequent imputation to categorical variables.


**Reasoning**:
Define a function to impute missing values in the DataFrame using mean for continuous and most frequent for categorical columns.



In [4]:
from sklearn.impute import SimpleImputer

def impute_missing_values(df):
    """
    Imputes missing values in the DataFrame, applying mean imputation to
    continuous variables and most frequent imputation to categorical variables.

    Args:
        df (pandas.DataFrame): The input DataFrame with missing values.

    Returns:
        pandas.DataFrame: The DataFrame with missing values imputed.
    """
    # Separate continuous and categorical columns based on previous analysis
    continuous_cols = [
        'FDG_bl', 'MOCA_bl', 'WholeBrain_bl', 'Entorhinal_bl', 'Fusiform_bl',
        'MidTemp_bl', 'ICV_bl', 'AGE','Ventricles_bl', 'Hippocampus_bl',
        'PTEDUCAT', 'CDRSB_bl', 'ADAS11_bl', 'ADAS13_bl', 'ADASQ4_bl',
        'MMSE_bl', 'RAVLT_immediate_bl', 'RAVLT_learning_bl',
        'RAVLT_forgetting_bl', 'RAVLT_perc_forgetting_bl', 'LDELTOTAL_BL',
        'TRABSCOR_bl', 'FAQ_bl', 'MOCA_bl'
    ]
    categorical_cols = ['APOE4', 'DX_bl']

    # Mean imputation for continuous
    imp_mean = SimpleImputer(strategy='mean')
    df[continuous_cols] = imp_mean.fit_transform(df[continuous_cols])

    # Most frequent for categorical
    imp_freq = SimpleImputer(strategy='most_frequent')
    df[categorical_cols] = imp_freq.fit_transform(df[categorical_cols])

    return df

# Example usage (assuming df_clean is the DataFrame after dropping columns and deduplication)
# df_imputed = impute_missing_values(df_clean.copy()) # Use a copy to avoid modifying the original DataFrame
# missing_percent = df_imputed.isnull().mean() * 100
# print("\nMissingness after imputation:\n", missing_percent[missing_percent > 0])

## Integrate functions into a main script

### Subtask:
Create a main script that calls the refactored functions in the correct order to perform the entire data cleaning and preprocessing pipeline.


**Reasoning**:
Define the main function to orchestrate the data cleaning pipeline by calling the previously refactored functions in sequence and save the cleaned data to a CSV file.



In [5]:
import pandas as pd
from sklearn.impute import SimpleImputer # Import SimpleImputer here as it's used in impute_missing_values

def load_and_filter_baseline_data(file_path):
    """
    Loads the ADNIMERGE data from a CSV file and selects a predefined set of
    baseline columns.

    Args:
        file_path (str): The path to the ADNIMERGE CSV file.

    Returns:
        pandas.DataFrame: A DataFrame containing only the baseline columns.
    """
    # Handle potential mixed types based on previous warning
    df = pd.read_csv(file_path, low_memory=False)

    baseline_cols = [
        'RID', 'PTID', 'DX_bl', 'AGE', 'PTGENDER', 'PTEDUCAT', 'APOE4',
        # Biomarkers
        'ABETA_bl', 'TAU_bl', 'PTAU_bl', 'FDG_bl', 'PIB_bl', 'AV45_bl', 'FBB_bl',
        # Clinical scores
        'CDRSB_bl', 'ADAS11_bl', 'ADAS13_bl', 'ADASQ4_bl', 'MMSE_bl',
        'RAVLT_immediate_bl', 'RAVLT_learning_bl', 'RAVLT_forgetting_bl', 'RAVLT_perc_forgetting_bl',
        'LDELTOTAL_BL', 'DIGITSCOR_bl', 'TRABSCOR_bl', 'FAQ_bl',
        'MOCA_bl', 'EcogPtMem_bl', 'EcogPtLang_bl', 'EcogPtVisspat_bl',
        'EcogPtPlan_bl', 'EcogPtOrgan_bl', 'EcogPtDivatt_bl', 'EcogPtTotal_bl',
        'EcogSPMem_bl', 'EcogSPLang_bl', 'EcogSPVisspat_bl', 'EcogSPPlan_bl',
        'EcogSPOrgan_bl', 'EcogSPDivatt_bl', 'EcogSPTotal_bl',
        # MRI volumes
        'Ventricles_bl', 'Hippocampus_bl', 'WholeBrain_bl', 'Entorhinal_bl',
        'Fusiform_bl', 'MidTemp_bl', 'ICV_bl'
    ]

    df_bl = df[baseline_cols]

    return df_bl

def drop_irrelevant_and_high_missingness_columns(df):
    """
    Drops specified Ecog score columns and high-missingness columns from the DataFrame.

    Args:
        df (pandas.DataFrame): The input DataFrame.

    Returns:
        pandas.DataFrame: The DataFrame with specified columns dropped.
    """
    ecog_cols = [
        'EcogPtMem_bl', 'EcogPtLang_bl', 'EcogPtVisspat_bl',
        'EcogPtPlan_bl', 'EcogPtOrgan_bl', 'EcogPtDivatt_bl', 'EcogPtTotal_bl',
        'EcogSPMem_bl', 'EcogSPLang_bl', 'EcogSPVisspat_bl', 'EcogSPPlan_bl',
        'EcogSPOrgan_bl', 'EcogSPDivatt_bl', "EcogSPTotal_bl",
    ]
    high_missingness_cols = ['PIB_bl', 'FBB_bl', 'DIGITSCOR_bl', 'AV45_bl']

    cols_to_drop = ecog_cols + high_missingness_cols

    df_dropped = df.drop(columns=cols_to_drop, errors='ignore') # Use errors='ignore' to avoid issues if a column is already missing

    return df_dropped

def deduplicate_by_ptid(df):
    """
    Deduplicates the DataFrame based on the 'PTID' column, keeping the first
    entry for each unique participant.

    Args:
        df (pandas.DataFrame): The input DataFrame.

    Returns:
        pandas.DataFrame: The deduplicated DataFrame with reset index.
    """
    # Remove exact duplicate rows first
    df_nodup = df.drop_duplicates()

    # Then deduplicate based on PTID, keeping the first entry
    df_nodup = df_nodup.drop_duplicates(subset='PTID', keep='first')

    # Reset index
    df_nodup.reset_index(drop=True, inplace=True)

    return df_nodup

def impute_missing_values(df):
    """
    Imputes missing values in the DataFrame, applying mean imputation to
    continuous variables and most frequent imputation to categorical variables.

    Args:
        df (pandas.DataFrame): The input DataFrame with missing values.

    Returns:
        pandas.DataFrame: The DataFrame with missing values imputed.
    """
    # Separate continuous and categorical columns based on previous analysis
    continuous_cols = [
        'FDG_bl', 'MOCA_bl', 'WholeBrain_bl', 'Entorhinal_bl', 'Fusiform_bl',
        'MidTemp_bl', 'ICV_bl', 'AGE','Ventricles_bl', 'Hippocampus_bl',
        'PTEDUCAT', 'CDRSB_bl', 'ADAS11_bl', 'ADAS13_bl', 'ADASQ4_bl',
        'MMSE_bl', 'RAVLT_immediate_bl', 'RAVLT_learning_bl',
        'RAVLT_forgetting_bl', 'RAVLT_perc_forgetting_bl', 'LDELTOTAL_BL',
        'TRABSCOR_bl', 'FAQ_bl', 'MOCA_bl'
    ]
    categorical_cols = ['APOE4', 'DX_bl']

    # Mean imputation for continuous
    imp_mean = SimpleImputer(strategy='mean')
    # Only impute if columns exist in the dataframe
    existing_continuous_cols = [col for col in continuous_cols if col in df.columns]
    if existing_continuous_cols:
        df[existing_continuous_cols] = imp_mean.fit_transform(df[existing_continuous_cols])

    # Most frequent for categorical
    imp_freq = SimpleImputer(strategy='most_frequent')
    # Only impute if columns exist in the dataframe
    existing_categorical_cols = [col for col in categorical_cols if col in df.columns]
    if existing_categorical_cols:
        df[existing_categorical_cols] = imp_freq.fit_transform(df[existing_categorical_cols])

    return df

def main():
    """
    Main function to run the ADNIMERGE baseline data cleaning pipeline.
    """
    file_path = "ADNIMERGE_25Apr2025.csv"
    output_file_path = "df_clean.csv"

    print(f"Loading data from {file_path} and filtering for baseline...")
    df_bl = load_and_filter_baseline_data(file_path)
    print(f"Initial baseline data shape: {df_bl.shape}")

    print("Dropping irrelevant and high-missingness columns...")
    df_dropped = drop_irrelevant_and_high_missingness_columns(df_bl)
    print(f"Shape after dropping columns: {df_dropped.shape}")

    print("Deduplicating data by PTID...")
    df_dedup = deduplicate_by_ptid(df_dropped)
    print(f"Shape after deduplication: {df_dedup.shape}")
    print(f"Unique PTIDs after deduplication: {df_dedup['PTID'].nunique()}")


    print("Imputing missing values...")
    df_clean = impute_missing_values(df_dedup.copy()) # Use a copy to avoid modifying the original DataFrame
    print("Missingness after imputation:")
    missing_percent_after_imputation = df_clean.isnull().mean() * 100
    print(missing_percent_after_imputation[missing_percent_after_imputation > 0])


    print(f"Saving cleaned data to {output_file_path}...")
    df_clean.to_csv(output_file_path, index=False)
    print("Data cleaning pipeline completed.")

if __name__ == "__main__":
    main()

Loading data from ADNIMERGE_25Apr2025.csv and filtering for baseline...
Initial baseline data shape: (16421, 49)
Dropping irrelevant and high-missingness columns...
Shape after dropping columns: (16421, 31)
Deduplicating data by PTID...
Shape after deduplication: (2430, 31)
Unique PTIDs after deduplication: 2430
Imputing missing values...
Missingness after imputation:
ABETA_bl    50.0
TAU_bl      50.0
PTAU_bl     50.0
dtype: float64
Saving cleaned data to df_clean.csv...
Data cleaning pipeline completed.


## Add comments and docstrings

### Subtask:
Add comments to explain each step and function, and add docstrings to the functions in the main script.


**Reasoning**:
Add comments and docstrings to the main script to explain each step and function, fulfilling the subtask requirement.



## Save the cleaned data

### Subtask:
Save the cleaned DataFrame to a CSV file named 'df_clean.csv' without including the DataFrame index.


**Reasoning**:
Save the cleaned DataFrame to a CSV file named 'df_clean.csv' without including the DataFrame index.



In [8]:
import pandas as pd
from sklearn.impute import SimpleImputer

def load_and_filter_baseline_data(file_path):
    """
    Loads the ADNIMERGE data from a CSV file and selects a predefined set of
    baseline columns.

    Args:
        file_path (str): The path to the ADNIMERGE CSV file.

    Returns:
        pandas.DataFrame: A DataFrame containing only the baseline columns.
    """
    # Handle potential mixed types based on previous warning
    df = pd.read_csv(file_path, low_memory=False)

    # Define the list of baseline columns to keep
    baseline_cols = [
        'RID', 'PTID', 'DX_bl', 'AGE', 'PTGENDER', 'PTEDUCAT', 'APOE4',
        # Biomarkers
        'ABETA_bl', 'TAU_bl', 'PTAU_bl', 'FDG_bl', 'PIB_bl', 'AV45_bl', 'FBB_bl',
        # Clinical scores
        'CDRSB_bl', 'ADAS11_bl', 'ADAS13_bl', 'ADASQ4_bl', 'MMSE_bl',
        'RAVLT_immediate_bl', 'RAVLT_learning_bl', 'RAVLT_forgetting_bl', 'RAVLT_perc_forgetting_bl',
        'LDELTOTAL_BL', 'DIGITSCOR_bl', 'TRABSCOR_bl', 'FAQ_bl',
        'MOCA_bl', 'EcogPtMem_bl', 'EcogPtLang_bl', 'EcogPtVisspat_bl',
        'EcogPtPlan_bl', 'EcogPtOrgan_bl', 'EcogPtDivatt_bl', 'EcogPtTotal_bl',
        'EcogSPMem_bl', 'EcogSPLang_bl', 'EcogSPVisspat_bl', 'EcogSPPlan_bl',
        'EcogSPOrgan_bl', 'EcogSPDivatt_bl', 'EcogSPTotal_bl',
        # MRI volumes
        'Ventricles_bl', 'Hippocampus_bl', 'WholeBrain_bl', 'Entorhinal_bl',
        'Fusiform_bl', 'MidTemp_bl', 'ICV_bl'
    ]

    # Select only the baseline columns
    df_bl = df[baseline_cols]

    return df_bl

def drop_irrelevant_and_high_missingness_columns(df):
    """
    Drops specified Ecog score columns and high-missingness columns from the DataFrame.

    Args:
        df (pandas.DataFrame): The input DataFrame.

    Returns:
        pandas.DataFrame: The DataFrame with specified columns dropped.
    """
    # Define columns to drop: Ecog scores
    ecog_cols = [
        'EcogPtMem_bl', 'EcogPtLang_bl', 'EcogPtVisspat_bl',
        'EcogPtPlan_bl', 'EcogPtOrgan_bl', 'EcogPtDivatt_bl', 'EcogPtTotal_bl',
        'EcogSPMem_bl', 'EcogSPLang_bl', 'EcogSPVisspat_bl', 'EcogSPPlan_bl',
        'EcogSPOrgan_bl', 'EcogSPDivatt_bl', "EcogSPTotal_bl",
    ]
    # Define columns to drop: High missingness features identified previously (>50%)
    high_missingness_cols = ['PIB_bl', 'FBB_bl', 'DIGITSCOR_bl', 'AV45_bl']

    # Combine the lists of columns to drop
    cols_to_drop = ecog_cols + high_missingness_cols

    # Drop the specified columns. errors='ignore' prevents errors if a column is not found.
    df_dropped = df.drop(columns=cols_to_drop, errors='ignore')

    return df_dropped

def deduplicate_by_ptid(df):
    """
    Deduplicates the DataFrame based on the 'PTID' column, keeping the first
    entry for each unique participant.

    Args:
        df (pandas.DataFrame): The input DataFrame.

    Returns:
        pandas.DataFrame: The deduplicated DataFrame with reset index.
    """
    # Remove exact duplicate rows first across all columns
    df_nodup = df.drop_duplicates()

    # Then deduplicate based on PTID, keeping the first entry for each participant
    df_nodup = df_nodup.drop_duplicates(subset='PTID', keep='first')

    # Reset the index of the resulting DataFrame for cleaner indexing
    df_nodup.reset_index(drop=True, inplace=True)

    return df_nodup

def impute_missing_values(df):
    """
    Imputes missing values in the DataFrame, applying mean imputation to
    continuous variables and most frequent imputation to categorical variables.

    Args:
        df (pandas.DataFrame): The input DataFrame with missing values.

    Returns:
        pandas.DataFrame: The DataFrame with missing values imputed.
    """
    # Separate continuous and categorical columns based on previous analysis
    # Note: ABETA_bl, TAU_bl, PTAU_bl are excluded as they have ~50% missingness and might require different handling
    continuous_cols = [
        'FDG_bl', 'MOCA_bl', 'WholeBrain_bl', 'Entorhinal_bl', 'Fusiform_bl',
        'MidTemp_bl', 'ICV_bl', 'AGE','Ventricles_bl', 'Hippocampus_bl',
        'PTEDUCAT', 'CDRSB_bl', 'ADAS11_bl', 'ADAS13_bl', 'ADASQ4_bl',
        'MMSE_bl', 'RAVLT_immediate_bl', 'RAVLT_learning_bl',
        'RAVLT_forgetting_bl', 'RAVLT_perc_forgetting_bl', 'LDELTOTAL_BL',
        'TRABSCOR_bl', 'FAQ_bl', # MOCA_bl is listed twice, keep one
    ]
    # Ensure MOCA_bl is only listed once in continuous_cols
    continuous_cols = list(set(continuous_cols))


    categorical_cols = ['APOE4', 'DX_bl']

    # Mean imputation for continuous variables
    imp_mean = SimpleImputer(strategy='mean')
    # Only impute if columns exist in the dataframe to prevent errors
    existing_continuous_cols = [col for col in continuous_cols if col in df.columns]
    if existing_continuous_cols:
        df[existing_continuous_cols] = imp_mean.fit_transform(df[existing_continuous_cols])

    # Most frequent imputation for categorical variables
    imp_freq = SimpleImputer(strategy='most_frequent')
    # Only impute if columns exist in the dataframe to prevent errors
    existing_categorical_cols = [col for col in categorical_cols if col in df.columns]
    if existing_categorical_cols:
        df[existing_categorical_cols] = imp_freq.fit_transform(df[existing_categorical_cols])

    return df

def main():
    """
    Main function to run the ADNIMERGE baseline data cleaning pipeline.

    This function orchestrates the loading, filtering, dropping columns,
    deduplication, and imputation steps.
    """
    # Define input and output file paths
    file_path = "ADNIMERGE_25Apr2025.csv"
    output_file_path = "df_clean.csv"

    # Step 1: Load data and filter for baseline columns
    print(f"Loading data from {file_path} and filtering for baseline...")
    df_bl = load_and_filter_baseline_data(file_path)
    print(f"Initial baseline data shape: {df_bl.shape}")

    # Step 2: Drop irrelevant and high-missingness columns
    print("Dropping irrelevant and high-missingness columns...")
    df_dropped = drop_irrelevant_and_high_missingness_columns(df_bl)
    print(f"Shape after dropping columns: {df_dropped.shape}")

    # Step 3: Deduplicate data by PTID
    print("Deduplicating data by PTID...")
    df_dedup = deduplicate_by_ptid(df_dropped)
    print(f"Shape after deduplication: {df_dedup.shape}")
    print(f"Unique PTIDs after deduplication: {df_dedup['PTID'].nunique()}")

    # Step 4: Impute missing values
    print("Imputing missing values...")
    # Create a copy before imputation to avoid modifying the intermediate DataFrame in place
    df_clean = impute_missing_values(df_dedup.copy())
    print("Missingness after imputation:")
    missing_percent_after_imputation = df_clean.isnull().mean() * 100
    # Print only columns that still have missing values
    print(missing_percent_after_imputation[missing_percent_after_imputation > 0])

    # Step 5: Save the cleaned data to a CSV file
    print(f"Saving cleaned data to {output_file_path}...")
    df_clean.to_csv(output_file_path, index=False)
    print("Data cleaning pipeline completed.")

# Ensure the main function runs only when the script is executed directly
if __name__ == "__main__":
    main()

Loading data from ADNIMERGE_25Apr2025.csv and filtering for baseline...
Initial baseline data shape: (16421, 49)
Dropping irrelevant and high-missingness columns...
Shape after dropping columns: (16421, 31)
Deduplicating data by PTID...
Shape after deduplication: (2430, 31)
Unique PTIDs after deduplication: 2430
Imputing missing values...
Missingness after imputation:
ABETA_bl    50.0
TAU_bl      50.0
PTAU_bl     50.0
dtype: float64
Saving cleaned data to df_clean.csv...
Data cleaning pipeline completed.


## Summary:

### Data Analysis Key Findings

*   Python functions were successfully created for each step of the data cleaning pipeline: data loading and initial filtering, column dropping, deduplication, and missing value imputation.
*   The `load_and_filter_baseline_data` function correctly loads the specified CSV and selects a predefined set of baseline columns.
*   The `drop_irrelevant_and_high_missingness_columns` function effectively removes specified Ecog score columns and high-missingness columns (PIB\_bl, FBB\_bl, DIGITSCOR\_bl, AV45\_bl) from the DataFrame.
*   The `deduplicate_by_ptid` function successfully removes exact duplicate rows and then deduplicates based on 'PTID', keeping the first entry.
*   The `impute_missing_values` function applies mean imputation to a specified list of continuous columns and most frequent imputation to a specified list of categorical columns ('APOE4', 'DX\_bl').
*   A `main` function was created to orchestrate the entire data cleaning process by calling the refactored functions sequentially.
*   The final cleaned DataFrame (`df_clean`) was successfully saved to a CSV file named 'df\_clean.csv' with the index excluded.
*   After imputation, the columns 'ABETA\_bl', 'TAU\_bl', and 'PTAU\_bl' still show approximately 50% missingness, as they were intentionally excluded from the imputation lists based on prior analysis.
*   Comments and docstrings were added to the functions and the main script to enhance code readability and reproducibility.

### Insights or Next Steps

*   The refactored script provides a clean, modular, and reproducible pipeline for cleaning the ADNIMERGE baseline data, suitable for use in a GitHub repository.
*   Further steps should address the handling of the remaining missing values in 'ABETA\_bl', 'TAU\_bl', and 'PTAU\_bl', potentially using more advanced imputation techniques or considering their high missingness in subsequent analyses.
