## Clinical Data Assessment for Participant and Sample Files

The Participant (Case) and Sample (Biospecimen) data were downloaded from the CCDI Hub Explore page for both phs002970 and phs000720 studies using filters 
for the dbGaP Accession filter and selecting either "phs002790" (Molecular Characterization Initiative) or "phs000720" (Genomic Sequencing of Pediatric Rhabdomyosarcoma) and downloading the "Paritipants" and "Samples" tabs of the filter results table, respectively

Data files (CSVs) were located in the following folders within the Seven Bridges Genomics Cancer Genomics Cloud (SBG-CGC) environment:
For phs002790 (MCI): MCI_Clinical_Participant and MCI_Clinical_Sample folders
For phs000720 (RMS): RMS_Clinical_Participant and RMS_Clinical_Sample folders

Author: Lucy Han and James Galbraith, Booz Allen Hamilton

In [34]:
# Checking directory organization in Seven Bridges 

In [35]:
import sys

In [36]:
import os
current_path = os.getcwd()
print(current_path)

/sbgenomics/workspace


In [37]:
#!ls /sbgenomics/project-files

In [38]:
# importing numpy and pandas
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)

In [39]:
# directory path to the case and sample (biospecimen) files
MCI_case_file = '/sbgenomics/project-files/MCI/MCI_Clinical_Participant/CCDI Hub_CCDI MCI_RMS Pts_Participants Download 2025-08-25 12-50-30.csv'
MCI_case_file_img = '/sbgenomics/project-files/MCI/MCI_Clinical_Participant/CCDI Hub_MCI_IMAGING Participants Download 2025-12-24 16-40-52.csv'
MCI_biospecimen_file = '/sbgenomics/project-files/MCI/MCI_Clinical_Sample/CCDI Hub_CCDI MCI_RMS_Samples Download 2025-08-25 13-24-08.csv'
RMS_case_file = '/sbgenomics/project-files/RMS-Mutation-Prediction/RMS_Clinical_Participant/CCDI Hub_RMS-Mutation-Prediction_phs000720_Participants_Download 2025-07-18 16-11-54.csv'
RMS_biospecimen_file = '/sbgenomics/project-files/RMS-Mutation-Prediction/RMS_Clinical_Sample/CCDI Hub_RMS-Mutation-Prediction_Samples Download 2025-08-25 16-10-57.csv'

In [40]:
# read data to a dataframe
df_case_MCI = pd.read_csv(MCI_case_file, header = 0)
df_case_MCI_img = pd.read_csv(MCI_case_file_img, header = 0)
df_biospecimen_MCI = pd.read_csv(MCI_biospecimen_file, header = 0)
df_case_RMS = pd.read_csv(RMS_case_file, header = 0)
df_biospecimen_RMS = pd.read_csv(RMS_biospecimen_file, header = 0)

In [41]:
# display the data types for the MCI case data
df_case_MCI.dtypes

Participant ID                object
Study ID                      object
Sex at Birth                  object
Race                          object
Diagnosis                     object
Diagnosis Anatomic Site       object
Age at Diagnosis (days)       object
Treatment Type                object
Last Known Survival Status    object
dtype: object

In [42]:
# display the data types for the MCI biospecimen data
df_biospecimen_MCI.dtypes

Sample ID                          object
Participant ID                     object
Study ID                           object
Sample Anatomic Site               object
Age at Sample Collection (days)    object
Sample Tumor Status                object
Sample Tumor Classification        object
Sample Diagnosis                   object
dtype: object

In [43]:
# display the data types for the RMS case data
df_case_RMS.dtypes

Participant ID                 object
Study ID                       object
Sex at Birth                   object
Race                           object
Diagnosis                      object
Diagnosis Anatomic Site        object
Age at Diagnosis (days)         int64
Treatment Type                float64
Last Known Survival Status     object
dtype: object

In [44]:
# display the data types for the RMS sample data
df_biospecimen_RMS.dtypes

Sample ID                           object
Participant ID                      object
Study ID                            object
Sample Anatomic Site                object
Age at Sample Collection (days)      int64
Sample Tumor Status                 object
Sample Tumor Classification         object
Sample Diagnosis                   float64
dtype: object

In [45]:
# display descriptive statistics for the MCI case data
df_case_MCI.describe(include = "all")

Unnamed: 0,Participant ID,Study ID,Sex at Birth,Race,Diagnosis,Diagnosis Anatomic Site,Age at Diagnosis (days),Treatment Type,Last Known Survival Status
count,577,577,577,577,577,577,577,142,146
unique,577,1,2,19,7,100,545,6,4
top,PAUTVR,phs002790,Male,White,"8900/3 : Rhabdomyosarcoma, NOS","C62.9 : Testis, NOS",2704,"Chemotherapy;Immunotherapy;Radiation Therapy, NOS",Alive
freq,1,577,349,283,221,66,2,68,124


In [46]:
# display descriptive statistics for the MCI sample data
df_biospecimen_MCI.describe(include = "all")

Unnamed: 0,Sample ID,Participant ID,Study ID,Sample Anatomic Site,Age at Sample Collection (days),Sample Tumor Status,Sample Tumor Classification,Sample Diagnosis
count,1325,1325,1325,1325,1325,1325,1325,1135
unique,1325,467,1,3,1,2,2,1
top,0DXH9I,PBBSYU,phs002790,Invalid value,Not Reported,Tumor,Not Reported,see diagnosis_comment
freq,1,4,1325,884,1325,886,886,1135


In [47]:
# display descriptive statistics for the RMS case data
df_case_RMS.describe(include = "all")

Unnamed: 0,Participant ID,Study ID,Sex at Birth,Race,Diagnosis,Diagnosis Anatomic Site,Age at Diagnosis (days),Treatment Type,Last Known Survival Status
count,403,403,403,403,403,403,403.0,0.0,403
unique,403,1,2,10,5,44,,,2
top,RMS2120,phs000720,Male,White,"8910/3 : Embryonal rhabdomyosarcoma, NOS","C62.9 : Testis, NOS",,,Not Reported
freq,1,403,270,255,259,62,,,394
mean,,,,,,,2962.173697,,
std,,,,,,,2325.612077,,
min,,,,,,,7.0,,
25%,,,,,,,1217.0,,
50%,,,,,,,2328.0,,
75%,,,,,,,4412.5,,


In [48]:
# display descriptive statistics for the RMS biospecimen data
df_biospecimen_RMS.describe(include = "all")

Unnamed: 0,Sample ID,Participant ID,Study ID,Sample Anatomic Site,Age at Sample Collection (days),Sample Tumor Status,Sample Tumor Classification,Sample Diagnosis
count,403,403,403,403,403.0,403,403,0.0
unique,403,403,1,51,,1,3,
top,PAIZZZ,RMS2447,phs000720,"C62.9 : Testis, NOS",,Tumor,Primary,
freq,1,1,403,59,,403,314,
mean,,,,,2962.173697,,,
std,,,,,2325.612077,,,
min,,,,,7.0,,,
25%,,,,,1217.0,,,
50%,,,,,2328.0,,,
75%,,,,,4412.5,,,


### Consistency Assessment

In [49]:
from collections import Counter
import math

In [50]:
# Consistency assessment
def analyze_df_columns(df):
    
# Analyze each column in a dataframe and print dictionary distinct value counts 
    
    try:
        # Read the CSV file into a DataFrame
        #df = pd.read_csv(csv_file)
        
        # print(f"\nAnalyzing file: {csv_file_path}")
        # print("=" * 50)
        
        # Iterate through each column
        for column in df.columns:
            # Get value counts for the column (excludes NaN by default)
            value_counts = df[column].value_counts()
            distinct_count = len(value_counts)
            
            # Check if there are more than 10 distinct values
            if distinct_count > 20:
                print(f"{column} has count of {distinct_count}: Output suppressed")
            else:
                # Convert value counts to dictionary
                value_dict = value_counts.to_dict()
                print(f"{column} has count of {distinct_count}: {value_dict}")
                
    except FileNotFoundError:
        print(f"Error: File '{csv_file_path}' not found.")
    except pd.errors.EmptyDataError:
        print(f"Error: File '{csv_file_path}' is empty.")
    except Exception as e:
        print(f"Error reading '{csv_file_path}': {str(e)}")

In [51]:
analyze_df_columns(df_case_MCI)

Participant ID has count of 577: Output suppressed
Study ID has count of 1: {'phs002790': 577}
Sex at Birth has count of 2: {'Male': 349, 'Female': 228}
Race has count of 19: {'White': 283, 'Black or African American': 79, 'Hispanic or Latino;White': 66, 'Unknown': 27, 'Not Reported': 22, 'Asian': 21, 'Hispanic or Latino;Not Reported': 19, 'Hispanic or Latino;Unknown': 16, 'Unknown;White': 11, 'Not Reported;White': 9, 'Native Hawaiian or other Pacific Islander': 5, 'Black or African American;Not Reported': 4, 'Black or African American;Hispanic or Latino': 4, 'American Indian or Alaska Native': 3, 'Not Reported;Unknown': 2, 'Black or African American;Unknown': 2, 'Asian;Hispanic or Latino': 2, 'Asian;Unknown': 1, 'Native Hawaiian or other Pacific Islander;Not Reported': 1}
Diagnosis has count of 7: {'8900/3 : Rhabdomyosarcoma, NOS': 221, '8910/3 : Embryonal rhabdomyosarcoma, NOS': 210, '8920/3 : Alveolar rhabdomyosarcoma': 115, '8912/3 : Spindle cell rhabdomyosarcoma': 27, '8902/3 : Mi

In [52]:
analyze_df_columns(df_biospecimen_MCI)

Sample ID has count of 1325: Output suppressed
Participant ID has count of 467: Output suppressed
Study ID has count of 1: {'phs002790': 1325}
Sample Anatomic Site has count of 3: {'Invalid value': 884, 'C42.0 : Blood': 439, 'C72.9 : Central nervous system': 2}
Age at Sample Collection (days) has count of 1: {'Not Reported': 1325}
Sample Tumor Status has count of 2: {'Tumor': 886, 'Normal': 439}
Sample Tumor Classification has count of 2: {'Not Reported': 886, 'Not Applicable': 439}
Sample Diagnosis has count of 1: {'see diagnosis_comment': 1135}


In [53]:
analyze_df_columns(df_case_RMS)

Participant ID has count of 403: Output suppressed
Study ID has count of 1: {'phs000720': 403}
Sex at Birth has count of 2: {'Male': 270, 'Female': 133}
Race has count of 10: {'White': 255, 'Black or African American': 67, 'Hispanic or Latino;White': 29, 'Not Reported': 17, 'Hispanic or Latino': 16, 'Asian': 12, 'Black or African American;Hispanic or Latino': 3, 'American Indian or Alaska Native': 2, 'Hispanic or Latino;Native Hawaiian or other Pacific Islander': 1, 'Native Hawaiian or other Pacific Islander': 1}
Diagnosis has count of 5: {'8910/3 : Embryonal rhabdomyosarcoma, NOS': 259, '8920/3 : Alveolar rhabdomyosarcoma': 67, '8900/3 : Rhabdomyosarcoma, NOS': 42, '8800/3 : Sarcoma, NOS': 30, '8912/3 : Spindle cell rhabdomyosarcoma': 5}
Diagnosis Anatomic Site has count of 44: Output suppressed
Age at Diagnosis (days) has count of 360: Output suppressed
Treatment Type has count of 0: {}
Last Known Survival Status has count of 2: {'Not Reported': 394, 'Dead': 9}


In [54]:
analyze_df_columns(df_biospecimen_RMS)

Sample ID has count of 403: Output suppressed
Participant ID has count of 403: Output suppressed
Study ID has count of 1: {'phs000720': 403}
Sample Anatomic Site has count of 51: Output suppressed
Age at Sample Collection (days) has count of 360: Output suppressed
Sample Tumor Status has count of 1: {'Tumor': 403}
Sample Tumor Classification has count of 3: {'Primary': 314, 'Metastatic': 88, 'Not Reported': 1}
Sample Diagnosis has count of 0: {}


In [55]:
from collections import Counter

def analyze_column_dtypes(df):

# Analyzes data types within each column of a dataframe

    try:
        # Read CSV file into DataFrame
        
        # Loop through each column
        for column in df.columns:
            # Get all non-null values in the column
            column_values = df[column].dropna()
            
            # If column is empty after dropping NaN, handle separately
            if len(column_values) == 0:
                print(f"{column} data types: {{NaN: {len(df[column])} (100.0%)}}")
                continue
            
            # Count data types
            dtype_counts = Counter()
            total_elements = len(df[column])  # Include NaN in total count
            
            # Count NaN values
            nan_count = df[column].isna().sum()
            if nan_count > 0:
                dtype_counts['NaN'] = nan_count
            
            # Analyze each non-null value
            for value in column_values:
                value_type = type(value).__name__
                
                # For more specific type checking
                if isinstance(value, str):
                    # Check if string represents a number
                    try:
                        float(value)
                        if '.' in value:
                            dtype_counts['numeric_string(float)'] += 1
                        else:
                            int(value)
                            dtype_counts['numeric_string(int)'] += 1
                    except ValueError:
                        # Check if it's a boolean-like string
                        if value.lower() in ['true', 'false', 'yes', 'no', '1', '0']:
                            dtype_counts['boolean_string'] += 1
                        else:
                            dtype_counts['str'] += 1
                elif isinstance(value, (int, float)):
                    dtype_counts[value_type] += 1
                elif isinstance(value, bool):
                    dtype_counts['bool'] += 1
                else:
                    dtype_counts[value_type] += 1
            
            # Format and print results
            dtype_dict = {}
            for dtype, count in dtype_counts.items():
                percentage = (count / total_elements) * 100
                dtype_dict[dtype] = f"{count} ({percentage:.1f}%)"
            
            print(f"{column} data types: {dtype_dict}")
    
    except FileNotFoundError:
        print(f"Error: File '{csv_file}' not found.")
    except pd.errors.EmptyDataError:
        print(f"Error: File '{csv_file}' is empty.")
    except Exception as e:
        print(f"Error reading file: {e}")

In [56]:
analyze_column_dtypes(df_case_MCI)

Participant ID data types: {'str': '577 (100.0%)'}
Study ID data types: {'str': '577 (100.0%)'}
Sex at Birth data types: {'str': '577 (100.0%)'}
Race data types: {'str': '577 (100.0%)'}
Diagnosis data types: {'str': '577 (100.0%)'}
Diagnosis Anatomic Site data types: {'str': '577 (100.0%)'}
Age at Diagnosis (days) data types: {'numeric_string(int)': '576 (99.8%)', 'str': '1 (0.2%)'}
Treatment Type data types: {'NaN': '435 (75.4%)', 'str': '142 (24.6%)'}
Last Known Survival Status data types: {'NaN': '431 (74.7%)', 'str': '146 (25.3%)'}


In [57]:
# Locate non-numeric entries in 'Age at Diagnosis (days)' column of MCI case file 
mask = pd.to_numeric(df_case_MCI['Age at Diagnosis (days)'], errors='coerce').isna()
non_numeric_values = df_case_MCI.loc[mask, 'Age at Diagnosis (days)']

# Count each unique non-numeric value
non_numeric_counts = non_numeric_values.value_counts()
print("Non-numeric value counts:")
print(non_numeric_counts)

Non-numeric value counts:
Age at Diagnosis (days)
Not Reported    1
Name: count, dtype: int64


In [58]:
# Convert non-numeric entry ('Not Reported') to numeric type so column is recognized as numeric for future outlier assessment
df_case_MCI['Age at Diagnosis (days)'] = pd.to_numeric(df_case_MCI['Age at Diagnosis (days)'], errors='coerce')
# Impute (is median best strategy?)
df_case_MCI['Age at Diagnosis (days)'] = df_case_MCI['Age at Diagnosis (days)'].fillna(df_case_MCI['Age at Diagnosis (days)'].median())
# Convert to int64
df_case_MCI['Age at Diagnosis (days)'] = df_case_MCI['Age at Diagnosis (days)'].astype('int64')
print(df_case_MCI['Age at Diagnosis (days)'].dtype)

int64


In [59]:
analyze_column_dtypes(df_biospecimen_MCI)

Sample ID data types: {'str': '1325 (100.0%)'}
Participant ID data types: {'str': '1325 (100.0%)'}
Study ID data types: {'str': '1325 (100.0%)'}
Sample Anatomic Site data types: {'str': '1325 (100.0%)'}
Age at Sample Collection (days) data types: {'str': '1325 (100.0%)'}
Sample Tumor Status data types: {'str': '1325 (100.0%)'}
Sample Tumor Classification data types: {'str': '1325 (100.0%)'}
Sample Diagnosis data types: {'NaN': '190 (14.3%)', 'str': '1135 (85.7%)'}


In [60]:
df_biospecimen_MCI['Age at Sample Collection (days)'] = pd.to_numeric(df_biospecimen_MCI['Age at Sample Collection (days)'], errors='coerce')
print(df_biospecimen_MCI['Age at Sample Collection (days)'].dtype)

float64


In [61]:
analyze_column_dtypes(df_case_RMS)

Participant ID data types: {'str': '403 (100.0%)'}
Study ID data types: {'str': '403 (100.0%)'}
Sex at Birth data types: {'str': '403 (100.0%)'}
Race data types: {'str': '403 (100.0%)'}
Diagnosis data types: {'str': '403 (100.0%)'}
Diagnosis Anatomic Site data types: {'str': '403 (100.0%)'}
Age at Diagnosis (days) data types: {'int': '403 (100.0%)'}
Treatment Type data types: {NaN: 403 (100.0%)}
Last Known Survival Status data types: {'str': '403 (100.0%)'}


In [62]:
analyze_column_dtypes(df_biospecimen_RMS)

Sample ID data types: {'str': '403 (100.0%)'}
Participant ID data types: {'str': '403 (100.0%)'}
Study ID data types: {'str': '403 (100.0%)'}
Sample Anatomic Site data types: {'str': '403 (100.0%)'}
Age at Sample Collection (days) data types: {'int': '403 (100.0%)'}
Sample Tumor Status data types: {'str': '403 (100.0%)'}
Sample Tumor Classification data types: {'str': '403 (100.0%)'}
Sample Diagnosis data types: {NaN: 403 (100.0%)}


## Completeness Assessment

In [76]:
def calculate_completeness(df, df_name):

    # Function to sum all missing values and calculate overall missingness
    # df: DataFrame itself
    # df_name: string for name of DataFrame there is no way to simply print the name (no "name" attribute of a DataFrame)
        
    # initiate missing and length counts for data
    counter_missing = 0
    counter_len = 0
    
    # get the number of records and attributes
    count_rows = df.shape[0]
    count_cols = df.shape[1]
    
    # loop through the columns to count the missing values
    for col in df.columns:
        df_col = df[col]
        missing_count = df_col.isnull().sum().sum() \
        + df_col.eq('-').sum().sum() \
        + df_col.eq('').sum().sum() \
        + df_col.eq(' ').sum().sum() \
        + df_col.eq('NA').sum().sum() \
        + df_col.eq('Undefined').sum().sum() \
        + df_col.eq('Unknown').sum().sum() \
        + df_col.eq('[Not Available]').sum().sum() \
        + df_col.eq('[Not Applicable]').sum().sum() \
        + df_col.eq('Not Available').sum().sum() \
        + df_col.eq('Not Applicable').sum().sum() \
        + df_col.eq('Not Reported').sum().sum() \
        + df_col.eq('Invalid value').sum().sum() \
        + df_col.eq('see diagnosis_comment').sum().sum() \
    
        counter_missing += missing_count
        counter_len += count_rows
        
        print(f"{col} has {missing_count} of {count_rows} ({100*missing_count/count_rows}%) missing values")
    
    # print missing percentage for MCI case 
    print(f"Overall missing percentage for {df_name}: {counter_missing} of {counter_len} ({100*counter_missing/counter_len}%)")

In [77]:
calculate_completeness(df_case_MCI, "df_case_MCI")

Participant ID has 0 of 577 (0.0%) missing values
Study ID has 0 of 577 (0.0%) missing values
Sex at Birth has 0 of 577 (0.0%) missing values
Race has 49 of 577 (8.492201039861351%) missing values
Diagnosis has 0 of 577 (0.0%) missing values
Diagnosis Anatomic Site has 0 of 577 (0.0%) missing values
Age at Diagnosis (days) has 0 of 577 (0.0%) missing values
Treatment Type has 435 of 577 (75.38994800693241%) missing values
Last Known Survival Status has 431 of 577 (74.69670710571924%) missing values
Overall missing percentage for df_case_MCI: 915 of 5193 (17.61987290583478%)


In [78]:
calculate_completeness(df_biospecimen_MCI, "df_biospecimen_MCI")

Sample ID has 0 of 1325 (0.0%) missing values
Participant ID has 0 of 1325 (0.0%) missing values
Study ID has 0 of 1325 (0.0%) missing values
Sample Anatomic Site has 884 of 1325 (66.71698113207547%) missing values
Age at Sample Collection (days) has 1325 of 1325 (100.0%) missing values
Sample Tumor Status has 0 of 1325 (0.0%) missing values
Sample Tumor Classification has 1325 of 1325 (100.0%) missing values
Sample Diagnosis has 1325 of 1325 (100.0%) missing values
Overall missing percentage for df_biospecimen_MCI: 4859 of 10600 (45.839622641509436%)


In [79]:
calculate_completeness(df_case_RMS, "df_case_RMS")

Participant ID has 0 of 403 (0.0%) missing values
Study ID has 0 of 403 (0.0%) missing values
Sex at Birth has 0 of 403 (0.0%) missing values
Race has 17 of 403 (4.218362282878412%) missing values
Diagnosis has 0 of 403 (0.0%) missing values
Diagnosis Anatomic Site has 0 of 403 (0.0%) missing values
Age at Diagnosis (days) has 0 of 403 (0.0%) missing values
Treatment Type has 403 of 403 (100.0%) missing values
Last Known Survival Status has 394 of 403 (97.76674937965261%) missing values
Overall missing percentage for df_case_RMS: 814 of 3627 (22.44279018472567%)


In [80]:
calculate_completeness(df_biospecimen_RMS, "df_case_RMS")

Sample ID has 0 of 403 (0.0%) missing values
Participant ID has 0 of 403 (0.0%) missing values
Study ID has 0 of 403 (0.0%) missing values
Sample Anatomic Site has 0 of 403 (0.0%) missing values
Age at Sample Collection (days) has 0 of 403 (0.0%) missing values
Sample Tumor Status has 0 of 403 (0.0%) missing values
Sample Tumor Classification has 1 of 403 (0.24813895781637718%) missing values
Sample Diagnosis has 403 of 403 (100.0%) missing values
Overall missing percentage for df_case_RMS: 404 of 3224 (12.531017369727047%)


## Outliers Assessment

In [43]:
# importing numpy and pandas
import numpy as np
import pandas as pd

In [44]:
def get_outlier_count(df, columns, print_flag = False):
    
    # Function to get outlier count and length counts
    
    # df is the case/sample dataframe
    # columns is the numeric columns to be assessed for outliers
    # print_flag is set to False as default. Prints the outlier lists for the list of unique IDs

    # initiate variables to count outliers and length
    count_outliers = 0
    count_samples = 0

    assessment_df = df
    
    # replace missing values with np.NaN
    assessment_df = assessment_df.replace("", np.NaN) # replace "" with NaN
    assessment_df = assessment_df.replace(" ", np.NaN) # replace " " with NaN
    assessment_df = assessment_df.replace("-", np.NaN) # replace - with NaN
    assessment_df = assessment_df.replace("NA", np.NaN) # replace "NA" with NaNs
    assessment_df = assessment_df.replace("Undefined", np.NaN) # replace "Undefined" with NaNs
    assessment_df = assessment_df.replace("Unknown", np.NaN) # replace "Unknown" with NaNs
    assessment_df = assessment_df.replace('[Not Available]', np.NaN) # replace "[Not Available]" with NaNs
    assessment_df = assessment_df.replace('[Not Applicable]', np.NaN) # replace "[Not Applicable]" with NaNs
    assessment_df = assessment_df.replace('Not Available', np.NaN) # replace "Not Available" with NaNs
    assessment_df = assessment_df.replace('Not Applicable', np.NaN) # replace "Not Applicable" with NaNs
    assessment_df = assessment_df.replace('Not Reported', np.NaN) # replace "Not Reported" with NaNs
    assessment_df = assessment_df.replace('Invalid value', np.NaN) # replace "Invalid value" with NaNs
    assessment_df = assessment_df.replace('see diagnosis_comment', np.NaN) # replace "see diagnosis_comment" with NaNs

    # loop through columns/attributes to calculate outliers per attribute
    for c in columns:
        assessment_df[c] = pd.to_numeric(assessment_df[c]) # change columns to be assessed to numeric
        values_array = assessment_df[c].values # get array of values

        Q1 = np.quantile(values_array, 0.25) # first quartile value
        Q3 = np.quantile(values_array, 0.75) # third quartile value
        iqr = Q3 - Q1 # interquartile range

        # generate list of outliers lower than Q1 - 1.5*IQR
        low_outlier_list = [v for v in values_array.tolist() if v < (Q1 - 1.5*iqr)] 

        # generate list of outliers greater than Q3 + 1.5*IQR
        high_outlier_list = [v for v in values_array.tolist() if v > (Q3 + 1.5*iqr)]

        # add outlier counts to count_outliers 
        count_outliers = count_outliers + len(low_outlier_list) + len(high_outlier_list)

        # add counts to count_samples
        count_samples += len(assessment_df[c])

        if print_flag:
            print(f"{c} \t {len(low_outlier_list)} less than {Q1 - 1.5*iqr} \t {len(high_outlier_list)} greater than {Q3 + 1.5*iqr}")
            print(low_outlier_list)
            print(high_outlier_list)

    return count_outliers, count_samples


In [45]:
df_list = [{"MCI case": df_case_MCI}, {"MCI biospecimen": df_biospecimen_MCI}, {"RMS case": df_case_RMS}, {"RMS biospecimen": df_biospecimen_RMS}]

In [46]:
# initiate lists
outlier_count_list = []
sample_count_list = []

# loop through list of dataframes (case and sample)
for dfs in df_list: 
    for key in dfs.keys():
        df = dfs[key]
        
        # get list of numeric columns
        numeric_columns = df.select_dtypes(include=[np.number]).columns
        print(numeric_columns)
        print(f"{key} has {len(numeric_columns)} numeric columns")

        # get outlier counts using the get_outlier_count function for all numeric_columns
        outlier_count, sample_count = get_outlier_count(df, numeric_columns, print_flag = False)
        outlier_count_list.append(outlier_count)
        sample_count_list.append(sample_count)

        # calculate the percentage of outliers
        percent_outliers = outlier_count/sample_count
        
        # print the outliers
        print(f"{key} has {outlier_count} of {sample_count} ({100*percent_outliers}%) outliers")


Index(['Age at Diagnosis (days)'], dtype='object')
MCI case has 1 numeric columns
MCI case has 0 of 577 (0.0%) outliers
Index(['Age at Sample Collection (days)'], dtype='object')
MCI biospecimen has 1 numeric columns
MCI biospecimen has 0 of 1325 (0.0%) outliers
Index(['Age at Diagnosis (days)', 'Treatment Type'], dtype='object')
RMS case has 2 numeric columns
RMS case has 4 of 806 (0.49627791563275436%) outliers
Index(['Age at Sample Collection (days)', 'Sample Diagnosis'], dtype='object')
RMS biospecimen has 2 numeric columns
RMS biospecimen has 4 of 806 (0.49627791563275436%) outliers


In [47]:
outlier_count_list

[0, 0, 4, 4]

In [48]:
sample_count_list

[577, 1325, 806, 806]

In [65]:
# Overall outliers
sum(outlier_count_list)/sum(sample_count_list)

0.0022766078542970974

In [85]:
# Overall outliers
sum(combined_outlier_count_list)/sum(combined_sample_count_list)

0.0011078286558345643