# Data Validation 3:  Verify that data appears as expected within the MTP UI
Started 2022-10-28 ZD  

## Description
### "Does the MTP UI match the MTP database?"

### Purpose
This validation will test “completeness and accuracy of data loaded into the platform” and follow MTP Data Validations 1&2. It will compare the data displayed within the platform GUI (after loading) to the expected values within the data (before loading). Automated scripts will pull test cases from the data that can be fed into platform testing automations to check displays for completeness and accuracy. 

### Scope
DV3 will focus on displays within the platform that relate to new pediatric data, including those that happen to incorporate Open Targets (OT) data. New pediatric data includes the Food and Drug Administration’s Pediatric Molecular Target Lists (FDA PMTL); the much larger collection of evidence data provided by the Children’s Hospital of Philadelphia (CHoP); and derived summary tables, such as the Pediatric Cancer Data Navigation (PCDN) page. It will not validate or test displays that only include OT data without pediatric data.  

The testing within DV3 will use sampling (spot-checking) methods, though scalability to meet automation capacity will be a design goal. Samples tested will include a set of defined high-profile genes and diseases that we expect will garner an abundance of user attention. We will also include a random sampling of genes and diseases (associated with CHoP data) to expand testing scope.

### Test Case Overview

Testable values for each test case will be contained in a tab within the output Excel

1. Target Association
    - Count of associated diseases
    - PMTL annotation
2. Target Profile Page
    - Somatic Alterations widget
        - Status
        - Row count of each of 5 tabs
    - Gene Expression widget
        - Status
    - Epigenetic Modification widget
        - Status
        - Row count of each of 2 tabs
    - Differential Expression widget
        -Status
3. Disease Association
    - Count of associated targets
4. Disease Profile Page
    - Differential Expression widget
        -Status
5. Evidence Page
    - Somatic Alterations widget
        - Status
        - Row count of each of 5 tabs
    - Gene Expression widget
        - Status
    - Epigenetic Modification widget
        - Status
        - Row count of each of 2 tabs
6. Pediatric Cancer Data Navigation (PCDN) Page
    - Row count of resulting evidence pages when searching for target or disease
7. Pediatric Molecular Targets List (PMTL) Page
    - Row count of Relevant and Non-Relevant targets  

### Updates

**2022-11-15 (initial changes)**
- Generate and export new PCDN instead of ingesting
- Add random seed for reproducible random results
- Change Excel export to function with dictionary input for more flexibility

**2022-11-16**
- Update PCDN export functions
    - Export PCDN as partitioned JSONL files
    - Compress files into one folder
    - Create md5 checksum
- Edit output data file structure

**2022-11-21**
- Add 'id' field to PCDN with uuids for each row

**2022-11-22**
- Reorganize data folders to allow for separate versioning of CHoP data types
- Update output filenames to identify data type versions instead of only OpenPedCan version
- Add function to build and export PCDN disease list for drop-down menu
- Add function to build and export PCDN target, disease, and evidence counts

**2022-12-05**
- Rebuild Target page Associations test case to use OT AOTF as source file instead of 
precalculated OT Direct Overall Associations
- Known issues remain:
    - Disease page Associations test case should be rebuilt to use AOTf file and impose
    indirect association (ontology) logic (DONE 1/25/2023)
    - Evidence page test case should be rebuild to include indirect association logic (DONE 1/26/2023)
    - PCDN Gene search test case should be rebuilt to match site logic of prefix search (DONE 1/27/2023)

**2023-01-25**
- Change ot_diseases filter to keep 'descendants' column. Propagates to all downstream joins
- Add function to search for indirect evidence using disease ontology descendants
- Rebuild Disease page Associations test to create indirect associations using AOTF evidence

**2023-01-26**
- Fix typos and adjust list spacing and indentation for readability
- Rebuild Evidence test cases to gather indirect evidence

**2023-01-27**
- Rebuild PCDN target test to match MTP PCDN search behavior, which includes similar 
hypnated targets in search results
- Add tests to check that targets, diseases, and evidence defined in dv3_priority_tests are both valid and useful
- Increased standard number of test cases randomly samples (total cases from 20-40 for most tests)

**2023-02-17**
- Add option to diseaseAssc (Disease page Associations) function use all pediatric diseases present in 
CHoP (non-OT) data instead of few priority diseases listed in csv

**2023-02-22**
- Add Epigenetic Modification (Methylation) data to test cases and PCDN
    - Include Methyl by Gene and Methyl by Isoform into loading/cleaning/combining steps
    - Add targetProfile columns to output xlsx for sums and widget presence
    - Add evidence columns to output xlsx for sums and widget presence
    - Add EM version tags to output filenames

**2023-03-13**
- Add function and paths to export versioned csvs of OpenTargets disease and target lists for ad-hoc analyses

**2023-03-29**
- Change name format of methylation files received from CHoP

**2023-04-03**
- Add function to load jsonl as chunks, aggregate, and then save as smaller grouped files to manage the large methylation files
- Improve chop cleaning functions
    - Fix datasourceId calling to prevent bug if first row is dropped
    - Add option to accurately report evidence counts in aggregated methylation data
    - Add gene symbols and disease names of blank IDs to text output
- Add functionality to build_test_case_df to sum evidenceCounts of methylation data instead of counting rows
- Add tqdm import and code to enable displaying progress bars to improve sanity
- Change loading of GeneExpression files to use chunkloading for better performance

**2023-05-04**
- Add Differential Expression (DE) to path configs and data loading

**2023-05-08**
- Add optional parameter to group by custom fields in jsonl loading steps
- Add DE to data cleaning steps
- Fix randomSeed calling within test case functions

**2023-05-10**
- Add DE to PCDN steps
- Update priority disease and evidence csvs to reflect v12.0 changes

**2023-05-11**
- Add DE to targetProfile test case generation
- Create new diseaseProfile test case and include DE data

## Import modules and define relative paths

In [None]:
import glob
import json
import os
import random
import hashlib
import shutil
import uuid

import ndjson
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

In [None]:
# CORE VERSIONS
OT_VERSION = '22.11'
OPENPEDCAN_SOMATIC_ALTERATIONS_VERSION = 'v12.0'
OPENPEDCAN_GENE_EXPRESSION_VERSION = 'v12.0'
OPENPEDCAN_EPIGENETIC_MODIFICATION_VERSION = 'v12.0'
OPENPEDCAN_DIFFERENTIAL_EXPRESSION_VERSION = 'v12.0'
PMTL_VERSION = 'v3.1'
RANDOM_SEED = 5555

# --------

# INPUTS

# Data from Open Targets
OT_PATH = 'data/external/opentargets/platform/' + OT_VERSION + '/output/etl/json/'

OT_DISEASES_PATH = OT_PATH + 'diseases/'
OT_TARGETS_PATH = OT_PATH + 'targets/'
OT_AOTF_PATH = OT_PATH + 'AOTFClickhouse/'


# Data from CHoP
CHOP_SA_PATH = 'data/raw/chopOpenPedCan/somaticAlterations/' + OPENPEDCAN_SOMATIC_ALTERATIONS_VERSION + '/'
CHOP_GX_PATH = 'data/raw/chopOpenPedCan/geneExpression/' + OPENPEDCAN_GENE_EXPRESSION_VERSION + '/'
CHOP_EM_PATH = 'data/raw/chopOpenPedCan/epigeneticModification/' + OPENPEDCAN_EPIGENETIC_MODIFICATION_VERSION + '/'
CHOP_DE_PATH = 'data/raw/chopOpenPedCan/differentialExpression/' + OPENPEDCAN_DIFFERENTIAL_EXPRESSION_VERSION + '/'


# CHoP: Somatic Alterations
CNV_PATH = CHOP_SA_PATH + 'gene-level-cnv-consensus-annotated-mut-freq.jsonl.gz'
SNVGENE_PATH = CHOP_SA_PATH + 'gene-level-snv-consensus-annotated-mut-freq.jsonl.gz'
SNV_PATH = CHOP_SA_PATH + 'variant-level-snv-consensus-annotated-mut-freq.jsonl.gz'
FUSIONGENE_PATH = CHOP_SA_PATH + 'putative-oncogene-fused-gene-freq.jsonl.gz'
FUSION_PATH = CHOP_SA_PATH + 'putative-oncogene-fusion-freq.jsonl.gz'


# CHoP: Gene Expression
TPMGENE_PATH = CHOP_GX_PATH + 'long_n_tpm_mean_sd_quantile_gene_wise_zscore.jsonl.gz'
TPMGROUP_PATH = CHOP_GX_PATH + 'long_n_tpm_mean_sd_quantile_group_wise_zscore.jsonl.gz'


# CHoP: Epigenetic Modification (Methylation)
METHYL_FILENAME = 'isoform-methyl-beta-values-summary'
METHYLGENE_FILENAME = 'gene-methyl-beta-values-summary'

# CHoP: Raw Methylation (large files)
METHYL_PATH = CHOP_EM_PATH + METHYL_FILENAME + '.jsonl.gz'
METHYLGENE_PATH = CHOP_EM_PATH + METHYLGENE_FILENAME + '.jsonl.gz'

# CHoP: Grouped Methylation (aggregated small files)
GROUPED_CHOP_EM_PATH = 'data/processed/chopOpenPedCan/epigeneticModification_grouped/' + OPENPEDCAN_EPIGENETIC_MODIFICATION_VERSION + '/'
GROUPED_METHYL_PATH = GROUPED_CHOP_EM_PATH + METHYL_FILENAME + '/'
GROUPED_METHYLGENE_PATH = GROUPED_CHOP_EM_PATH + METHYLGENE_FILENAME + '/'


# CHoP: Differential Expression
DIFFEXPR_FILENAME = 'gene-counts-rsem-expected_count-collapsed-deseq'

# CHoP: Raw Differential Expression (large files)
DIFFEXPR_PATH = CHOP_DE_PATH + DIFFEXPR_FILENAME + '.jsonl.gz'

# CHoP: Grouped Differential Expression (aggregated small files)
GROUPED_CHOP_DE_PATH = 'data/processed/chopOpenPedCan/differentialExpression_grouped/' + OPENPEDCAN_DIFFERENTIAL_EXPRESSION_VERSION + '/'
GROUPED_DIFFEXPR_PATH = GROUPED_CHOP_DE_PATH + DIFFEXPR_FILENAME + '/'



# PMTL data
PMTL_PATH = 'data/processed/pmtl/pmtl_' + PMTL_VERSION + '.json'

# Priority targets and diseases for test cases
PRIORITY_PATH = 'data/processed/dv3_priority_tests/'
PRIORITY_TARGETS_PATH = PRIORITY_PATH + 'targets.csv'
PRIORITY_DISEASES_PATH = PRIORITY_PATH + 'diseases.csv'
PRIORITY_EVIDENCES_PATH = PRIORITY_PATH + 'evidences.csv'

# --------

# OUTPUTS

# Versioned Open Targets Target/Disease csvs
OT_TABLES_EXPORT_PATH = 'data/processed/ot_lists/' + OT_VERSION + '/'
OT_TARGET_EXPORT_PATH = OT_TABLES_EXPORT_PATH + 'targets.csv'
OT_DISEASE_EXPORT_PATH = OT_TABLES_EXPORT_PATH + 'diseases.csv'

# Excel file for DV3 automation input
XLSX_OUTPUT =   ('dv3_test_cases/MTP_DV3_' + OT_VERSION + 
                '_SA' + OPENPEDCAN_SOMATIC_ALTERATIONS_VERSION + 
                '_GX' + OPENPEDCAN_GENE_EXPRESSION_VERSION + 
                '_EM' + OPENPEDCAN_EPIGENETIC_MODIFICATION_VERSION +
                '_DE' + OPENPEDCAN_DIFFERENTIAL_EXPRESSION_VERSION +
                '.xlsx')

# Pediatric Cancer Data Navigation
PCDN_PATH =     ('data/processed/pcdn/' + OT_VERSION + 
                '_SA' + OPENPEDCAN_SOMATIC_ALTERATIONS_VERSION + 
                '_GX' + OPENPEDCAN_GENE_EXPRESSION_VERSION + 
                '_EM' + OPENPEDCAN_EPIGENETIC_MODIFICATION_VERSION +
                '_DE' + OPENPEDCAN_DIFFERENTIAL_EXPRESSION_VERSION +
                '/')
PCDN_JSON = PCDN_PATH + 'chopDataNavigationTable.json'
PCDN_COUNTS = PCDN_PATH + 'pcdnCounts.json'
PCDN_DISEASES = PCDN_PATH + 'diseaseOptions.json'
PCDN_JSONL = PCDN_PATH + 'jsonl_files'
PCDN_JSONL_TEMP = PCDN_JSONL + '/'
PCDN_JSONL_FILENAMES = 'pcdn_part_{0:04d}.json'

## Load data for processing

In [None]:
def load_jsonl_files_to_df(path:str, 
                           filetype:str='*.json', 
                           progress:bool=False, 
                           aggData:bool=False, 
                           group_cols:list=['targetFromSourceId', 
                                            'Gene_symbol', 
                                            'diseaseFromSourceMappedId', 
                                            'Disease', 
                                            'datasourceId']):
    """
    Load multiple identically-structured jsonl files within a local folder 
    into a single dataframe. Useful for OpenTargets FTP downloads.

    :param path: Relative filepath to the folder containing the jsonl files.
    :param filetype: Filetype suffix of files to include. default '*.json'
    :param progress: If True, outputs tqdm progress bar while loading
    :param aggData: If True, perform additonal groupby function. Use when 
        loading data already aggregated into chunks. 
    :param group_cols: list of columns by which to group agg data
    """
    
    # OT uses 'json' extension for 'jsonl' files
    fullPath = path + filetype

    # Create list of all files within path folder
    files = glob.glob(fullPath)

    # Build df by combining all files in path folder
    if progress == False:
        df = pd.concat(
            (pd.read_json(f, orient='records', lines=True)
            for f in files), 
            ignore_index=True
            )
    if progress == True:
            df = pd.concat(
            (pd.read_json(f, orient='records', lines=True)
            for f in tqdm(files)), 
            ignore_index=True
            )

    # If loaded files are aggregated chunks, perform a final aggregation
    # to combine any rows across chunks
    if aggData == True:
        df = df.groupby(group_cols).sum().reset_index()

    return df

In [None]:
def load_jsonl_as_chunks(path:str, limit=None, chunksize=2.5e5):
    """
    Load large jsonl file in chunks and then combine to save memory.

    :param path: Relative filepath to jsonl file
    :param limit: max number of chunks to include in concat. Use to
        load a subset of the data
    :param chunksize: int size of the lines to include

    """
    # Load jsonl in chunks
    df_chunks = pd.read_json(path, orient='records', lines=True, chunksize=chunksize)

    # Create empty dataframe
    df = pd.DataFrame()

    # Fill empty dataframe with chunks up to limit
    for label, chunk in enumerate(df_chunks):
        if label == limit:
            break
        df = pd.concat([df,chunk], ignore_index=True)
    
    return df

In [None]:
def load_jsonl_chunk_group_save(path:str, 
                                directory:str, 
                                start=0, 
                                stop=None, 
                                chunksize=2.5e5, 
                                group_cols:list=['targetFromSourceId', 
                                                'Gene_symbol', 
                                                'diseaseFromSourceMappedId', 
                                                'Disease', 
                                                'datasourceId']):
    """
    Load large jsonl file in chunks, aggregate, and then save as 
    smaller jsonl files.  JSONL must already include the following 
    fields: 'targetFromSourceId', 'Gene_symbol', 
    'diseaseFromSourceMappedId', 'Disease', 'datasourceId'

    :param path: Relative filepath to jsonl file
    :param directory: Relative filepath to folder for jsonl chunk files
    :param start: index of starting chunk (inclusive)
    :param stop: index of maximum chunk number
    :param chunksize: int size of the lines to include
    :param group_cols: list of columns by which to group raw data

    """

    # Make versioned directory if does not exist
    if os.path.exists(directory) == False:
        os.makedirs(directory)

    # Load jsonl in chunks
    df_chunks = pd.read_json(path, orient='records', lines=True, chunksize=chunksize)

    # Handle each chunk within range limit
    for label, chunk in enumerate(df_chunks):
        if label >= start:
            if label == stop:
                break   
            
            # Group df by only summary columns used for the PCDN and tests
            df_grouped = chunk.groupby(group_cols).size().reset_index().rename(columns={0:'evidenceCount'})

            # Iterate through JSONL partitions to export as files. Show NaN values as blank    
            dict_records = df_grouped.apply(lambda x: x.dropna()).to_dict('records')
            jsonlData = ndjson.dumps(dict_records).encode('utf-8')
            with open(f"{directory}part_{label:04}.jsonl", 'wb') as file:
                file.write(jsonlData)

    return None

In [None]:
# Load OT files as dataframes
ot_diseases = load_jsonl_files_to_df(OT_DISEASES_PATH)
ot_targets = load_jsonl_files_to_df(OT_TARGETS_PATH, progress=True)
ot_aotf = load_jsonl_files_to_df(OT_AOTF_PATH, progress=True)

In [None]:
# Load CHoP files as dataframes

# Somatic Alterations 
cnv = pd.read_json(CNV_PATH, orient='records', lines=True)
snvGene = pd.read_json(SNVGENE_PATH, orient='records', lines=True)
snv = pd.read_json(SNV_PATH, orient='records', lines=True)
fusionGene = pd.read_json(FUSIONGENE_PATH, orient='records', lines=True)
fusion = pd.read_json(FUSION_PATH, orient='records', lines=True)

In [None]:
# Gene Expression
tpmGene = load_jsonl_as_chunks(TPMGENE_PATH)
tpmGroup = load_jsonl_as_chunks(TPMGROUP_PATH)

In [None]:
# Epigenetic Modification: Methylation (raw data)

# Load large jsonl, partition and save for faster later uses
# Process takes ~7.5min/GB of raw file on local (1.5-2hr) 
# Only needs to run once for each new data version
load_jsonl_chunk_group_save(METHYL_PATH, GROUPED_METHYL_PATH)
load_jsonl_chunk_group_save(METHYLGENE_PATH, GROUPED_METHYLGENE_PATH)

In [None]:
# Epigenetic Modification: Methylation (grouped data)
methyl = load_jsonl_files_to_df(GROUPED_METHYL_PATH, filetype='*.jsonl', progress=True, aggData=True)
methylGene = load_jsonl_files_to_df(GROUPED_METHYLGENE_PATH, filetype='*.jsonl', progress=True, aggData=True)

In [None]:
# Differential Expression (raw data)

# Load large jsonl, partition and save for faster later uses
# Process takes ~7.5min/GB of raw file on local (~2.5hr) 
# Only needs to run once for each new data version
load_jsonl_chunk_group_save(DIFFEXPR_PATH, 
                            GROUPED_DIFFEXPR_PATH, 
                            group_cols=['targetFromSourceId', 
                                        'Gene_symbol', 
                                        'diseaseFromSourceMappedId', 
                                        'Disease'])

In [None]:
# Differential Expression (grouped data)

# Special handling to standardize datasourceId
diffExpr = load_jsonl_files_to_df(GROUPED_DIFFEXPR_PATH, 
                                  filetype='*.jsonl', 
                                  progress=True, 
                                  aggData=True, 
                                  group_cols=['targetFromSourceId', 
                                                'Gene_symbol', 
                                                'diseaseFromSourceMappedId', 
                                                'Disease'])

In [None]:
# Load PMTL as dataframes
pmtl_df = pd.read_json(PMTL_PATH, orient='records')

In [None]:
# Load list of priorities for testing as dataframes
priority_targets = pd.read_csv(PRIORITY_TARGETS_PATH)
priority_diseases = pd.read_csv(PRIORITY_DISEASES_PATH)
priority_evidences = pd.read_csv(PRIORITY_EVIDENCES_PATH)

## Clean & Transform Data

### Simplify OT Target and Disease datasets

In [None]:
# Drop all columns except for ids, names, and disease descendants
ot_targets = ot_targets.loc[:, ['id', 'approvedSymbol']]
ot_diseases = ot_diseases.loc[:, ['id', 'name', 'descendants']]

In [None]:
def export_OT_table(df:pd.DataFrame, outfile:str, overwrite:bool=False):
    """
    Export CSVs of simplified target and disease tables from OT
    for easier ad-hoc analyses.
    
    :param df: pandas dataframe to export
    :param outfile: filename path for export
    :param overwrite: replace existing files, default False
    """

    # Skip function if files already exist unless overwrite=True
    if os.path.exists(outfile) == True:
        if overwrite == False:
            return None

    # Make output directory if it doesn't exist
    if os.path.exists(os.path.dirname(outfile)) == False:
        os.makedirs(os.path.dirname(outfile))

    # Export df as csv
    df.to_csv(outfile, index=False)

In [None]:
# Export OT target and disease lists
export_OT_table(ot_targets, OT_TARGET_EXPORT_PATH)
export_OT_table(ot_diseases, OT_DISEASE_EXPORT_PATH)

### Reformat CHoP API files for consistency

In [None]:
# Define columns to rename
tpmColRenameDict = {
    'Gene_Ensembl_ID': 'targetFromSourceId',
    'EFO': 'diseaseFromSourceMappedId',
    'cohort': 'Dataset'
}

# Rename columns and add datasourceId column for each file
tpmGene.rename(columns=tpmColRenameDict, inplace=True)
tpmGene['datasourceId'] = 'chop_tpm_genewise_expression'

tpmGroup.rename(columns=tpmColRenameDict, inplace=True)
tpmGroup['datasourceId'] = 'chop_tpm_groupwise_expression'

In [None]:
# Set all Differential Expression datasourceId values to a standard
diffExpr['datasourceId'] = 'chop_differential_expression'

### Clean IDs within CHoP data

In [None]:
def clean_chop_targets(df:pd.DataFrame, ot_targets:pd.DataFrame=ot_targets, output:bool=True):
    """
    Remove rows of evidence that contain blank or incompatible Target IDs.
    These represent values that will not load into the MTP database. 
    
    :param df: Dataframe of CHoP evidence file
    :param ot_targets: Dataframe of Open Targets target database
    """

    # Note any rows with blank target IDs
    if 'evidenceCount' in df.columns:
        blankEvidences = df[df['targetFromSourceId'] == '']['evidenceCount'].sum()
    else:
        blankEvidences = len(df[df['targetFromSourceId'] == ''])
    blankTargets = df[df['targetFromSourceId'] == '']['Gene_symbol'].unique().tolist()

    # Drop any rows with blank target IDs
    df.drop(df[df['targetFromSourceId'] == ''].index, inplace=True)


    # Enrich chop df with OT target ids and symbols
    df1 = pd.merge(
        df, ot_targets, how='left', left_on='targetFromSourceId', right_on='id').rename(
        columns={'id':'otTargetId', 'approvedSymbol':'otSymbol'})

    # Note any rows with target IDs not found within OT database
    if 'evidenceCount' in df.columns:
        invalidEvidences = df1[df1['otTargetId'].isna()]['evidenceCount'].sum()
    else:
        invalidEvidences = len(df1[df1['otTargetId'].isna()])
    invalidTargets = df1[df1['otTargetId'].isna()]['targetFromSourceId'].unique().tolist()


    # Drop any rows with target IDs not found within OT database
    df2 = df1[df1['otTargetId'].notnull()]

    # Printed output
    if output == True:
        print(f"    {blankEvidences} evidences across {len(blankTargets)} gene symbols with blank target IDs removed from {df2.datasourceId[0]}")
        if len(blankTargets) > 0:
            print('    Gene Symbols of blank IDs:', *blankTargets, sep='\n        ')
        print(f"    {invalidEvidences} evidences across {len(invalidTargets)} invalid target IDs removed from {df2.datasourceId[0]}")
        if len(invalidTargets) > 0:
            print('    Invalid Target IDs:', *invalidTargets, sep='\n        ')

    return df2

In [None]:
def clean_chop_diseases(df:pd.DataFrame, ot_diseases:pd.DataFrame=ot_diseases, output:bool=True):
    """
    Remove rows of evidence that contain blank or incompatible Disease IDs.
    These represent values that will not load into the MTP database. 
    
    :param df: Dataframe of CHoP evidence file
    :param ot_diseases: Dataframe of Open Targets disease database
    """

    # Note any rows with blank disease IDs
    if 'evidenceCount' in df.columns:
        blankEvidences = df[df['diseaseFromSourceMappedId'] == '']['evidenceCount'].sum()
    else:
        blankEvidences = len(df[df['diseaseFromSourceMappedId'] == ''])
    blankDiseases = df[df['diseaseFromSourceMappedId'] == '']['Disease'].unique().tolist()


    # Drop any rows with blank disease IDs
    df.drop(df[df['diseaseFromSourceMappedId'] == ''].index, inplace=True)


    # Enrich chop df with OT disease ids and symbols
    df1 = pd.merge(
        df, ot_diseases, how='left', left_on='diseaseFromSourceMappedId', right_on='id').rename(
        columns={'id':'otDiseaseId', 'name':'otDiseaseName'})

    # Note any rows with disease IDs not found within OT database
    if 'evidenceCount' in df.columns:
        invalidEvidences = df1[df1['otDiseaseId'].isna()]['evidenceCount'].sum()
    else:
        invalidEvidences = len(df1[df1['otDiseaseId'].isna()])
    invalidDiseases = df1[df1['otDiseaseId'].isna()]['diseaseFromSourceMappedId'].unique().tolist()

    # Drop any rows with disease IDs not found within OT database
    df2 = df1[df1['otDiseaseId'].notnull()]

    # Printed output
    if output == True:
        print(f"    {blankEvidences} evidences across {len(blankDiseases)} disease names with blank disease IDs removed from {df2.datasourceId[0]}")
        if len(blankDiseases) > 0:
            print('    Disease Names of blank IDs:', *blankDiseases, sep='\n        ')
        print(f"    {invalidEvidences} evidences across {len(invalidDiseases)} invalid disease IDs removed from {df2.datasourceId[0]}")
        if len(invalidDiseases) > 0:
            print('    Invalid Disease IDs:', *invalidDiseases, sep='\n        ')

    return df2


In [None]:
def chop_cleaning_functions(df:pd.DataFrame, 
                            ot_targets:pd.DataFrame=ot_targets, 
                            ot_diseases:pd.DataFrame=ot_diseases, 
                            output:bool=False):
    """
    Combines target and disease ID cleaning functions. 
    Note that because this cleans evidence in series (targets first), it's possible that 
    an invalid disease that ONLY appears in evidence with an invalid target will not be 
    reported out. Final data output will still be clean, but report will not catch the 
    disease ID. 

    :param df: pandas dataframe of evidence to clean
    :param ot_targets: pandas DataFrame of Open Targets targets to use for reference
    :param ot_diseases: pandas DataFrame of Open Targets diseases to use for reference
    :param output: Boolean determination of whether details of targets/diseases removed
                will be printed in addition to total start/end record counts. Default False
    """

    if 'evidenceCount' in df.columns:
        print(df.datasourceId[0], '\n    Start length:', df['evidenceCount'].sum())
    else:
        print(df.datasourceId[0], '\n    Start length:', len(df))

    df1 = clean_chop_targets(df, output=output)
    df2 = clean_chop_diseases(df1, output=output)
    
    if 'evidenceCount' in df.columns:
        print('    End length:', df2['evidenceCount'].sum(), '\n---')
    else:
        print('    End length:', len(df2), '\n---')

    return df2

In [None]:
# Run each CHoP df through the cleaning functions defined above
cnv_clean = chop_cleaning_functions(cnv, output=True)
snvGene_clean = chop_cleaning_functions(snvGene, output=True)
snv_clean = chop_cleaning_functions(snv, output=True)
fusion_clean = chop_cleaning_functions(fusion, output=True)
fusionGene_clean = chop_cleaning_functions(fusionGene, output=True)

tpmGene_clean = chop_cleaning_functions(tpmGene, output=True)
tpmGroup_clean = chop_cleaning_functions(tpmGroup, output=True)

methyl_clean = chop_cleaning_functions(methyl, output=True)
methylGene_clean = chop_cleaning_functions(methylGene, output=True)

diffExpr_clean = chop_cleaning_functions(diffExpr, output=True)

### Build Target-Disease Evidence Dataframe Function
Build a dataframe containing all of the pediatric cancer evidence in the format required for validation. Test cases will be subsets of this df, exported into Excel for automation.

In [None]:
def build_test_case_df(dfList:list, ot_targets:pd.DataFrame=ot_targets, ot_diseases:pd.DataFrame=ot_diseases):
    """
    Build and format a dataframe to use when generating target and evidence page tests.
    Combine and transform a list of cleaned/preprocessed evidence dataframes.
    
    :param dfList: list of pandas DataFrames containing evidence
    :param ot_targets: pandas DataFrame of Open Targets targets to use for reference
    :param ot_diseases: pandas DataFrame of Open Targets diseases to use for reference
    """

    # Create blank output to fill with each evidence df
    dfCombined = pd.DataFrame()

    # Iterate through list of evidence dataframes
    for df in dfList:

        # Group data by 5 columns and get evidence count
        # All evidence must use identical column names/contents
        if 'evidenceCount' in df.columns:
            df1 = df.groupby(
                ['targetFromSourceId', 
                'Gene_symbol', 
                'diseaseFromSourceMappedId', 
                'Disease', 
                'datasourceId']
                ).sum().reset_index()
        else:
            df1 = df.groupby(
                ['targetFromSourceId', 
                'Gene_symbol', 
                'diseaseFromSourceMappedId', 
                'Disease', 
                'datasourceId']
                ).size().reset_index().rename(columns={0:'evidenceCount'})
        
    
        # Add each formatted df into a single dataframe
        dfCombined = pd.concat([dfCombined, df1], ignore_index=True)

    # Pivot to organize datasources as columns showing evidence sums
    df2 = dfCombined.pivot_table(
            values='evidenceCount', 
            index=['targetFromSourceId','diseaseFromSourceMappedId'], 
            columns='datasourceId', 
            aggfunc=sum, fill_value=0
            ).reset_index().rename_axis(None, axis=1)

    # Use target IDs to map canonical OT names for targets
    df3 = pd.merge(df2, ot_targets, how='left', left_on='targetFromSourceId', right_on='id').rename(columns={'approvedSymbol':'targetNameOT'})
    df3.drop(columns='id', inplace=True)

    # Use disease IDs to map canonical OT names for diseases
    df4 = pd.merge(df3, ot_diseases, how='left', left_on='diseaseFromSourceMappedId', right_on='id').rename(columns={'name':'diseaseNameOT'})
    df4.drop(columns='id', inplace=True)

    # Rearrange output columns
    df5 = df4[[
                'targetFromSourceId',
                'diseaseFromSourceMappedId',
                'targetNameOT',
                'diseaseNameOT',
                'descendants',
                'chop_gene_level_snv',
                'chop_variant_level_snv',
                'chop_gene_level_cnv',
                'chop_putative_oncogene_fused_gene',
                'chop_putative_oncogene_fusion',
                'chop_tpm_groupwise_expression',
                'chop_tpm_genewise_expression',
                'chop_gene_level_methylation',
                'chop_isoform_level_methylation',
                'chop_differential_expression',]]

    return df5

### Run function to combine evidence and build test case dataframe

In [None]:
# Create list of clean dataframes for iteration
dfList = [
    cnv_clean,
    snvGene_clean,
    snv_clean,
    fusion_clean,
    fusionGene_clean,
    tpmGene_clean,
    tpmGroup_clean,
    methylGene_clean,
    methyl_clean,
    diffExpr_clean,
    ]

In [None]:
# Build test case
testCase_df = build_test_case_df(dfList, ot_targets, ot_diseases)

## Build and export Pediatric Cancer Data Navigation table using the Test Case Dataframe

In [None]:
def build_pcdn_df(df:pd.DataFrame=testCase_df):
    """
    Build the Pediatric Cancer Data Navigation table
    by reformatting the test case dataframe. 
    
    :param df: pandas dataFrame of combined and processed evidence
    """

    # Define labels and source columns for True/False summary columns
    # This assumes that gene-level evidence presence matches variant-level
    # presence in relevant data sources
    pcdnSummaryDict = {
    'SNV': 'chop_gene_level_snv',
    'CNV': 'chop_gene_level_cnv',
    'Fusion': 'chop_putative_oncogene_fused_gene',
    'GeneExpression': 'chop_tpm_genewise_expression',
    'Methylation': 'chop_gene_level_methylation',
    'DifferentialExpression': 'chop_differential_expression'
    }

    # Rename columns for clarity and to match input format
    df1 = df.rename(columns={'targetNameOT':'Gene_symbol', 'diseaseNameOT':'Disease'})

    # Create summary True/False columns for evidence presence
    for label, col in pcdnSummaryDict.items():
        df1[label] = np.where(df1[col] > 0, True, False)

    # Create id column with uuids
    df1['id'] = df1.apply(lambda x: uuid.uuid4(), axis=1)

    # Keep only columns of interest
    df2 = df1[
        ['targetFromSourceId',
        'diseaseFromSourceMappedId',
        'Gene_symbol',
        'Disease',
        'SNV',
        'CNV',
        'Fusion',
        'GeneExpression',
        'Methylation',
        'DifferentialExpression',
        'id']]
    
    return df2

In [None]:
def export_df_as_json(df:pd.DataFrame, outfile:str):
    """
    Export pandas dataframe as JSON file. Parsing using
    json.dumps is needed to avoid extra backslashes otherwise
    present if using only built-in pandas to_json().
    
    :param df: pandas dataframe to export
    :param outfile: filename path for export
    """
    
    # Make output directory if it doesn't exist
    if os.path.exists(os.path.dirname(outfile)) == False:
        os.makedirs(os.path.dirname(outfile))

    # Load df as json string and parse
    # Use default_handler to avoid overflow error with uuids
    json_str = df.to_json(orient='records', default_handler=str)
    parsed = json.loads(json_str)

    # Export to json filepath
    with open(outfile, 'w') as json_file:
        json_file.write(json.dumps(parsed))

In [None]:
def export_df_as_jsonl_chunks(df:pd.DataFrame, 
                                directory:str, 
                                maxlines:int, 
                                fileformat:str):
    """
    Export pandas dataframe as many JSONL chunks for easier ETL.

    :param df: pandas dataframe to export
    :param directory: filepath of directory to hold exported jsonl
    :param maxlines: int max number of jsonl lines per file
    :param fileformat: name format for export files. 
        Must contain '{}' string for part number formatting
    """

    # Make versioned directory if does not exist
    if os.path.exists(directory) == False:
        os.makedirs(directory)

    # Load pandas df as json lines object
    # Use default_handler to avoid overflow error with uuids
    json_str = df.to_json(orient='records', default_handler=str, lines=True)

    # Load json lines as chunks with max lines of 50K per file
    chunks = pd.read_json(json_str, orient='records', lines=True, chunksize=maxlines)

    # Iterate through JSONL partitions to export as files. Show NaN values as blank
    for label, chunk in enumerate(chunks):
        dict_records = chunk.apply(lambda x: x.dropna()).to_dict('records')
        jsonlData = ndjson.dumps(dict_records).encode('utf-8')
        with open(directory+fileformat.format(label), 'wb') as file:
            file.write(jsonlData)

In [None]:
def output_md5(filename:str, suffix:str='_md5.txt', compression_type:str='zip'):
    """
    Create txt file containing md5sum of file. 
    
    :param filename: path of file to compress
    :param suffix: string to attach to end of md5.txt file
    :param compression_type: file extension of filename
    """

    compression_str = '.'+compression_type

    with open(filename+compression_str, 'rb') as f:
        contents = f.read() 
        md5_returned = hashlib.md5(contents).hexdigest()

    with open(filename+suffix, 'w') as file:
        file.write(md5_returned)

In [None]:
def clear_temp_folder(directory:str):
    """
    Delete directory and all contents.

    :param directory: path of directory to delete
    """

    try:
        shutil.rmtree(directory)
    except OSError as error:
        print(f'Error: {directory}: {error.strerror}')

In [None]:
def compress_directory(filename, root, compression_type='zip'):
    """
    Compress and zip a file directory.
    
    :param filename: name of the directory to compress. This will
                        also be used as filename for the output
    :param root: parent directory containing the directory to compress
    :param compression_type: compression type string, default 'zip'
    """

    shutil.make_archive(filename, compression_type, root, filename.split('/')[-1])

In [None]:
def package_jsonl(df:pd.DataFrame,
                    directory:str,
                    maxlines:int,
                    fileformat:str,
                    filename:str,
                    root:str,
                    compression_type:str='zip',
                    suffix:str='_md5.txt'):
    """
    Package a pandas dataframe into multiple JSONL files, compress,
    and then get md5 checksum. 
    
    :param df: pandas dataframe to export
    :param directory: filepath of directory to hold exported jsonl
    :param maxlines: int max number of jsonl lines per file
    :param fileformat: name format for export files
                    Must contain '{}' string for part number formatting
    :param filename: name of the directory to compress. This will
                    also be used as filename for the output
    :param root: parent directory containing the directory to compress
    :param compression_type: compression type string, default 'zip'
    :param suffix: string to attach to end of md5.txt file
    """

    export_df_as_jsonl_chunks(df, directory, maxlines, fileformat)
    compress_directory(filename, root, compression_type=compression_type)
    output_md5(filename, suffix=suffix, compression_type=compression_type)
    clear_temp_folder(directory)

In [None]:
# Buid PCDN
pcdn_df = build_pcdn_df(testCase_df)

# Export as single JSON
export_df_as_json(pcdn_df, PCDN_JSON)

# Export as multiple JSONL
package_jsonl(df=pcdn_df, 
            directory=PCDN_JSONL_TEMP,
            maxlines=3000,
            fileformat=PCDN_JSONL_FILENAMES,
            filename=PCDN_JSONL,
            root=PCDN_PATH,
            compression_type='zip',
            suffix='_md5.txt')

In [None]:
def get_pcdn_counts(pcdn_df:pd.DataFrame=pcdn_df, outfile:str=PCDN_COUNTS):
    """
    Export PCDN summary counts for PCDN page header
    
    :param pcdn_df: pandas dataframe of PCDN
    :param outfile: output filename
    """
    # Build dict with summary counts
    counts = {
        'Targets': pcdn_df['Gene_symbol'].nunique(),
        'Diseases': pcdn_df['Disease'].nunique(),
        'Evidences': len(pcdn_df)
        }
    
    # Recast as json object and export
    counts_obj = json.dumps(counts)
    with open(outfile, 'w') as f:
        f.write(counts_obj)

In [None]:
def get_pcdn_diseases(pcdn_df:pd.DataFrame=pcdn_df, outfile:str=PCDN_DISEASES):
    """
    Build and export list of unique (OT) disease names within the PCDN.
    This can be used for the PCDN drop-down menu.
    
    :param pcdn_df: pandas dataframe of the PCDN
    :param outfile: output filename
    """

    # Get unique Diseases and IDs from the PCDN
    df = pcdn_df.groupby(['Disease','diseaseFromSourceMappedId']).size().reset_index()[['Disease', 'diseaseFromSourceMappedId']]

    # Sort by disease name, ignoring case
    df.sort_values(by='Disease', inplace=True, key=lambda col: col.str.lower())

    # Load df as json string and parse
    # Use default_handler to avoid overflow error with uuids
    json_str = df.to_json(orient='records', default_handler=str)
    parsed = json.loads(json_str)

    # Export to json filepath
    with open(outfile, 'w') as json_file:
        json_file.write(json.dumps(parsed))

In [None]:
# Export PCDN summary counts
get_pcdn_counts(pcdn_df, PCDN_COUNTS)

# Export PCDN disease list for drop-down menu
get_pcdn_diseases(pcdn_df, PCDN_DISEASES)

## Validate priority test inputs

In [None]:
def validate_priority_tests(priority_diseases:pd.DataFrame=priority_diseases,
                            priority_targets:pd.DataFrame=priority_targets,
                            priority_evidences:pd.DataFrame=priority_evidences,
                            ot_diseases:pd.DataFrame=ot_diseases,
                            ot_targets:pd.DataFrame=ot_targets):
    """
    Validate that all priority targets and diseases defined in input priority csvs
    are found within OT's target and disease databases before continuing.

    :param priority_diseases: pandas dataframe of curated diseases to include in test cases
    :param priority_targets: pandas dataframe of curated targets to include in test cases
    :param priority_evidences: pandas dataframe of curated target-disease combinations to include in test cases
    :param ot_diseases: pandas dataframe of Open Targets disease database
    :param ot_targets: pandas dataframe of Open Targets target database
    """
    
    invalidDiseases = priority_diseases[~priority_diseases['diseaseId'].isin(ot_diseases['id'])]
    if len(invalidDiseases) > 0: print(f"DISEASE.CSV: \n {invalidDiseases} \n ---")

    invalidTargets = priority_targets[~priority_targets['targetId'].isin(ot_targets['id'])]
    if len(invalidTargets) > 0: print(f"TARGETS.CSV: \n {invalidTargets} \n ---")

    invalidEvidenceDiseases = priority_evidences[~priority_evidences['diseaseId'].isin(ot_diseases['id'])]
    if len(invalidEvidenceDiseases) > 0: print(f"EVIDENCES.CSV: \n {invalidEvidenceDiseases} \n ---")

    invalidEvidenceTargets = priority_evidences[~priority_evidences['targetId'].isin(ot_targets['id'])]
    if len(invalidEvidenceTargets) > 0: print(f"EVIDENCES.CSV: \n {invalidEvidenceTargets} \n ---")

    assert all(
                [len(invalidDiseases) == 0,
                len(invalidTargets) == 0,
                len(invalidEvidenceDiseases) == 0,
                len(invalidEvidenceTargets) == 0
                ]),f"Priority test case csvs have invalid target and/or disease ids."

In [None]:
def check_priority_test_relevance(priority_diseases:pd.DataFrame=priority_diseases,
                                    priority_targets:pd.DataFrame=priority_targets,
                                    priority_evidences:pd.DataFrame=priority_evidences,
                                    testCase_df:pd.DataFrame=testCase_df):
    """
    Validate that all priority targets and diseases defined in input priority csvs
    are found within at least one pediatric dataset.
    
    :param priority_diseases: pandas dataframe of curated diseases to include in test cases
    :param priority_targets: pandas dataframe of curated targets to include in test cases
    :param priority_evidences: pandas dataframe of curated target-disease combinations to include in test cases
    :param testCase_df: pandas DataFrame of preprocessed evidence data
    """
    
    pedTargets = testCase_df['targetFromSourceId'].unique().tolist()
    pedDiseases = testCase_df['diseaseFromSourceMappedId'].unique().tolist()
    
    invalidDiseases = priority_diseases[~priority_diseases['diseaseId'].isin(pedDiseases)]
    if len(invalidDiseases) > 0: print(f"DISEASE.CSV: \n {invalidDiseases} \n ---")

    invalidTargets = priority_targets[~priority_targets['targetId'].isin(pedTargets)]
    if len(invalidTargets) > 0: print(f"TARGETS.CSV: \n {invalidTargets} \n ---")

    invalidEvidenceDiseases = priority_evidences[~priority_evidences['diseaseId'].isin(pedDiseases)]
    if len(invalidEvidenceDiseases) > 0: print(f"EVIDENCES.CSV: \n {invalidEvidenceDiseases} \n ---")

    invalidEvidenceTargets = priority_evidences[~priority_evidences['targetId'].isin(pedTargets)]
    if len(invalidEvidenceTargets) > 0: print(f"EVIDENCES.CSV: \n {invalidEvidenceTargets} \n ---")

    assert all(
                [len(invalidDiseases) == 0,
                len(invalidTargets) == 0,
                len(invalidEvidenceDiseases) == 0,
                len(invalidEvidenceTargets) == 0
                ]),f"Priority test case csvs have target and/or disease ids not found in pediatric data."

In [None]:
# Run priority test case validations
validate_priority_tests()
check_priority_test_relevance()

## Build Test Case Functions

### Test Case: Target Associations (Direct Overall)

In [None]:
def build_testCase_targetAssc(sampleSize:int,
                            ot_aotf:pd.DataFrame=ot_aotf, 
                            ot_targets:pd.DataFrame=ot_targets, 
                            pmtl_df:pd.DataFrame=pmtl_df, 
                            priority_targets:pd.DataFrame=priority_targets,
                            randomSeed:int=RANDOM_SEED):
    """
    Build test case df for Target Associations page. Not specific to new MTP data. 
    Note that until pediatric data is included in association scoring, MTP associations will
    be identical to OT associations. UPDATED function uses OT Associations on the Fly

    :param sampleSize: int number of random targets to include in test
    :param ot_aotf: pandas DataFrame of OpenTargets Associations on the Fly (AOTF)
    :param ot_targets: pandas DataFrame of OpenTargets target names for reference
    :param pmtl_df: pandas DataFrame of FDA PMTL
    :param priority_targets: pandas DataFrame of targets to always include in test case
    :param randomSeed: int seed for reproducible random results
    """

    # Group associations by target and disease ID, 
    # then again by only target Id to mimic associations heatmap
    df = ot_aotf.groupby(['target_id', 'disease_id']).size().reset_index(
        ).groupby('target_id').size().reset_index()

    # Enrich associations with target names (OT Targets) and PMTL designations
    df1 = pd.merge(df, ot_targets, how='left', left_on='target_id', right_on='id').merge(
                    pmtl_df[['ensemblID', 'designation']], 
                    how='left', left_on='target_id', right_on='ensemblID')

    # Rename columns
    df2 = df1.rename(columns={
        0:'diseaseCount',
        'approvedSymbol':'targetNameOT',
        'designation':'PMTLcode'})
        
    # Recast NaN PMTL as Unspecified
    df2.fillna('Unspecified Target', inplace=True)

    # Add suffix column
    df2['suffixUrl'] = '/target/'+df2['target_id']+'/associations'

    # Reorder columns and omit redundant
    df3 = df2[
                ['suffixUrl', 
                'target_id', 
                'targetNameOT', 
                'PMTLcode',
                'diseaseCount']
                ]

    # Create subset df with priority targets and random sample of other targets
    random.seed(randomSeed)
    df4 = df3[
            (df3['target_id'].isin(priority_targets['targetId'])) |
            (df3['target_id'].isin(random.sample(df3['target_id'].unique(
            ).tolist(), sampleSize)))]

    return df4

### Test Case: Target Profile

In [None]:
def build_testCase_targetProfile(sampleSize:int, 
                                testCase_df:pd.DataFrame=testCase_df, 
                                priority_targets:pd.DataFrame=priority_targets,
                                randomSeed:int=RANDOM_SEED):
    """
    Build test case df for Target Profile page. Should only include targets with
    at least some new pediatric data. 

    :param sampleSize: int number of random targets to include in test case
    :param testCase_df: pandas DataFrame of preprocessed evidence data
    :param priority_targets: pandas DataFrame of targets to always include in test case
    :param randomSeed: int seed for reproducible random results
    """

    df = testCase_df.groupby(['targetFromSourceId','targetNameOT']).sum().reset_index()

    # Add TRUE/FALSE for presence of Gene Expression Widget (groupwise plot)
    df['opcGeneExp_target'] = np.where(df['chop_tpm_groupwise_expression'] >0, 'TRUE', 'FALSE')

    # Add TRUE/FALSE for presence of Differential Expression Widget (Target vs all diseases plot)
    df['opcDiffExpr_target'] = np.where(df['chop_differential_expression'] >0, 'TRUE', 'FALSE')

    # List all datasourceIds to be grouped and considered Somatic Alterations with tab labels
    somaticAltCols = {
    'chop_gene_level_cnv':'cnvByGene',
    'chop_gene_level_snv':'snvByGene',
    'chop_putative_oncogene_fused_gene':'fusionByGene',
    'chop_putative_oncogene_fusion':'fusion',
    'chop_variant_level_snv':'snvByVariant'}

    # List all datasourceIds to be grouped and considered Epigenetic Modifications
    epigeneticModCols = {
    'chop_gene_level_methylation':'methylByGene',
    'chop_isoform_level_methylation':'methylByIsoform'}

    # Raname columns
    df.rename(columns=somaticAltCols, inplace=True)
    df.rename(columns=epigeneticModCols, inplace=True)
    df.rename(columns={'targetFromSourceId':'targetId'}, inplace=True)

    # Add TRUE/FALSE for presence of Somatic Alterations widget
    df['opcSomaticAlt'] = np.where(df[somaticAltCols.values()].sum(axis=1) > 0, 'TRUE', 'FALSE')

    # Add TRUE/FALSE for presence of Epigenetic Modifications widget
    df['opcEpiMod'] = np.where(df[epigeneticModCols.values()].sum(axis=1) > 0, 'TRUE', 'FALSE')

    # Add suffix column
    df['suffixUrl'] ='/target/'+df['targetId']

    # Rearrange columns and omit redundant
    df1 = df[
                ['suffixUrl', 
                'targetId', 
                'targetNameOT', 
                'opcGeneExp_target', 
                'opcSomaticAlt',
                'opcEpiMod',
                'opcDiffExpr_target', 
                'snvByGene', 
                'snvByVariant', 
                'cnvByGene', 
                'fusionByGene', 
                'fusion',
                'methylByGene',
                'methylByIsoform']
                ]

    # Create subset df with priority targets and  random sample of other targets
    random.seed(randomSeed)
    df2 = df1[
            (df1['targetId'].isin(priority_targets['targetId'])) |
            (df1['targetId'].isin(random.sample(df1['targetId'].unique().tolist(), sampleSize)))]

    return df2

### Define Indirect Search function

In [None]:
def indirect_search(df:pd.DataFrame, 
                    diseaseId:str, 
                    nonPropDS:list=['expression_atlas'], 
                    ot_diseases:pd.DataFrame=ot_diseases, 
                    targetId:str=None,
                    diseaseCol:str='disease_id',
                    targetCol:str='target_id'):
    """
    Search a dataframe and return all direct and indirect evidence of a disease.
    Indirect evidence includes both the target and disease of interest as well as 
    the target and descendant diseases of interest. 

    :param df: dataframe to search. Must contain descendants and disease_id cols
    :diseaseId: EFO id of disease of interest. 
    :nonPropDS: Non-Propagating DataSources. List of datasource IDs to exclude from
                indirect evidence propagation. Direct evidence for the disease of 
                interest will be gathered, but no indirect evidence. 
    :ot_diseases: dataframe of OpenTargets disease database. Used as a backup to get
                descendants for a disease that does not appear in direct evidence
    :targetId: ENSG id of target of interest. Optional. Default=None
    :diseaseCol: Column header for disease IDs. Default 'disease_id' to match AOTF
    :targetCol: Column header for target IDs. Default 'target_id' to match AOTF
    """


    # Get direct evidence for disease of interest
    # Indirect evidence will be iteratively added to this df later
    df_combined = df[df[diseaseCol] == diseaseId]

    # If no direct evidence found, then use backup disease db
    # to get list of descendants for disease of interest
    if len(df_combined) == 0:
        descList = (ot_diseases[
                        ot_diseases['id'] == diseaseId]
                        ['descendants'].tolist()[0])
    else:
        # Get list of descendants for disease of interest
        descList = df_combined['descendants'].tolist()[0]

    # Loop through descendants and add evidence to the df
    for efo in descList:
        
        # If datasource_id is included, exclude nonProp datasources
        if 'datasource_id' in df.columns:
            df_temp = df[
                        (df[diseaseCol] == efo) & 
                        (~df['datasource_id'].isin(nonPropDS))]
        else: 
            df_temp = df[
                        (df[diseaseCol] == efo)]

        df_combined = pd.concat([df_combined, df_temp])
    
    # If target of interest specified, add final filtering
    if targetId != None:
        df_combined = df_combined[df_combined[targetCol] == targetId]

    return df_combined

### Test Case: Disease Associations (Indirect Overall)

In [None]:
def build_testCase_diseaseAssc(sampleSize:int,
                            ot_aotf:pd.DataFrame=ot_aotf, 
                            ot_diseases:pd.DataFrame=ot_diseases,
                            priority_diseases:pd.DataFrame=priority_diseases,
                            all_ped_diseases:bool=False,
                            testCase_df:pd.DataFrame=testCase_df,
                            randomSeed:int=RANDOM_SEED):
    """
    Build test case df for Target Associations page. Not specific to new MTP data. 
    Note that until pediatric data is included in association scoring, MTP associations will
    be identical to OT associations. UPDATED function finds indirect associations using 
    OT's AOTF files and OT's disease ontology descendants.

    :param sampleSize: int number of random diseases to include in test
    :param ot_aotf: pandas DataFrame of OpenTargets Associations on the Fly evidence
    :param ot_diseases: pandas DataFrame of OpenTargets disease names for reference
    :param priority_diseases: pandas DataFrame of diseases to always include in test case
    :param all_ped_diseases: if set to True, use all ped diseases instead of priority diseases
    :param testCase_df: pandas DataFrame of preprocessed evidence data. Required if all_ped_diseases=True
    :param randomSeed: int seed for reproducible random results
    """

    # Get list of all unique diseases in df
    diseaseList = ot_aotf['disease_id'].unique().tolist()

    # Initiate random seed for sampling
    # Note that there's no logic to prevent overlap between priority and random 
    # sample lists, but that won't causes errors
    random.seed(randomSeed)

    # If all_ped_diseases toggled, Get list of all pediatric diseases and add to random sample
    if all_ped_diseases == True:
        diseaseTestList = (testCase_df['diseaseFromSourceMappedId'].unique().tolist()
                           + random.sample(diseaseList, sampleSize))

    # If all_ped_diseases not toggled (default False), use priority diseases and add random sample
    else:
        diseaseTestList = (priority_diseases['diseaseId'].tolist()
                            + random.sample(diseaseList, sampleSize))

    # Enrich evidence df with disease descendants
    df = ot_aotf.merge(ot_diseases, how='left', left_on='disease_id', right_on='id')

    # Build df with test case diseases from OT disease database
    df1 = (ot_diseases
                .loc[ot_diseases['id'].isin(diseaseTestList)]
                .reset_index(drop=True)
                .rename(columns={'id':'diseaseId','name':'diseaseNameOT'}))

    # For each test case disease, run indirect_search function and get count of unique target IDs.
    # indirect_search function gathers all evidence for the disease and all descendants
    # Runtime for this step varies depending on number of descendants, but usually <30 sec/disease
    df1['targetCount'] = df1['diseaseId'].apply(lambda x: indirect_search(df, x)['target_id'].nunique())

    # Add suffix column
    df1['suffixUrl'] = '/disease/'+df1['diseaseId']+'/associations'

    # Reorder columns and omit redundant
    df2 = df1[
                ['suffixUrl', 
                'diseaseId', 
                'diseaseNameOT', 
                'targetCount']
                ]

    return df2

### Test Case: Disease Profile

In [None]:
def build_testCase_diseaseProfile(sampleSize:int, 
                                testCase_df:pd.DataFrame=testCase_df, 
                                priority_diseases:pd.DataFrame=priority_diseases,
                                all_ped_diseases:bool=False,
                                randomSeed:int=RANDOM_SEED):
    """
    Build test case df for Disease Profile page. Should only include diseases with
    at least some new pediatric data. 

    :param sampleSize: int number of random diseases to include in test case
    :param testCase_df: pandas DataFrame of preprocessed evidence data
    :param priority_diseases: pandas DataFrame of diseases to always include in test case
    :param all_ped_diseases: if set to True, use all ped diseases instead of priority diseases
    :param randomSeed: int seed for reproducible random results
    """

    df = testCase_df.groupby(['diseaseFromSourceMappedId','diseaseNameOT']).sum().reset_index()

    # Add TRUE/FALSE for presence of Differential Expression Widget (Disease vs top targets plot)
    df['opcDiffExpr_disease'] = np.where(df['chop_differential_expression'] >0, 'TRUE', 'FALSE')

    # Rename columns
    df.rename(columns={'diseaseFromSourceMappedId':'diseaseId'}, inplace=True)

    # Add suffix column
    df['suffixUrl'] ='/disease/'+df['diseaseId']

    # Rearrange columns and omit redundant
    df1 = df[
                ['suffixUrl', 
                'diseaseId', 
                'diseaseNameOT', 
                'opcDiffExpr_disease']
                ]

    # If all_ped_diseases toggled, use entire df as-is
    if all_ped_diseases == True:
        df2 = df1

    # If all_ped_diseases not toggled (default False), then create subset df
    # with priority diseases and random sample of other ped diseases
    # Note that there's no logic to prevent overlap between priority and random 
    # sample lists, but that won't causes errors
    if all_ped_diseases == False:
        random.seed(randomSeed)
        df2 = df1[
                (df1['diseaseId'].isin(priority_targets['diseaseId'])) |
                (df1['diseaseId'].isin(random.sample(df1['diseaseId'].unique().tolist(), sampleSize)))]

    return df2

### Test Case: Evidence Page

In [None]:
def build_testCase_evidence(sampleSize:int, 
                            testCase_df:pd.DataFrame=testCase_df, 
                            priority_evidences:pd.DataFrame=priority_evidences,
                            randomSeed:int=RANDOM_SEED):
    """
    Build test case df for Evidence page. Should only include target-disease
    evidence with at least some new pediatric data. UPDATED function includes 
    indirect evidence

    :param sampleSize: int number of random evidence combinations to include in test case
    :param testCase_df: pandas DataFrame of preprocessed evidence data
    :param priority_evidences: pandas DataFrame of evudebces to always include in test case
    :param randomSeed: int seed for reproducible random results
    """

    # Get list of target-disease evidence combos from priority list
    priorityTuples = (priority_evidences
                        [['diseaseId','targetId']]
                        .apply(tuple, axis=1).tolist())

    # Get random sample of target-disease combos from evidence df
    randomTuples = (testCase_df.sample(n=sampleSize, random_state=randomSeed)
                        [['diseaseFromSourceMappedId', 'targetFromSourceId']]
                        .apply(tuple, axis=1).tolist())

    # Combine target-disease combo lists
    evidenceTuples = priorityTuples + randomTuples


    # List all datasourceIds to be grouped and considered Somatic Alterations with tab labels
    somaticAltCols = {
                    'chop_gene_level_cnv':'cnvByGene',
                    'chop_gene_level_snv':'snvByGene',
                    'chop_putative_oncogene_fused_gene':'fusionByGene',
                    'chop_putative_oncogene_fusion':'fusion',
                    'chop_variant_level_snv':'snvByVariant'
                    }

    # List all datasourceIds to be grouped and considered Epigenetic Modifications
    epigeneticModCols = {
                    'chop_gene_level_methylation':'methylByGene',
                    'chop_isoform_level_methylation':'methylByIsoform'
                    }


    # CHoP datasource columns to sum for total indirect evidence
    sumCols =      [
                    'chop_gene_level_snv', 
                    'chop_variant_level_snv', 
                    'chop_gene_level_cnv', 
                    'chop_putative_oncogene_fused_gene', 
                    'chop_putative_oncogene_fusion',
                    'chop_gene_level_methylation',
                    'chop_isoform_level_methylation'
                    ]

    # CHoP columns to retain as-is for direct evidence
    retainCols =    [
                    'targetFromSourceId', 
                    'diseaseFromSourceMappedId', 
                    'targetNameOT', 
                    'diseaseNameOT', 
                    'chop_tpm_groupwise_expression', 
                    'chop_tpm_genewise_expression'
                    ] 


    # Define empty dataframe to fill
    df = pd.DataFrame()

    # Iterate through evidence target-disease combos to get indirect evidence
    for disease, target in evidenceTuples:

        # Gather all indirect evidence for a single target-disease combo
        df_temp1 = indirect_search(df=testCase_df, 
                                    diseaseId=disease, 
                                    targetId=target, 
                                    diseaseCol='diseaseFromSourceMappedId', 
                                    targetCol='targetFromSourceId')

        # Group by target and sum evidence counts
        df_temp2 = df_temp1.groupby('targetFromSourceId')[sumCols].sum().reset_index()

        # Join target/evidenceCount df with retained columns to get summed row with all info
        df_temp3 = pd.merge(df_temp2, df_temp1[retainCols], how='left', on='targetFromSourceId')

        # Select single row with summed indirect evidence counts for target-disese combo
        df_temp4 = df_temp3[df_temp3['diseaseFromSourceMappedId'] == disease]
        
        # Add to cumulative output df
        df = pd.concat([df, df_temp4])


    # Add TRUE/FALSE for presence of Gene Expression Widget (genewise plot)
    df['opcGeneExp_evidence'] = np.where(df['chop_tpm_genewise_expression'] >0, 'TRUE', 'FALSE')

    # Rename columns
    df.rename(columns=somaticAltCols, inplace=True)
    df.rename(columns=epigeneticModCols, inplace=True)
    df.rename(columns={'targetFromSourceId':'targetId', 'diseaseFromSourceMappedId':'diseaseId'}, inplace=True)

    # Add TRUE/FALSE for presence of Somatic Alterations widget
    df['opcSomaticAlt'] = np.where(df[somaticAltCols.values()].sum(axis=1) > 0, 'TRUE', 'FALSE')

    # Add TRUE/FALSE for presence of Epigenetic Modifications widget
    df['opcEpiMod'] = np.where(df[epigeneticModCols.values()].sum(axis=1) > 0, 'TRUE', 'FALSE')

    # Add suffix column
    df['suffixUrl'] ='/evidence/'+df['targetId']+'/'+df['diseaseId']

    # Rearrange columns and omit redundant
    df1 = df[
                ['suffixUrl', 
                'targetId', 
                'diseaseId', 
                'targetNameOT', 
                'diseaseNameOT', 
                'opcGeneExp_evidence', 
                'opcSomaticAlt',
                'opcEpiMod', 
                'snvByGene', 
                'snvByVariant', 
                'cnvByGene', 
                'fusionByGene', 
                'fusion',
                'methylByGene',
                'methylByIsoform']
                ]

    return df1

### Test Case: Pediatric Cancer Data Navigation (PCDN) by Gene

In [None]:
def build_testCase_pcdnGene(sampleSize:int, 
                        pcdn_df:pd.DataFrame=pcdn_df, 
                        priority_targets:pd.DataFrame=priority_targets,
                        randomSeed:int=RANDOM_SEED):
    """
    Build test case df for PCDN page, gene search. Should only include 
    targets contained within the PCDN data (and therefore new pediatric data). 

    :param sampleSize: int number of random targets to include in test case
    :param pcdn_df: pandas DataFrame of preprocessed Pediatric Cancer Data Navigation page
    :param priority_targets: pandas DataFrame of targets to always include in test case
    :param randomSeed: int seed for reproducible random results
    """

    # ----------------------------------------------------------------------------------------
    # The code for original function below is commented out due to unexpected behavior in 
    # the PCDN search where selecting a symbol (e.g. EGFR) will return results for that symbol 
    # and any other that starts with the symbol followed by a hyphen (e.g. EGFR-AS1). The test
    # case script has been updated to match this behavior, but keeping the commented code in 
    # case the PCDN search behavior is ever changed. 

    # # Build subset of the PCDN df using priority targets and random sample of other targets
    # random.seed(randomSeed)
    # df = pcdn_df[
    #     (pcdn_df['Gene_symbol'].isin(priority_targets['symbol'])) |
    #     (pcdn_df['Gene_symbol'].isin(
    #       random.sample(pcdn_df['Gene_symbol'].unique().tolist(), sampleSize)))]

    # # Group subset by gene symbol to get resulting evidence count
    # df1 = df.groupby('Gene_symbol').size().reset_index()

    # # Rename columns
    # df1.rename(columns={0:'evidenceResults', 'Gene_symbol':'name'}, inplace=True)
    # ----------------------------------------------------------------------------------------

    # Create combined list of test targets from priority targets and random sample
    random.seed(randomSeed)
    targetList = (priority_targets['symbol'].tolist() + 
                    random.sample(pcdn_df['Gene_symbol'].unique().tolist(), sampleSize)) 
    
    # Define empty dataframe to fill
    df = pd.DataFrame()

    # Iterate over targets to test to get count of diseases for the target AND for
    # any targets that start with the target name followed by a hyphen (-)
    for gene in targetList:
        count = len(
            pcdn_df[
            (pcdn_df['Gene_symbol'] == gene) | 
            (pcdn_df['Gene_symbol'].str.startswith(gene + '-'))])

        # Add the combined count for each target to a cumulative dataframe
        df_row = pd.DataFrame([{'name': gene, 'evidenceResults': count}])
        df = pd.concat([df, df_row])

    df1 = df.reset_index(drop=True)

    # End updated code section. Keep everything below ----

    # Add suffix and category columns
    df1['suffixUrl'] = '/pediatric-cancer-data-navigation'
    df1['category'] = 'target'

    # Rearrange columns and omit redundant
    df2 = df1[
                ['suffixUrl', 
                'category', 
                'name', 
                'evidenceResults']
                ]

    return df2

### Test Case: Pediatric Cancer Data Navigation (PCDN) by Disease

In [None]:
def build_testCase_pcdnDisease(sampleSize:int, 
                        maxResults:int=10000,
                        pcdn_df:pd.DataFrame=pcdn_df, 
                        priority_diseases:pd.DataFrame=priority_diseases,
                        randomSeed:int=RANDOM_SEED):
    """
    Build test case df for PCDN page, disease search. Should only include 
    diseases contained within the PCDN data (and therefore new pediatric data). 

    :param sampleSize: int number of random diseases to include in test case
    :param maxResults: int maximum count of evidence results. Default is 10,000 to match MTP
    :param pcdn_df: pandas DataFrame of preprocessed Pediatric Cancer Data Navigation page
    :param priority_diseases: pandas DataFrame of diseases to always include in test case
    :param randomSeed: int seed for reproducible random results
    """

    # Build subset of the PCDN df using priority diseases and random sample of other diseases
    random.seed(randomSeed)
    df = pcdn_df[
        (pcdn_df['Disease'].isin(priority_diseases['name'])) |
        (pcdn_df['Disease'].isin(random.sample(pcdn_df['Disease'].unique().tolist(), sampleSize)))]

    # Group subset by gene symbol to get resulting evidence count
    df1 = df.groupby('Disease').size().reset_index().rename(columns={0:'fullResults'})

    # Cap number of results at a maximum (default 10,000) to match site functionality
    df1['evidenceResults'] = np.where(df1['fullResults'] > maxResults, maxResults, df1['fullResults'])

    # Rename columns
    df1.rename(columns={'Disease':'name'}, inplace=True)

    # Add suffix and category columns
    df1['suffixUrl'] = '/pediatric-cancer-data-navigation'
    df1['category'] = 'disease'

    # Rearrange columns and omit redundant
    df2 = df1[
                ['suffixUrl', 
                'category', 
                'name', 
                'evidenceResults']]

    return df2

### Test Case: Pediatric Molecular Targets Lists (PMTL)

In [None]:
def build_testCase_pmtl(pmtl_df:pd.DataFrame=pmtl_df):
    """ 
    Build test case df for FDA PMTL page.

    :param pmtl_df: pandas DataFrame of the computable FDA PMTL
    """

    # Build df that groups PMTL targets by R/NR designation and get totals
    df = pmtl_df.groupby('designation').size().reset_index()

    # Add suffix and extra column to specify the group type as designation
    df['suffixUrl'] = '/fda-pmtl'
    df['category'] = 'designation'

    # Rename columns
    df.rename(columns={0:'count', 'designation':'categoryValue'}, inplace=True)

    # Rearrange columns and omit redundant
    df1 = df[
                ['suffixUrl', 
                'category', 
                'categoryValue', 
                'count']
                ]

    return df1

## Generate Test Cases

In [None]:
# Run test case functions to build dataframes for excel export
targetAssc = build_testCase_targetAssc(30)
targetProfile = build_testCase_targetProfile(30)
diseaseAssc = build_testCase_diseaseAssc(15, all_ped_diseases=True)
diseaseProfile = build_testCase_diseaseProfile(15, all_ped_diseases=True)
evidence = build_testCase_evidence(35)
pcdnGene = build_testCase_pcdnGene(30)
pcdnDisease = build_testCase_pcdnDisease(30)
pmtl = build_testCase_pmtl()

## Export Test Cases to Excel

In [None]:
# Build df with test generation metadata
sourceData = pd.DataFrame({
    'Open Targets Version':OT_VERSION, 
    'CHoP Somatic Alterations Version':OPENPEDCAN_SOMATIC_ALTERATIONS_VERSION,
    'CHoP Gene Expression Version':OPENPEDCAN_GENE_EXPRESSION_VERSION,
    'CHoP Epigenetic Modification Version':OPENPEDCAN_EPIGENETIC_MODIFICATION_VERSION,
    'CHoP Differential Expression Version':OPENPEDCAN_DIFFERENTIAL_EXPRESSION_VERSION,
    'PMTL Version':PMTL_VERSION,
    'Test Case Generation Date':pd.Timestamp.now(),
    'Reproducible Random Seed':RANDOM_SEED
    }.items(),
    columns=['MTP DV3 TEST CASES FOR AUTOMATION', ''])

In [None]:
# Define and name test case dfs for export
outputDict = {
    'sourceData': sourceData,
    'targetAssc': targetAssc,
    'targetProfile': targetProfile,
    'diseaseAssc': diseaseAssc,
    'diseaseProfile': diseaseProfile,
    'evidence': evidence,
    'pcdnGene': pcdnGene,
    'pcdnDisease': pcdnDisease,
    'pmtl': pmtl}

In [None]:
def export_test_cases_as_excel(outputDict:dict, outfile:str=XLSX_OUTPUT):
    """
    Build and format an Excel file containing test cases for automation.
    
    :param outfile: filepath for Excel output file
    :param outputDict: dict of test cases to export as Excel sheets
    """

    # Create writer using XlsxWriter as the engine
    writer = pd.ExcelWriter(outfile, engine='xlsxwriter')

    # Write each test case df defined in outputDict to a sheet
    for sheetname, df in outputDict.items():
        df.to_excel(writer, sheet_name=sheetname, index=False)
    
    # Format width of (descriptive) first sheet columns for readability
    writer.sheets[list(outputDict.keys())[0]].set_column(0, 0, 40)
    writer.sheets[list(outputDict.keys())[0]].set_column(1, 1, 20)

    # Save and close excel writer
    writer.save()

In [None]:
# Run export function
export_test_cases_as_excel(outputDict, XLSX_OUTPUT)