# Data Validation 3:  Verify that data appears as expected within the MTP UI
2022-10-28 ZD  

## Description
### "Does the MTP UI match the MTP database?"

### Purpose
This validation will test “completeness and accuracy of data loaded into the platform” and follow MTP Data Validations 1&2. It will compare the data displayed within the platform GUI (after loading) to the expected values within the data (before loading). Automated scripts will pull test cases from the data that can be fed into platform testing automations to check displays for completeness and accuracy. 

### Scope
DV3 will focus on displays within the platform that relate to new pediatric data, including those that happen to incorporate Open Targets (OT) data. New pediatric data includes the Food and Drug Administration’s Pediatric Molecular Target Lists (FDA PMTL); the much larger collection of evidence data provided by the Children’s Hospital of Philadelphia (CHoP); and derived summary tables, such as the Pediatric Cancer Data Navigation (PCDN) page. It will not validate or test displays that only include OT data without pediatric data.  

The testing within DV3 will use sampling (spot-checking) methods, though scalability to meet automation capacity will be a design goal. Samples tested will include a set of defined high-profile genes and diseases that we expect will garner an abundance of user attention. We will also include a random sampling of genes and diseases (associated with CHoP data) to expand testing scope.

### Test Case Overview

Testable values for each test case will be contained in a tab within the output Excel

1. Target Association
    - Count of associated diseases
    - PMTL annotation
2. Target Profile Page
    - Somatic Alterations widget
        - Status
        - Row count of each of 5 tabs
    - Gene Expression widget
        - Status
3. Disease Association
    - Count of associated targets
4. Evidence Page
    - Somatic Alterations widget
        - Status
        - Row count of each of 5 tabs
    - Gene Expression widget
        - Status   
5. Pediatric Cancer Data Navigation (PCDN) Page
    - Row count of resulting evidence pages when searching for target or disease
6. Pediatric Molecular Targets List (PMTL) Page
    - Row count of Relevant and Non-Relevant targets

## Import modules and define relative paths

In [32]:
import pandas as pd
import numpy as np
import random
import glob
import json

In [2]:
# CORE VERSIONS
OT_VERSION = '22.04'
OPENPEDCAN_VERSION = 'v10.0'
PMTL_VERSION = 'v3.0'

# --------

# INPUTS

# Data from Open Targets
OT_PATH = 'data/external/opentargets/platform/' + OT_VERSION + '/output/etl/json/'

OT_ASSC_OVERALLDIRECT_PATH = OT_PATH + 'associationByOverallDirect/'
OT_ASSC_OVERALLINDIRECT_PATH = OT_PATH + 'associationByOverallIndirect/'
OT_DISEASES_PATH = OT_PATH + 'diseases/'
OT_TARGETS_PATH = OT_PATH + 'targets/'

# Data from CHoP
CHOP_PATH = 'data/raw/OpenPedCan_' + OPENPEDCAN_VERSION + '/'

# CHoP: Somatic Alterations
CNV_PATH = CHOP_PATH + 'gene-level-cnv-consensus-annotated-mut-freq.jsonl.gz'
SNVGENE_PATH = CHOP_PATH + 'gene-level-snv-consensus-annotated-mut-freq.jsonl.gz'
SNV_PATH = CHOP_PATH + 'variant-level-snv-consensus-annotated-mut-freq.jsonl.gz'
FUSIONGENE_PATH = CHOP_PATH + 'putative-oncogene-fused-gene-freq.jsonl.gz'
FUSION_PATH = CHOP_PATH + 'putative-oncogene-fusion-freq.jsonl.gz'

# CHoP: Gene Expression
TPMGENE_PATH = CHOP_PATH + 'long_n_tpm_mean_sd_quantile_gene_wise_zscore.jsonl.gz'
TPMGROUP_PATH = CHOP_PATH + 'long_n_tpm_mean_sd_quantile_group_wise_zscore.jsonl.gz'

# PMTL and PCDN data
PMTL_PATH = 'data/processed/pmtl_' + PMTL_VERSION + '.json'
PCDN_PATH = 'data/processed/chopDataNavigationTable_' + OT_VERSION + '_' + OPENPEDCAN_VERSION + '.json'

# Priority targets and diseases for test cases
PRIORITY_PATH = 'data/processed/dv3_priority_tests/'
PRIORITY_TARGETS_PATH = PRIORITY_PATH + 'targets.csv'
PRIORITY_DISEASES_PATH = PRIORITY_PATH + 'diseases.csv'
PRIORITY_EVIDENCES_PATH = PRIORITY_PATH + 'evidences.csv'

# --------

# OUTPUTS
XLSX_OUTPUT = 'MTP_DataValidation_InputFile.xlsx'

## Load data for processing

In [3]:
def load_jsonl_files_to_df(path:str, filetype:str='*.json'):
    """
    Load multiple identically-structured jsonl files within a local folder 
    into a single dataframe. Useful for OpenTargets FTP downloads.

    :param path: Relative filepath to the folder containing the jsonl files.
    :param filetype: Filetype suffix of files to include. default '*.json'
    """
    
    # OT uses 'json' extension for 'jsonl' files
    fullPath = path + filetype

    # Create list of all files within path folder
    files = glob.glob(fullPath)

    # Build df by combining all files in path folder
    df = pd.concat(
        (pd.read_json(f, orient='records', lines=True)
        for f in files))

    return df

In [4]:
# Load OT files as dataframes
ot_asscOverallDirect = load_jsonl_files_to_df(OT_ASSC_OVERALLDIRECT_PATH)
ot_asscOverallIndirect = load_jsonl_files_to_df(OT_ASSC_OVERALLINDIRECT_PATH)
ot_diseases = load_jsonl_files_to_df(OT_DISEASES_PATH)
ot_targets = load_jsonl_files_to_df(OT_TARGETS_PATH)

In [5]:
# Load CHoP files as dataframes
cnv = pd.read_json(CNV_PATH, orient='records', lines=True)
snvGene = pd.read_json(SNVGENE_PATH, orient='records', lines=True)
snv = pd.read_json(SNV_PATH, orient='records', lines=True)
fusionGene = pd.read_json(FUSIONGENE_PATH, orient='records', lines=True)
fusion = pd.read_json(FUSION_PATH, orient='records', lines=True)

tpmGene = pd.read_json(TPMGENE_PATH, orient='records', lines=True)
tpmGroup = pd.read_json(TPMGROUP_PATH, orient='records', lines=True)

In [6]:
# Load PMTL and PCDN as dataframes
pmtl_df = pd.read_json(PMTL_PATH, orient='records')
pcdn_df = pd.read_json(PCDN_PATH, orient='records')

In [8]:
# Load list of priorities for testing as dataframes
priority_targets = pd.read_csv(PRIORITY_TARGETS_PATH)
priority_diseases = pd.read_csv(PRIORITY_DISEASES_PATH)
priority_evidences = pd.read_csv(PRIORITY_EVIDENCES_PATH)

## Clean & Transform Data

### Simplify OT Target and Disease datasets

In [9]:
# Drop all columns except for ids and names
ot_targets = ot_targets.loc[:, ['id', 'approvedSymbol']]
ot_diseases = ot_diseases.loc[:, ['id', 'name']]

### Reformat CHoP TPM files for consistency

In [10]:
# Define columns to rename
tpmColRenameDict = {
    'Gene_Ensembl_ID': 'targetFromSourceId',
    'EFO': 'diseaseFromSourceMappedId'
}

# Rename columns and add datasourceId column for each file
tpmGene.rename(columns=tpmColRenameDict, inplace=True)
tpmGene['datasourceId'] = 'chop_tpm_genewise_expression'

tpmGroup.rename(columns=tpmColRenameDict, inplace=True)
tpmGroup['datasourceId'] = 'chop_tpm_groupwise_expression'

### Clean IDs within CHoP data

In [11]:
def clean_chop_targets(df:pd.DataFrame, ot_targets:pd.DataFrame=ot_targets):
    """ Remove rows of evidence that contain blank or incompatible Target IDs.
    These represent values that will not load into the MTP database. 
    
    :param df: Dataframe of CHoP evidence file
    :param ot_targets: Dataframe of Open Targets target database
    """

    # Note any rows with blank target IDs
    blankEvidences = len(df[df['targetFromSourceId'] == ''])
    blankTargets = df[df['targetFromSourceId'] == '']['targetFromSourceId'].nunique()
    print(f"    {blankEvidences} evidences across {blankTargets} blank target IDs removed from {df.datasourceId[0]}")

    # Drop any rows with blank target IDs
    df.drop(df[df['targetFromSourceId'] == ''].index, inplace=True)


    # Enrich chop df with OT target ids and symbols
    df1 = pd.merge(
        df, ot_targets, how='left', left_on='targetFromSourceId', right_on='id').rename(
        columns={'id':'otTargetId', 'approvedSymbol':'otSymbol'})

    # Note any rows with target IDs not found within OT database
    invalidEvidences = len(df1[df1['otTargetId'].isna()])
    invalidTargets = df1[df1['otTargetId'].isna()]['targetFromSourceId'].nunique()
    print(f"    {invalidEvidences} evidences across {invalidTargets} invalid target IDs removed from {df.datasourceId[0]}")

    # Drop any rows with target IDs not found within OT database
    df2 = df1[df1['otTargetId'].notnull()]

    return df2


In [12]:
def clean_chop_diseases(df:pd.DataFrame, ot_diseases:pd.DataFrame=ot_diseases):
    """ Remove rows of evidence that contain blank or incompatible Disease IDs.
    These represent values that will not load into the MTP database. 
    
    :param df: Dataframe of CHoP evidence file
    :param ot_diseases: Dataframe of Open Targets disease database
    """

    # Note any rows with blank disease IDs
    blankEvidences = len(df[df['diseaseFromSourceMappedId'] == ''])
    blankDiseases = df[df['diseaseFromSourceMappedId'] == '']['diseaseFromSourceMappedId'].nunique()
    print(f"    {blankEvidences} evidences across {blankDiseases} blank disease IDs removed from {df.datasourceId[0]}")

    # Drop any rows with blank disease IDs
    df.drop(df[df['diseaseFromSourceMappedId'] == ''].index, inplace=True)


    # Enrich chop df with OT disease ids and symbols
    df1 = pd.merge(
        df, ot_diseases, how='left', left_on='diseaseFromSourceMappedId', right_on='id').rename(
        columns={'id':'otDiseaseId', 'name':'otDiseaseName'})

    # Note any rows with disease IDs not found within OT database
    invalidEvidences = len(df1[df1['otDiseaseId'].isna()])
    invalidDiseases = df1[df1['otDiseaseId'].isna()]['diseaseFromSourceMappedId'].nunique()
    print(f"    {invalidEvidences} evidences across {invalidDiseases} invalid disease IDs removed from {df.datasourceId[0]}")

    # Drop any rows with disease IDs not found within OT database
    df2 = df1[df1['otDiseaseId'].notnull()]

    return df2


In [13]:
def chop_cleaning_functions(df:pd.DataFrame, ot_targets:pd.DataFrame=ot_targets, ot_diseases:pd.DataFrame=ot_diseases):
    """ Combines target and disease ID cleaning functions in series. """

    print(df.datasourceId[0], '\n    Start length:', len(df))

    df1 = clean_chop_targets(df)
    df2 = clean_chop_diseases(df1)
    
    print('    End length:', len(df2), '\n---')

    return df2

In [14]:
# Run each CHoP df through the cleaning functions defined above
cnv_clean = chop_cleaning_functions(cnv)
snvGene_clean = chop_cleaning_functions(snvGene)
snv_clean = chop_cleaning_functions(snv)
fusion_clean = chop_cleaning_functions(fusion)
fusionGene_clean = chop_cleaning_functions(fusionGene)

tpmGene_clean = chop_cleaning_functions(tpmGene)
tpmGroup_clean = chop_cleaning_functions(tpmGroup)

chop_gene_level_cnv 
    Start length: 1505739
    0 evidences across 0 blank target IDs removed from chop_gene_level_cnv
    943 evidences across 22 invalid target IDs removed from chop_gene_level_cnv
    0 evidences across 0 blank disease IDs removed from chop_gene_level_cnv
    0 evidences across 0 invalid disease IDs removed from chop_gene_level_cnv
    End length: 1504796 
---
chop_gene_level_snv 
    Start length: 102569
    0 evidences across 0 blank target IDs removed from chop_gene_level_snv
    121 evidences across 44 invalid target IDs removed from chop_gene_level_snv
    42 evidences across 1 blank disease IDs removed from chop_gene_level_snv
    0 evidences across 0 invalid disease IDs removed from chop_gene_level_snv
    End length: 102406 
---
chop_variant_level_snv 
    Start length: 535622
    0 evidences across 0 blank target IDs removed from chop_variant_level_snv
    358 evidences across 44 invalid target IDs removed from chop_variant_level_snv
    43 evidences acro

### Build Target-Disease Evidence Dataframe Function
Build a dataframe containing all of the pediatric cancer evidence in the format required for validation. Test cases will be subsets of this df, exported into Excel for automation.

In [33]:
def build_test_case_df(dfList:list=dfList, ot_targets:pd.DataFrame=ot_targets, ot_diseases:pd.DataFrame=ot_diseases):
    """ Build and format a dataframe to use when generating target and evidence page tests.
    Combine and transform a list of cleaned/preprocessed evidence dataframes.
    
    :param dfList: list of pandas DataFrames containing evidence
    :param ot_targets: pandas DataFrame of Open Targets targets to use for reference
    :param ot_diseases: pandas DataFrame of Open Targets diseases to use for reference
    """

    # Create blank output to fill with each evidence df
    dfCombined = pd.DataFrame()

    # Iterate through list of evidence dataframes
    for df in dfList:

        # Group data by 5 columns and get evidence count
        # All evidence must use identical column names/contents
        df1 = df.groupby(
            ['targetFromSourceId', 
            'Gene_symbol', 
            'diseaseFromSourceMappedId', 
            'Disease', 
            'datasourceId']
            ).size().reset_index().rename(columns={0:'evidenceCount'})

        # Add each formatted df into a single dataframe
        dfCombined = pd.concat([dfCombined, df1], ignore_index=True)

    # Pivot to organize datasources as columns showing evidence sums
    df2 = dfCombined.pivot_table(
            values='evidenceCount', 
            index=['targetFromSourceId','diseaseFromSourceMappedId'], 
            columns='datasourceId', 
            aggfunc=sum, fill_value=0
            ).reset_index().rename_axis(None, axis=1)

    # Use target IDs to map canonical OT names for targets
    df3 = pd.merge(df2, ot_targets, how='left', left_on='targetFromSourceId', right_on='id').rename(columns={'approvedSymbol':'targetNameOT'})
    df3.drop(columns='id', inplace=True)

    # Use disease IDs to map canonical OT names for diseases
    df4 = pd.merge(df3, ot_diseases, how='left', left_on='diseaseFromSourceMappedId', right_on='id').rename(columns={'name':'diseaseNameOT'})
    df4.drop(columns='id', inplace=True)

    # Rearrange output columns
    df5 = df4[[
                'targetFromSourceId',
                'diseaseFromSourceMappedId',
                'targetNameOT',
                'diseaseNameOT',
                'chop_gene_level_snv',
                'chop_variant_level_snv',
                'chop_gene_level_cnv',
                'chop_putative_oncogene_fused_gene',
                'chop_putative_oncogene_fusion',
                'chop_tpm_groupwise_expression',
                'chop_tpm_genewise_expression']]

    return df5

### Run function to combine evidence and build test case dataframe

In [34]:
# Create list of clean dataframes for iteration
dfList = [
    cnv_clean,
    snvGene_clean,
    snv_clean,
    fusion_clean,
    fusionGene_clean,
    tpmGene_clean,
    tpmGroup_clean,
    ]

In [35]:
testCase_df = build_test_case_df(dfList, ot_targets, ot_diseases)

## Build Test Case Functions

### Test Case: Target Associations (Direct Overall)

In [36]:
def build_testCase_targetAssc(sampleSize:int,
                            ot_asscOverallDirect:pd.DataFrame=ot_asscOverallDirect, 
                            ot_targets:pd.DataFrame=ot_targets, 
                            pmtl_df:pd.DataFrame=pmtl_df, 
                            priority_targets:pd.DataFrame=priority_targets):
    """ Build test case df for Target Associations page. Not specific to new MTP data. 
    Note that until pediatric data is included in association scoring, MTP associations will
    be identical to OT associations.

    :param sampleSize: int number of random targets to include in test
    :param ot_asscOverallDirect: pandas DataFrame of OpenTargets Overall Direct associations
    :param ot_targets: pandas DataFrame of OpenTargets target names for reference
    :param pmtl_df: pandas DataFrame of FDA PMTL
    :param priority_targets: pandas DataFrame of targets to always include in test case"""

    # Group associations by targetId to mimic associations heatmap
    df = ot_asscOverallDirect.groupby('targetId').size().reset_index()
    
    # Enrich associations with target names (OT Targets) and PMTL designations
    df1 = df.merge(ot_targets, how='left', left_on='targetId', right_on='id').merge(
                    pmtl_df[['ensemblID', 'designation']], how='left', left_on='targetId', right_on='ensemblID')
    
    # Rename columns
    df2 = df1.rename(columns={
        0:'diseaseCount',
        'approvedSymbol':'targetNameOT',
        'designation':'PMTLcode'})
        
    # Recast NaN PMTL as Unspecified
    df2.fillna('Unspecified Target', inplace=True)

    # Add suffix column
    df2['suffixUrl'] = '/target/'+df2['targetId']+'/associations'

    # Reorder columns and omit redundant
    df3 = df2[['suffixUrl', 'targetId', 'targetNameOT', 'PMTLcode','diseaseCount']]

    # Create subset df with priority targets and random sample of other targets
    df4 = df3[
            (df3['targetId'].isin(priority_targets['targetId'])) |
            (df3['targetId'].isin(random.sample(df3['targetId'].unique().tolist(), sampleSize)))]

    return df4

### Test Case: Target Profile

In [37]:
def build_testCase_targetProfile(sampleSize:int, 
                                testCase_df:pd.DataFrame=testCase_df, 
                                priority_targets:pd.DataFrame=priority_targets):
    """ Build test case df for Target Profile page. Should only include targets with
    at least some new pediatric data. 

    :param sampleSize: int number of random targets to include in test case
    :param testCase_df: pandas DataFrame of preprocessed evidence data
    :param priority_targets: pandas DataFrame of targets to always include in test case"""

    df = testCase_df.groupby(['targetFromSourceId','targetNameOT']).sum().reset_index()

    # Add TRUE/FALSE for presence of Gene Expression Widget (groupwise plot)
    df['opcGeneExp_target'] = np.where(df['chop_tpm_groupwise_expression'] >0, 'TRUE', 'FALSE')

    # List all datasourceIds to be grouped and considered Somatic Alterations with tab labels
    somaticAltCols = {
    'chop_gene_level_cnv':'cnvByGene',
    'chop_gene_level_snv':'snvByGene',
    'chop_putative_oncogene_fused_gene':'fusionByGene',
    'chop_putative_oncogene_fusion':'fusion',
    'chop_variant_level_snv':'snvByVariant'}

    # Raname columns
    df.rename(columns=somaticAltCols, inplace=True)
    df.rename(columns={'targetFromSourceId':'targetId'}, inplace=True)

    # Add TRUE/FALSE for presence of Somatic Alterations widget
    df['opcSomaticAlt'] = np.where(df[somaticAltCols.values()].sum(axis=1) > 0, 'TRUE', 'FALSE')

    # Add suffix column
    df['suffixUrl'] ='/target/'+df['targetId']

    # Rearrange columns and omit redundant
    df1 = df[['suffixUrl', 'targetId', 'targetNameOT', 'opcGeneExp_target', 'opcSomaticAlt', 'snvByGene', 'snvByVariant', 'cnvByGene','fusionByGene','fusion']]

    # Create subset df with priority targets and random sample of other targets
    df2 = df1[
            (df1['targetId'].isin(priority_targets['targetId'])) |
            (df1['targetId'].isin(random.sample(df1['targetId'].unique().tolist(), sampleSize)))]

    return df2

### Test Case: Disease Associations (Indirect Overall)

In [38]:
def build_testCase_diseaseAssc(sampleSize:int,
                            ot_asscOverallIndirect:pd.DataFrame=ot_asscOverallIndirect, 
                            ot_diseases:pd.DataFrame=ot_diseases,
                            priority_diseases:pd.DataFrame=priority_diseases):
    """ Build test case df for Target Associations page. Not specific to new MTP data. 
    Note that until pediatric data is included in association scoring, MTP associations will
    be identical to OT associations.

    :param sampleSize: int number of random diseases to include in test
    :param ot_asscOverallIndirect: pandas DataFrame of OpenTargets Overall Indirect associations
    :param ot_diseases: pandas DataFrame of OpenTargets disease names for reference
    :param priority_diseases: pandas DataFrame of diseases to always include in test case
    """

    # Group associations by targetId to mimic associations heatmap
    df = ot_asscOverallIndirect.groupby('diseaseId').size().reset_index()
    
    # Enrich associations with target names (OT Targets) and PMTL designations
    df1 = df.merge(ot_diseases, how='left', left_on='diseaseId', right_on='id')
    
    # Rename columns
    df2 = df1.rename(columns={
        0:'targetCount',
        'name':'diseaseNameOT'})

    # Add suffix column
    df2['suffixUrl'] = '/disease/'+df2['diseaseId']+'/associations'

    # Reorder columns and omit redundant
    df3 = df2[['suffixUrl', 'diseaseId', 'diseaseNameOT','targetCount']]

    # Create subset df with priority targets and random sample of other targets
    df4 = df3[
            (df3['diseaseId'].isin(priority_diseases['diseaseId'])) |
            (df3['diseaseId'].isin(random.sample(df3['diseaseId'].unique().tolist(), sampleSize)))]

    return df4

### Test Case: Evidence Page

In [39]:
def build_testCase_evidence(sampleSize:int, 
                            testCase_df:pd.DataFrame=testCase_df, 
                            priority_evidences:pd.DataFrame=priority_evidences):
    """ Build test case df for Evidence page. Should only include target-disease
    evidence with at least some new pediatric data. 

    :param sampleSize: int number of random evidence combinations to include in test case
    :param testCase_df: pandas DataFrame of preprocessed evidence data
    :param priority_evidences: pandas DataFrame of evudebces to always include in test case"""

    # Build subset of testCase df using only priority evidences
    priorityEvidence = pd.merge(priority_evidences[['targetId','diseaseId']], testCase_df, how='left', 
        left_on=['targetId','diseaseId'], right_on=['targetFromSourceId','diseaseFromSourceMappedId'])
    priorityEvidence.drop(columns=['targetId','diseaseId'], inplace=True)

    # Build subset of testCase df using random sample of evidence combinations
    sampleEvidence = testCase_df.groupby(['targetFromSourceId', 'diseaseFromSourceMappedId']).size().reset_index().sample(sampleSize)
    sampleEvidence.drop(columns=0, inplace=True)
    sampleEvidenceDetails = pd.merge(sampleEvidence, testCase_df, how='left', on=['targetFromSourceId', 'diseaseFromSourceMappedId'])

    # Concat priority and random subset dfs
    df = pd.concat([priorityEvidence, sampleEvidenceDetails], ignore_index=True)

    # Add TRUE/FALSE for presence of Gene Expression Widget (genewise plot)
    df['opcGeneExp_evidence'] = np.where(df['chop_tpm_genewise_expression'] >0, 'TRUE', 'FALSE')

    # List all datasourceIds to be grouped and considered Somatic Alterations with tab labels
    somaticAltCols = {
    'chop_gene_level_cnv':'cnvByGene',
    'chop_gene_level_snv':'snvByGene',
    'chop_putative_oncogene_fused_gene':'fusionByGene',
    'chop_putative_oncogene_fusion':'fusion',
    'chop_variant_level_snv':'snvByVariant'}

    # Rename columns
    df.rename(columns=somaticAltCols, inplace=True)
    df.rename(columns={'targetFromSourceId':'targetId', 'diseaseFromSourceMappedId':'diseaseId'}, inplace=True)

    # Add TRUE/FALSE for presence of Somatic Alterations widget
    df['opcSomaticAlt'] = np.where(df[somaticAltCols.values()].sum(axis=1) > 0, 'TRUE', 'FALSE')

    # Add suffix column
    df['suffixUrl'] ='/evidence/'+df['targetId']+'/'+df['diseaseId']

    # Rearrange columns and omit redundant
    df1 = df[['suffixUrl', 'targetId', 'diseaseId', 'targetNameOT', 'diseaseNameOT', 'opcGeneExp_evidence', 'opcSomaticAlt', 'snvByGene', 'snvByVariant', 'cnvByGene','fusionByGene','fusion']]

    return df1


### Test Case: Pediatric Cancer Data Navigation (PCDN) by Gene

In [40]:
def build_testCase_pcdnGene(sampleSize:int, 
                        pcdn_df:pd.DataFrame=pcdn_df, 
                        priority_targets:pd.DataFrame=priority_targets):
    """ Build test case df for PCDN page, gene search. Should only include 
    targets contained within the PCDN data (and therefore new pediatric data). 

    :param sampleSize: int number of random targets to include in test case
    :param pcdn_df: pandas DataFrame of preprocessed Pediatric Cancer Data Navigation page
    :param priority_targets: pandas DataFrame of targets to always include in test case"""

    # Build subset of the PCDN df using priority targets and random sample of other targets
    df = pcdn_df[
        (pcdn_df['Gene_symbol'].isin(priority_targets['symbol'])) |
        (pcdn_df['Gene_symbol'].isin(random.sample(pcdn_df['Gene_symbol'].unique().tolist(), sampleSize)))]

    # Group subset by gene symbol to get resulting evidence count
    df1 = df.groupby('Gene_symbol').size().reset_index()

    # Rename columns
    df1.rename(columns={0:'evidenceResults', 'Gene_symbol':'name'}, inplace=True)

    # Add suffix and category columns
    df1['suffixUrl'] = '/pediatric-cancer-data-navigation'
    df1['category'] = 'target'

    # Rearrange columns and omit redundant
    df2 = df1[['suffixUrl', 'category', 'name', 'evidenceResults']]

    return df2

### Test Case: Pediatric Cancer Data Navigation (PCDN) by Disease

In [41]:
def build_testCase_pcdnDisease(sampleSize:int, 
                        maxResults:int=10000,
                        pcdn_df:pd.DataFrame=pcdn_df, 
                        priority_diseases:pd.DataFrame=priority_diseases):
    """ Build test case df for PCDN page, disease search. Should only include 
    diseases contained within the PCDN data (and therefore new pediatric data). 

    :param sampleSize: int number of random diseases to include in test case
    :param maxResults: int maximum count of evidence results. Default is 10,000 to match MTP
    :param pcdn_df: pandas DataFrame of preprocessed Pediatric Cancer Data Navigation page
    :param priority_diseases: pandas DataFrame of diseases to always include in test case"""

    # Build subset of the PCDN df using priority diseases and random sample of other diseases
    df = pcdn_df[
        (pcdn_df['Disease'].isin(priority_diseases['name'])) |
        (pcdn_df['Disease'].isin(random.sample(pcdn_df['Disease'].unique().tolist(), sampleSize)))]

    # Group subset by gene symbol to get resulting evidence count
    df1 = df.groupby('Disease').size().reset_index().rename(columns={0:'fullResults'})

    # Cap number of results at a maximum (default 10,000) to match site functionality
    df1['evidenceResults'] = np.where(df1['fullResults'] > maxResults, maxResults, df1['fullResults'])

    # Rename columns
    df1.rename(columns={'Disease':'name'}, inplace=True)

    # Add suffix and category columns
    df1['suffixUrl'] = '/pediatric-cancer-data-navigation'
    df1['category'] = 'disease'

    # Rearrange columns and omit redundant
    df2 = df1[['suffixUrl', 'category', 'name', 'evidenceResults']]

    return df2

### Test Case: Pediatric Molecular Targets Lists (PMTL)

In [42]:
def build_testCase_pmtl(pmtl_df:pd.DataFrame=pmtl_df):
    """ Build test case df for FDA PMTL page.

    :param pmtl_df: pandas DataFrame of the computable FDA PMTL"""

    df = pmtl_df.groupby('designation').size().reset_index()

    df['suffixUrl'] = '/fda-pmtl'
    df['category'] = 'designation'

    df.rename(columns={0:'count', 'designation':'categoryValue'}, inplace=True)

    df1 = df[['suffixUrl', 'category', 'categoryValue', 'count']]

    return df1

## Generate Test Cases

In [43]:
# Run test case functions to build dataframes for excel export
targetAssc = build_testCase_targetAssc(10)
targetProfile = build_testCase_targetProfile(10)
diseaseAssc = build_testCase_diseaseAssc(10)
evidence = build_testCase_evidence(15)
pcdnGene = build_testCase_pcdnGene(10)
pcdnDisease = build_testCase_pcdnDisease(10)
pmtl = build_testCase_pmtl()

## Export Test Cases to Excel

In [56]:
# Build df with test generation metadata
sourceData = pd.DataFrame({
    'Test Case generation date':pd.Timestamp.now(),
    'Open Targets Version':OT_VERSION, 
    'CHoP OpenPedCan Version': OPENPEDCAN_VERSION, 
    'PMTL Version':PMTL_VERSION}.items(),
    columns=['MTP DV3 TEST CASES FOR AUTOMATION', ''])

In [60]:
# Create writer using XlsxWriter as the engine
writer = pd.ExcelWriter(XLSX_OUTPUT, engine='xlsxwriter')

# Write each dataframe as a sheet to the excel file defined above
sourceData.to_excel(writer, sheet_name='sourceData', index=False)
targetAssc.to_excel(writer, sheet_name='targetAssc', index=False)
targetProfile.to_excel(writer, sheet_name='targetProfile', index=False)
diseaseAssc.to_excel(writer, sheet_name='diseaseAssc', index=False)
evidence.to_excel(writer, sheet_name='evidence', index=False)
pcdnGene.to_excel(writer, sheet_name='pcdnGene', index=False)
pcdnDisease.to_excel(writer, sheet_name='pcdnDisease', index=False)
pmtl.to_excel(writer, sheet_name='pmtl', index=False)

# Save and close excel writer
writer.save()