# Verify Migrated API Data Downloads
2023-07-12 ZD  

Verify that the data available as MTP user downloads from the Gene Expression widget are correct. The old downloads pulled a summary file from CHoP's API. After migrating the API server to CBIIT space, we have also changed the data downloads to return the full dataset, rather than summary data, which will allow users to rebuild the plots themselves.  

Because the structure of data is very different between the old and new types of downloads, comparisons require transformations. This notebook will compare a sample of old data downloads to new downloads using those transforms. 

Jira ticket: https://tracker.nci.nih.gov/browse/CCDIMTP-782

## Usage
This notebook is intended to be used to compare new MTP Gene Expression data download files in a directory against old files in a different directory.
```
Example structure:  
.
└── fileSet/
    ├── newData/
    │   ├── Filename1
    │   ├── Filename2
    │   └── Filename3
    └── oldData/
        ├── Filename1
        ├── Filename2
        └── Filename3
```

The notebook will compare the new version of each file in a fileset to the matching old file to check for differences (i.e. `newData/Filename1` vs `oldData/Filename1`). Some transformation steps are run on the newData to mimic the summary statistic steps used by CHoP to deliver the oldData. The comparison is scalable and can compare many files at a time as long as each file in newData has a companion oldData file with identical name for comparison. 

Specifically, data derived from the following fields are compared within the files:
- **x Axis Labels** - Plot labels that determine data grouping. This is provided in both the new and old files. In old files, there is a single row for each label, but in new files, there are many rows for each label. e.g. `Acute Lymphoblastic Leukemia  (Dataset = TARGET, Specimen = Pediatric Primary Tumors, N = 538)`  
- **Mean TPM** - Average Transcript Per Million read values for each x Axis Label group rounded to 2 decimal places. This is provided as `tpmMean` in old files and is calculated from `TPM` in new files for comparison.  e.g. `15.95`
- **Sample Count** - Number of samples included in each x Axis Label group. This is provided as `boxSampleCount` in both new and files, but is verified by counting the number of rows for each x Axis Label. e.g. `538`

Note that the Mean TPM comparison allows for a tolerance of 0.01 difference between provided/old and calculated/new Mean TPM values. This is to allow for any floating point differences in rounding between the provided R-derived means and the new means calculated here with Python. Read more: https://docs.python.org/3/tutorial/floatingpoint.html#tut-fp-issues 

#### File Preparation
Download and organization of files into the comparison directories (e.g. `oldData` and `newData`) is done manually. The `fileSet` folder is used for batching and versioning (e.g. `example` or data download date `20230712`). For notebook development and initial validation, TSV files were downloaded from either the Production tier (old files) or QA tier of MTP. Each API endpoint (download button) delivers data with a different structure. One file was downloaded for each Gene Expression endpoint on each selected page for each tier. The default downloaded filename is `OpenPedCanGeneExpression-{ENSG}.tsv` for target page downloads and `OpenPedCanGeneExpression-{ENSG}-{EFO}.tsv` for evidence page downloads. In order to distinguish downloads from different endpoints on the same page, endpoint descriptors were added to each filename after download (e.g. `OpenPedCanGeneExpression-{ENSG}-gene-all-cancer.tsv`).  


#### Initial Validation
The sample downloads for initial validation were selected using `dv3_priority_tests.csv` as a guide:  

**Endpoints:**  
- `gene-all-cancer` (Target)
- `gene-all-cancer-tcga` (Target)
- `gene-disease-gtex` (Evidence)
- `gene-disease-tcga` (Evidence)

**Targets:**  
- ALK -     ENSG00000171094
- BRAF -    ENSG00000157764
- CD19 -    ENSG00000177455
- EGFR -    ENSG00000146648
- FLT3 -    ENSG00000122025

**Evidence Combinations**
- ALK in neuroblastoma -                    ENSG00000171094-EFO_0000621
- FLT3 in acute myeloid leukemia -          ENSG00000122025-EFO_0000222
- TP53 in osteosarcoma -                    ENSG00000141510-EFO_0000637
- ABL1 in acute lymphoblastic leukemia -    ENSG00000097007-EFO_0000220
- PAX3 in osteosarcoma -                    ENSG00000135903-EFO_0000637

### Import modules

In [1]:
import pandas as pd
import numpy as np
import os

### Build workflow for handling data download sets

In [2]:
def summarize_new_df(df):
    """Transform new data into grouped DataFrame with summary statistics."""
    
    # Build initial summary df with xLabel columns and TPM Mean
    df_summarized = df.groupby('xLabel')['tpm'].mean().round(2).reset_index()
    df_summarized = df_summarized.rename(columns={'tpm':'tpmMean'})

    # Add column for sample/row counts
    df_summarized['boxSampleCount'] = df.groupby('xLabel').size().tolist()

    return df_summarized

In [3]:
def compare_data(df_summarized, df_old):
    """Compare transformed new data against old data. If the data matches, 
    then a boolean True will be returned. If not, then a dictionary with
    additional details will be returned along with the False value.
    
    :param df_summarized: pandas DataFrame result of summary step
    :param df_old: pandas DataFrame 
    """

    # Get only comparison columns from old data
    df_old = df_old[['xLabel', 'tpmMean', 'boxSampleCount']]

    # Sort data by x-axis label for comparison
    df_old = df_old.sort_values(by='xLabel', ascending=True).reset_index(drop=True)
    df_summarized = df_summarized.sort_values(by='xLabel', ascending=True).reset_index(drop=True)

    # Check for exact match
    comparison_result = df_summarized.equals(df_old)

    # Check for columns with mismatched values
    if not comparison_result:
        # Check x-Axis Labels
        match_xLabel = all(df_old['xLabel'] == df_summarized['xLabel'])
        # Use relative check of TPM mean with tolerance of 0.01 difference
        match_TPMMean = all(np.isclose(a=df_old['tpmMean'], b=df_summarized['tpmMean'], rtol=0.01))
        # Check sample counts
        match_boxSampleCount = all(df_old['boxSampleCount'] == df_summarized['boxSampleCount'])

        # If values still mismatched after tpmMean tolerance, list issues
        comparison_result = all([match_xLabel, match_TPMMean, match_boxSampleCount])
        if not comparison_result:
            return {
                'comparison_result': comparison_result,
                'match_xLabel': match_xLabel,
                'match_tpmMean': match_TPMMean,
                'match_boxSampleCount': match_boxSampleCount
                }

    return comparison_result

In [38]:
def compare_data_download_files(newDataPath, oldDataPath):
    """Main function. Transform new data and compare.

    :param newDataPath: directory path of raw new files
    :param oldDataPath: directory path of old files
    """
    new_files = os.listdir(newDataPath)
    old_files = os.listdir(oldDataPath)

    # Sort the files to ensure consistent order for comparison
    new_files.sort()
    old_files.sort()

    # Ensure that directories have identical file counts
    if len(new_files) != len(old_files):
        raise ValueError(f"Number of files ({len(new_files)},{len(old_files)}) in comparison directories do not match.")

    compared_data = []

    # Iterate through directories to get matching new and old files
    for new_file, old_file in zip(new_files, old_files):
        new_file_path = os.path.join(newDataPath, new_file)
        old_file_path = os.path.join(oldDataPath, old_file)

        # Ensure that each file path points to a file
        if os.path.isfile(new_file_path) and os.path.isfile(old_file_path):
            # Ensure that new_file and old_file have the same filename
            if new_file != old_file:
                raise ValueError(f"The filenames {new_file} and {old_file} do not match.")

            # Read the TSV files into DataFrames
            df_new = pd.read_csv(new_file_path, delimiter='\t')
            df_old = pd.read_csv(old_file_path, delimiter='\t')

            # Summarize the new data to match format of old data
            df_summarized = summarize_new_df(df_new)

            # Compare the transformed DataFrame with the old DataFrame
            comparison_result = compare_data(df_summarized, df_old)

            # Add filename to any mismatched file results
            if type(comparison_result)!=bool:
                comparison_result.update({"file:": new_file})

            # Append the comparison result to the list
            compared_data.append(comparison_result)

    return compared_data

### Run workflow to get results

In [39]:
# Define paths for new and old data directories
newDataPath = 'data/test/geneExpressionDataDownloads/example/newData/'
oldDataPath = 'data/test/geneExpressionDataDownloads/example/oldData/'

# Compare data!
results = compare_data_download_files(newDataPath, oldDataPath)

# If results are good, say so. If not, list issues
if all(results):
    print(f"Success! New data downloads match old data downloads.")
else:
    print(f"Error: Some new data downloads DO NOT match old data downloads.")
    results

Success! New data downloads match old data downloads.


In [56]:
# Define file set directory
fileSet = 'example'

# Define paths for new and old data directories
newDataPath = 'data/test/geneExpressionDataDownloads/'+fileSet+'/newData/'
oldDataPath = 'data/test/geneExpressionDataDownloads/'+fileSet+'/oldData/'

# Compare data!
results = compare_data_download_files(newDataPath, oldDataPath)

# If any results are dict details (data mismatch), show details
if any(isinstance(result, dict) for result in results):
    print(f"Error: Some new data downloads do not match old data downloads.")
    for result in results:
        if isinstance(result, dict): 
            print(result)
else:
    print(f"Success! New data downloads match old data downloads.")


Success! New data downloads match old data downloads.
