# Verify Migrated API Data Downloads
2023-07-12 ZD  

Verify that the data available as MTP user downloads from the Gene Expression widget are correct. The old downloads pulled a summary file from CHoP's API. After migrating the API server to CBIIT space, we have also changed the data downloads to return the full dataset, rather than summary data, which will allow users to rebuild the plots themselves.  

Because the structure of data is very different between the old and new types of downloads, comparisons require transformations. This notebook will compare a sample of old data downloads to new downloads using those transforms. 

Jira ticket: https://tracker.nci.nih.gov/browse/CCDIMTP-782

### Import modules

In [1]:
import pandas as pd
import os
import numpy as np

### Proof of Concept:

In [2]:
# Set path and load matching set of new and old file
df_new_path = 'data/test/plotlyDownloads/newData/OpenPedCanGeneExpression-ENSG00000097007-EFO_0000220-gene-disease-gtex.tsv'
df_new = pd.read_csv(df_new_path, sep='\t')

df_old_path = 'data/test/plotlyDownloads/oldData/OpenPedCanGeneExpression-ENSG00000097007-EFO_0000220-gene-disease-gtex.tsv'
df_old = pd.read_csv(df_old_path, sep='\t')

In [3]:
# Roll up new data to match old data. Get mean TPM statistic for comparison
df_new_grouped = df_new.groupby('xLabel')['tpm'].mean().round(2).reset_index()

In [4]:
# Rough check to see if new (derived) and old (provided) Mean TPMs match across all rows
# Note that this is dependent upon sorting, so will need improvement
all(df_old['tpmMean'] == df_new_grouped['tpm'])

True

Derived values match provided values for the mean TPM of each x-axis grouping. Next step is to test a larger sample.

### Build workflow for handling data download sets

In [5]:
def summarize_new_df(df):
    
    # Build initial summary df with xLabel columns and TPM Mean
    df_summarized = df.groupby('xLabel')['tpm'].mean().round(2).reset_index()
    df_summarized = df_summarized.rename(columns={'tpm':'tpmMean'})

    # Add column for sample/row counts
    df_summarized['boxSampleCount'] = df.groupby('xLabel').size().tolist()

    return df_summarized

In [6]:
def compare_data(df_summarized, df_old):

    # Get only comparison columns from old data
    df_old = df_old[['xLabel', 'tpmMean', 'boxSampleCount']]

    # Sort data by x-axis label for comparison
    df_old = df_old.sort_values(by='xLabel', ascending=True).reset_index(drop=True)
    df_summarized = df_summarized.sort_values(by='xLabel', ascending=True).reset_index(drop=True)

    # Check for exact match
    comparison_result = df_summarized.equals(df_old)

    # Check for columns with mismatched values
    if not comparison_result:
        # Check x-Axis Labels
        match_xLabel = all(df_old['xLabel'] == df_summarized['xLabel'])
        # Use relative check of TPM mean with tolerance of 0.01 difference
        match_TPM = all(np.isclose(a=df_old['tpmMean'], b=df_summarized['tpmMean'], rtol=0.01))
        # Check sample counts
        match_boxSampleCount = all(df_old['boxSampleCount'] == df_summarized['boxSampleCount'])

        # If values still mismatched after tpmMean tolerance, list issues
        comparison_result = all([match_xLabel, match_TPM, match_boxSampleCount])
        if not comparison_result:
            return {
                'comparison_result': comparison_result,
                'match_xLabel': match_xLabel,
                'match_tpmMean': match_tpmMean,
                'match_boxSampleCount': match_boxSampleCount
                }

    return comparison_result

In [7]:
def compare_data_download_files(newDataPath, oldDataPath):
    new_files = os.listdir(newDataPath)
    old_files = os.listdir(oldDataPath)

    # Sort the files to ensure consistent order for comparison
    new_files.sort()
    old_files.sort()

    compared_data = []

    for new_file, old_file in zip(new_files, old_files):
        new_file_path = os.path.join(newDataPath, new_file)
        old_file_path = os.path.join(oldDataPath, old_file)

        if os.path.isfile(new_file_path) and os.path.isfile(old_file_path):
            # Ensure that new_file and old_file have the same filename
            if new_file != old_file:
                raise ValueError(f"The filenames {new_file} and {old_file} do not match.")

            # Read the TSV files into DataFrames
            df_new = pd.read_csv(new_file_path, delimiter='\t')
            df_old = pd.read_csv(old_file_path, delimiter='\t')

            # Summarize the new data to match format of old data
            df_summarized = summarize_new_df(df_new)

            # Compare the transformed DataFrame with the old DataFrame
            comparison_result = compare_data(df_summarized, df_old)

            if type(comparison_result)!=bool:
                comparison_result.update({"file:": new_file})

            # Append the comparison result to the list
            compared_data.append(comparison_result)

    return compared_data

### Run workflow to get results

In [8]:
# Define paths for new and old data directories
newDataPath = 'data/test/plotlyDownloads/newData/'
oldDataPath = 'data/test/plotlyDownloads/oldData/'

# Compare data!
results = compare_data_download_files(newDataPath, oldDataPath)

# If results are good, say so. If not, list issues
if all(results): 
    print(f"Success! New data downloads match old data downloads.")
else:
    print(f"Error: Some new data downloads DO NOT match old data downloads.")
    results

Success! New data downloads match old data downloads.
