# Extract HiFi QC Data<a class="tocSkip">

**This notebook reads in data from NTSM and ReadStats WDLS (stored in data tables). This is part of the HiFi QC process.**

**Below are the steps taken in this notebook:**
1. Import Statements & Global Variable Definitions
2. Define Functions
3. Read In Sample Names
4. Create Dataframe Of Files
5. Write data frame to data tables

# Import Statements & Global Variable Definitions

## Installs

In [1]:
%%capture
%pip install gcsfs
## capture CANNOT have comments above it
## For reading CSVs stored in Google Cloud (without downloading them first)
## May need to restart kernel after install 

In [2]:
%%capture
%pip install --upgrade --no-cache-dir --force-reinstall terra-pandas
%pip install --upgrade --no-cache-dir  --force-reinstall git+https://github.com/DataBiosphere/terra-notebook-utils
## For reading/writing data tables into pandas data frames
## May need to restart kernel after install 

## Import Statements

In [1]:
from firecloud import fiss
import pandas as pd 
import numpy as np
import terra_pandas as tp
import os                 
import subprocess       
import re                 
import io
import gcsfs

from typing import Any, Callable, List, Optional
from terra_notebook_utils import table, WORKSPACE_NAME, WORKSPACE_GOOGLE_PROJECT


## Global Variable Declarations

In [2]:
# Get the Google billing project name and workspace name for current workspace
PROJECT = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE =os.path.basename(os.path.dirname(os.getcwd()))
bucket = os.environ['WORKSPACE_BUCKET'] + "/"


# Verify that we've captured the environment variables
print("Billing project: " + PROJECT)
print("Workspace: " + WORKSPACE)
print("Workspace storage bucket: " + bucket)

Billing project: human-pangenome-ucsc
Workspace: HPRC_WRANGLING_UW_HPRC_HiFi_Y3
Workspace storage bucket: gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/


# Extract NTSM Data

## Read in NTSM Data Table

In [3]:
ntsm_df = tp.table_to_dataframe("ntsm", workspace=WORKSPACE, workspace_namespace=PROJECT)

ntsm_df.head()

Unnamed: 0_level_0,ntsv_count_2,ntsv_count_1,hifi,1000g_cram,ntsm_eval_out
ntsm_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HG00099,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...
HG00280,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...
HG00558,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...
HG00639,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...
HG01074,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...


## Read NTSM Output & Write To DataFrame

In [4]:
ntsm_df['ntsm_score'] = np.nan
ntsm_df['result']     = np.nan

for index, row in ntsm_df.iterrows():

        sample_ntsm_fp = row['ntsm_eval_out']
        sample_ntsm_fn = os.path.basename(sample_ntsm_fp)

        ! gsutil cp {sample_ntsm_fp} .
        
        sample_ntsm_df = pd.read_csv(sample_ntsm_fn, header=None, sep='\t')

        ntsm_df['ntsm_score'][index] = sample_ntsm_df[2]
        ntsm_df['result'][index]     = sample_ntsm_df[3].astype('str')[0]


Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/d572bc00-3943-494a-bed8-cfddb30a1848/ntsm_workflow/2a7ef75f-352c-46f8-a78c-c2246a406196/call-ntsm_eval/sample_1000genome_Illumina_vs_hifi.txt...
/ [1 files][  424.0 B/  424.0 B]                                                
Operation completed over 1 objects/424.0 B.                                      


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/d572bc00-3943-494a-bed8-cfddb30a1848/ntsm_workflow/2985c3a1-3e42-40e2-9efa-4458a74503ad/call-ntsm_eval/sample_1000genome_Illumina_vs_hifi.txt...
/ [1 files][  424.0 B/  424.0 B]                                                
Operation completed over 1 objects/424.0 B.                                      
Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/83ef6c7c-17f7-4cb1-8709-2809ad2975ea/ntsm_workflow/2ea3f12a-1c72-4dcb-b24d-393ae92866ff/call-ntsm_eval/sample_1000genome_Illumina_vs_hifi.txt...
/ [1 files][  424.0 B/  424.0 B]                                                
Operation completed over 1 objects/424.0 B.                                      
Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/83ef6c7c-17f7-4cb1-8709-2809ad2975ea/ntsm_workflow/c84df0bc-edde-44eb-b873-4ce78bf382c5/call-ntsm_eval/sample_1000genome_Illumina_vs_hifi.txt...
/ [1 files][  424.0 B/  424.0 B]            

Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/83ef6c7c-17f7-4cb1-8709-2809ad2975ea/ntsm_workflow/cd22c870-eff4-4584-a616-910a73fb4cd3/call-ntsm_eval/sample_1000genome_Illumina_vs_hifi.txt...
/ [1 files][  434.0 B/  434.0 B]                                                
Operation completed over 1 objects/434.0 B.                                      


In [5]:
ntsm_df

Unnamed: 0_level_0,ntsv_count_2,ntsv_count_1,hifi,1000g_cram,ntsm_eval_out,ntsm_score,result
ntsm_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
HG00099,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.496386,Similar
HG00280,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.52521,Similar
HG00558,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.488614,Similar
HG00639,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.486938,Similar
HG01074,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.508654,Similar
HG01081,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.511131,Similar
HG02040,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.516192,Similar
HG02165,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.521418,Similar
HG02451,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.50651,Similar
HG02735,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/C...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,0.477988,Similar


In [7]:
## How many rows don't match??? (Should be 0)
sum(ntsm_df['result'] != 'Similar')

0

# Extract ReadStats Data

## Read in ReadStats Data Table

In [8]:
readstats_df = tp.table_to_dataframe("readstats", workspace=WORKSPACE, workspace_namespace=PROJECT)

readstats_df.head()

Unnamed: 0_level_0,ReadStatsTarball,hifi,ReadStatsReport
readstats_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
HG00099,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...
HG00280,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...
HG00558,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...
HG00639,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...
HG01074,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...


## Read ReadStats Output & Write To DataFrame

In [34]:
readstats_df['output']   = np.nan

for index, row in readstats_df.iterrows():

        sample_readstats_fp = row['ReadStatsReport']
        sample_readstats_fn = os.path.basename(sample_readstats_fp)

        ! gsutil cp {sample_readstats_fp} .
        
        sample_readstats_df = pd.read_csv(sample_readstats_fn, header=None, sep='\t')

        ## Just look at sample-level metrics
        sample_readstats_df = sample_readstats_df[sample_readstats_df[0]=='sample.fastq']

        ## Get rid of extra row
        sample_readstats_df = sample_readstats_df.iloc[1: , :]


        sample_coverage = sample_readstats_df[sample_readstats_df[1] == 'total_Gbp'][2]
        readstats_df['output'][index] = float(sample_coverage.values[0])

        
readstats_df['coverage'] = readstats_df['output']/3.1

Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/c9a54e77-e953-4fa7-978f-dde608fd1637/runReadStats/680cc0a8-4d66-40c4-a3db-00a5122d29ff/call-consolidateReadStats/glob-44edd1a7587a8a70d756f66d9e5e0ada/sample_all.report.tsv...
/ [1 files][  2.6 KiB/  2.6 KiB]                                                
Operation completed over 1 objects/2.6 KiB.                                      


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/cf90cb25-5cac-43df-9a0a-d14a4fce6cc7/runReadStats/cd30c762-5013-4d96-af90-97985e2abe8e/call-consolidateReadStats/attempt-2/glob-44edd1a7587a8a70d756f66d9e5e0ada/sample_all.report.tsv...
/ [1 files][  2.6 KiB/  2.6 KiB]                                                
Operation completed over 1 objects/2.6 KiB.                                      
Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/c9a54e77-e953-4fa7-978f-dde608fd1637/runReadStats/03cf30f3-cca7-44eb-9b39-ed38144682a1/call-consolidateReadStats/glob-44edd1a7587a8a70d756f66d9e5e0ada/sample_all.report.tsv...
/ [1 files][  3.4 KiB/  3.4 KiB]                                                
Operation completed over 1 objects/3.4 KiB.                                      
Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/b0a872c7-c9e1-41c8-9fb8-765c44f86128/runReadStats/1be66890-fd13-4f32-bfa3-08b55ac7cde7/call-consolidateReadStats/glob

/ [1 files][  2.6 KiB/  2.6 KiB]                                                
Operation completed over 1 objects/2.6 KiB.                                      
Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/b0a872c7-c9e1-41c8-9fb8-765c44f86128/runReadStats/a552eea4-1878-41ed-b27f-3458c812361d/call-consolidateReadStats/glob-44edd1a7587a8a70d756f66d9e5e0ada/sample_all.report.tsv...
/ [1 files][  2.6 KiB/  2.6 KiB]                                                
Operation completed over 1 objects/2.6 KiB.                                      
Copying gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/submissions/b0a872c7-c9e1-41c8-9fb8-765c44f86128/runReadStats/1196f8ea-24f8-442e-82b3-7b272892f8be/call-consolidateReadStats/glob-44edd1a7587a8a70d756f66d9e5e0ada/sample_all.report.tsv...
/ [1 files][  2.6 KiB/  2.6 KiB]                                                
Operation completed over 1 objects/2.6 KiB.                                      


In [35]:
readstats_df

Unnamed: 0_level_0,ReadStatsTarball,hifi,ReadStatsReport,output,coverage
readstats_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HG00099,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,143.97,46.441935
HG00280,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,128.42,41.425806
HG00558,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,105.37,33.990323
HG00639,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,106.17,34.248387
HG01074,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,125.21,40.390323
HG01081,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,108.12,34.877419
HG02040,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,129.9,41.903226
HG02165,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,134.71,43.454839
HG02451,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,131.33,42.364516
HG02735,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,[gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/...,gs://fc-c2424fa0-8b70-4a0a-bd92-598d9fa71260/s...,112.78,36.380645


# Upload To Tables

In [17]:
## Create tables for running NTSM and ReadStats
dataframe_to_table("ntsm",      ntsm_df, WORKSPACE, PROJECT)
dataframe_to_table("readstats", sample_df, WORKSPACE, PROJECT)