<div style="background-color:lightgreen; padding: 10px; font-size: 24px;">
    
__PRS Calculator:__ Hail VDS   
    
Check Variant Annotation Table
</div>

<div style="background-color:lightgrey; padding: 10px;; font-size: 18px;">  
    
__Author:__ Ahmed Khattab  
        __Scripps Research__
    
</div>

<div style="background-color:lightblue; padding: 10px; font-size: 16px;"> 
    
__Introduction__

In this notebook, we will demonstrate how to check the callset (VAT) for how many present and missing variants in your weight table.


__Using VAT__ 

VAT is a structured table containing functional annotations for the variants detected in the srWGS dataset. 
    
__Resources used?__   


Cost when running: $0.73 per hour  

Main node: 4CPUs, 15GB RAM, 150 GB Disk  
Workers (2): 4CPUs, 15GB RAM, 150GB Disk   

Time and Cost:  __$0.05 / ~4 min__  

In [1]:
import time
import datetime

# Get the current date and time
start_time = datetime.datetime.now()

# Record the start time
current_date = start_time.date()
current_time = start_time.time()

# Format the current date
formatted_start_date = current_date.strftime("%Y-%m-%d")

# Format the current time
formatted_start_time = current_time.strftime("%H:%M:%S")

# Print the formatted date and time separately
print("Start date:", formatted_start_date)
print("Start time:", formatted_start_time)

Start date: 2024-07-01
Start time: 22:09:31


In [2]:
import sys
sys.path.insert(0, '/home/jupyter/.local/lib/python3.10/site-packages')

```ruby
# Install tabix if not already installed
pip install --user pytabix
```

In [3]:
import os
import pandas as pd
import numpy as np
from datetime import datetime
import gcsfs
import multiprocessing
import ast
import concurrent.futures
import glob
import tabix

# Define Bucket

In [4]:
bucket = os.getenv("WORKSPACE_BUCKET")

# Check SNP in VAT
- Check which variants in your score exist in the callset before using VDS for scores calculation
- Find more: https://support.researchallofus.org/hc/en-us/articles/22522829338260-Hotfix-released-for-v7-Variant-Annotation-Table

## Download VAT to Workspace

In [5]:
# VAT can be found here
!gsutil -m -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/vat

gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/vat/vat_complete.bgz.tsv.gz
gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/vat/vat_complete_v7.1.bgz.tsv.gz


<div class="alert alert-block alert-info" style="background-color: grey;">
    <ul>
        <li>VAT has many fields but we only need chr, position, and both alleles to match with variants in our weight table.</li>
        <li>The code below will help you download the VAT file but we already completeted that step elsewhere so will be using it directly.</li>
    </ul>
</div>

```ruby
# Use a TERMINAL to download and tabix the VAT

# Copy the file to locally:
gsutil -u $GOOGLE_PROJECT cp gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/vat/vat_complete_v7.1.bgz.tsv.gz > vat_complete_v7.1.bgz.tsv.gz


# Process the local file
mkdir tmp
gunzip -c vat_complete_v7.1.bgz.tsv.gz | \
# Only keep the 4 needed cols; chr, pos, alt, ref
  awk -F'\t' '{print $3"\t"$4"\t"$5"\t"$6}' | \
  sed 's~chr~~g' | \
  sort -u -S 5G -T /tmp/ | \
  bgzip -@ 16 > chr_pos_ref_alt.vat_v7.1.tsv.bgz

# Create tabix file
tabix -S 1 -s 3 -b 4 -e 4 chr_pos_ref_alt.vat_v7.1.tsv.bgz

# The local files get deleted when you delete the "Hail Genomic Analysis" environment. Make sure to copy them to the cloud.
```

In [6]:
# download files locally (will be deleted when you delete the current Hail enviornment)
!gcloud storage cp {bucket}/allofus_phenotypes/people_with_WGS_EHR_020524/cad/chr_pos_ref_alt.vat_v7.1.tsv.bgz /home/jupyter
!gcloud storage cp {bucket}/allofus_phenotypes/people_with_WGS_EHR_020524/cad/chr_pos_ref_alt.vat_v7.1.tsv.bgz.tbi /home/jupyter

Copying gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/allofus_phenotypes/people_with_WGS_EHR_020524/cad/chr_pos_ref_alt.vat_v7.1.tsv.bgz to file:///home/jupyter/chr_pos_ref_alt.vat_v7.1.tsv.bgz
  Completed files 1/1 | 2.8GiB/2.8GiB | 57.3MiB/s                              

Average throughput: 136.0MiB/s
Copying gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/allofus_phenotypes/people_with_WGS_EHR_020524/cad/chr_pos_ref_alt.vat_v7.1.tsv.bgz.tbi to file:///home/jupyter/chr_pos_ref_alt.vat_v7.1.tsv.bgz.tbi
  Completed files 1/1 | 2.3MiB/2.3MiB                                          

Average throughput: 146.7MiB/s


In [5]:
vat_fp = "/home/jupyter/chr_pos_ref_alt.vat_v7.1.tsv.bgz"
vat_tb = tabix.open(vat_fp)

In [9]:
def check_allele_in_tabix(row, vat_tb):
    try:
        records = vat_tb.query(str(row['chr']), row['bp'] - 1, row['bp'])
        for record in records:
            # Assuming the structure of record is ['chr', 'bp', 'ref', 'alt']
            if row['effect_allele'] in record[2:4]:  # Checking if effect_allele is in ref/alt
                return True
    except Exception as e:
        print(f"Error querying tabix: {e}")
    return False

In [10]:
input_dir = "/home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/"
query_dir = "/home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/vat_query"
check_dir = "/home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/vat_check"

In [11]:
import pandas as pd
import glob
import time

start_time = time.time()
 
for pgs_fp in glob.glob(f'{input_dir}/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv'): 
    pgs_df_name = pgs_fp.split("/")[-1].split(".")[0]  # Specify the name of the DataFrame
    pgs_df = pd.read_csv(pgs_fp, dtype={"weight": "str"})
    print(f"{pgs_df.shape} read: {pgs_fp}")
    
    # Assuming `check_allele_in_tabix` is a function defined elsewhere
    pgs_df['AoU'] = pgs_df.apply(check_allele_in_tabix, vat_tb=vat_tb, axis=1)
    print(f"{pgs_df['AoU'].value_counts().get(False, 0)} SNPs not found in VAT")
    
    vat_pgs_fp = f"{query_dir}/{pgs_fp.split('/')[-1]}"
    pgs_df.to_csv(vat_pgs_fp, index=False)
    print(f"{pgs_df.shape} saved: {vat_pgs_fp}")
    
    vat_check_pgs_df = pgs_df.loc[pgs_df['AoU'] == True].drop(columns=["AoU"])
    vat_check_pgs_fp = f"{check_dir}/{pgs_fp.split('/')[-1].replace(input_dir.split('_')[-1], 'vat_check')}"
    vat_check_pgs_df.to_csv(vat_check_pgs_fp, index=False)
    print(f"{vat_check_pgs_df.shape} saved: {vat_check_pgs_fp}\n")
    
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

(1938, 9) read: /home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores//PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv
3 SNPs not found in VAT
(1938, 10) saved: /home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/vat_query/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv
(1935, 9) saved: /home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/vat_check/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv

Execution time: 35.2281973361969 seconds


In [12]:
# Always save files to the cloud as they get deleted with the deletion of the environment.
!gsutil -m cp /home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv {bucket}/AoUPRS/AoUPRS_hail_vds/
!gsutil -m cp /home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/vat_check/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv {bucket}/AoUPRS/AoUPRS_hail_vds/vat_check/
!gsutil -m cp /home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/vat_query/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv {bucket}/AoUPRS/AoUPRS_hail_vds/vat_query/

Copying file:///home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv [Content-Type=text/csv]...
/ [1/1 files][117.5 KiB/117.5 KiB] 100% Done                                    
Operation completed over 1 objects/117.5 KiB.                                    
Copying file:///home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/vat_check/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv [Content-Type=text/csv]...
/ [1/1 files][117.3 KiB/117.3 KiB] 100% Done                                    
Operation completed over 1 objects/117.3 KiB.                                    
Copying file:///home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/prs_scores/vat_query/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv [Content-Type=text/csv]...
/ [1/1 fi

In [13]:
import time
import datetime

# Get the current date and time again
end_time = datetime.datetime.now()

# Record the end time
current_date = end_time.date()
current_time = end_time.time()

# Format the current date
formatted_end_date = current_date.strftime("%Y-%m-%d")

# Format the current time
formatted_end_time = current_time.strftime("%H:%M:%S")

# Print the formatted end date and time separately
print("End date:", formatted_end_date)
print("End time:", formatted_end_time)

End date: 2024-07-01
End time: 22:12:10
