<div style="background-color: #e0f7fa; padding: 20px; font-size: 24px; font-family: Arial, sans-serif; border-radius: 10px; text-align: center; color: #00796b; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
    <strong>AoUPRS:</strong> Hail VDS
</div>

<div style="background-color: #f9f9f9; padding: 15px; font-size: 18px; font-family: 'Georgia', serif; border-left: 5px solid #4a90e2; color: #333; line-height: 1.5;">
    <strong>Author:</strong> Ahmed Khattab<br>
    <em>Scripps Research</em>
</div>


<div style="background-color: #e3f2fd; padding: 20px; font-size: 16px; font-family: 'Helvetica Neue', sans-serif; border-radius: 8px; color: #0d47a1; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
    <strong>Introduction</strong><br>
    In this notebook, we will demonstrate how to use <strong style="color: #00796b;">AoUPRS</strong> to calculate Polygenic Risk Scores Hail VDS.<br><br>
    <strong>Hail VDS</strong><br>
    VDS is a sparse Hail format that contains the complete callset.<br><br>
    <strong>Resources used</strong>
    <ul style="list-style-type: none; padding: 0;">
        <li>Cost when running: <strong>&dollar;5.29 per hour</strong></li>
        <li>Main node: 4 CPUs, 26 GB RAM, 150 GB Disk</li>
        <li>Workers (2/50): 4 CPUs, 15 GB RAM, 150 GB Disk</li>
        <li>Time and Cost: <strong>&dollar;0.35 / 4 min</strong></li>
    </ul>
</div>


In [1]:
import time
import datetime

# Get the current date and time
start_time = datetime.datetime.now()

# Record the start time
current_date = start_time.date()
current_time = start_time.time()

# Format the current date
formatted_start_date = current_date.strftime("%Y-%m-%d")

# Format the current time
formatted_start_time = current_time.strftime("%H:%M:%S")

# Print the formatted date and time separately
print("Start date:", formatted_start_date)
print("Start time:", formatted_start_time)

Start date: 2024-07-01
Start time: 22:21:16



<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Define Bucket</strong>
</div>


In [2]:
import os
bucket = os.getenv("WORKSPACE_BUCKET")

<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Import Hail</strong>
</div>


In [3]:
import AoUPRS
import pandas as pd
import numpy as np
from datetime import datetime
import gcsfs
import multiprocessing
import ast
import concurrent.futures
import glob
import hail as hl

In [4]:
hl.init(tmp_dir = f'{bucket}/hail_temp/', default_reference='GRCh38')


Using hl.init with a default_reference argument is deprecated. To set a default reference genome after initializing hail, call `hl.default_reference` with an argument to set the default reference genome.


Reading spark-defaults.conf to determine GCS requester pays configuration. This is deprecated. Please use `hailctl config set gcs_requester_pays/project` and `hailctl config set gcs_requester_pays/buckets`.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.0
SparkUI available at http://all-of-us-7093-m.c.terra-vpc-sc-e098d676.internal:40377
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.130-bea04d9c79b5
LOGGING: writing to /home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/hail-20240701-2221-0.2.130-bea04d9c79b5.log



<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Read Hail VDS</strong>
</div>


In [5]:
vds_srwgs_path = os.getenv("WGS_VDS_PATH")
vds_srwgs_path

'gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/vds/hail.vds'

In [6]:
vds = hl.vds.read_vds(vds_srwgs_path)

In [7]:
vds.n_samples()

245394


<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Drop Flagged srWGS samples</strong>
</div>


<div style="background-color: #e8eaf6; padding: 20px; font-size: 18px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #303f9f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
    AoU provides a table listing samples that are flagged as part of the sample outlier QC for the srWGS SNP and Indel joint callset. <a href="https://support.researchallofus.org/hc/en-us/articles/4614687617556-How-the-All-of-Us-Genomic-data-are-organized#h_01GY7QZR2QYFDKGK89TCHSJSA7" style="color: #303f9f; text-decoration: none; font-size: 14px;"><strong>Read more</strong></a>
</div>


In [8]:
# Read flagged samples
flagged_samples_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/relatedness/relatedness_flagged_samples.tsv"

In [9]:
!gsutil -u $$GOOGLE_PROJECT cat $flagged_samples_path > flagged_samples.cvs

In [10]:
# Import flagged samples into a hail table
flagged_samples = hl.import_table(flagged_samples_path, key='sample_id')

2024-07-01 22:22:03.936 Hail: INFO: Reading table without type imputation1) / 1]
  Loading field 'sample_id' as type str (not specified)


In [11]:
# Drop flagged sample from main Hail VDS
vds_no_flag = hl.vds.filter_samples(vds, flagged_samples, keep=False)

[Stage 2:>                                                          (0 + 1) / 1]

In [12]:
vds_no_flag.n_samples()

230019


<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Define The Sample Intended for PRS Calculation</strong>
</div>


<div style="background-color: #f3e5f5; padding: 20px; font-size: 18px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #6a1b9a; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    This is a pre-selected sample for all people with WGS and EHR data available.
</div>

In [13]:
import pandas
import os

# This query represents dataset "participants_with_WGS_EHR_phenotypes_020524" for domain "person" and was generated for All of Us Controlled Tier Dataset v7
dataset_16967016_person_sql = """
    SELECT
        person.person_id,
        person.gender_concept_id,
        p_gender_concept.concept_name as gender,
        person.birth_datetime as date_of_birth,
        person.race_concept_id,
        p_race_concept.concept_name as race,
        person.ethnicity_concept_id,
        p_ethnicity_concept.concept_name as ethnicity,
        person.sex_at_birth_concept_id,
        p_sex_at_birth_concept.concept_name as sex_at_birth 
    FROM
        `""" + os.environ["WORKSPACE_CDR"] + """.person` person 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_gender_concept 
            ON person.gender_concept_id = p_gender_concept.concept_id 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_race_concept 
            ON person.race_concept_id = p_race_concept.concept_id 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_ethnicity_concept 
            ON person.ethnicity_concept_id = p_ethnicity_concept.concept_id 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_sex_at_birth_concept 
            ON person.sex_at_birth_concept_id = p_sex_at_birth_concept.concept_id  
    WHERE
        person.PERSON_ID IN (
            SELECT
                distinct person_id  
            FROM
                `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` cb_search_person  
            WHERE
                cb_search_person.person_id IN (
                    SELECT
                        person_id 
                    FROM
                        `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` p 
                    WHERE
                        has_ehr_data = 1 
                ) 
                AND cb_search_person.person_id IN (
                    SELECT
                        person_id 
                    FROM
                        `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` p 
                    WHERE
                        has_whole_genome_variant = 1 
                ) 
            )"""

dataset_16967016_person_df = pandas.read_gbq(
    dataset_16967016_person_sql,
    dialect="standard",
    use_bqstorage_api=("BIGQUERY_STORAGE_API_ENABLED" in os.environ),
    progress_bar_type="tqdm_notebook")

Downloading:   0%|          | 0/206173 [00:00<?, ?rows/s]

In [14]:
dataset_16967016_person_df['person_id'].nunique()

206173

In [15]:
unique_ids = dataset_16967016_person_df['person_id'].unique()
allofus_id = pd.DataFrame(unique_ids, columns=['person_id'])

In [16]:
# save to the bucket
allofus_id.to_csv(f'{bucket}/prs_calculator_tutorial/prs_calculator_hail_vds/people_with_WGS_EHR_ids.csv', index=False)

In [17]:
sample_needed_ht = hl.import_table(f'{bucket}/prs_calculator_tutorial/prs_calculator_hail_vds/people_with_WGS_EHR_ids.csv', delimiter=',', key='person_id')

2024-07-01 22:22:17.621 Hail: INFO: Reading table without type imputation
  Loading field 'person_id' as type str (not specified)


In [18]:
# Filter samples
vds_subset = hl.vds.filter_samples(vds_no_flag, sample_needed_ht, keep=True)

[Stage 5:>                                                          (0 + 1) / 1]

In [19]:
vds_subset.n_samples()

193835


<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Prepare PRS Weight Table</strong>
</div>


<div style="background-color: #f0f4c3; padding: 15px; font-size: 18px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #33691e; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    We are using <strong>PGS000746</strong> from <a href="https://www.pgscatalog.org/score/PGS000746/" style="color: #33691e; text-decoration: none; font-size: 18px;"><strong>PGS Catalog</strong></a>
</div>


In [20]:
# Prepare PRS weight table using function 'prepare_prs_table'
AoUPRS.prepare_prs_table('AoUPRS/AoUPRS_hail_vds/vat_check/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv',
                 'AoUPRS/AoUPRS_hail_vds/vat_check/PGS000746_weight_table.csv', bucket=bucket)


********************************************************
*                                                      *
*   Winter is coming... and so is your Weight table!   *
*                                                      *
********************************************************

Number of variants in the modified table: 1935
Modified PRS table saved as: gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_vds/vat_check/PGS000746_weight_table.csv

********************************************************
*                                                      *
*       Your quest is nearly complete, brave one!      *
*                         BUT                          *
*        The PRS adventure is about to get epic!       *
*                                                      *
********************************************************



In [21]:
with gcsfs.GCSFileSystem().open(f'{bucket}/AoUPRS/AoUPRS_hail_vds/vat_check/PGS000746_weight_table.csv', 'rb') as gcs_file:
    PGS000746_weights_tabel = pd.read_csv(gcs_file)

In [22]:
PGS000746_weights_tabel.shape

(1935, 12)

In [23]:
PGS000746_weights_tabel.head()

Unnamed: 0,chr,bp,rs_number,effect_allele,noneffect_allele,weight,additive,recessive,dominant,contig,position,variant_id
0,1,1006159,1:941539:C:T:1_941539_T_C,C,T,0.045347,1,0,0,chr1,1006159,chr1:1006159
1,1,2232129,1:2163568:C:T:1_2163568_T_C,C,T,0.033695,1,0,0,chr1,2232129,chr1:2232129
2,1,2293397,1:2224836:G:A:1_2224836_A_G,G,A,0.035216,1,0,0,chr1,2293397,chr1:2293397
3,1,2301093,1:2232532:G:A:1_2232532_A_G,G,A,0.053983,1,0,0,chr1,2301093,chr1:2301093
4,1,2308567,1:2240006:T:C:1_2240006_C_T,T,C,0.05953,1,0,0,chr1,2308567,chr1:2308567



<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>PRS Calculator</strong>
</div>


In [24]:
prs_identifier = 'PGS000746'
pgs_weight_path = 'AoUPRS/AoUPRS_hail_vds/vat_check/PGS000746_weight_table.csv'
output_path = 'AoUPRS/AoUPRS_hail_vds/calculated_scores/PGS000746_aou'

In [25]:
%time AoUPRS.calculate_prs_vds(vds_subset, prs_identifier, pgs_weight_path, output_path, bucket=bucket, save_found_variants=True)


##########################################
##                                      ##
##                AoUPRS                ##
##    A PRS Calculator for All of Us    ##
##         Author: Ahmed Khattab        ##
##           Scripps Research           ##
##                                      ##
##########################################

******************************************
*                                        *
*       Ahoy, PRS treasures ahead!       *
*                                        *
******************************************


<<<<<<<<<>>>>>>>>
   PGS000746   
<<<<<<<<<>>>>>>>>

Reading PRS weight table...
Saving intervals...
Intervals saved as: gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_vds/calculated_scores/PGS000746_aou/interval/PGS000746_interval.tsv
Importing locus intervals and filtering variants...


2024-07-01 22:22:32.257 Hail: INFO: Reading table without type imputation
  Loading field 'f0' as type str (user-supplied)
  Loading field 'f1' as type int32 (user-supplied)
  Loading field 'f2' as type int32 (user-supplied)


Filtering intervals in VDS...


2024-07-01 22:22:34.794 Hail: INFO: Coerced sorted dataset          (0 + 1) / 1]


Filtered intervals successfully.
Re-importing PRS table...


2024-07-01 22:22:50.018 Hail: INFO: Reading table without type imputation
  Loading field 'chr' as type str (user-supplied)
  Loading field 'bp' as type str (user-supplied)
  Loading field 'rs_number' as type str (user-supplied)
  Loading field 'effect_allele' as type str (user-supplied)
  Loading field 'noneffect_allele' as type str (user-supplied)
  Loading field 'weight' as type float64 (user-supplied)
  Loading field 'additive' as type str (user-supplied)
  Loading field 'recessive' as type str (user-supplied)
  Loading field 'dominant' as type str (user-supplied)
  Loading field 'contig' as type str (user-supplied)
  Loading field 'position' as type int32 (user-supplied)
  Loading field 'variant_id' as type str (user-supplied)


PRS table re-imported successfully.
Annotating the MatrixTable with PRS information...
Calculating effect allele count...
Summing weighted counts per sample...
Writing the PRS scores to a Hail Table at: gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_vds/calculated_scores/PGS000746_aou/hail/


2024-07-01 22:22:58.328 Hail: INFO: Coerced sorted dataset          (0 + 1) / 1]
2024-07-01 22:23:07.103 Hail: INFO: wrote table with 1935 rows in 1 partition to gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/hail_temp//__iruid_8976-AeHBnC3IcJ8atzoDm6goE7
2024-07-01 22:24:10.991 Hail: INFO: wrote table with 193835 rows in 200 partitions to gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_vds/calculated_scores/PGS000746_aou/hail/


Exporting PRS scores...


2024-07-01 22:24:17.489 Hail: INFO: merging 201 files totalling 4.4M...2) / 200]
2024-07-01 22:24:32.419 Hail: INFO: while writing:
    gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_vds/calculated_scores/PGS000746_aou/score/PGS000746_scores.csv
  merge time: 14.930s


PRS scores saved as: gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_vds/calculated_scores/PGS000746_aou/score/PGS000746_scores.csv
Extracting and saving found variants...


2024-07-01 22:24:48.201 Hail: INFO: Coerced sorted dataset          (0 + 1) / 1]
2024-07-01 22:24:55.019 Hail: INFO: wrote table with 1935 rows in 1 partition to gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/hail_temp//__iruid_15870-Av2M2VASnRpHuxrgTmbq93

Number of found variants: 1935
Found variants saved as: gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_vds/calculated_scores/PGS000746_aou/score/PGS000746_found_in_aou.csv
CPU times: user 3.25 s, sys: 166 ms, total: 3.42 s
Wall time: 2min 35s


In [26]:
import datetime

# Get the current date and time again
end_time = datetime.datetime.now()

# Record the end time
current_date = end_time.date()
current_time = end_time.time()

# Format the current date
formatted_end_date = current_date.strftime("%Y-%m-%d")

# Format the current time
formatted_end_time = current_time.strftime("%H:%M:%S")

# Print the formatted end date and time separately
print("End date:", formatted_end_date)
print("End time:", formatted_end_time)

End date: 2024-07-01
End time: 22:25:05
