<div style="background-color: #e0f7fa; padding: 20px; font-size: 24px; font-family: Arial, sans-serif; border-radius: 10px; text-align: center; color: #00796b; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
    <strong>AoUPRS:</strong> Hail MT
</div>

<div style="background-color: #f9f9f9; padding: 15px; font-size: 18px; font-family: 'Georgia', serif; border-left: 5px solid #4a90e2; color: #333; line-height: 1.5;">
    <strong>Author:</strong> Ahmed Khattab<br>
    <em>Scripps Research</em>
</div>


<div style="background-color: #e3f2fd; padding: 20px; font-size: 16px; font-family: 'Helvetica Neue', sans-serif; border-radius: 8px; color: #0d47a1; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
    <strong>Introduction</strong><br>
    In this notebook, we will demonstrate how to use <strong style="color: #00796b;">AoUPRS</strong> to calculate Polygenic Risk Scores Hail MatrixTable (MT).<br><br>
    <strong>Hail MT</strong><br>
    MT is a dense Hail format that contains variants at an Alternate Allele Count Frequency (ACAF) threshold (AF >1% and AC > 100). .<br><br>
    <strong>Resources used</strong>
    <ul style="list-style-type: none; padding: 0;">
        <li>Cost when running: <strong>&dollar;72.96 per hour</strong></li>
        <li>Main node: 4 CPUs, 26 GB RAM, 150 GB Disk</li>
        <li>Workers (300/0): 4 CPUs, 15 GB RAM, 150 GB Disk</li>
        <li>Time and Cost: <strong>&dollar;41.3 / 34 min</strong></li>
    </ul>
</div>


In [1]:
import time
import datetime

# Get the current date and time
start_time = datetime.datetime.now()

# Record the start time
current_date = start_time.date()
current_time = start_time.time()

# Format the current date
formatted_start_date = current_date.strftime("%Y-%m-%d")

# Format the current time
formatted_start_time = current_time.strftime("%H:%M:%S")

# Print the formatted date and time separately
print("Start date:", formatted_start_date)
print("Start time:", formatted_start_time)

Start date: 2024-07-01
Start time: 23:51:28



<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Define Bucket</strong>
</div>


In [2]:
import os
bucket = os.getenv("WORKSPACE_BUCKET")

<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Import Hail</strong>
</div>


In [3]:
import AoUPRS
import pandas as pd
import numpy as np
from datetime import datetime
import gcsfs
import multiprocessing
import ast
import concurrent.futures
import glob
import hail as hl

In [4]:
hl.init(tmp_dir=f'{bucket}/hail_temp/', default_reference='GRCh38')


Using hl.init with a default_reference argument is deprecated. To set a default reference genome after initializing hail, call `hl.default_reference` with an argument to set the default reference genome.


Reading spark-defaults.conf to determine GCS requester pays configuration. This is deprecated. Please use `hailctl config set gcs_requester_pays/project` and `hailctl config set gcs_requester_pays/buckets`.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.0
SparkUI available at http://all-of-us-7093-m.c.terra-vpc-sc-e098d676.internal:43057
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.130-bea04d9c79b5
LOGGING: writing to /home/jupyter/workspaces/duplicateoftype2diabetesriskprediction/hail-20240701-2351-0.2.130-bea04d9c79b5.log



<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Read Hail MT</strong>
</div>


In [5]:
# Hail MT
mt_wgs_path = os.getenv("WGS_ACAF_THRESHOLD_MULTI_HAIL_PATH")
mt = hl.read_matrix_table(mt_wgs_path)

In [6]:
# To reduce the MT size, keep only the GT field (The only field we need for PRS calculation)
mt = mt.select_entries("GT")

In [7]:
mt.count()

(48314438, 245394)


<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Drop Flagged srWGS samples</strong>
</div>


<div style="background-color: #e8eaf6; padding: 20px; font-size: 18px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #303f9f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
    AoU provides a table listing samples that are flagged as part of the sample outlier QC for the srWGS SNP and Indel joint callset. <a href="https://support.researchallofus.org/hc/en-us/articles/4614687617556-How-the-All-of-Us-Genomic-data-are-organized#h_01GY7QZR2QYFDKGK89TCHSJSA7" style="color: #303f9f; text-decoration: none; font-size: 14px;"><strong>Read more</strong></a>
</div>


In [8]:
# Read flagged samples
flagged_samples_path = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/relatedness/relatedness_flagged_samples.tsv"

In [9]:
!gsutil -u $$GOOGLE_PROJECT cat $flagged_samples_path > flagged_samples.cvs

In [10]:
# Import flagged samples into a hail table
flagged_samples = hl.import_table(flagged_samples_path, key='sample_id')

2024-07-01 23:52:20.181 Hail: INFO: Reading table without type imputation1) / 1]
  Loading field 'sample_id' as type str (not specified)


In [11]:
# Drop flagged sample from main Hail MT
mt = mt.anti_join_cols(flagged_samples)

In [12]:
mt.count()

(48314438, 230019)


<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Define The Sample Intended for PRS Calculation</strong>
</div>


<div style="background-color: #f3e5f5; padding: 20px; font-size: 18px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #6a1b9a; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    This is a pre-selected sample for all people with WGS and EHR data available.
</div>

In [13]:
import pandas
import os

# This query represents dataset "participants_with_WGS_EHR_phenotypes_020524" for domain "person" and was generated for All of Us Controlled Tier Dataset v7
dataset_16967016_person_sql = """
    SELECT
        person.person_id,
        person.gender_concept_id,
        p_gender_concept.concept_name as gender,
        person.birth_datetime as date_of_birth,
        person.race_concept_id,
        p_race_concept.concept_name as race,
        person.ethnicity_concept_id,
        p_ethnicity_concept.concept_name as ethnicity,
        person.sex_at_birth_concept_id,
        p_sex_at_birth_concept.concept_name as sex_at_birth 
    FROM
        `""" + os.environ["WORKSPACE_CDR"] + """.person` person 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_gender_concept 
            ON person.gender_concept_id = p_gender_concept.concept_id 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_race_concept 
            ON person.race_concept_id = p_race_concept.concept_id 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_ethnicity_concept 
            ON person.ethnicity_concept_id = p_ethnicity_concept.concept_id 
    LEFT JOIN
        `""" + os.environ["WORKSPACE_CDR"] + """.concept` p_sex_at_birth_concept 
            ON person.sex_at_birth_concept_id = p_sex_at_birth_concept.concept_id  
    WHERE
        person.PERSON_ID IN (
            SELECT
                distinct person_id  
            FROM
                `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` cb_search_person  
            WHERE
                cb_search_person.person_id IN (
                    SELECT
                        person_id 
                    FROM
                        `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` p 
                    WHERE
                        has_ehr_data = 1 
                ) 
                AND cb_search_person.person_id IN (
                    SELECT
                        person_id 
                    FROM
                        `""" + os.environ["WORKSPACE_CDR"] + """.cb_search_person` p 
                    WHERE
                        has_whole_genome_variant = 1 
                ) 
            )"""

dataset_16967016_person_df = pandas.read_gbq(
    dataset_16967016_person_sql,
    dialect="standard",
    use_bqstorage_api=("BIGQUERY_STORAGE_API_ENABLED" in os.environ),
    progress_bar_type="tqdm_notebook")

Downloading:   0%|          | 0/206173 [00:00<?, ?rows/s]

In [14]:
dataset_16967016_person_df['person_id'].nunique()

206173

In [15]:
# drop flagged sample
flag_s = pd.read_csv('flagged_samples.cvs')
flag_s.rename(columns={'sample_id': 'person_id'}, inplace=True)

# Merge the two DataFrames on person_id column
aou_ids = pd.merge(dataset_16967016_person_df[['person_id']], flag_s, how='left', indicator=True)

# Filter to keep only rows where the merge indicator is left-only (i.e., rows present only in dataset_16967016_person_df[['person_id']])
aou_ids = aou_ids[aou_ids['_merge'] == 'left_only']

# Drop the merge indicator column
aou_ids.drop(columns=['_merge'], inplace=True)

In [16]:
aou_ids.shape

(193835, 1)

In [17]:
# Convert the subset_sample_ids to a Python set
subset_sample_ids_set = set(map(str, aou_ids['person_id'].tolist()))

In [18]:
mt = mt.filter_cols(hl.literal(subset_sample_ids_set).contains(mt.s))

In [19]:
mt.count()

(48314438, 193835)


<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>Prepare PRS Weight Table</strong>
</div>


<div style="background-color: #f0f4c3; padding: 15px; font-size: 18px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #33691e; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    We are using <strong>PGS000746</strong> from <a href="https://www.pgscatalog.org/score/PGS000746/" style="color: #33691e; text-decoration: none; font-size: 18px;"><strong>PGS Catalog</strong></a>
</div>


In [20]:
# Prepare PRS weight table using function 'prepare_prs_table'
AoUPRS.prepare_prs_table('AoUPRS/AoUPRS_hail_mt/PGS000746_Gola_D_PRS_1940_Coronary_artery_disease_Circ_Genom_Precis_Med_2020.GRCh37_to_GRCh38.csv',
                 'AoUPRS/AoUPRS_hail_mt/PGS000746_weight_table.csv', bucket=bucket)


********************************************************
*                                                      *
*   Winter is coming... and so is your Weight table!   *
*                                                      *
********************************************************

Number of variants in the modified table: 1938
Modified PRS table saved as: gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_mt/PGS000746_weight_table.csv

********************************************************
*                                                      *
*       Your quest is nearly complete, brave one!      *
*                         BUT                          *
*        The PRS adventure is about to get epic!       *
*                                                      *
********************************************************



In [21]:
with gcsfs.GCSFileSystem().open(f'{bucket}/AoUPRS/AoUPRS_hail_mt/PGS000746_weight_table.csv', 'rb') as gcs_file:
    PGS000746_weights_tabel = pd.read_csv(gcs_file)

In [22]:
PGS000746_weights_tabel.shape

(1938, 12)

In [23]:
PGS000746_weights_tabel.head()

Unnamed: 0,chr,bp,rs_number,effect_allele,noneffect_allele,weight,additive,recessive,dominant,contig,position,variant_id
0,1,1006159,1:941539:C:T:1_941539_T_C,C,T,0.045347,1,0,0,chr1,1006159,chr1:1006159
1,1,2232129,1:2163568:C:T:1_2163568_T_C,C,T,0.033695,1,0,0,chr1,2232129,chr1:2232129
2,1,2293397,1:2224836:G:A:1_2224836_A_G,G,A,0.035216,1,0,0,chr1,2293397,chr1:2293397
3,1,2301093,1:2232532:G:A:1_2232532_A_G,G,A,0.053983,1,0,0,chr1,2301093,chr1:2301093
4,1,2308567,1:2240006:T:C:1_2240006_C_T,T,C,0.05953,1,0,0,chr1,2308567,chr1:2308567



<div style="background-color: #dcedc8; padding: 15px; font-size: 24px; font-family: 'Arial', sans-serif; border-radius: 8px; color: #558b2f; line-height: 1.6; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); text-align: center;">
    <strong>PRS Calculator</strong>
</div>


In [24]:
prs_identifier = 'PGS000746'
pgs_weight_path = 'AoUPRS/AoUPRS_hail_mt/PGS000746_weight_table.csv'
output_path = 'AoUPRS/AoUPRS_hail_mt/calculated_scores/PGS000746_aou'

In [25]:
%time AoUPRS.calculate_prs_mt(mt, prs_identifier, pgs_weight_path, output_path, bucket=bucket, save_found_variants=True)


##########################################
##                                      ##
##                AoUPRS                ##
##    A PRS Calculator for All of Us    ##
##         Author: Ahmed Khattab        ##
##           Scripps Research           ##
##                                      ##
##########################################

******************************************
*                                        *
*       Ahoy, PRS treasures ahead!       *
*                                        *
******************************************


<<<<<<<<<>>>>>>>>
    PGS000746   
<<<<<<<<<>>>>>>>>

Reading PRS weight table...
Re-importing PRS table with determined column types...


2024-07-01 23:52:44.743 Hail: INFO: Reading table without type imputation
  Loading field 'chr' as type str (user-supplied)
  Loading field 'bp' as type str (user-supplied)
  Loading field 'rs_number' as type str (user-supplied)
  Loading field 'effect_allele' as type str (user-supplied)
  Loading field 'noneffect_allele' as type str (user-supplied)
  Loading field 'weight' as type float64 (user-supplied)
  Loading field 'additive' as type str (user-supplied)
  Loading field 'recessive' as type str (user-supplied)
  Loading field 'dominant' as type str (user-supplied)
  Loading field 'contig' as type str (user-supplied)
  Loading field 'position' as type int32 (user-supplied)
  Loading field 'variant_id' as type str (user-supplied)


Filtering variants...
Annotating the MatrixTable with the PRS information...
Calculating effect allele count and multiplying by variant weight...
Summing the weighted counts per sample and counting the number of variants with weights per sample...
Extracting and saving found variants...


2024-07-01 23:52:54.102 Hail: INFO: Coerced sorted dataset          (0 + 1) / 1]
2024-07-01 23:52:55.484 Hail: INFO: Coerced sorted dataset          (0 + 1) / 1]
2024-07-01 23:53:04.890 Hail: INFO: wrote table with 1938 rows in 1 partition to gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/hail_temp//__iruid_5485-mg1ZlrCvbr1G7L7UyEPS9y
2024-07-01 23:53:11.492 Hail: INFO: wrote table with 1938 rows in 1 partition to gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/hail_temp//__iruid_5901-FhITOlCdPYhw4pdQDNHIZc

Number of found variants: 1935
Found variants saved as: gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_mt/calculated_scores/PGS000746_aou/score/PGS000746_found_in_aou.csv
Writing the PRS scores to a Hail Table at: gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_mt/calculated_scores/PGS000746_aou/hail/


2024-07-02 00:04:11.142 Hail: INFO: Coerced sorted dataset          (0 + 1) / 1]
2024-07-02 00:04:12.205 Hail: INFO: Coerced sorted dataset          (0 + 1) / 1]
2024-07-02 00:04:21.274 Hail: INFO: wrote table with 1938 rows in 1 partition to gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/hail_temp//__iruid_11738-pGNT9nFFHY2EzZopKJMt0s
2024-07-02 00:04:27.664 Hail: INFO: wrote table with 1938 rows in 1 partition to gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/hail_temp//__iruid_12154-hMBscSbnGknioUGJQomHQD
2024-07-02 00:22:34.485 Hail: INFO: wrote table with 193835 rows in 1192 partitions to gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_mt/calculated_scores/PGS000746_aou/hail/


Exporting the Hail Table to a CSV file...


2024-07-02 00:23:03.462 Hail: INFO: merging 1193 files totalling 4.4M... / 1192]
2024-07-02 00:24:45.624 Hail: INFO: while writing:
    gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_mt/calculated_scores/PGS000746_aou/score/PGS000746_scores.csv
  merge time: 1m42.2s


PRS scores saved as: gs://fc-secure-e5684327-e720-41ed-979a-b9ae6477b844/AoUPRS/AoUPRS_hail_mt/calculated_scores/PGS000746_aou/score/PGS000746_scores.csv
CPU times: user 9.06 s, sys: 2.06 s, total: 11.1 s
Wall time: 33min 11s


In [26]:
import datetime

# Get the current date and time again
end_time = datetime.datetime.now()

# Record the end time
current_date = end_time.date()
current_time = end_time.time()

# Format the current date
formatted_end_date = current_date.strftime("%Y-%m-%d")

# Format the current time
formatted_end_time = current_time.strftime("%H:%M:%S")

# Print the formatted end date and time separately
print("End date:", formatted_end_date)
print("End time:", formatted_end_time)

End date: 2024-07-02
End time: 00:25:54


Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/opt/conda/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in launch_instance
    app.start()
  File "/opt/conda/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 736, in start
    self.io_loop.start()
  File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 195, in start
    self.asyncio_loop.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1894, in _run_once
    handle = self._ready.popleft()
IndexError: pop from 