This file contains a pipeline for the variant selection process, that is executed on RAP and is ainmed to reduce the amount of variants, downloaded from RAP. We focus on the selection of the variants, that are located in our 1929 recessive genes and covered in >90% of the cases with at least 15x. 



This notebook should be placed in UKBB Research Analysis Platform.

The cell that contains chromosome variable should be tagged as explained [here](https://papermill.readthedocs.io/en/latest/usage-parameterize.html#designate-parameters-for-a-cell).

Prior to running the notebook, the following data should be uploaded to the RAP:

 - The GRCh38 coordinates of the targeted regions `xgen_plus_spikein.GRCh38.bed` from https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=3803 .

 - The GRCh38 coordinates of the 1929 recessive genes `transcripts_exons_hg38_merged_10bp.bed`

 - List of related samples that needs to be removed `related_samples_to_remove_final.txt` (generated on the previous step).

 - In this section we also download `pvcf_blocks.txt` as the data for each chromosome is splitted across several files, marked by block_id.

In [1]:
import datetime
import pyspark
import pandas as pd
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

In [2]:
import hail as hl
hl.init(sc=sc)

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 2.4.4
SparkUI available at http://ip-10-60-94-211.eu-west-2.compute.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.78-b17627756568
LOGGING: writing to /opt/notebooks/hail-20221118-1438-0.2.78-b17627756568.log


In [3]:
# this cell is tagged parameters (will act as a command line argument)
chromosome = 21

In [4]:
# load bed for 1929 genes
bed_path = 'file:///mnt/project/Uploaded_data/transcripts_exons_hg38_merged_10bp.bed'

recode = {f"{i}":f"chr{i}" for i in (list(range(1, 23)) + ['X', 'Y'])}

bed = hl.import_bed(bed_path, reference_genome='GRCh38', contig_recoding=recode)

# load bed for target sequencing region
target_bed_path = 'file:///mnt/project/Uploaded_data/xgen_plus_spikein.GRCh38.bed'

bed_target = hl.import_bed(target_bed_path, reference_genome='GRCh38')

2022-11-18 14:39:03 Hail: INFO: Reading table without type imputation
  Loading field 'f0' as type str (user-supplied)
  Loading field 'f1' as type int32 (user-supplied)
  Loading field 'f2' as type int32 (user-supplied)
2022-11-18 14:39:04 Hail: INFO: Reading table without type imputation
  Loading field 'f0' as type str (user-supplied)
  Loading field 'f1' as type int32 (user-supplied)
  Loading field 'f2' as type int32 (user-supplied)


In [5]:
# read pvcf blocks file
pvcf_blocks = pd.read_csv("https://biobank.ctsu.ox.ac.uk/crystal/ukb/auxdata/pvcf_blocks.txt", 
                          sep='\t', header=None)
pvcf_blocks.columns = ['row_id', 'chrom', 'block_id', 'start', 'end']

chromosome_pvcf_blocks = pvcf_blocks[pvcf_blocks['chrom'].astype(int) == int(chromosome)]['block_id'].tolist()

print (len(chromosome_pvcf_blocks))

11


# Load and filter data by bed

In this section we:

1. Load all gvcfs associated with a chromosome.

2. Leave only target regions and 1,929 recessive genes of interest described [here](https://hail.is/docs/0.2/guides/genetics.html#from-a-ucsc-bed-file).

In [6]:
import sys
def load_filter_gvcf(gvcf_path):
    """
        Reads pVCF file defined in `gvcf_path` 
        and leaves only target regions from `bed_target` and 
        1929 recessive genes from `bed`.
    """
    print (f"load {gvcf_path}")
    sys.stdout.flush()
    
    gvcf = hl.methods.import_vcf(gvcf_path, force_bgz=True, reference_genome='GRCh38', block_size=32)
    gvcf = gvcf.filter_rows(hl.is_defined(bed[gvcf.locus]))
    gvcf = gvcf.filter_rows(hl.is_defined(bed_target[gvcf.locus]))
    
    return gvcf

def load_concatenate_gvcfs(gvcf_paths):
    """
        Reads all gvcfs defined in `gvcf_paths` list
    """
    gvcfs = [load_filter_gvcf(gvcf_path) for gvcf_path in gvcf_paths]
    
    return gvcfs

In [None]:
# generate all gvcf paths for a given chromosome
gvcf_paths = [
    f'file:///mnt/project/Bulk/Exome sequences/'
    f'Population level exome OQFE variants, pVCF format - final release/ukbXXXXX_c{chromosome}_b{idx}_v1.vcf.gz' for idx in chromosome_pvcf_blocks
]

# load all gvcfs for a given chromosome
vcf = load_concatenate_gvcfs(gvcf_paths)

# Create & load a list of unrelated samples 

This code should be executed once, when the list of the related samples arrives. Otherwise `CREATE_SAMPLES_LIST` should be `False`.

It creates `unrelated_samples.txt` that is all samples minus `related_samples_to_remove_final.txt`. This file is used for LoF variant collection pipeline as well. 

In [8]:
CREATE_SAMPLES_LIST = False

In [9]:
if CREATE_SAMPLES_LIST:
    # load samples that are related
    with open('/mnt/project/Uploaded_data/related_samples_to_remove_final.txt', 'r') as f:
        related_samples_list = f.readlines()

    related_samples_list = [sample_id.strip() for sample_id in related_samples_list]

    print ("Related number of samples:", len(related_samples_list))
    
    # get the list of all samples
    samples = vcf[0].s.collect()

    print ("Original number of samples:", len(samples))

    # filter withdrawned
    samples = [sample for sample in samples if not sample.startswith('W')]

    print ("Filtered withdrawn number of samples:", len(samples))

    # filter related
    samples = [sample for sample in samples if not (sample in related_samples_list)]

    print ("Filtered related number of samples:", len(samples))

    # save
    with open('unrelated_samples.txt', 'w') as f:
        f.writelines([sample + '\n' for sample in samples])

For the first time, you should now download `unrelated_samples.txt` and manually upload it to `Uploaded_data` folder.

Later, there is no need to do that, you can proceed to the next piece of code.

In [10]:
# load samples that should be used in analysis and doesnt contain related samples

with open('/mnt/project/Uploaded_data/unrelated_samples.txt', 'r') as f:
    unrelated_samples = f.readlines()
    
unrelated_samples = [sample_id.strip() for sample_id in unrelated_samples]

print (len(unrelated_samples))

unrelated_samples[:3]

466322


['1322654', '2611975', '2738629']

# Process

Below is the code to select all variants of interest. 

## Filter related & withdrawned samples

First, we keep only unrelated samples, that were not withdrawn. 

In [11]:
# leave only variants that are unrelated
# and remove withdrawned sampples

for idx in range(len(vcf)):
    vcf[idx] = vcf[idx].filter_cols(hl.array(unrelated_samples).contains(vcf[idx].s))

## Filter by coverage

We remove all sites, that didn't pass our quality control (covered at least 15x in >=90% of the cohort)

In [12]:
# count how many samples has sufficient coverage >= 15
vcf_annotated = []

for idx in range(len(vcf)):
    vcf_annotated.append(
        vcf[idx].annotate_rows(variant_dp = hl.agg.sum(vcf[idx].DP >= 15))
    )

In [13]:
vcf_filtered = []

for idx in range(len(vcf_annotated)):
    # leave only locations with sufficient coverage in most locations
    vcf_filtered_item = vcf_annotated[idx].filter_rows(vcf_annotated[idx].variant_dp >= len(unrelated_samples)*0.9)

    # drop unused unfo
    vcf_filtered_item = vcf_filtered_item.select_globals().select_rows().select_entries('GT')

    vcf_filtered.append(vcf_filtered_item)

## Leave only het_ref and hom_var 

We remove all homozygous reference genotypes

In [14]:
for idx in range(len(vcf_filtered)):
    
    # leave only variants that contain non-ref allele
    vcf_filtered[idx] = vcf_filtered[idx].filter_entries(~vcf_filtered[idx].GT.is_hom_ref())

## Initiate calculations and save data

Finally, we convert variants + genotype information into pandas table and save it for a future download. 

In [15]:
import datetime
import gc

for idx in range(len(vcf_filtered)):
    print (f'Processing: {idx+1} of {len(vcf_filtered)}')
    sys.stdout.flush()
    
    start = datetime.datetime.now()
    
    # flatten table 
    gt_entries = vcf_filtered[idx].entries()
    
    # convert to pandas
    df = gt_entries.to_pandas()
    df.to_csv(f'output.chr{chromosome}.part{idx}.csv.gz', compression='gzip')
    
    del df
    gc.collect()
    
    # calculate duration
    delta = datetime.datetime.now() - start
    print (f'Elapsed time: {delta.total_seconds()}s')
    print ()

Processing: 1 of 11


2022-11-18 14:45:24 Hail: WARN: entries(): Resulting entries table is sorted by '(row_key, col_key)'.
    To preserve row-major matrix table order, first unkey columns with 'key_cols_by()'
2022-11-18 14:47:51 Hail: INFO: Coerced sorted dataset
2022-11-18 14:47:55 Hail: INFO: Coerced sorted dataset
2022-11-18 14:48:04 Hail: INFO: Coerced sorted dataset


Elapsed time: 486.386498s

Processing: 2 of 11


2022-11-18 14:56:12 Hail: INFO: Coerced sorted dataset
2022-11-18 14:56:15 Hail: INFO: Coerced sorted dataset
2022-11-18 14:56:22 Hail: INFO: Coerced sorted dataset


Elapsed time: 482.970929s

Processing: 3 of 11


2022-11-18 15:04:16 Hail: INFO: Coerced sorted dataset
2022-11-18 15:04:16 Hail: INFO: Coerced sorted dataset
2022-11-18 15:04:18 Hail: INFO: Coerced sorted dataset


Elapsed time: 536.350031s

Processing: 4 of 11


2022-11-18 15:13:45 Hail: INFO: Coerced sorted dataset
2022-11-18 15:13:46 Hail: INFO: Coerced sorted dataset
2022-11-18 15:13:53 Hail: INFO: Coerced sorted dataset


Elapsed time: 604.795341s

Processing: 5 of 11


2022-11-18 15:24:02 Hail: INFO: Coerced sorted dataset
2022-11-18 15:24:03 Hail: INFO: Coerced sorted dataset
2022-11-18 15:24:04 Hail: INFO: Coerced sorted dataset


Elapsed time: 548.042084s

Processing: 6 of 11


2022-11-18 15:32:36 Hail: INFO: Coerced sorted dataset
2022-11-18 15:32:37 Hail: INFO: Coerced sorted dataset
2022-11-18 15:32:45 Hail: INFO: Coerced sorted dataset


Elapsed time: 511.980043s

Processing: 7 of 11


2022-11-18 15:41:11 Hail: INFO: Coerced sorted dataset
2022-11-18 15:41:11 Hail: INFO: Coerced sorted dataset
2022-11-18 15:41:13 Hail: INFO: Coerced sorted dataset


Elapsed time: 530.365661s

Processing: 8 of 11


2022-11-18 15:49:44 Hail: INFO: Coerced sorted dataset
2022-11-18 15:49:44 Hail: INFO: Coerced sorted dataset
2022-11-18 15:49:46 Hail: INFO: Coerced sorted dataset


Elapsed time: 452.416757s

Processing: 9 of 11


2022-11-18 15:57:47 Hail: INFO: Coerced sorted dataset
2022-11-18 15:57:48 Hail: INFO: Coerced sorted dataset
2022-11-18 15:57:49 Hail: INFO: Coerced sorted dataset


Elapsed time: 644.291446s

Processing: 10 of 11


2022-11-18 16:08:27 Hail: INFO: Coerced sorted dataset
2022-11-18 16:08:28 Hail: INFO: Coerced sorted dataset
2022-11-18 16:08:29 Hail: INFO: Coerced sorted dataset


Elapsed time: 595.420629s

Processing: 11 of 11


2022-11-18 16:17:10 Hail: INFO: Coerced sorted dataset
2022-11-18 16:17:12 Hail: INFO: Coerced sorted dataset
2022-11-18 16:17:14 Hail: INFO: Coerced sorted dataset


Elapsed time: 396.263435s



In [16]:
!rm -rf /opt/notebooks/hail*

The following commands were used to start the code:

```
dx cd ..


dx mkdir chr19
dx cd chr19

my_cmd="papermill 2_collect_variants_rap.ipynb 2_collect_variants_rap_chr19_output.ipynb -p chromosome 19"

dx run dxjupyterlab_spark_cluster --instance-type=mem2_ssd1_v2_x16 --instance-count=5 --priority=low --name="Run analysis chr19" -icmd="$my_cmd" -iduration=2160 -iin="../../2_collect_variants_rap.ipynb" -ifeature="HAIL-0.2.78-VEP-1.0.3"

dx download chr1 -r
dx download chr2 -r
dx download chr3 -r
dx download chr4 -r
dx download chr5 -r
dx download chr6 -r
dx download chr7 -r
dx download chr8 -r
dx download chr9 -r
dx download chr10 -r
dx download chr11 -r
dx download chr12 -r
dx download chr13 -r
dx download chr14 -r
dx download chr15 -r
dx download chr16 -r
dx download chr17 -r
dx download chr19 -r
```