This script contains the code to create a gene list with `s_het>=0.15`, that contains gene start and end positions + flanking regions of 10bp.

This code is executed locally (not in the Research Analysis Platform).

In [1]:
import pandas as pd
from pybiomart import Dataset

# path to the working folder
workfolder =  ...

# Get s_het data from Weghorn et al.

1. We download data from Weghorn et al., that contains the information about s-het scores per gene. (https://academic.oup.com/mbe/article/36/8/1701/5475505#supplementary-data) and save as `Supplementary_Table_1_weghorn.txt` in `workfodler`. 

2. We select genes with high s-het scores, that is s-het >= 0.15. 

In [2]:
# read s_het file from Weghorn et al.

s_het = pd.read_csv(f"{workfolder}/Supplementary_Table_1_weghorn.txt", sep='\t')

s_het.shape

(16279, 10)

In [3]:
# get high s_het genes 
# use s_het with modeled drift (s_het_drift) 

high_s_het = s_het[s_het['s_het_drift'] >= 0.15]
high_s_het = high_s_het[['Gene', 's_het_drift']]


high_s_het.shape

(1983, 2)

In [4]:
high_s_het.head(3)

Unnamed: 0,Gene,s_het_drift
30,ABCA2,0.258261
47,ABCD1,0.185187
51,ABCE1,0.333416


Now, UKBB data is aligned to GRCh38, s-het data is for GRCh37, therefore we need to map GRCh37 gene names to GRCh38 positions, we do that through mapping by `HGNC-id`.  

# Get HGNC ID for GRCh37

1. Download `HGNC-id` -- `chromosome/scaffold` -- `HGNC-symbol` from Ensembl for GRCH37

2. Remove all scaffolds, that are not attributed to chromosomes

3. Maintain only `HGNC-id` -- `HGNC-symbol`  mapping

In [5]:
# query data from ensembl 
dataset = Dataset(name='hsapiens_gene_ensembl',  host='http://grch37.ensembl.org')

hg37_table = dataset.query(
    attributes=['hgnc_id','chromosome_name', 'hgnc_symbol'])

# drop scaffolds not mapped to chromosomes
hg37_table = hg37_table[~hg37_table['Chromosome/scaffold name'].apply(
    lambda x: x.startswith('CHR_') or 'H' in x or x.startswith('GL'))]

# leave only mapping information
hg37_table = hg37_table[['HGNC ID', 'HGNC symbol']].dropna().drop_duplicates()

hg37_table.head(3)

Unnamed: 0,HGNC ID,HGNC symbol
2,19121.0,HMGA1P6
3,42488.0,RNY3P4
4,42682.0,LINC00362


# Get ensembl gene locations for GRCh38

1. Download `chromosome/scaffold` -- `start position` -- `end position` -- `HGNC-id` -- `HGNC-symbol` from Ensembl for GRCh38.

2. Remove all scaffolds, that are not attributed to chromosomes.

3. Edit `HGNC-id` so that it will match GRCh37 format (GRCh38 starts with "HGNC:", while GRCh37 not).

In [6]:
# load data
dataset = Dataset(name='hsapiens_gene_ensembl',  host='http://www.ensembl.org')

hg38_table = dataset.query(attributes=[
    'chromosome_name','start_position','end_position','hgnc_id','hgnc_symbol'])

# drop scaffolds not mapped to chromosomes
hg38_table = hg38_table[~hg38_table['Chromosome/scaffold name'].apply(
    lambda x: x.startswith('CHR_') or 'H' in x or x.startswith('GL'))]

# drop rows with NA values
hg38_table = hg38_table.dropna()

# edit NGNC field to have the same format as in GRCh37
hg38_table['HGNC ID'] = hg38_table['HGNC ID'].apply(
    lambda x: x.replace('HGNC:', '') if x else x).astype(float)

hg38_table.shape

(40104, 5)

# Map s_het from GRCh37 to GRCh38 gene names using HGNC ID

We add gene start and end position to s-het table by following this steps: 


1. Rename s-het table column name from `Gene` to `HGNC symbol`

2. Add GRCh37 `HGNC-id` information to s-het table by merhing on `HGNC symbol`

3.  Rename s-het table column name from `HGNC symbol` to `HGNC symbol GRCh37`

4. Add GRCh38 `chromosome` -- `start position` -- `end position` -- `HGNC symbol` to s-het table by merging on `HGNC symbol` 

5. Save resulting s-het table as `high_s_het_gene_list.bed`

6. Add 10bp flanking region to gene `start position` -- `end position` and save as `high_s_het_gene_list_10bp.bed`

7. Upload `high_s_het_gene_list_10bp.bed` to UKBB RAP for further usage.

In [7]:
# Rename gene name column

high_s_het = high_s_het.rename(columns={'Gene': 'HGNC symbol'})

print ('Total rows:', high_s_het.shape[0])

high_s_het.head(3)

Total rows: 1983


Unnamed: 0,HGNC symbol,s_het_drift
30,ABCA2,0.258261
47,ABCD1,0.185187
51,ABCE1,0.333416


In [8]:
# add HGNC ID information

high_s_het = high_s_het.merge(hg37_table, on='HGNC symbol')

print ('Total rows:', high_s_het.shape[0])

high_s_het.head(3)

Total rows: 1975


Unnamed: 0,HGNC symbol,s_het_drift,HGNC ID
0,ABCA2,0.258261,32.0
1,ABCD1,0.185187,61.0
2,ABCE1,0.333416,69.0


In [9]:
# Rename gene name column

high_s_het = high_s_het.rename(columns={'HGNC symbol': 'HGNC symbol GRCh37'})

print ('Total rows:', high_s_het.shape[0])

high_s_het.head(3)

Total rows: 1975


Unnamed: 0,HGNC symbol GRCh37,s_het_drift,HGNC ID
0,ABCA2,0.258261,32.0
1,ABCD1,0.185187,61.0
2,ABCE1,0.333416,69.0


In [10]:
# Add gene name and location in GRC38

high_s_het = high_s_het.merge(hg38_table, on='HGNC ID')

# save as csv
high_s_het.to_csv(f'{workfolder}/high_s_het_gene_list.csv', sep='\t', index=False)

print ('Total rows:', high_s_het.shape[0])

high_s_het.head(3)

Total rows: 1974


Unnamed: 0,HGNC symbol GRCh37,s_het_drift,HGNC ID,Chromosome/scaffold name,Gene start (bp),Gene end (bp),HGNC symbol
0,ABCA2,0.258261,32.0,9,137007234,137028915,ABCA2
1,ABCD1,0.185187,61.0,X,153724856,153744755,ABCD1
2,ABCE1,0.333416,69.0,4,145098288,145129524,ABCE1


In [11]:
# Add 10 bp, sort and save as bed file

high_s_het['Gene start (bp)'] = high_s_het['Gene start (bp)'] - 10
high_s_het['Gene end (bp)'] = high_s_het['Gene end (bp)'] + 10


high_s_het = high_s_het.sort_values(
    by=['Chromosome/scaffold name', 'Gene start (bp)'])

high_s_het[['Chromosome/scaffold name', 
            'Gene start (bp)',
            'Gene end (bp)']].to_csv(f'{workfolder}/high_s_het_gene_list_10bp.bed',
                                     header=False, sep='\t', index=False)

This file `high_s_het_gene_list_10bp.bed` should be uploaded to UKBB RAP for further usage.