# Example - Investigate the Ella HBOC_v01 Gene Panel

In order to understand what is happening behind the scenes let's take a look at the data for Gene Panel **HBOC_v01** and reverse engineer the record for the BRCA2 gene.

In [1]:
import os
import pandas as pd
import requests
from pprint import pprint
import numpy as np
import tempfile

## The Transcripts and Phenotypes Files for the HBOC_v01 GenePanel

For the sake of record keeping these files are included here, but they are also in the Ella github repo under `src/vardb/testdata/clinicalGenePanels/HBOC_v01`.

In [2]:
transcript_file="""# Genepanel: HBOC Version: 01 Date: 2017-11-07
#chromosome	txStart	txEnd	refseq	score	strand	geneSymbol	HGNC	Omim gene entry	geneAlias	eGeneID	eTranscriptID	cdsStart	cdsEnd	exonsStarts	exonEnds
17	41196311	41277468	NM_007297.3	0	-	BRCA1	1100	113705	RNF53,BRCC1,PPP1R53,FANCS	ENSG00000012048	ENST00000309486	41197694	41258543	41196311,41199659,41201137,41203079,41209068,41215349,41215890,41219624,41222944,41226347,41228504,41234420,41242960,41243451,41247862,41249260,41251791,41256138,41256884,41258472,41276033,41277293	41197819,41199720,41201211,41203134,41209152,41215390,41215968,41219712,41223255,41226538,41228631,41234592,41243049,41246877,41247939,41249306,41251897,41256278,41256973,41258550,41276132,41277468
13	32889616	32973809	NM_000059.3	0	+	BRCA2	1101	600185	FAD,FAD1,BRCC2,XRCC11	ENSG00000139618	ENST00000544455	32890597	32972907	32889616,32890558,32893213,32899212,32900237,32900378,32900635,32903579,32905055,32906408,32910401,32918694,32920963,32928997,32930564,32931878,32936659,32937315,32944538,32945092,32950806,32953453,32953886,32954143,32968825,32971034,32972298	32889804,32890664,32893462,32899321,32900287,32900419,32900750,32903629,32905167,32907524,32915333,32918790,32921033,32929425,32930746,32932066,32936830,32937670,32944694,32945237,32950928,32953652,32954050,32954282,32969070,32971181,32973809
"""

In [3]:
phenotype_file="""# Genepanel: HBOC Version: 01 Date: 2017-11-07
#gene symbol	HGNC	remove (add x)	phenotype	inheritance	omim_number	pmid	inheritance info	comment
BRCA1	1100		{Breast-ovarian cancer, familial, 1}	AD	604370
BRCA2	1101		{Breast-ovarian cancer, familial, 2}	AD	612555
"""

Ella provides several sample genepanels. Let's take the first `HBOC_v01.transcripts.csv` gene panel and reverse engineer it using `myGene.info`.

In [4]:
with tempfile.NamedTemporaryFile(mode='w+t') as temp:
    temp.write(transcript_file)
    temp.flush()
    columns = ["#chromosome","txStart", "txEnd","refseq", "score", "strand", "geneSymbol", "HGNC", "Omim gene entry", "geneAlias", "eGeneID", "eTranscriptID", "cdsStart", "cdsEnd", "exonsStarts", "exonEnds"]
    genepanel = pd.read_csv(temp.name, sep="\t", skiprows=2, header=None, names=columns)

genepanel

Unnamed: 0,#chromosome,txStart,txEnd,refseq,score,strand,geneSymbol,HGNC,Omim gene entry,geneAlias,eGeneID,eTranscriptID,cdsStart,cdsEnd,exonsStarts,exonEnds
0,17,41196311,41277468,NM_007297.3,0,-,BRCA1,1100,113705,"RNF53,BRCC1,PPP1R53,FANCS",ENSG00000012048,ENST00000309486,41197694,41258543,"41196311,41199659,41201137,41203079,41209068,4...","41197819,41199720,41201211,41203134,41209152,4..."
1,13,32889616,32973809,NM_000059.3,0,+,BRCA2,1101,600185,"FAD,FAD1,BRCC2,XRCC11",ENSG00000139618,ENST00000544455,32890597,32972907,"32889616,32890558,32893213,32899212,32900237,3...","32889804,32890664,32893462,32899321,32900287,3..."


## Query MyGene.Info

Query the mygene.info by the eGeneID. It is **VERY IMPORTANT** to note that Ella Anno uses HG19, so make sure you query the coordinates accordingly.

In [5]:
gene_data = requests.get("https://mygene.info/v3/gene/ENSG00000139618")
gene_data = gene_data.json()
gene_data

{'HGNC': '1101',
 'MIM': '600185',
 '_id': '675',
 '_version': 7,
 'accession': {'genomic': ['AF288938.2',
   'AF309413.1',
   'AF317283.1',
   'AF348515.1',
   'AF489725.1',
   'AF489726.1',
   'AF489727.1',
   'AF489728.1',
   'AF489729.1',
   'AF489730.1',
   'AF489731.1',
   'AF489732.1',
   'AF489733.1',
   'AF489734.1',
   'AF489735.1',
   'AF489736.1',
   'AF489737.1',
   'AF489738.1',
   'AF507079.1',
   'AF507080.1',
   'AF507081.1',
   'AF507082.1',
   'AF507083.1',
   'AF507084.1',
   'AF507085.1',
   'AF507086.1',
   'AF507087.1',
   'AF507088.1',
   'AF507089.1',
   'AF507090.1',
   'AL137247.14',
   'AL445212.9',
   'AY008850.1',
   'AY008851.1',
   'AY151039.1',
   'AY436640.1',
   'CH471075.1',
   'DQ115319.1',
   'DQ889340.1',
   'EU625579.1',
   'HM763690.1',
   'HQ221557.1',
   'HQ221558.1',
   'KJ625180.1',
   'KJ625181.1',
   'KJ625182.1',
   'KJ625183.1',
   'KJ625184.1',
   'KJ625185.1',
   'KJ625186.1',
   'KJ625187.1',
   'KJ625188.1',
   'KJ625189.1',
   'KJ62

### Get the Gene Data

In [6]:
gene_data['exons_hg19']

[{'cdsend': 32972907,
  'cdsstart': 32890597,
  'chr': '13',
  'position': [[32889644, 32889804],
   [32890558, 32890664],
   [32893213, 32893462],
   [32899212, 32899321],
   [32900237, 32900287],
   [32900378, 32900419],
   [32900635, 32900750],
   [32903579, 32903629],
   [32905055, 32905167],
   [32906408, 32907524],
   [32910401, 32915333],
   [32918694, 32918790],
   [32920963, 32921033],
   [32928997, 32929425],
   [32930564, 32930746],
   [32931878, 32932066],
   [32936659, 32936830],
   [32937315, 32937670],
   [32944538, 32944694],
   [32945092, 32945237],
   [32950806, 32950928],
   [32953453, 32953652],
   [32953886, 32954050],
   [32954143, 32954282],
   [32968825, 32969070],
   [32971034, 32971181],
   [32972298, 32974405]],
  'strand': 1,
  'transcript': 'NM_000059',
  'txend': 32974405,
  'txstart': 32889644}]

In [7]:
gene_data['genomic_pos_hg19']

{'chr': '13', 'end': 32973805, 'start': 32889611, 'strand': 1}

In [8]:
print("GenePanel TxStart:\t{}".format(genepanel.iloc[1]['txStart']))
print("MyGene.Info TxStart:\t{}".format(gene_data['exons_hg19'][0]['txstart']))

print("GenePanel TxEnd:\t{}".format(genepanel.iloc[1]['txEnd']))
print("MyGene.Info TxEnd:\t{}".format(gene_data['exons_hg19'][0]['txend']))
mygene_txstart = gene_data['exons_hg19'][0]['txstart']
mygene_txend = gene_data['exons_hg19'][0]['txend']

GenePanel TxStart:	32889616
MyGene.Info TxStart:	32889644
GenePanel TxEnd:	32973809
MyGene.Info TxEnd:	32974405


In [9]:
print("GenePanel CDSStart:\t{}".format(genepanel.iloc[1]['cdsStart']))
print("MyGene.Info CDSStart:\t{}".format(gene_data['exons_hg19'][0]['cdsstart']))


print("GenePanel CDSEnd:\t{}".format(genepanel.iloc[1]['cdsEnd']))
print("MyGene.Info CDSEnd:\t{}".format(gene_data['exons_hg19'][0]['cdsend']))

mygene_cdsstart = gene_data['exons_hg19'][0]['cdsstart']
mygene_cdsend = gene_data['exons_hg19'][0]['cdsend']

GenePanel CDSStart:	32890597
MyGene.Info CDSStart:	32890597
GenePanel CDSEnd:	32972907
MyGene.Info CDSEnd:	32972907


In [10]:
exon_start = genepanel.iloc[1]['exonsStarts'].split(',')
exon_start = np.array(exon_start, dtype=int)

exon_end = genepanel.iloc[1]['exonEnds'].split(',')
exon_end = np.array(exon_end, dtype=int)

In [11]:
genepanel_exons = np.stack([exon_start, exon_end], axis=1)
genepanel_exons

array([[32889616, 32889804],
       [32890558, 32890664],
       [32893213, 32893462],
       [32899212, 32899321],
       [32900237, 32900287],
       [32900378, 32900419],
       [32900635, 32900750],
       [32903579, 32903629],
       [32905055, 32905167],
       [32906408, 32907524],
       [32910401, 32915333],
       [32918694, 32918790],
       [32920963, 32921033],
       [32928997, 32929425],
       [32930564, 32930746],
       [32931878, 32932066],
       [32936659, 32936830],
       [32937315, 32937670],
       [32944538, 32944694],
       [32945092, 32945237],
       [32950806, 32950928],
       [32953453, 32953652],
       [32953886, 32954050],
       [32954143, 32954282],
       [32968825, 32969070],
       [32971034, 32971181],
       [32972298, 32973809]])

In [12]:
mygene_exons = np.array(gene_data['exons_hg19'][0]['position'], dtype=int)
mygene_exons

array([[32889644, 32889804],
       [32890558, 32890664],
       [32893213, 32893462],
       [32899212, 32899321],
       [32900237, 32900287],
       [32900378, 32900419],
       [32900635, 32900750],
       [32903579, 32903629],
       [32905055, 32905167],
       [32906408, 32907524],
       [32910401, 32915333],
       [32918694, 32918790],
       [32920963, 32921033],
       [32928997, 32929425],
       [32930564, 32930746],
       [32931878, 32932066],
       [32936659, 32936830],
       [32937315, 32937670],
       [32944538, 32944694],
       [32945092, 32945237],
       [32950806, 32950928],
       [32953453, 32953652],
       [32953886, 32954050],
       [32954143, 32954282],
       [32968825, 32969070],
       [32971034, 32971181],
       [32972298, 32974405]])

In [13]:
# The two arrays are almost equal, but the start of the first exon is off by a little.
# np.testing.assert_array_equal(mygene_exons, genepanel_exons)

In [14]:
mygene_exon_starts = mygene_exons[:,0]
mygene_exon_starts

array([32889644, 32890558, 32893213, 32899212, 32900237, 32900378,
       32900635, 32903579, 32905055, 32906408, 32910401, 32918694,
       32920963, 32928997, 32930564, 32931878, 32936659, 32937315,
       32944538, 32945092, 32950806, 32953453, 32953886, 32954143,
       32968825, 32971034, 32972298])

In [15]:
mygene_exon_ends = mygene_exons[:,1]
mygene_exon_ends

array([32889804, 32890664, 32893462, 32899321, 32900287, 32900419,
       32900750, 32903629, 32905167, 32907524, 32915333, 32918790,
       32921033, 32929425, 32930746, 32932066, 32936830, 32937670,
       32944694, 32945237, 32950928, 32953652, 32954050, 32954282,
       32969070, 32971181, 32974405])

### Get the Chromosome and Strand

In [16]:
chr = gene_data['exons_hg19'][0]['chr']
strand = gene_data['exons_hg19'][0]['strand']
if strand == 1:
    strand = '+'
else:
    strand = '-'

In [17]:
chr

'13'

In [18]:
strand

'+'

## Query MyVariant.Info for BRCA2 - NM_000059.3

This gives us some useful variant information that can be useful in building out the phenotype file.

In [19]:
data = requests.get("https://myvariant.info/v1/query?q=NM_000059.3")
data = data.json()
data

{'took': 5,
 'total': 10504,
 'max_score': 5.6187677,
 'hits': [{'_id': 'chr13:g.32953582_32953647del',
   '_score': 5.6187677,
   'chrom': '13',
   'clinvar': {'_license': 'http://bit.ly/2SQdcI0',
    'allele_id': 568493,
    'alt': 'G',
    'chrom': '13',
    'cytogenic': '13q13.1',
    'gene': {'id': '675', 'symbol': 'BRCA2'},
    'hg19': {'end': 32953647, 'start': 32953582},
    'hg38': {'end': 32379510, 'start': 32379445},
    'hgvs': {'coding': ['LRG_293t1:c.8885_8950del',
      'NM_000059.3:c.8885_8950del',
      'NM_000059.3:c.8885_8950del',
      'NM_000059.3:c.8885_8950del66',
      'NM_000059.3:c.8885_8950del66',
      'NM_000059.3:c.8885_8950delTATCAAGGGATGTCACAACCGTGTGGAAGTTGCGTATTGTAAGCTATTCAAAAAAAGAAAAAGATT'],
     'genomic': ['LRG_293:g.68968_69033del',
      'NC_000013.10:g.32953584_32953649del',
      'NC_000013.11:g.32379447_32379512del',
      'NG_012772.3:g.68968_69033del']},
    'rcv': [{'accession': 'RCV000689905',
      'clinical_significance': 'Uncertain signif

## Build out our Custom Gene Panel

This assumes that you have some information about your gene of interest. If you don't, its best to go to https://gnomad.broadinstitute.org/ and search for your gene under **gnomad v2.1.1**.

In [20]:
refseq = "NM_000059.3"
geneSymbol = "BRCA2"
HGCN = 1101
Omim_gene_entry = 600185
geneAlias = "FAD,FAD1,BRCC2,XRCC11"
eGeneID = "ENSG00000139618"
eTranscriptID = "ENST00000544455"
geneSymbol = "BRCA2"

In [21]:
#["#chromosome",
#"txStart", "txEnd",
#"refseq", "score", "strand", 
#"geneSymbol", "HGNC", 
#"Omim gene entry", 
#"geneAlias", 
#"eGeneID", 
#"eTranscriptID", 
#"cdsStart", "cdsEnd", 
#"exonsStarts", "exonEnds"]

custom_gene_panel_data = {
    "#chromosome": chr,
    "strand": strand,
    "score": 0, #?
    "refseq": refseq,
    "HGNC": HGCN,
    "Omim gene entry": Omim_gene_entry,
    "geneSymbol": geneSymbol,
    "geneAlias": geneAlias,
    "eGeneID": eGeneID,
    "eTranscriptID": eTranscriptID,
    "cdsStart": mygene_cdsstart,
    "cdsEnd": mygene_cdsend,
    "exonsStarts": ','.join(np.array(mygene_exon_starts, dtype=str)),
    "exonEnds": ','.join(np.array(mygene_exon_ends, dtype=str)),
    "txStart" : mygene_txstart,
    "txEnd" : mygene_txend
}

## Compare the Ella Gene Panel with MyGene.Info

You can see here that the data that we get from MyGene.Info is slightly off from the data we get from Ella, but in the same ballpark.

In [22]:
# Genepanel we create from mygene.info
df = pd.DataFrame(columns = columns, data=[custom_gene_panel_data])
df

Unnamed: 0,#chromosome,txStart,txEnd,refseq,score,strand,geneSymbol,HGNC,Omim gene entry,geneAlias,eGeneID,eTranscriptID,cdsStart,cdsEnd,exonsStarts,exonEnds
0,13,32889644,32974405,NM_000059.3,0,+,BRCA2,1101,600185,"FAD,FAD1,BRCC2,XRCC11",ENSG00000139618,ENST00000544455,32890597,32972907,"32889644,32890558,32893213,32899212,32900237,3...","32889804,32890664,32893462,32899321,32900287,3..."


In [23]:
# Ella genepanel
genepanel

Unnamed: 0,#chromosome,txStart,txEnd,refseq,score,strand,geneSymbol,HGNC,Omim gene entry,geneAlias,eGeneID,eTranscriptID,cdsStart,cdsEnd,exonsStarts,exonEnds
0,17,41196311,41277468,NM_007297.3,0,-,BRCA1,1100,113705,"RNF53,BRCC1,PPP1R53,FANCS",ENSG00000012048,ENST00000309486,41197694,41258543,"41196311,41199659,41201137,41203079,41209068,4...","41197819,41199720,41201211,41203134,41209152,4..."
1,13,32889616,32973809,NM_000059.3,0,+,BRCA2,1101,600185,"FAD,FAD1,BRCC2,XRCC11",ENSG00000139618,ENST00000544455,32890597,32972907,"32889616,32890558,32893213,32899212,32900237,3...","32889804,32890664,32893462,32899321,32900287,3..."


## Write out the gene panel name into the Ella Transcripts File Definition

If you wanted to write out the Ella Transcripts file you would first write out the info lines, then you could use pandas to write out the tabular data with no index.

In [24]:
genepanel_name = 'BRCA2'
genepanel_version = 'v01'

info_line = "# Genepanel: {} Version: {} Date: 2020-10-09\n".format(genepanel_name, genepanel_version)
genepanel_transcripts_file_name = "{name}_{version}.transcripts.csv".format(name=genepanel_name,
                                                                             version=genepanel_version)

with open(os.path.join(genepanel_transcripts_file_name), 'w') as fp:
    fp.write(info_line)
    df.to_csv(fp, index=False, sep="\t")

Then the `genepanel_transcripts_file_name` will have the correct output for the transcripts. 

## Add the HBOC_v01 Gene Panel to the Ella Database

The HBOC_v01 gene panel is included with the Ella codebase and is in the ella docker container.

```bash
docker run -it dabbleofdevops/ella:1.11.1 bash -c "ella-cli deposit genepanel \
    --genepanel_name HBOC \
    --genepanel_version v01 \
    --transcripts_path /ella/src/vardb/testdata/clinicalGenePanels/HBOC_v01/HBOC_v01.transcripts.csv
    --phenotypes_path /ella/src/vardb/testdata/clinicalGenePanels/HBOC_v01/HBOC_v01.phenotypes.csv"
```