# Phenotypes & Genes Map

In this notebook, we load the manually (or using a software like Qiagen) curated list of genes of John, fetch each of their phenotypes from http://uswest.ensembl.org 's database and feed into a graph database for visualization of their relationships

In [14]:
import pandas as pd
from itertools import chain

## Import Genes

In [2]:
gene_names_raw = pd.read_csv("notcleanedofcommonvariants-BiologicalContext-5.csv")

In [3]:
gene_names_raw.head()

Unnamed: 0,Gene Symbol
0,EMG1; PHB2
1,GLB1; TMPPE
2,CEP57L1; SESN1
3,UBB
4,DNAJB1; TECR


In [12]:
genes_df = gene_names_raw.apply( lambda g: g["Gene Symbol"].split(";"), axis=1 )

In [64]:
genes_raw = list(chain.from_iterable( genes_df.tolist() ))

In [70]:
genes = map( lambda g: g.strip(), genes_raw )

In [71]:
len(genes)

432

Save results to disk

In [84]:
pd.DataFrame(genes, columns=["gene"]).to_csv("genes_list.csv", index=False)

## Fetch Phenotypes

1. REST API for fetching phenotypes http://rest.ensembl.org/documentation/info/phenotype_gene
2. http://uswest.ensembl.org/Homo_sapiens/Gene/Phenotype?db=core;g=ENSG00000139618;r=13:32315474-32400266

In [86]:
import requests
from time import sleep
from json import dump

In [30]:
url = "http://rest.ensembl.org/phenotype/gene/homo_sapiens/{gene}"

In [54]:
phenotypes_per_gene = {}

In [76]:
r = requests.get( url.format(gene=genes[0]), headers={'Content-type': 'application/json'} )

In [78]:
%%time

unknown_genes = []
failed_requests = []
for i, g in enumerate(genes):
    r = requests.get( url.format(gene=g), headers={'Content-type': 'application/json'} )
    try:
        resp = r.json()
        if type(resp) == dict and "error" in resp:
            unknown_genes.append(g)
            continue

        phenotypes_raw = map( lambda phenotype: phenotype.get("description", ""), r.json() )
        phenotypes = filter( lambda phenotype: len(phenotype) > 1, phenotypes_raw )
        phenotypes_per_gene[g] = phenotypes
    except ValueError:
        failed_requests.append(g)
    
    sleep(0.05) # be kind to the API :P
    if i % 100 == 0:
        print( "Fetched {} genes' info".format(i) )

Fetched 0 genes' info
Fetched 100 genes' info
Fetched 300 genes' info
Fetched 400 genes' info
CPU times: user 2.09 s, sys: 272 ms, total: 2.36 s
Wall time: 4min 59s


In [81]:
len(unknown_genes)

60

In [82]:
len(failed_requests)

0

Number of genes for which phenotypes are found

In [27]:
len( filter(lambda v: len(v) > 0, phenotypes_per_gene.values() ) )

92

Save results to disk

In [87]:
with open("gene_phenotypes_map.json", "w") as fp:
    dump(phenotypes_per_gene, fp)

### Push Gene-Phenotypes to Neo4J

In [8]:
! pip install -q py2neo

[31mtwisted 18.7.0 requires PyHamcrest>=1.9.0, which is not installed.[0m
[31mgrin 1.2.1 requires argparse>=1.1, which is not installed.[0m
[31mipython 5.8.0 has requirement prompt-toolkit<2.0.0,>=1.0.4, but you'll have prompt-toolkit 2.0.9 which is incompatible.[0m
[31mjupyter-console 5.2.0 has requirement prompt-toolkit<2.0.0,>=1.0.0, but you'll have prompt-toolkit 2.0.9 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [38]:
from py2neo import Graph
from json import load

Start neo4j graph database server locally. Then execute the below line to connect to it

In [39]:
graph = Graph(password = "1234")

In [14]:
# load phenotype & genes map
with open("gene_phenotypes_map.json", "r") as fp:
    phenotypes_per_gene = load(fp)

In [24]:
%%time

for i, gene in enumerate(phenotypes_per_gene):
    phenotypes = phenotypes_per_gene[gene]
    for phenotype in phenotypes:
        create_query = """
            MERGE (g:Gene {name: "%(gene_name)s"})
            MERGE (p:Phenotype {name: "%(phenotype_name)s"})
            MERGE (g)-[r:causes]->(p)
            RETURN g, r, p
        """ % {"gene_name": gene, "phenotype_name": phenotype}
        graph.run(create_query)
        
        if i % 100 == 0:
            print( "Saved {} genes to Graph DB".format(i) )

CPU times: user 1.91 s, sys: 157 ms, total: 2.07 s
Wall time: 13.7 s


In [25]:
len(phenotypes_per_gene.keys())

241

## Metabolic Pathways

Link Genes to Metabolic Pathways

In [36]:
from json import dump

In [28]:
pathways_per_gene = {}

In [33]:
with open("keggGeneMapping.txt", "r") as fp:
    for l in fp:
        records = filter( lambda y: len(y) > 0, map( lambda x: x.strip(), l.strip().split("\t") ) )
        gene = records[0]
        pathways = records[1:]
        pathways_per_gene[gene] = pathways

Save the pathways as JSON to disk

In [37]:
with open("pathways_per_gene.json", "w") as fp:
    dump(pathways_per_gene, fp)

### Push Pathways to Neo4J

In [8]:
! pip install -q py2neo

[31mtwisted 18.7.0 requires PyHamcrest>=1.9.0, which is not installed.[0m
[31mgrin 1.2.1 requires argparse>=1.1, which is not installed.[0m
[31mipython 5.8.0 has requirement prompt-toolkit<2.0.0,>=1.0.4, but you'll have prompt-toolkit 2.0.9 which is incompatible.[0m
[31mjupyter-console 5.2.0 has requirement prompt-toolkit<2.0.0,>=1.0.0, but you'll have prompt-toolkit 2.0.9 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [42]:
from py2neo import Graph
from json import load
import os

Start neo4j graph database server locally. Then execute the below line to connect to it

In [48]:
graph = Graph(password = "1234")

In [40]:
# load phenotype & genes map
with open("pathways_per_gene.json", "r") as fp:
    pathways_per_gene = load(fp)

In [41]:
PATHWAY_IMAGES_FOLDER = "./pathview_images_full/"

In [52]:
%%time

missing_pathway_files = []

pathway_png_template = "hsa%(pathway)s.pathview.png"
for i, pathway in enumerate(pathways_per_gene):
    pathway_png = os.path.join( PATHWAY_IMAGES_FOLDER, pathway_png_template % {"pathway": pathway} )
    if not os.path.exists(pathway_png):
        pathway_png = "Yet to generate a pathway for me"
        missing_pathway_files.append(pathway)
    
    genes = pathways_per_gene[pathway]
    for gene in genes:
        create_query = """
            MERGE (g:Gene {name: "%(gene_name)s"})
            MERGE (p:Pathway {name: "%(pathway)s", pathway_png: "%(pathway_png)s"})
            MERGE (g)-[r:is_in]->(p)
            RETURN g, r, p
        """ % {"gene_name": gene, "pathway": pathway, "pathway_png": pathway_png}
        graph.run(create_query)
        
    if i % 100 == 0:
        print( "Saved {} pathways to Graph DB".format(i) )

Saved 0 pathways to Graph DB
Saved 100 pathways to Graph DB
Saved 200 pathways to Graph DB
CPU times: user 2.41 s, sys: 323 ms, total: 2.73 s
Wall time: 13.4 s


In [53]:
len(pathways_per_gene.keys())

292

## Variants & Genes

In [1]:
import pandas as pd

In [2]:
v_df = pd.read_csv("variants_file_svai.csv")

In [3]:
v_df.shape

(120430, 6)

In [4]:
v_df.head()

Unnamed: 0,gene,Position,Variation Type,Gene Region,dbSNP ID,1000 Genomes Frequency
0,DDX11L1,10108,Insertion,Promoter,1322538365,
1,DDX11L1,10321,SNV,Promoter,1002315756,
2,WASH7P,18164,SNV,Intronic; Promoter,62636370,
3,WASH7P,20729,SNV,Intronic,6661499,
4,WASH7P,28682,SNV,Promoter; Intronic,1490102872,


In [5]:
df = v_df.fillna({ 
    "1000 Genomes Frequency": 0
})

In [6]:
df.head()

Unnamed: 0,gene,Position,Variation Type,Gene Region,dbSNP ID,1000 Genomes Frequency
0,DDX11L1,10108,Insertion,Promoter,1322538365,0.0
1,DDX11L1,10321,SNV,Promoter,1002315756,0.0
2,WASH7P,18164,SNV,Intronic; Promoter,62636370,0.0
3,WASH7P,20729,SNV,Intronic,6661499,0.0
4,WASH7P,28682,SNV,Promoter; Intronic,1490102872,0.0


In [7]:
df.describe()

Unnamed: 0,Position,1000 Genomes Frequency
count,120430.0,120430.0
mean,68647220.0,0.031209
std,54734250.0,0.115265
min,4769.0,0.0
25%,24947890.0,0.0
50%,57438070.0,0.0
75%,100303800.0,0.0
max,249069900.0,0.998


### Push to Neo4J

In [None]:
from py2neo import Graph
import pandas as pd

v_df = pd.read_csv("variants_file_svai.csv")
df = v_df.fillna({
    "1000 Genomes Frequency": 0,
    "position": 0
})

graph = Graph(password="1234")

for i, r in df.iterrows():
    create_query = """
                MERGE (g:Gene {name: "%(gene_name)s"})
                MERGE (p:Variant {name: "%(variant)s", gene_region: "%(gene_region)s", freq_1000: %(freq_1000)s})
                MERGE (g)-[r:has {variant_type: "%(variant_type)s", position: %(position)s}]->(p)
                RETURN g, r, p
            """ % {
        "gene_name": r["gene"],
        "variant": r["dbSNP ID"],
        "gene_region": r["Gene Region"],
        "freq_1000": r["1000 Genomes Frequency"],
        "variant_type": r["Variation Type"],
        "position": r["Position"]
    }
    graph.run(create_query)

    if i % 10000 == 0:
        print("Saved {} variants".format(i))
