### Script Overview
This script creates a toy dataset from INDRA covid19, hosted on emma.indra.bio 

Emma puts together this graph on daily basis via a cron job that pulls in literature, does NER,  train new ML model..
It incorporates daily updates from CORD-19 and also searches the Internet, and runs about 6 text mining systems on those

The script converts the graph to BEL format via pybel library. 
The pybel library can be used to further process the graph and generate toy dataset outputs. 

In [1]:
import sys
import os

In [2]:

print(sys.path)

['/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '', '/home/lani_lichtenstein/.local/lib/python3.6/site-packages', '/usr/local/lib/python3.6/dist-packages', '/usr/lib/python3/dist-packages', '/home/lani_lichtenstein/.local/lib/python3.6/site-packages/IPython/extensions', '/home/lani_lichtenstein/.ipython']


In [4]:
import getpass
import os
import sys
import time

import matplotlib.pyplot as plt
import pandas as pd
import pykeen
import torch
from pykeen.pipeline import pipeline
import pybel
import pybel_tools
import indra


%matplotlib inline

In [None]:
print(sys.version)

In [None]:
print(time.asctime())

In [None]:
print(getpass.getuser())

In [None]:
print(pykeen.get_version(with_git_hash=True))

In [None]:
print(pybel.get_version(with_git_hash=True))

In [None]:
import requests
from indra.statements import stmts_from_json
from indra.tools import assemble_corpus as ac
from indra.assemblers.pybel import PybelAssembler
model_url = 'https://emmaa.s3.amazonaws.com/assembled/covid19/latest_statements_covid19.json'
stmts_json = requests.get(model_url).json()
stmts = stmts_from_json(stmts_json)
filtered_stmts = ac.filter_belief(stmts, 0.9)
pa = PybelAssembler(filtered_stmts)
pybel_graph = pa.make_model()

In [5]:
from pybel.io.emmaa import get_statements_from_emmaa
from indra.tools import assemble_corpus as ac
from indra.assemblers.pybel import PybelAssembler

stmts = get_statements_from_emmaa('covid19')
filtered_stmts = ac.filter_belief(stmts, 0.9)
pa = PybelAssembler(filtered_stmts)
pybel_graph = pa.make_model()

INFO: [2020-07-28 08:53:58] indra.tools.assemble_corpus - Filtering 172446 statements to above 0.900000 belief
INFO: [2020-07-28 08:53:58] indra.tools.assemble_corpus - 24170 statements after filter...
INFO: [2020-07-28 08:54:05] indra.assemblers.pybel.assembler - Skipping modification of type modification on agent ACE2(mods: (modification))
INFO: [2020-07-28 08:54:05] indra.assemblers.pybel.assembler - Skipping modification of type modification on agent S7(mods: (modification))
INFO: [2020-07-28 08:54:05] indra.assemblers.pybel.assembler - Skipping modification of type sumoylation on agent NLRP3(mods: (sumoylation))
INFO: [2020-07-28 08:54:05] indra.assemblers.pybel.assembler - Skipping modification of type sumoylation on agent NLRP3(mods: (sumoylation))
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Skipping modification of type modification on agent TBK1(mods: (modification))
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Skipping modification of type

INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(Clathrin(), None, plasma membrane)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(ERK(mods: (phosphorylation)), None, nucleus)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(GEF(), None, membrane)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(Interferon(), None, nucleus)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(Kinesin(), None, membrane)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(MAP1LC3(), None, plasma membrane)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(Protease(), None, nucleus)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocat

INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(ATF6(), None, Golgi apparatus)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(ATF6(), None, nucleus)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(NR3C1(), None, nucleus)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(NUP214(), None, cell surface)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(OSBP(), None, Golgi apparatus)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(OSBP(), None, membrane)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(PRKN(), cytosol, None)
INFO: [2020-07-28 08:54:10] indra.assemblers.pybel.assembler - Unhandled statement: Translocation(PI4KB(), None, Golgi apparatus

In [None]:
# Convert Indra graph to Pybel
#https://emmaa.indra.bio/dashboard/covid19?tab=model

#pybel_covid_graph=pybel.from_emmaa('covid19', date="2020-04-23-17-44-57") 

In [6]:
pybel_graph.summarize() # summarise 

indra v58c8826b-7b91-4e81-be5b-d0102720ea08
Number of Nodes: 14269
Number of Edges: 104423
Number of Citations: 27630
Number of Authors: 0
Network Density: 5.13E-04
Number of Components: 293


In [21]:
import pickle
pickle.dump(pybel_graph, open( "pybel_graph.p", "wb"))

#### Approach A - Generate Triples

One approach to generating a toy dataset is to generate triples. 
Triples can be used to generate knowledge graph embeddings. 
They also contain grounded source and target identifiers, as well as details relation descriptions. 

This is not obtained using Approach B - Generate Raw Data with Evidence

In [7]:
import pybel.io.tsv.api

triples=pybel.io.tsv.api.get_triples(pybel_graph)













































In [None]:
import numpy as np
triples = np.array(triples)

In [8]:
triples_df=pd.DataFrame(triples)

In [10]:
triples_df.to_csv("indra_covid_toy_dataset_triples.csv",index=False,sep="\t",header=False)

#### Approach B - Generate Toy Dataset with Raw Text and Evidence

In [23]:
# use local repo cloned from github to access to_triple function
# this is not yet in pypi version, so need to access local cloned location
#sys.path.insert(0,"/home/username/pybel/src/") # If you are using a local version of the file

#from pybel.io.triples import api
# not working - IGNORE
#import imp
#imp.find_module("pybel")
#triples_api = imp.load_source('api', "/home/lani_lichtenstein/pybel/src/pybel/io/triples/api.py")
#import importlib
#importlib.reload(pybel)
pybel.__path__

(None, '/home/lani_lichtenstein/pybel/src/pybel', ('', '', 5))

In [28]:
import logging
from pybel.dsl import BaseConcept
from tqdm import tqdm
#from pybel.io.triples import api

column_list=["Source", "Target", "Relation", "Evidence", "Citation"]
indra_df=pd.DataFrame(columns=column_list)

for u,v,data in tqdm(pybel_graph.edges(data=True)):

    source='NaN'
    target='NaN'
    evidence='NaN'
    relation='NaN'
    annotations='NaN'
    
    #h,r,t=to_triple(u,v,data) https://github.com/pybel/pybel/blob/master/src/pybel/io/triples/api.py
    
    if isinstance(u, BaseConcept):
        source=u.name
        #source_obo=u.obo
        
    if isinstance(v, BaseConcept):
        target=v.name
        #for key in v.keys():
        #    print(key)
        #print("next")
        
    if 'evidence' in data.keys():  # look also at pybel.has_edge_evidence() 
        #print("Explore evidence \n")
        #print(data['evidence'])
        evidence=data["evidence"]
    
    if 'relation' in data.keys():
        #print("Explore relation \n")
        #print(data['relation'])
        #print("\n")
        relation=data['relation']
        
    if 'annotations' in data.keys():
        #print("Explore relation \n")
        #print(data['relation'])
        #print("\n")
        annotations=data['relation']
        #print(annotations)
        
    if 'citation' in data.keys():
        #print("Explore relation \n")
        #print(data['relation'])
        #print("\n")
        citation=data['citation']
        
    tmp=pd.Series([source, target, relation, evidence, citation], index=column_list)
    indra_df=indra_df.append(tmp, ignore_index=True)


100%|██████████| 104423/104423 [11:11<00:00, 155.52it/s]


In [29]:
# explore
indra_df.shape
indra_df.head()

Unnamed: 0,Source,Target,Relation,Evidence,Citation
0,MS,Mucins,directlyIncreases,"In the current study, we investigated the NanS...","{'db': 'PubMed', 'db_id': '30340996'}"
1,MS,Mucins,directlyIncreases,"In the current study, we investigated the NanS...","{'db': 'PubMed', 'db_id': '30340996'}"
2,MS,Mucins,directlyIncreases,"In the current study, we investigated the NanS...","{'db': 'Other', 'db_id': 'reach:Unknown'}"
3,MS,,partOf,,"{'db': 'Other', 'db_id': 'reach:Unknown'}"
4,MS,,partOf,,"{'db': 'Other', 'db_id': 'reach:Unknown'}"


In [30]:
indra_df.to_csv("indra_covid_toy_dataset_raw_evidence_high_belief.csv",index=False,sep="\t",header=False)

In [None]:
# filter for high belief score.. 
# ask Ben or John Bachman, Ben Gyori
# filter statements with ontology- e.g. chebi as this is a toy graph


In [None]:
# to add triples - read in api.py module and use to_triple
#https://github.com/pybel/pybel/blob/master/src/pybel/io/triples/api.py
