# Data pre-processing

__Goal:__ Build dataset for training and evaluation relation extraction models

__Method:__ Extract relation data from manually curated covid-19 kg, split into training and test sets

__Data:__ [Covid-19 KG](https://github.com/covid19kg/covid19kg) from Fraunhofer Institute for Algorithms and Scientific Computing SCAI

__Tools:__ PyBEL

__Result:__ covid-19-kg dataset

In [2]:
import requests
import pybel
import pandas as pd

## Load graph

In [4]:
#load original graph
#url = 'https://github.com/covid19kg/covid19kg/raw/master/covid19kg/_cache.bel.nodelink.json'

#snippet for mapping resource to resource
# https://cthoyt.com/2020/04/19/inspector-javerts-xref-database.html
# https://github.com/pyobo/pyobo

#import pyobo
## Map to the best source possible
#mapt_ncbigene = pyobo.get_priority_curie('hgnc:6893')
#assert mapt_ncbigene == 'ncbigene:4137'
## Sometimes you know you're the best. Own it.
#assert 'ncbigene:4137' == pyobo.get_priority_curie('ncbigene:4137')


#load graph pre-procesed by Charlie Hoyt: https://github.com/CoronaWhy/bel4corona/tree/master/data/covid19kg
url = 'https://github.com/CoronaWhy/bel4corona/raw/master/data/covid19kg/covid19-fraunhofer-grounded.bel.nodelink.json'
res = requests.get(url)
graph = pybel.from_nodelink(res.json())

In [5]:
#export  graph as a tab-separated edge list.
pybel.to_csv(graph, 'covid19kg_dataset.csv')

## Graph overview

Covid19 graph is encoded with [BEL](https://language.bel.bio/language/reference/2.1.0/functions/abundance/) (Biological Expression Language). This is a domain-specific language that enables the expression of complex molecular relationships and their context in a machine-readable form.

In [6]:
#nodes (entity types)
pybel.struct.summary.count_functions(graph)

Counter({'Abundance': 506,
         'BiologicalProcess': 640,
         'Complex': 845,
         'Protein': 800,
         'Composite': 29,
         'Gene': 106,
         'Pathology': 266,
         'RNA': 99,
         'Reaction': 3})

In [7]:
#list of entity attributes (namespaces) used in a graph
pybel.struct.summary.count_namespaces(graph)

Counter({'chebi': 219,
         'efo': 9,
         'go': 760,
         'mesh': 337,
         'ncbitaxon': 42,
         'uniprot': 113,
         'hgnc': 745,
         'eccode': 12,
         'hgnc.genefamily': 21,
         'mgi': 56,
         'interpro': 2,
         'dbsnp': 1,
         'pfam': 9,
         'rgd': 2,
         'doid': 85,
         'hp': 66})

In [8]:
#number of relation types
pybel.struct.summary.count_relations(graph)

Counter({'partOf': 1630,
         'isA': 24,
         'decreases': 677,
         'increases': 862,
         'negativeCorrelation': 776,
         'regulates': 240,
         'positiveCorrelation': 2008,
         'association': 644,
         'biomarkerFor': 1,
         'hasVariant': 82,
         'prognosticBiomarkerFor': 3,
         'causesNoChange': 1,
         'hasComponent': 1,
         'hasReactant': 3,
         'hasProduct': 3})

In [10]:
#example of entities that related to GO (Gene Ontology) entity type
pybel.struct.summary.count_names_by_namespace(graph,'go')

Counter({'chromosome, centromeric region': 1,
         'condensed chromosome, centromeric region': 1,
         'condensed nuclear chromosome, centromeric region': 1,
         'condensed chromosome': 1,
         'condensed nuclear chromosome': 1,
         'spindle pole': 1,
         'cornified envelope': 1,
         'immunological synapse': 1,
         'uropod': 1,
         'nucleus': 6,
         'nuclear envelope': 1,
         'cytoplasm': 13,
         'mitochondrion': 2,
         'mitochondrial outer membrane': 1,
         'mitochondrial inner membrane': 1,
         'mitochondrial matrix': 1,
         'lysosomal membrane': 1,
         'primary lysosome': 1,
         'early endosome': 1,
         'late endosome': 2,
         'vacuolar membrane': 1,
         'vacuolar lumen': 1,
         'endoplasmic reticulum lumen': 1,
         'endoplasmic reticulum-Golgi intermediate compartment': 5,
         'spindle': 1,
         'cytoskeleton': 1,
         'microtubule': 1,
         'spindle micr

## Export CSV

In [11]:
#load csv
dataset = pd.read_csv('covid19kg_dataset.csv', sep='\t')

In [12]:
dataset.head()

Unnamed: 0,a(chebi:10100),partOf,"complex(a(chebi:10100), p(hgnc:19679))","{""relation"": ""partOf""}"
0,a(chebi:101278),isA,a(chebi:38215),"{""annotations"": {}, ""citation"": {""db"": ""DOI"", ..."
1,a(chebi:101278),isA,a(chebi:35674),"{""annotations"": {}, ""citation"": {""db"": ""DOI"", ..."
2,a(chebi:101278),decreases,path(mesh:D000787),"{""annotations"": {}, ""citation"": {""db"": ""DOI"", ..."
3,a(chebi:101278),decreases,path(mesh:D001145),"{""annotations"": {}, ""citation"": {""db"": ""DOI"", ..."
4,a(chebi:101278),increases,p(hgnc:5434),"{""annotations"": {}, ""citation"": {""db"": ""DOI"", ..."


## Format description

Let’s consider the first raw as an example. 

First part of the row: `a(CHEBI:"(+)-Tetrandrine")` . `a` is a short form of the function [abundance](https://language.bel.bio/language/reference/2.1.0/functions/abundance/).  It means, as far as I can understand, that many entities of such type exist.  This function is used to denote chemicals or drugs, as fa as I can understand. There are 9 function used in our dataset: Abundance, Protein, Gene, BiologicalProcess etc. I suppose we can think about function as an entity types. `CHEBI` is a namespace, short name for dictionary of [Chemical Entities of Biological Interest](https://www.ebi.ac.uk/chebi). In this case it denotes chemical, we can think about it as an entity attribute. Namespace can denote subtype (COVID, for example) or biomedical ontology (Gene Ontology) that are used for entity normalisation. `:` is a subject term only assertion, and indicates that the entity specified by the term has been observed. `(+)-Tetrandrine`  is a name of chemical itself.

The second part (`negativeCorrelation`) is just an relation type. There are several relation types: negativeCorrelation, association, partOf etc.

Third part (`act(p(MGI:Tpcn2)`) has similar to first one structure. `act` is a short form of function activity that denotes activty of protein, RNA or a complex entity. `p` is a short form for [proteinAbundance](https://language.bel.bio/language/reference/2.1.0/functions/proteinabundance/) function. `MGI` is a namespace, short name for database of [Mouse Genome Informatics](http://www.informatics.jax.org/). And `Tpcn2`  is a protein name.

The last column contains information about paper from which it was extracted: authors, title, section name etc.