# GNBR Parsing Script
This notebook gives an in depth explanation of the code used to parse GNBR from its native format into node and edge file for import into Neo4j using the bulk importer tool. In the actual build module the code will be put into functions for purposes of style and hygeine.  Use this as a reference or protoyping tool when playing around with changes to the GNBR neo4j database.

Planned improvements are to implement this using the dask library, as the script as it currently stands is quite emory intensive.  It came close to crashing my personal machine, which is top spec in terms of RAM.

In [1]:
import os
import gzip
import numpy as np
import pandas as pd

### Global Variables
Here we set declare 
1. The file header for the GNBR dependency path files (i.e. part 2 files
2. Source directory the raw GNBR csv files reside (i.e. download directory)
3. Destination and filenames for the parsed GNBR node and edge files

In [2]:
HEADER = [
    "pmid", "loc", 
    "subj_name", "subj_loc", 
    "obj_name", "obj_loc",
    "subj_name_raw", "obj_name_raw", 
    "subj_id", "obj_id", 
    "subj_type", "obj_type", 
    "path", "text"
]

DWNLD_DIR=os.path.expanduser("~/test/gnbr")
NEO_DIR = os.path.expanduser("~/neo4j/import")
ENTITIES = 'entity.csv.gz'
MENTIONS = 'mention.csv.gz'
SENTENCES = 'sentence.csv.gz'
THEMES = 'theme.csv.gz'
STATEMENTS = 'statement.csv.gz'
HAS_MENTION = 'has_mention.csv.gz'
IN_SENTENCE = 'in_sentence.csv.gz'
HAS_THEME = 'has_theme.csv.gz'

### Import into Pandas
Now we get the filenames and import the GNBR files into a Pandas Dataframe.   I orginially did everything streaming because I was worried about speed and memory, but it turns out pandas can handle all of GNBR at once.  The ease of data manipulation more than makes up for an loss in performance.

In [3]:
source = DWNLD_DIR
theme_files = sorted( [os.path.join(source, file) for file in os.listdir(source) if '-i-' in file] )
path_files = sorted( [os.path.join(source, file) for file in os.listdir(source) if '-ii-' in file] )

In this block of code we join theme (part 1) and path (part 2) files using dependency paths as the key, and then append each joined Dataframe to the results of the previous iteration.  When we are done, we end up with all the data in one massive data frame.  BIGGGG DATA!!!

In [4]:
gnbr_df = pd.DataFrame()
for theme_file, path_file in zip(theme_files, path_files):
    theme_df = pd.read_csv(theme_file, compression='gzip', header=0, sep='\t')
    path_df = pd.read_csv(path_file, compression='gzip', header=None, names=HEADER, sep='\t')
    path_df = path_df.dropna(0).drop_duplicates()
    path_df.path = path_df.path.str.lower()
    merged_df = path_df.merge(theme_df, how="inner", on=['path'])
    gnbr_df = pd.concat([gnbr_df, merged_df], join='outer', ignore_index=True, sort=False)

In [5]:
gnbr_df.head()

Unnamed: 0,pmid,loc,subj_name,subj_loc,obj_name,obj_loc,subj_name_raw,obj_name_raw,subj_id,obj_id,...,V+,V+.ind,I,I.ind,H,H.ind,Rg,Rg.ind,Q,Q.ind
0,18509642,11,10058-F4,15161524,tumor,14921497,10058-F4,tumor,MESH:C524814,MESH:D009369,...,,,,,,,,,,
1,20801893,10,10074-G5,17131721,tumor,16891694,10074-G5,tumor,MESH:C534883,MESH:D009369,...,,,,,,,,,,
2,8027220,3,17-deoxysteroids,517533,hypertension,463475,17-deoxysteroids,hypertension,MESH:C006022,MESH:D006973,...,,,,,,,,,,
3,18936997,15,"2,3,7,8-tetrachlorodibenzo-p-dioxin",23762411,toxicity,23372345,"2,3,7,8-tetrachlorodibenzo-p-dioxin",toxicity,MESH:D013749,MESH:D064420,...,,,,,,,,,,
4,26279999,0,"3,5-Diiodothyronine",85104,Cardiac_Illness,3348,"3,5-Diiodothyronine",Cardiac Illness,MESH:C030102,MESH:D006331,...,,,,,,,,,,


### Clean Up Data
Here we do some basic hoursekeeping to clean up the data and produce some unique identifiers that will make our life easier in the future.  

##### Generate Sentence IDs
In this next block of code, we hash the text of each sentence to produce a unique identifier.  We do this because it is much easier of Pandas and Neo4j to compare and match hash values than long strings of text.

In [6]:
import hashlib
sentence_ids = gnbr_df['text'].copy()
sentence_ids = sentence_ids.astype(str).str.encode('utf-8')
sentence_ids = sentence_ids.apply(lambda x: hashlib.md5(x).hexdigest())
gnbr_df.loc[:,'sentence_id'] = sentence_ids
gnbr_df.head()

Unnamed: 0,pmid,loc,subj_name,subj_loc,obj_name,obj_loc,subj_name_raw,obj_name_raw,subj_id,obj_id,...,V+.ind,I,I.ind,H,H.ind,Rg,Rg.ind,Q,Q.ind,sentence_id
0,18509642,11,10058-F4,15161524,tumor,14921497,10058-F4,tumor,MESH:C524814,MESH:D009369,...,,,,,,,,,,f407c175ed3c5bed6e49e853bd1521db
1,20801893,10,10074-G5,17131721,tumor,16891694,10074-G5,tumor,MESH:C534883,MESH:D009369,...,,,,,,,,,,432e5c439550f059249b8e05273cd1b4
2,8027220,3,17-deoxysteroids,517533,hypertension,463475,17-deoxysteroids,hypertension,MESH:C006022,MESH:D006973,...,,,,,,,,,,b7e17c4dd71970206b4bfc31b82d4eca
3,18936997,15,"2,3,7,8-tetrachlorodibenzo-p-dioxin",23762411,toxicity,23372345,"2,3,7,8-tetrachlorodibenzo-p-dioxin",toxicity,MESH:D013749,MESH:D064420,...,,,,,,,,,,71865f16aa1424cb09cac2cf609ea7cb
4,26279999,0,"3,5-Diiodothyronine",85104,Cardiac_Illness,3348,"3,5-Diiodothyronine",Cardiac Illness,MESH:C030102,MESH:D006331,...,,,,,,,,,,08731188178997f788b85b96e98b9120


##### Generate Dependency Path IDs
Here we do something similar to sentence ids (hashing), with one key difference.  We hash triples with subject type, object type, and dependency path.  We include the subject and object types because the same dependency path can map to different types of distributions (i.e. theme sets) depending on what types of entities are involved.  This makes paths non-unique identifiers for theme distributions, which crashes the neo4j import.  Adding subject and object types creates a unique triple tht we can use as the id.

In [7]:
path_ids = gnbr_df['path'].astype(str)  + gnbr_df['subj_type'].astype(str) + gnbr_df['obj_type'].astype(str)
path_ids = path_ids.astype(str).str.encode('utf-8')
path_ids = path_ids.apply(lambda x: hashlib.md5(x).hexdigest())
gnbr_df.loc[:,'path_id'] = path_ids
gnbr_df.head()

Unnamed: 0,pmid,loc,subj_name,subj_loc,obj_name,obj_loc,subj_name_raw,obj_name_raw,subj_id,obj_id,...,I,I.ind,H,H.ind,Rg,Rg.ind,Q,Q.ind,sentence_id,path_id
0,18509642,11,10058-F4,15161524,tumor,14921497,10058-F4,tumor,MESH:C524814,MESH:D009369,...,,,,,,,,,f407c175ed3c5bed6e49e853bd1521db,05d11ea4b1d924abafac60a269c3347c
1,20801893,10,10074-G5,17131721,tumor,16891694,10074-G5,tumor,MESH:C534883,MESH:D009369,...,,,,,,,,,432e5c439550f059249b8e05273cd1b4,05d11ea4b1d924abafac60a269c3347c
2,8027220,3,17-deoxysteroids,517533,hypertension,463475,17-deoxysteroids,hypertension,MESH:C006022,MESH:D006973,...,,,,,,,,,b7e17c4dd71970206b4bfc31b82d4eca,05d11ea4b1d924abafac60a269c3347c
3,18936997,15,"2,3,7,8-tetrachlorodibenzo-p-dioxin",23762411,toxicity,23372345,"2,3,7,8-tetrachlorodibenzo-p-dioxin",toxicity,MESH:D013749,MESH:D064420,...,,,,,,,,,71865f16aa1424cb09cac2cf609ea7cb,05d11ea4b1d924abafac60a269c3347c
4,26279999,0,"3,5-Diiodothyronine",85104,Cardiac_Illness,3348,"3,5-Diiodothyronine",Cardiac Illness,MESH:C030102,MESH:D006331,...,,,,,,,,,08731188178997f788b85b96e98b9120,05d11ea4b1d924abafac60a269c3347c


##### Fix Chemical, Gene, and Document IDs
Here we clean up the chemical and gene identifiers.  Some chemical identifiers are missing their MESH prefixes, and some genes are missing theie entrez prefixes.  Also some genes have a taxon string appended on the end.  The three things going on here are:
1. Split off the Taxon string where it is present
2. Use regexes to find MESH and NCBIGENE IDs missing prefixes, and fix them
3. Add taxonomy prefix and put everything back into the Dataframe

We do this for both the subjects and the objects of each sentence.

##### Fix Subject IDs
First we split the uri field to separate the Texonomy ids from genes.

In [8]:
uri_split = gnbr_df['subj_id'].str.split('(', expand=True)

Then we parse look for mesh id and entez gene ids without prefixes, prepend them on, and place them back in the subject uri field.

In [9]:
uris = uri_split[0]
uris = uris.str.replace(r'^(C\d+)$', lambda m: 'MESH:' + m.group(0))
uris = uris.str.replace(r'^(D\d+)$', lambda m: 'MESH:' + m.group(0))
uris = uris.str.replace(r'^(\d+)$', lambda m: 'NCBIGENE:' + m.group(0))
gnbr_df['subj_id'] = uris

Finally, we add the prefix for Taxa, assume that it is human where not indicated (this may be a faulty assumption), and add to the dataframe as a new column.

In [10]:
species = uri_split[1].str.strip(')')
species = species.str.replace('Tax', 'Taxonomy')
species = species.fillna(value="Taxonomy:9606")
species[gnbr_df['subj_type'] != 'Gene'] = ''
gnbr_df['subj_species'] = species

##### Fix Object IDs
Object IDs follow the exact same procedure as the subject IDs.  I thought about writing a function, but I think the explicit indication that you need to do for both subject and object IDs, outweighs any loss in style points from cutting and pasting code.

In [11]:
uri_split = gnbr_df['obj_id'].str.split('(', expand=True)

uris = uri_split[0]
uris = uris.str.replace(r'^(C\d+)$', lambda m: 'MESH:' + m.group(0))
uris = uris.str.replace(r'^(D\d+)$', lambda m: 'MESH:' + m.group(0))
uris = uris.str.replace(r'^(\d+)$', lambda m: 'NCBIGENE:' + m.group(0))
gnbr_df['obj_id'] = uris

species = uri_split[1].str.strip(')')
species = species.str.replace('Tax', 'Taxonomy')
species = species.fillna(value="Taxonomy:9606")
species[gnbr_df['obj_type'] != 'Gene'] = ''
gnbr_df['obj_species'] = species

##### TODO
There is one minor annoyance left to deal with here.  Some rows have multiple gene ids, which are stored together as a semicolon separated string.  This happens for sentences where the subject or object is a series of genes (e.g. "MMP 8-12").  This is a completely non-trivial to fix and only affects a couple thousand rows.  So far now will handle at the REST server, and will only put in effort for underlying fix if it turns out to be a major issue.

##### Fix Document IDs
We also add the pubmed prefix to the pmids.

In [12]:
pmids = gnbr_df['pmid'].astype(str)
pmids = pmids.str.replace(r'^(\d+)$', lambda m: 'PUBMED:' + m.group(0))
gnbr_df['pmid'] = pmids

In [13]:
gnbr_df.head()

Unnamed: 0,pmid,loc,subj_name,subj_loc,obj_name,obj_loc,subj_name_raw,obj_name_raw,subj_id,obj_id,...,H,H.ind,Rg,Rg.ind,Q,Q.ind,sentence_id,path_id,subj_species,obj_species
0,PUBMED:18509642,11,10058-F4,15161524,tumor,14921497,10058-F4,tumor,MESH:C524814,MESH:D009369,...,,,,,,,f407c175ed3c5bed6e49e853bd1521db,05d11ea4b1d924abafac60a269c3347c,,
1,PUBMED:20801893,10,10074-G5,17131721,tumor,16891694,10074-G5,tumor,MESH:C534883,MESH:D009369,...,,,,,,,432e5c439550f059249b8e05273cd1b4,05d11ea4b1d924abafac60a269c3347c,,
2,PUBMED:8027220,3,17-deoxysteroids,517533,hypertension,463475,17-deoxysteroids,hypertension,MESH:C006022,MESH:D006973,...,,,,,,,b7e17c4dd71970206b4bfc31b82d4eca,05d11ea4b1d924abafac60a269c3347c,,
3,PUBMED:18936997,15,"2,3,7,8-tetrachlorodibenzo-p-dioxin",23762411,toxicity,23372345,"2,3,7,8-tetrachlorodibenzo-p-dioxin",toxicity,MESH:D013749,MESH:D064420,...,,,,,,,71865f16aa1424cb09cac2cf609ea7cb,05d11ea4b1d924abafac60a269c3347c,,
4,PUBMED:26279999,0,"3,5-Diiodothyronine",85104,Cardiac_Illness,3348,"3,5-Diiodothyronine",Cardiac Illness,MESH:C030102,MESH:D006331,...,,,,,,,08731188178997f788b85b96e98b9120,05d11ea4b1d924abafac60a269c3347c,,


### Entities and Mentions in Sentences
Now that we've taken care of all the organization and housekeeping, it's time to start generating some node and edge files!  

In this section we generate the nodes files for Entities (Chemical, Gene, Disease) and Mentions, and the edge files linking Entities to Mentions and Sentences.  Mentions nodes serve no purpose other than to optimize lookup speed for text search.  Imo, they pollute the data model, but are a necessary evil.  If I ever get around to dumping the nodes into an elastic search index, they might go away.

##### Entities

First order of business is pulling subject and object entities out and merging them into a single entities dataframe (concepts).

In [14]:
subj_df = gnbr_df[['subj_name', 'subj_name_raw','subj_id', 'subj_species', 'subj_type', 'sentence_id']]
subj_df.columns = ['name' , 'mention', 'uri', 'species', 'type', 'sentence_id']

In [15]:
obj_df = gnbr_df[['obj_name', 'obj_name_raw','obj_id', 'obj_species', 'obj_type', 'sentence_id']]
obj_df.columns = ['name', 'mention', 'uri', 'species', 'type', 'sentence_id']

In [16]:
concepts = pd.concat([subj_df, obj_df], ignore_index = True)
concepts.head()

Unnamed: 0,name,mention,uri,species,type,sentence_id
0,10058-F4,10058-F4,MESH:C524814,,Chemical,f407c175ed3c5bed6e49e853bd1521db
1,10074-G5,10074-G5,MESH:C534883,,Chemical,432e5c439550f059249b8e05273cd1b4
2,17-deoxysteroids,17-deoxysteroids,MESH:C006022,,Chemical,b7e17c4dd71970206b4bfc31b82d4eca
3,"2,3,7,8-tetrachlorodibenzo-p-dioxin","2,3,7,8-tetrachlorodibenzo-p-dioxin",MESH:D013749,,Chemical,71865f16aa1424cb09cac2cf609ea7cb
4,"3,5-Diiodothyronine","3,5-Diiodothyronine",MESH:C030102,,Chemical,08731188178997f788b85b96e98b9120


Next I am finding the modt frequently used term for each entitiy, setting that as its name, and then writing to a file.  The aggregation step where I collect the mentions of each entitiy ensures that there are no duplications.  Neo4j doesn't like duplicate nodes.

In [17]:
from collections import Counter
entities = concepts[['mention', 'uri', 'type', 'species']]
entities = entities.groupby(by=['uri','type', 'species'])['mention'].apply(lambda x: x.values.tolist())
entities = entities.apply(lambda x: Counter(x).most_common(1)[0][0])
entities = pd.DataFrame(entities)
entities.reset_index(inplace=True)
entities = entities.drop_duplicates(subset=['uri'] ,keep ='last')

In [18]:
entities_file = os.path.join(NEO_DIR, ENTITIES)
entities.to_csv(entities_file, 
                columns=["uri", "type", "mention", 'species'], 
                header=['uri:ID(Entity-ID)', 'type:LABEL', 'name', 'species'], 
                index=False, compression='gzip')

#### Mentions

Now that we have entities we pull out mentions.  This is much simpler.  We dump the mentions into a dataframe, and then dedupe.  The edge file just needs to link mentions to entity uris, which I am using as unique IDs.

In [19]:
mentions = concepts[['name','mention']]
mentions = mentions.drop_duplicates()
mentions.tail()

Unnamed: 0,name,mention
26506335,rhoGDI-3,rhoGDI-3
26509670,lhx2,lhx2
26636988,retTPC/PTC,retTPC/PTC
26717150,Sting,Sting
26734469,ubiquitin-specific_peptidase_1,ubiquitin-specific peptidase 1


In [20]:
mentions_file = os.path.join(NEO_DIR, MENTIONS)
mentions.to_csv(mentions_file, 
                columns=["name", "mention"], 
                header=['formatted', 'mention:ID(Mention-ID)'], 
                index=False, compression='gzip')

In [21]:
has_mention = concepts[['mention', 'uri']]
has_mention = has_mention.drop_duplicates()
has_mention_file = os.path.join(NEO_DIR, HAS_MENTION)
has_mention.to_csv(has_mention_file, 
                columns=["uri", "mention"], 
                header=[':START_ID(Entity-ID)', ':END_ID(Mention-ID)'], 
                index=False, compression='gzip')

##### In Sentence
Here we are making the edges that connect mentions to sentences.  This is just as simple as the mentions.  Pull out the entity ids and sentence ids, dedupe, and dump into a csv.  

In [22]:
in_sentence = concepts[['mention', 'sentence_id']]
in_sentence = in_sentence.drop_duplicates()
in_sentence.head()

Unnamed: 0,mention,sentence_id
0,10058-F4,f407c175ed3c5bed6e49e853bd1521db
1,10074-G5,432e5c439550f059249b8e05273cd1b4
2,17-deoxysteroids,b7e17c4dd71970206b4bfc31b82d4eca
3,"2,3,7,8-tetrachlorodibenzo-p-dioxin",71865f16aa1424cb09cac2cf609ea7cb
4,"3,5-Diiodothyronine",08731188178997f788b85b96e98b9120


In [23]:
in_sentence_file = os.path.join(NEO_DIR, IN_SENTENCE)
in_sentence.to_csv(in_sentence_file, 
                   columns=["mention", "sentence_id"], 
                   header=[":START_ID(Mention-ID)", ":END_ID(Sentence-ID)"], 
                   index=False, compression='gzip')

### Sentences, Themes, and Documents
This is the core of the database in terms of the information the API provides.  The sentence annotations are where GNBR shines, so they make up the center of the data model.
##### Sentences
The code for generating the sentence nodes isn;t too complex.  Basically just collect take the sentences and dedupe by sentence id. Interestingly, if you dedupe using id, text, pmid triples you end up with one or two duplicate sentence ids, which crashes the neo4j import.  This behavior is caused by some sentences mapping to more than one pmid.  When I chased down the pmids, they appeared to be invalid.

In [24]:
sentence_df = gnbr_df[['subj_id', 'obj_id', 'text', 'pmid', 'path', 'sentence_id', 'path_id']]

In [25]:
sentences = sentence_df[['sentence_id', 'text', 'pmid']]
sentences = sentences.drop_duplicates(subset='sentence_id')
sentences.head()

Unnamed: 0,sentence_id,text,pmid
0,f407c175ed3c5bed6e49e853bd1521db,Peak tumor concentrations of 10058-F4 were at ...,PUBMED:18509642
1,432e5c439550f059249b8e05273cd1b4,The lack of antitumor activity probably was ca...,PUBMED:20801893
2,b7e17c4dd71970206b4bfc31b82d4eca,A 15-yr-old patient from Germany was seen for ...,PUBMED:8027220
3,71865f16aa1424cb09cac2cf609ea7cb,Dioxin-like activity of multilayer and carbon ...,PUBMED:18936997
4,08731188178997f788b85b96e98b9120,Nonthyroidal_Illness_Syndrome in Cardiac_Illne...,PUBMED:26279999


In [26]:
sentences_file = os.path.join(NEO_DIR, SENTENCES)
sentences.to_csv(sentences_file, 
                columns=["sentence_id", "text", "pmid"], 
                header=[":ID(Sentence-ID)", "text", "pmid"], 
                index=False, compression='gzip')

##### Has Theme
Edges between sentences and themes are also quite simple.  Just grab, sentence ids and path ids, then dedupe.  I use these edges to store the dependency paths because I don't every expect them to be the subject of a query.  Protip: model data that will be the subject of queries as nodes.

In [27]:
has_theme = sentence_df[['sentence_id', 'path', 'path_id']]
has_theme = has_theme.drop_duplicates()
has_theme.head()

Unnamed: 0,sentence_id,path,path_id
0,f407c175ed3c5bed6e49e853bd1521db,concentrations|nmod|start_entity concentration...,05d11ea4b1d924abafac60a269c3347c
1,432e5c439550f059249b8e05273cd1b4,concentrations|nmod|start_entity concentration...,05d11ea4b1d924abafac60a269c3347c
2,b7e17c4dd71970206b4bfc31b82d4eca,concentrations|nmod|start_entity concentration...,05d11ea4b1d924abafac60a269c3347c
3,71865f16aa1424cb09cac2cf609ea7cb,concentrations|nmod|start_entity concentration...,05d11ea4b1d924abafac60a269c3347c
4,08731188178997f788b85b96e98b9120,concentrations|nmod|start_entity concentration...,05d11ea4b1d924abafac60a269c3347c


In [28]:
has_theme_file = os.path.join(NEO_DIR, HAS_THEME)
has_theme.to_csv(has_theme_file, 
                 columns=["sentence_id", "path", "path_id"], 
                 header=[":START_ID(Sentence-ID)", "path", ":END_ID(Path-ID)"], 
                 index=False, compression='gzip')

#### Themes
Now we start to get a little fancy and work with numeric data.  Theme distributions are indicate how strongly we believe a sentence asserts some relationship between a pair of entities. 

Code is simple enough:
1. Select numeric fields in the dataframe (i.e. theme fields)  
2. Get rid of flagship indicators
3. Normalize theme scores
4. Create the file header
5. Dedupe

In [33]:
themes_df = gnbr_df.select_dtypes(include='float64')
themes_df = themes_df[[i for i in themes_df.columns if not i.endswith('.ind')]]

The next line is the secret sauce.  Here we are using the pecentile rank function to normalize theme scores.  So for example we map a score for "treats" to its percentile rank among all the "treats" scores.  We do this for each theme.

In [34]:
themes_df = themes_df.rank(numeric_only=True, pct=True, method='dense')

Below we do all the header creation and deduplication

In [37]:
themes_df = themes_df.fillna(0) 
themes_df.columns = [i + ':float' for i in themes_df.columns]
themes_df[':ID(Path-ID)'] = gnbr_df['path_id']
themes_df = themes_df.drop_duplicates()
themes_df.head()

Unnamed: 0,T:float,C:float,Sa:float,Pr:float,Pa:float,J:float,Mp:float,A+:float,A-:float,B:float,...,Md:float,X:float,L:float,W:float,V+:float,I:float,H:float,Rg:float,Q:float,:ID(Path-ID)
0,0.066845,0.177719,0.12141,0.110619,0.029617,0.161604,0.050955,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,05d11ea4b1d924abafac60a269c3347c
141,0.000668,0.001326,0.002611,0.001475,0.003484,0.00243,0.003185,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2d7980e2baff336b50b709cbf16b52de
147,0.001337,0.001326,0.001305,0.0059,0.001742,0.001215,0.006369,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,a9f615f64661e4d1c328e551e6c4ff0f
156,0.994652,0.997347,0.951697,0.989676,0.945993,0.957473,0.901274,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ced8dc92de3e40f8cb8c8e90e5cd0d90
8521,0.000668,0.002653,0.001305,0.001475,0.001742,0.001215,0.003185,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,c95cbad08fb46a3e88d51e9222f642a8


In [38]:
themes_file = os.path.join(NEO_DIR, THEMES)
themes_df.to_csv(themes_file, index=False, compression='gzip')

Initially I wanted to drop theme distributions where everything was close to zero, but this cause issues because they are referenced in an edge file (i.e. has_theme), and misaligned references crashes the neo4j import.  

### Statements
Statements are also pretty key for GNBR as they form the core of its "reasoning" functionality.  Statements are assertions made about relationships between pairs of entities that take into account the sum total literature evidence in pubtator.

Algorithm
1. Aggregate over each all sentences and themes connecting each entity pair using mean as aggregation function
2. Normalize using percentile rank as described for themes.  
3. Drop statements (rows) where all theme scores are in the bottom five percent
4. Create file header and dedupe

The next line is where all the magic (aggregation) happens.

In [39]:
statements = gnbr_df.groupby(by=['subj_id','obj_id']).mean(numeric_only=True)

After that it's pretty much the same as the themes.  I do take the step of getting rid of statements edges where all the theme scores are close to zero.  This doesn't cause any failures.  

In [40]:
statements = statements.select_dtypes(include='float64')
statements = statements.drop(['loc'], axis=1)
statements = statements[[i for i in statements.columns if not i.endswith('.ind')]]
statements = statements.rank(numeric_only=True, pct=True, method='dense')
statements = statements[statements > 0.05].dropna(how='all')
statements.columns = [i + ':float' for i in statements.columns]
statements = pd.DataFrame(statements)
statements.reset_index(inplace=True)
statements = statements.fillna(0) 
statements = statements.rename(columns = {'subj_id': ':START_ID(Entity-ID)', 'obj_id': ':END_ID(Entity-ID)'})

In [41]:
statements_file = os.path.join(NEO_DIR, STATEMENTS)
statements.to_csv(statements_file, index=False, compression='gzip')

### Documents
Nothing here yet.  I'm working to beef up the amount of data we offer about the publications (i.e. publication date. author, etc), but it's not so germaine to the functioning of anything right now, so it is low priority.  