# Build database EcoCyc-plus for network traversal and analysis

Create ecocyc database using the steps described in "create_ecocyc_mod_database_for_GDS.md" except step 8. Don't remove currency metabolites. The add STRING data.  The database was deployed at dtu neo4j server as 'ecocyc-plus' database.

In [33]:
from py2neo import Graph
from neo4j import GraphDatabase
import pandas as pd
import csv, os

In [34]:
driver = GraphDatabase.driver('bolt://localhost:7687', auth=('neo4j', 'rcai'))

## Load STRING relationships
From documentation here's the categories. I think definitely the experimental evidence (purple) for physical interaction but we may also want to consider the green (gene neighborhood) and black (co-expression). This would significantly enrich the dataset. 
Ewa

- Red line - indicates the presence of fusion evidence
- Green line - neighborhood evidence
- Blue line - cooccurrence evidence
- Purple line - experimental evidence
- Yellow line - textmining evidence
- Light blue line - database evidence
- Black line - coexpression evidence.


Download the file 511145.protein.links.detailed.v11.0.txt.gz from https://string-db.org/cgi/download?sessionId=bp2V6CA10a3x

In [28]:
file = "511145.protein.links.detailed.v11.0.txt.gz"  # use the correct file path in your computer
df = pd.read_csv(file, sep=' ')
print(len(df))
df.head()

1060854


Unnamed: 0,protein1,protein2,neighborhood,fusion,cooccurence,coexpression,experimental,database,textmining,combined_score
0,511145.b0001,511145.b3766,0,0,0,125,0,0,292,354
1,511145.b0001,511145.b2483,0,0,0,0,0,0,430,430
2,511145.b0001,511145.b0075,0,0,0,0,0,0,303,303
3,511145.b0001,511145.b3672,0,0,0,0,0,0,408,408
4,511145.b0001,511145.b0861,0,0,0,0,0,0,306,306


##### STRING ID contains tax_id and b-number.  Use b-number to match EcoCyc gene accession 

In [29]:
df['protein1'] = df['protein1'].str.replace('511145.', '')
df['protein2'] = df['protein2'].str.replace('511145.', '')
df.head()

Unnamed: 0,protein1,protein2,neighborhood,fusion,cooccurence,coexpression,experimental,database,textmining,combined_score
0,b0001,b3766,0,0,0,125,0,0,292,354
1,b0001,b2483,0,0,0,0,0,0,430,430
2,b0001,b0075,0,0,0,0,0,0,303,303
3,b0001,b3672,0,0,0,0,0,0,408,408
4,b0001,b0861,0,0,0,0,0,0,306,306


In [35]:
len(df)

1060854

#### Create gene links for E. coli genes
Use relationship type 'LINKED_TO' with properties 'source', 'type' and 'score'   
e.g.  source='db_STRING', type='neighborhood', score = 230

Create index for gene property accession
```
create index idx_gene_accession for (n:Gene) on (n.accession);
```

In [36]:
def insert_links(input_df, link_type, db_name):
    df = input_df[['protein1', 'protein2', link_type]]
    df = df[df[link_type]>0]
    print(link_type, len(df))
    df.rename(columns={link_type: 'score'}, inplace=True)
    query = """
        WITH $rows as rows unwind rows as row 
        MATCH (n1:Gene {accession:row.protein1}), (n2:Gene {accession:row.protein2}) 
        MERGE (n1)-[r:LINKED_TO {source:'db_STRING', type:$linktype, score: toFloat(row.score)/1000}]->(n2) 
        """
#     print(query)
    rows = df.to_dict('records')
    with driver.session(database=db_name) as session:
        info = session.run(query, rows=rows, linktype=link_type).consume()
        print(info.counters)

In [37]:
dbname = 'ecocyc-plus'
insert_links(df, 'neighborhood', dbname)

neighborhood 291100
{'relationships_created': 291096, 'properties_set': 873288}


In [38]:
insert_links(df, 'coexpression', dbname)

coexpression 320706
{'relationships_created': 320706, 'properties_set': 962118}


In [39]:
insert_links(df, 'experimental', dbname)

experimental 132150
{'relationships_created': 132150, 'properties_set': 396450}
