## Example: PCS use case

This notebook shows all the steps to generate PCS KG and the downstream analysis.

## Install packages

In [1]:
import os

os.chdir("../../../")
!pwd

/Users/yojana/Documents/GitHub/pyBiodatafuse


In [2]:
# Import modules
import pickle

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd

from pyBiodatafuse import id_mapper
from pyBiodatafuse.annotators import disgenet, minerva, opentargets, stringdb
from pyBiodatafuse.constants import DISGENET_DISEASE_COL
from pyBiodatafuse.graph import generator
from pyBiodatafuse.utils import (
    combine_sources,
    create_harmonized_input_file,
    create_or_append_to_metadata,
)

## Load input genes

In [3]:
data_input = pd.read_csv(os.path.join(os.getcwd(), r"examples/usecases/PCS/data/PCS_gene_list.csv"))
print("Total number of genes:", len(data_input.drop_duplicates()))
data_input.head()

Total number of genes: 2325


Unnamed: 0,identifier
0,CTLA4
1,PTPN22
2,KIT
3,KRAS
4,NF1


### Entity resolution using BridgeDB

In [4]:
pickle_path = os.path.join(os.getcwd(), "examples/usecases/PCS/data/PCS_gene_list.pkl")
metadata_path = os.path.join(os.getcwd(), "examples/usecases/PCS/data/PCS_gene_list_metadata.pkl")

if not os.path.exists(pickle_path):
    bridgedb_df, bridgedb_metadata = id_mapper.bridgedb_xref(
        identifiers=data_input,
        input_species="Human",
        input_datasource="HGNC",
        output_datasource="All",
    )
    bridgedb_df.to_pickle(pickle_path)
    with open(metadata_path, "wb") as file:
        pickle.dump(bridgedb_metadata, file)
else:
    bridgedb_df = pd.read_pickle(pickle_path)
    with open(metadata_path, "rb") as file:
        bridgedb_metadata = pickle.load(file)

In [5]:
print("Number of genes with mapping in BridgeDb:", len(bridgedb_df["identifier"].unique()))
bridgedb_df.head()

Number of genes with mapping in BridgeDb: 1958


Unnamed: 0,identifier,identifier.source,target,target.source
0,CTLA4,HGNC,HGNC:2505,HGNC Accession Number
1,CTLA4,HGNC,CTLA4,HGNC
2,CTLA4,HGNC,1493,NCBI Gene
3,CTLA4,HGNC,ENSG00000163599,Ensembl
4,CTLA4,HGNC,P16410,Uniprot-TrEMBL


## Step-by-step graph generation based on data source of interest


### Gene-Disease edges

Here we use Disgenet database. To run the following code, you would need the API key from DisGeNet by creating an account [here](https://disgenet.com/Profile-area#apiKey)

In [6]:
disgenet_api_key = "89ba9e26-dc4d-45de-a92d-79fe45d9ae1c"

In [7]:
disgenet_pickle_path = os.path.join(os.getcwd(), "examples/usecases/PCS/data/disgenet_df.pkl")
disgenet_metadata_path = os.path.join(
    os.getcwd(), "examples/usecases/PCS/data/disgenet_metadata.pkl"
)

if not os.path.exists(disgenet_pickle_path):
    disgenet_df, disgenet_metadata = disgenet.get_gene_disease(
        api_key=disgenet_api_key, bridgedb_df=bridgedb_df
    )
    disgenet_df.to_pickle(disgenet_pickle_path)
    with open(disgenet_metadata_path, "wb") as file:
        pickle.dump(disgenet_metadata, file)
else:
    disgenet_df = pd.read_pickle(disgenet_pickle_path)
    with open(disgenet_metadata_path, "rb") as file:
        disgenet_metadata = pickle.load(file)

disgenet_df.head()

Querying DisGeNET:  56%|█████▋    | 1078/1915 [21:47<16:55,  1.21s/it]  


ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

In [None]:
# Example of metadata extracted from DisGeNET
disgenet_df[DISGENET_DISEASE_COL][0]

### Disease-compound edges

We added these edges using output from DisGeNet and querying OpenTargets.

In [None]:
disease_mapping_df = create_harmonized_input_file(disgenet_df, DISGENET_DISEASE_COL, "EFO", "UMLS")
disease_mapping_df.head()

In [None]:
opentargets_dc_pickle_path = os.path.join(
    os.getcwd(), "examples/usecases/PCS/data/opentargets_disease_compound_df.pkl"
)
opentargets_dc_metadata_path = os.path.join(
    os.getcwd(), "examples/usecases/PCS/data/opentargets_disease_compound_metadata.pkl"
)

if not os.path.exists(opentargets_dc_pickle_path):
    opentargets_disease_compound_df, opentargets_disease_compound_metadata = (
        opentargets.get_disease_compound_interactions(disease_mapping_df, cache_pubchem_cid=True)
    )

    opentargets_disease_compound_df.to_pickle(opentargets_dc_pickle_path)
    with open(opentargets_dc_metadata_path, "wb") as file:
        pickle.dump(opentargets_disease_compound_metadata, file)
else:
    opentargets_disease_compound_df = pd.read_pickle(opentargets_dc_pickle_path)
    with open(opentargets_dc_metadata_path, "rb") as file:
        opentargets_disease_compound_metadata = pickle.load(file)
opentargets_disease_compound_df.head()

### Gene-Compound edges

These edges are extracted from OpenTargets

In [None]:
opentargets_gc_pickle_path = os.path.join(
    os.getcwd(), "examples/usecases/PCS/data/opentargets_gene_compound_df.pkl"
)
opentargets_gc_metadata_path = os.path.join(
    os.getcwd(), "examples/usecases/PCS/data/opentargets_gene_compound_metadata.pkl"
)

if not os.path.exists(opentargets_gc_pickle_path):
    opentargets_gene_compound_df, opentargets_gene_compound_metadata = (
        opentargets.get_gene_compound_interactions(bridgedb_df)
    )

    opentargets_gene_compound_df.to_pickle(opentargets_gc_pickle_path)
    with open(opentargets_gc_metadata_path, "wb") as file:
        pickle.dump(opentargets_gene_compound_metadata, file)
else:
    opentargets_gene_compound_df = pd.read_pickle(opentargets_gc_pickle_path)
    with open(opentargets_gc_metadata_path, "rb") as file:
        opentargets_gene_compound_metadata = pickle.load(file)

opentargets_gene_compound_df.head()

### Gene-Pathways edges

These edges are extracted from MINERVA, WikiPathways, and OpenTargets

In [None]:
minerva_pickle_path = os.path.join(os.getcwd(), "examples/usecases/PCS/data/minerva_df.pkl")
minerva_metadata_path = os.path.join(os.getcwd(), "examples/usecases/PCS/data/minerva_metadata.pkl")

if not os.path.exists(minerva_pickle_path):
    minerva_df, minerva_metadata = minerva.get_gene_minerva_pathways(
        bridgedb_df, map_name="COVID19 Disease Map"
    )
    minerva_df.to_pickle(minerva_pickle_path)
    with open(minerva_metadata_path, "wb") as file:
        pickle.dump(minerva_metadata, file)
else:
    minerva_df = pd.read_pickle(minerva_pickle_path)
    with open(minerva_metadata_path, "rb") as file:
        minerva_metadata = pickle.load(file)

minerva_df.head()

In [None]:
opentargets_reactome_path = os.path.join(
    os.getcwd(), "examples/usecases/PCS/data/opentargets_reactome_df.pkl"
)
opentargets_reactome_metadata_path = os.path.join(
    os.getcwd(), "examples/usecases/PCS/data/opentargets_reactome_metadata.pkl"
)

if not os.path.exists(opentargets_reactome_path):
    opentargets_reactome_df, opentargets_reactome_metadata = opentargets.get_gene_reactome_pathways(
        bridgedb_df=bridgedb_df
    )
    opentargets_reactome_df.to_pickle(opentargets_reactome_path)
    with open(opentargets_reactome_metadata_path, "wb") as file:
        pickle.dump(opentargets_reactome_metadata, file)

else:
    opentargets_reactome_df = pd.read_pickle(opentargets_reactome_path)
    with open(opentargets_reactome_metadata_path, "rb") as file:
        opentargets_reactome_metadata = pickle.load(file)

opentargets_reactome_df.head()

### Gene annotation

We extracted gene annotation from GO through OpenTargets.

In [None]:
opentargets_go_pickle_path = os.path.join(
    os.getcwd(), "examples/usecases/PCS/data/opentargets_go_df.pkl"
)
opentargets_go_metadata_path = os.path.join(
    os.getcwd(), "examples/usecases/PCS/data/opentargets_go_metadata.pkl"
)

if not os.path.exists(opentargets_go_pickle_path):
    opentargets_go_df, opentargets_go_metadata = opentargets.get_gene_go_process(
        bridgedb_df=bridgedb_df
    )
    opentargets_go_df.to_pickle(opentargets_go_pickle_path)
    with open(opentargets_go_metadata_path, "wb") as file:
        pickle.dump(opentargets_go_metadata, file)

else:
    opentargets_go_df = pd.read_pickle(opentargets_go_pickle_path)
    with open(opentargets_go_metadata_path, "rb") as file:
        opentargets_go_metadata = pickle.load(file)

opentargets_go_df.head()

### Protein-Protein edges

We extracted these edges from StringDB.

In [None]:
ppi_pickle_path = os.path.join(os.getcwd(), "examples/usecases/PCS/data/string_ppi_df.pkl")
ppi_metadata_path = os.path.join(os.getcwd(), "examples/usecases/PCS/data/string_ppi_metadata.pkl")

if not os.path.exists(ppi_pickle_path):
    string_ppi_df, string_ppi_metadata = stringdb.get_ppi(bridgedb_df=bridgedb_df)
    string_ppi_df.to_pickle(ppi_pickle_path)
    with open(ppi_metadata_path, "wb") as file:
        pickle.dump(string_ppi_metadata, file)

else:
    string_ppi_df = pd.read_pickle(ppi_pickle_path)
    with open(ppi_metadata_path, "rb") as file:
        string_ppi_metadata = pickle.load(file)

string_ppi_df.head()

## Generating the main graph

In [16]:
combined_df = combine_sources(
    bridgedb_df,
    [
        disgenet_df,
        opentargets_gene_compound_df,
        minerva_df,
        opentargets_reactome_df,
        opentargets_go_df,
        string_ppi_df,
    ],
)
combined_metadata = create_or_append_to_metadata(
    bridgedb_metadata,
    [
        disgenet_metadata,
        opentargets_disease_compound_metadata,
        opentargets_gene_compound_metadata,
        minerva_metadata,
        opentargets_reactome_metadata,
        opentargets_go_metadata,
        string_ppi_metadata,
    ],
)

In [None]:
combined_df.head(4)

In [None]:
combined_metadata[0]

In [None]:
combined_df.shape

### Exporting the combined data and network

In [None]:
generator.save_graph(
    combined_df=combined_df,
    combined_metadata=combined_metadata,
    graph_name="PCS",
    graph_dir=os.path.join(os.getcwd(), "examples/usecases"),
    disease_compound=opentargets_disease_compound_df,
)

In [23]:
pygraph = nx.read_gml(os.path.join(os.getcwd(), "examples/usecases/PCS/PCS_graph.gml"))

### Visualize the graph

In [None]:
pos = nx.circular_layout(pygraph)

plt.figure(3, figsize=(30, 30))
nx.draw(pygraph, pos)
plt.show()

## Exporting graph to external resources or databases

#### Cytoscape

In [None]:
from pyBiodatafuse.graph import cytoscape

cytoscape.load_graph(pygraph, network_name="PCS network")

#### Neo4j

In [35]:
from pyBiodatafuse.graph import neo4j

neo4j.save_graph_to_graphml(
    pygraph,
    output_path=os.path.join(
        os.getcwd(), "examples", "usecases", "PCS", "pcs_networkx_graph.graphml"
    ),
)

##### Steps to load the graph in Neo4j

- Add `.graphml` file in **import** subfolder of the DBMS folder
- Install apoc plugin
- Create `apoc.conf` file:
    ```
    apoc.trigger.enabled=true
    apoc.import.file.enabled=true
    apoc.export.file.enabled=true
    apoc.import.file.use_neo4j_config=true
    ```
- Add `apoc.conf` file to **conf** subfolder of the DBMS folder
- Open Neo4j Browser
- (Optionl, only run if you have imported a graph before) Remove all the nodes before importing `.graphml` file

    ```
    MATCH (n) DETACH DELETE n
    ```

- Import `.graphml` file

    ```
    call apoc.import.graphml('file:///pcs_networkx_graph.graphml',{readLabels:TRUE})
    ```

- Add indexes after importing the graph for improving the performance of queries

    ```
    create index Gene for (n:Gene) on (n.node_type)
    create index Pathway for (n:Pathway) on (n.node_type)
    create index `Biological Process` for (n:`Biological Process`) on (n.node_type)
    create index `Molecular Function` for (n:`Molecular Function`) on (n.node_type)
    create index `Cellular Component` for (n:`Cellular Component`) on (n.node_type)
    create index Disease for (n:Disease) on (n.node_type)
    create index Compound for (n:Compound) on (n.node_type)
    create index `Side Effect` for (n:`Side Effect`) on (n.node_type)
    ```

- Count the number of each node type
    - total (```MATCH (n) RETURN count(n)```) = 19860
        - Gene (```MATCH (n:Gene) RETURN count(n)```) = 1667
        - Pathway (```MATCH (n:Pathway) RETURN count(n)```) = 1847
            - WikiPathways (```MATCH (n:Pathway {source: "WikiPathways"}) RETURN count(n)```) = 678
            - OpenTargets, Reactome (```MATCH (n:Pathway {source: "OpenTargets"}) RETURN count(n)```) = 1154
            - MINERVA (```MATCH (n:Pathway {source: "MINERVA"}) RETURN count(n)```) = 15
        - Biological Process (```MATCH (n:`Biological Process`) RETURN count(n)```) = 4624
        - Molecular Function (```MATCH (n:`Molecular Function`) RETURN count(n)```) = 1327
        - Cellular Component (```MATCH (n:`Cellular Component`) RETURN count(n)```) = 736
        - Disease (```MATCH (n:Disease) RETURN count(n)```) = 2914
            - DISGENET (```MATCH (n:Disease {source: "DISGENET"}) RETURN count(n)```) = 2913
            - Literature (```MATCH (n:Disease {source: "PMID: 37675861"}) RETURN count(n)```) = 1
        - Compound (```MATCH (n:Compound) RETURN count(n)```) = 2244
        - Side Effect (```MATCH (n:`Side Effect`) RETURN count(n)```) = 4501
- Count the number of each edge type
    - total (```MATCH ()-[r]->() RETURN count(r)```) = 101659
        - interacts_with (```MATCH ()-[r:interacts_with]->() RETURN count(r)```) = 16844
        - part_of (```MATCH ()-[r:part_of]->() RETURN count(r)```) = 30066 
            - WikiPathways (```MATCH ()-[r:part_of {source: "WikiPathways"}]->() RETURN count(r)```) = 3174
            - OpenTargets, Reactome (```MATCH ()-[r:part_of {source: "OpenTargets"}]->() RETURN count(r)```) = 26784
            - MINERVA (```MATCH ()-[r:part_of {source: "MINERVA"}]->() RETURN count(r)```) = 108
        - activates (```MATCH ()-[r:activates]->() RETURN count(r)```) = 499
        - treats (```MATCH ()-[r:treats]->() RETURN count(r)```) = 8215
        - has_side_effect (```MATCH ()-[r:has_side_effect]->() RETURN count(r)```) = 38328
        - inhibits (```MATCH ()-[r:inhibits]->() RETURN count(r)```) = 71
        - associated_with (```MATCH ()-[r:associated_with]->() RETURN count(r)```) = 7636
            - Literature (```MATCH ()-[r:associated_with {source: "DISGENET"}]->() RETURN count(r)```) = 7607
            - DISGENET (```MATCH ()-[r:associated_with{source: "PMID: 37675861"}]->() RETURN count(r)```) = 29

- Export the graph as a `.csv` file

    ```call apoc.export.csv.all("pcs_networkx_graph.csv",{})```