<a href="https://colab.research.google.com/github/AnacletoLAB/grape/blob/main/tutorials/Loading_a_Graph_in_Ensmallen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading graphs with Ensmallen
The first step to do any work on a graph is to load it. In this notebook we will
explore the different ways a graph can be loaded with Ensmallen.

To install the GraPE library do run the usual pip install:

In [1]:
! pip install grape -q

You should consider upgrading via the '/Users/lucacappelletti/opt/miniconda3/bin/python -m pip install --upgrade pip' command.[0m


## Automatic Graph Retrival
Ensmallen supports the automatic download of many common graph.
These graphs are organized by repository and follow the format: `ensmallen.datasets.<repository>.<graph_name>`. 

The available repositories are:

In [2]:
from grape.datasets import get_available_repositories
get_available_repositories()

['linqs',
 'kgobo',
 'zenodo',
 'string',
 'kghub',
 'freebase',
 'wikidata',
 'networkrepository',
 'pheknowlatorkg',
 'yue',
 'monarchinitiative',
 'jax',
 'wikipedia']

And the list of all the available graphs can be retrived as:

In [3]:
from grape.datasets import get_all_available_graphs_dataframe
get_all_available_graphs_dataframe()

Parsing repositories:   0%|                                                                                   …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Parsing graphs:   0%|                                                                                         …

Unnamed: 0,repository,name,version
0,linqs,PubMedDiabetes,latest
1,linqs,Cora,latest
2,linqs,CiteSeer,latest
3,kgobo,MOD,1.031.4
4,kgobo,MOD,10-03-2021-14-36
...,...,...,...
83585,wikipedia,WikiSourceHR,20220401
83586,wikipedia,WikiSourceHR,20220420
83587,wikipedia,WikiSourceHR,20220501
83588,wikipedia,WikiSourceHR,20220520


As an example we can import String PPI's HomoSapiens graph: 

In [4]:
from grape.datasets.string import HomoSapiens

The first time ensmallen will automatically download, preprocess and optimize the edges and nodes lists. The cached files can be found in the local folder `./graphs`.

In [5]:
%%time
homosapiens = HomoSapiens()

CPU times: user 21.8 s, sys: 978 ms, total: 22.7 s
Wall time: 11.3 s


All the other times, Ensmallen will load the now optimized and cached data, this improve the loading time and the peak memory usage.

In [6]:
%%time
homosapiens = HomoSapiens()

CPU times: user 18.6 s, sys: 369 ms, total: 19 s
Wall time: 8.84 s


The pre-processing and the caching of the files can be disabled using the optional parameters of the graph loader: 

In [7]:
help(HomoSapiens)

Help on function HomoSapiens in module grape.datasets.string:

HomoSapiens(directed=False, preprocess='auto', load_nodes=True, load_node_types=True, load_edge_weights=True, auto_enable_tradeoffs=True, sort_tmp_dir=None, verbose=2, cache=True, cache_path=None, cache_sys_var='GRAPH_CACHE_DIR', version='links.v11.5', **kwargs) -> Graph
    Return Homo sapiens graph       
    
    Parameters
    ----------
    directed = False
    preprocess = "auto"
        Preprocess for optimal load time & memory peak.
        Will preprocess in Linux/macOS but not Windows.
    load_nodes = True
        Load node names or use numeric range
    auto_enable_tradeoffs = True
        Enable when graph has < 50M edges
    cache_path = None
        Path to store graphs
        Defaults either to `GRAPH_CACHE_DIR` sys var or `graphs`
    cache_sys_var = "GRAPH_CACHE_DIR"
    version = "links.v11.5"
        Version to retrieve     
                The available versions are:
                        - homology.

The default display of the graph is a Human-readable report with the main features of the graph:

In [8]:
homosapiens

# Caching folder structure
The graphs are downloaded in the `./graphs` folder. More precisely, the folders follows this naming convention: `./graphs/REPOSITORY/GRAPH_NAME/VERSION`. Inside this folder will be downloaded, and extracted if needed, the files for the choosen graph. Then, if preprocessing is not disables, the folder `./graphs/REPOSITORY/GRAPH_NAME/VERSION/processed/DIRECTED_OR_UNDIRECTED/` will be created containg the optimized and standardize files.

To see an example of how this folder is structured we can use the library `seedir` on colab, or just a simple file browser on a local machine.

In [9]:
! pip install seedir --quiet

You should consider upgrading via the '/Users/lucacappelletti/opt/miniconda3/bin/python -m pip install --upgrade pip' command.[0m


In [10]:
from seedir import seedir
seedir("./graphs")

graphs/
├─kgobo/
│ └─ZFS/
│   └─2020-03-10/
│     ├─preprocessed/
│     │ └─undirected/
│     │   └─d26389eb68063882c73a6e08922ab136ea5260d09ac14ef436c94b9790181b88/
│     │     ├─edges.tsv
│     │     ├─metadata.json
│     │     ├─edge_types.tsv
│     │     ├─node_types.tsv
│     │     └─nodes.tsv
│     ├─zfs_kgx_tsv.tar.gz
│     └─zfs_kgx_tsv/
│       ├─zfs_kgx_tsv_edges.tsv
│       └─zfs_kgx_tsv_nodes.tsv
├─linqs/
│ └─Cora/
│   └─latest/
│     ├─edges.tsv
│     ├─preprocessed/
│     │ └─undirected/
│     │   └─6520106ecde87b4fc40d3c3c3aeb7c3922d39db88496f84d516173a548ef1457/
│     │     ├─edges.tsv
│     │     ├─metadata.json
│     │     ├─edge_types.tsv
│     │     ├─node_types.tsv
│     │     └─nodes.tsv
│     ├─cora/
│     │ └─cora/
│     │   ├─cora.content
│     │   ├─README
│     │   └─cora.cites
│     ├─cora.tgz
│     └─nodes.tsv
├─kghub/
│ └─KGMicrobe/
│   └─current/
│     ├─preprocessed/
│     │ └─undirected/
│     │   └─1a89d82964ea842fc88b4d564b6a456de30db37136e74a21f5fd7d

In [20]:
from glob import glob

node_path = glob("./graphs/string/HomoSapiens/links.v11.5/**/nodes.tsv", recursive=True)[0]
edge_path = glob("./graphs/string/HomoSapiens/links.v11.5/**/edges.tsv", recursive=True)[0]

# Adding a graph to Ensmallen

If a graph is not available, you can open a Github Issue [here](https://github.com/AnacletoLAB/ensmallen/issues) providing permanent links to the graph data, and, if possible, in the next release we will add it to the automatical retrival pipeline. To speed-up the process you can add in the Issue the parameters needed to load your graph using the `from_csv` mehtod, explained in the following section.

# Manually loading a graph
You can also load manually a graph. Ensmallen supports only tsv / csv edge lists, and you can optionally also provide Nodes, Node Types, Edge Types lists.

As an example to explain how this can be done, let's manually load the pre-processed files for the graph previously downloaded.

This can be done using the `from_csv` method, which has **a lot** of optional parameters:

In [21]:
from grape import Graph
help(Graph.from_csv)

Help on built-in function from_csv:

from_csv(node_type_path, node_type_list_separator, node_types_column_number, node_types_column, node_types_ids_column_number, node_types_ids_column, node_types_number, numeric_node_type_ids, minimum_node_type_id, node_type_list_header, node_type_list_support_balanced_quotes, node_type_list_rows_to_skip, node_type_list_is_correct, node_type_list_max_rows_number, node_type_list_comment_symbol, load_node_type_list_in_parallel, node_path, node_list_separator, node_list_header, node_list_support_balanced_quotes, node_list_rows_to_skip, node_list_is_correct, node_list_max_rows_number, node_list_comment_symbol, default_node_type, nodes_column_number, nodes_column, node_types_separator, node_list_node_types_column_number, node_list_node_types_column, node_ids_column, node_ids_column_number, nodes_number, minimum_node_id, numeric_node_ids, node_list_numeric_node_type_ids, skip_node_types_if_unavailable, load_node_list_in_parallel, edge_type_path, edge_types_

We can load the graph as:

In [23]:
%%time
manually_loaded_homosapiens = Graph.from_csv(
    # Edges related parameters

    ## The path to the edges list tsv
    edge_path=edge_path,
    ## Set the tab as the separator between values
    edge_list_separator="\t",
    ## The first rows should NOT be used as the columns names
    edge_list_header=False,
    ## The source nodes are in the first nodes
    sources_column_number=0,
    ## The destination nodes are in the second column
    destinations_column_number=1,
    ## Both source and destinations columns use numeric node_ids instead of node names
    edge_list_numeric_node_ids=True,
    ## The weights are in the third column
    weights_column_number=2,

    # Nodes related parameters
    ## The path to the nodes list tsv
    node_path=node_path,
    ## Set the tab as the separator between values
    node_list_separator="\t",
    ## The first rows should be used as the columns names
    node_list_header=True,
    ## The column with the node names is the one with name "node_name".
    nodes_column="node_name",

    # Graph related parameters
    ## The graph is undirected
    directed=False,
    ## The name of the graph is HomoSapiens
    name="HomoSapiens",
    ## Display a progress bar, (this might be in the terminal and not in the notebook)
    verbose=True,
)

CPU times: user 15.7 s, sys: 655 ms, total: 16.3 s
Wall time: 2.87 s


It's worth noticing that this took 1 minute and 16 seconds, while loading the same data using the automatic retrival takes around 4 seconds! This is because Ensmallen by default thoroughly checks the files it load. Since we know for sure that these files are correct (because they were preprocessed) we can enable optional flags to skip the checks and speed-up the loading.

In [25]:
%%time
manually_loaded_homosapiens = Graph.from_csv(
    # Edges related parameters

    ## The path to the edges list tsv
    edge_path=edge_path,
    ## Set the tab as the separator between values
    edge_list_separator="\t",
    ## The first rows should NOT be used as the columns names
    edge_list_header=False,
    ## The source nodes are in the first nodes
    sources_column_number=0,
    ## The destination nodes are in the second column
    destinations_column_number=1,
    ## Both source and destinations columns use numeric node_ids instead of node names
    edge_list_numeric_node_ids=True,
    ## The weights are in the third column
    weights_column_number=2,

    # Edges speed-ups arguments
    ## The edges are sorted
    edge_list_is_sorted=True,
    ## The edge does not contains duplicates and all the node_ids or node_names
    ## exists and are reasonable
    edge_list_is_correct=True,
    ## This is mainly meant for undirected graphs, this is if the edge list
    ## contains the edges in both directions, meaning that if src -> dst exists
    ## then it also exist dst -> src.
    edge_list_is_complete=True,
    ## The number of edges in the file, this is used to pre-allocate memory
    ## and allows us to process in stream all the data
    edges_number=11938498,


    # Nodes related parameters
    ## The path to the nodes list tsv
    node_path=node_path,
    ## Set the tab as the separator between values
    node_list_separator="\t",
    ## The first rows should be used as the columns names
    node_list_header=True,
    ## The column with the node names is the one with name "node_name".
    nodes_column="node_name",

    # Nodes speed-ups arguments
    ## The nodes does not contains duplicates and all the node_ids or node_names
    ## exists and are reasonable
    node_list_is_correct=True,
    ## The number of nodes in the file, this is used to pre-allocate memory
    ## and allows us to process in stream all the data
    nodes_number=19566,

    # Graph related parameters
    ## The graph is undirected
    directed=False,
    ## The name of the graph is HomoSapiens
    name="HomoSapiens",
    ## Display a progress bar, (this might be in the terminal and not in the notebook)
    verbose=True,
)

CPU times: user 10.8 s, sys: 142 ms, total: 11 s
Wall time: 1.48 s


We can verify that we are now loading exactly the same graph by comparing their hashes:

In [26]:
print("The original hash is {}".format(hash(homosapiens)))
print("The new hash is      {}".format(hash(manually_loaded_homosapiens)))
print("Are they equal? {}".format(hash(homosapiens) == hash(manually_loaded_homosapiens)))

The original hash is -8463466388411006992
The new hash is      -699056784427808228
Are they equal? False


These hashs are computed taking in consideration every property of the graph, so they can be used to ensure reproducibility of experiments. We can show this by computing the hash, after modifying the graph, in this case removing nodes that are not connected to any other node.

In [28]:
hash(homosapiens.remove_disconnected_nodes())

5245231756560536037