<a href="https://colab.research.google.com/github/AnacletoLAB/grape/blob/main/tutorials/Loading_a_Graph_in_Ensmallen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading graphs with Ensmallen
The first step to do any work on a graph is to load it. In this notebook we will
explore the different ways a graph can be loaded with Ensmallen.

Install Grape, if not already done. (For windows just comment out the `> /dev/null` part)

In [1]:
! pip install grape > /dev/null

## Automatic Graph Retrival
Ensmallen supports the automatic download of many common graph.
These graphs are organized by repository and follow the format: `ensmallen.datasets.<repository>.<graph_name>`. 

The available repositories are:

In [2]:
from ensmallen.datasets import get_available_repositories
get_available_repositories()

['networkrepository',
 'zenodo',
 'jax',
 'pheknowlatorkg',
 'linqs',
 'kghub',
 'monarchinitiative',
 'string',
 'yue']

And the list of all the available graphs can be retrived as:

In [3]:
from ensmallen.datasets import get_all_available_graphs_dataframe
get_all_available_graphs_dataframe()

Unnamed: 0,repository,graph_name,version
0,networkrepository,Sw10040d2Trial1,latest
1,networkrepository,Aa4,latest
2,networkrepository,SocfbMississippi66,latest
3,networkrepository,LasagneSpanishbook,latest
4,networkrepository,SocSinaweibo,latest
...,...,...,...
57999,yue,DrugBankDDI,latest
58000,yue,node2vecPPI,latest
58001,yue,MashupPPI,latest
58002,yue,StringPPI,latest


As an example we can import String PPI's HomoSapiens graph: 

In [4]:
from ensmallen.datasets.string import  HomoSapiens

The first time ensmallen will automatically download, preprocess and optimize the edges and nodes lists. The cached files can be found in the local folder `./graphs`.

In [5]:
%%time
homosapiens = HomoSapiens()

Downloading files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading to graphs/string/Ho...inks.v11.5.txt.gz:   0%|          | 0.00/72.7M [00:00<?, ?iB/s]

Downloading to graphs/string/Ho...info.v11.5.txt.gz:   0%|          | 0.00/1.90M [00:00<?, ?iB/s]

CPU times: user 30.6 s, sys: 2.01 s, total: 32.6 s
Wall time: 1min 11s


All the other times, Ensmallen will load the now optimized and cached data, this improve the loading time and the peak memory usage.

In [6]:
%%time
homosapiens = HomoSapiens()

CPU times: user 7.13 s, sys: 161 ms, total: 7.29 s
Wall time: 3.83 s


The pre-processing and the caching of the files can be disabled using the optional parameters of the graph loader: 

In [7]:
help(HomoSapiens)

Help on function HomoSapiens in module ensmallen.datasets.string.homosapiens:

HomoSapiens(directed: bool = False, preprocess: bool = True, verbose: int = 2, cache: bool = True, cache_path: str = 'graphs/string', version: str = 'links.v11.5', **additional_graph_kwargs: Dict) -> Graph
    Return new instance of the Homo sapiens graph.
    
    The graph is automatically retrieved from the STRING repository.    
    
    Parameters
    -------------------
    directed: bool = False,
        Wether to load the graph as directed or undirected.
        By default false.
    preprocess: bool = True,
        Whether to preprocess the graph to be loaded in 
        optimal time and memory.
    verbose: int = 2,
        Wether to show loading bars during the retrieval and building
        of the graph.
    cache: bool = True,
        Whether to use cache, i.e. download files only once
        and preprocess them only once.
    cache_path: str = "graphs",
        Where to store the downloaded gr

The default display of the graph is a Human-readable report with the main features of the graph:

In [8]:
homosapiens

# Caching folder structure
The graphs are downloaded in the `./graphs` folder. More precisely, the folders follows this naming convention: `./graphs/REPOSITORY/GRAPH_NAME/VERSION`. Inside this folder will be downloaded, and extracted if needed, the files for the choosen graph. Then, if preprocessing is not disables, the folder `./graphs/REPOSITORY/GRAPH_NAME/VERSION/processed/DIRECTED_OR_UNDIRECTED/` will be created containg the optimized and standardize files.

To see an example of how this folder is structured we can use the library `seedir` on colab, or just a simple file browser on a local machine.

In [9]:
! pip install seedir --quiet

In [10]:
from seedir import seedir
seedir("./graphs")

graphs/
└─string/
  └─HomoSapiens/
    └─links.v11.5/
      ├─9606.protein.info.v11.5.txt
      ├─9606.protein.info.v11.5.txt.gz
      ├─preprocessed/
      │ └─undirected/
      │   ├─edges.tsv
      │   ├─nodes.tsv
      │   └─metadata.json
      ├─9606.protein.links.v11.5.txt.gz
      └─9606.protein.links.v11.5.txt


# Adding a graph to Ensmallen

If a graph is not available, you can open a Github Issue [here](https://github.com/AnacletoLAB/ensmallen/issues) providing permanent links to the graph data, and, if possible, in the next release we will add it to the automatical retrival pipeline. To speed-up the process you can add in the Issue the parameters needed to load your graph using the `from_csv` mehtod, explained in the following section.

# Manually loading a graph
You can also load manually a graph. Ensmallen supports only tsv / csv edge lists, and you can optionally also provide Nodes, Node Types, Edge Types lists.

As an example to explain how this can be done, let's manually load the pre-processed files for the graph previously downloaded.

This can be done using the `from_csv` method, which has **a lot** of optional parameters:

In [11]:
from ensmallen import Graph
help(Graph.from_csv)

Help on built-in function from_csv:

from_csv(node_type_path, node_type_list_separator, node_types_column_number, node_types_column, node_types_ids_column_number, node_types_ids_column, node_types_number, numeric_node_type_ids, minimum_node_type_id, node_type_list_header, node_type_list_rows_to_skip, node_type_list_is_correct, node_type_list_max_rows_number, node_type_list_comment_symbol, load_node_type_list_in_parallel, node_path, node_list_separator, node_list_header, node_list_rows_to_skip, node_list_is_correct, node_list_max_rows_number, node_list_comment_symbol, default_node_type, nodes_column_number, nodes_column, node_types_separator, node_list_node_types_column_number, node_list_node_types_column, node_ids_column, node_ids_column_number, nodes_number, minimum_node_id, numeric_node_ids, node_list_numeric_node_type_ids, skip_node_types_if_unavailable, load_node_list_in_parallel, edge_type_path, edge_types_column_number, edge_types_column, edge_types_ids_column_number, edge_types_

Let's start by inspecting how the files we want to load are formatted:

In [12]:
! head ./graphs/string/HomoSapiens/links.v11.5/preprocessed/undirected/edges.tsv

0	1	157
0	9	200
0	30	170
0	47	164
0	59	175
0	60	550
0	68	240
0	71	161
0	82	162
0	83	273


In [13]:
! head ./graphs/string/HomoSapiens/links.v11.5/preprocessed/undirected/nodes.tsv

node_name	1
9606.ENSP00000000233	
9606.ENSP00000000412	
9606.ENSP00000001008	
9606.ENSP00000001146	
9606.ENSP00000002125	
9606.ENSP00000002165	
9606.ENSP00000002596	
9606.ENSP00000002829	
9606.ENSP00000003084	


We can load the graph as:

In [14]:
%%time
manually_loaded_homosapiens = Graph.from_csv(
    # Edges related parameters

    ## The path to the edges list tsv
    edge_path="./graphs/string/HomoSapiens/links.v11.5/preprocessed/undirected/edges.tsv",
    ## Set the tab as the separator between values
    edge_list_separator="\t",
    ## The first rows should NOT be used as the columns names
    edge_list_header=False,
    ## The source nodes are in the first nodes
    sources_column_number=0,
    ## The destination nodes are in the second column
    destinations_column_number=1,
    ## Both source and destinations columns use numeric node_ids instead of node names
    edge_list_numeric_node_ids=True,
    ## The weights are in the third column
    weights_column_number=2,

    # Nodes related parameters
    ## The path to the nodes list tsv
    node_path="./graphs/string/HomoSapiens/links.v11.5/preprocessed/undirected/nodes.tsv",
    ## Set the tab as the separator between values
    node_list_separator="\t",
    ## The first rows should be used as the columns names
    node_list_header=True,
    ## The column with the node names is the one with name "node_name".
    nodes_column="node_name",

    # Graph related parameters
    ## The graph is undirected
    directed=False,
    ## The name of the graph is HomoSapiens
    name="HomoSapiens",
    ## Display a progress bar, (this might be in the terminal and not in the notebook)
    verbose=True,
)

CPU times: user 23.2 s, sys: 1min 26s, total: 1min 49s
Wall time: 1min 37s


It's worth noticing that this took 1 minute and 16 seconds, while loading the same data using the automatic retrival takes around 4 seconds! This is because Ensmallen by default thoroughly checks the files it load. Since we know for sure that these files are correct (because they were preprocessed) we can enable optional flags to skip the checks and speed-up the loading.

In [15]:
%%time
manually_loaded_homosapiens = Graph.from_csv(
    # Edges related parameters

    ## The path to the edges list tsv
    edge_path="./graphs/string/HomoSapiens/links.v11.5/preprocessed/undirected/edges.tsv",
    ## Set the tab as the separator between values
    edge_list_separator="\t",
    ## The first rows should NOT be used as the columns names
    edge_list_header=False,
    ## The source nodes are in the first nodes
    sources_column_number=0,
    ## The destination nodes are in the second column
    destinations_column_number=1,
    ## Both source and destinations columns use numeric node_ids instead of node names
    edge_list_numeric_node_ids=True,
    ## The weights are in the third column
    weights_column_number=2,

    # Edges speed-ups arguments
    ## The edges are sorted
    edge_list_is_sorted=True,
    ## The edge does not contains duplicates and all the node_ids or node_names
    ## exists and are reasonable
    edge_list_is_correct=True,
    ## This is mainly meant for undirected graphs, this is if the edge list
    ## contains the edges in both directions, meaning that if src -> dst exists
    ## then it also exist dst -> src.
    edge_list_is_complete=True,
    ## The number of edges in the file, this is used to pre-allocate memory
    ## and allows us to process in stream all the data
    edges_number=11938498,


    # Nodes related parameters
    ## The path to the nodes list tsv
    node_path="./graphs/string/HomoSapiens/links.v11.5/preprocessed/undirected/nodes.tsv",
    ## Set the tab as the separator between values
    node_list_separator="\t",
    ## The first rows should be used as the columns names
    node_list_header=True,
    ## The column with the node names is the one with name "node_name".
    nodes_column="node_name",

    # Nodes speed-ups arguments
    ## The nodes does not contains duplicates and all the node_ids or node_names
    ## exists and are reasonable
    node_list_is_correct=True,
    ## The number of nodes in the file, this is used to pre-allocate memory
    ## and allows us to process in stream all the data
    nodes_number=19566,

    # Graph related parameters
    ## The graph is undirected
    directed=False,
    ## The name of the graph is HomoSapiens
    name="HomoSapiens",
    ## Display a progress bar, (this might be in the terminal and not in the notebook)
    verbose=True,
)

CPU times: user 7.14 s, sys: 156 ms, total: 7.3 s
Wall time: 3.85 s


We can verify that we are now loading exactly the same graph by comparing their hashes:

In [16]:
print("The original hash is {}".format(hash(homosapiens)))
print("The new hash is      {}".format(hash(manually_loaded_homosapiens)))
print("Are they equal? {}".format(hash(homosapiens) == hash(manually_loaded_homosapiens)))

The original hash is -699056784427808228
The new hash is      -699056784427808228
Are they equal? True


These hashs are computed taking in consideration every property of the graph, so they can be used to ensure reproducibility of experiments. We can show this by computing the hash, after modifying the graph, in this case removing nodes that are not connected to any other node.

In [17]:
hash(homosapiens.drop_disconnected_nodes())

-3404126140336579873

# Graphs Versions
When possible, in the Automatic retrival pipeline we support multiple versions of the same graphs. By default we download the latest version, but by passing the `version` argument one can select a specific vewrsion.

In [18]:
%%time
older_homosapiens = HomoSapiens(version="links.v11.0")
older_homosapiens.set_name("Older HomoSapiens")

Downloading files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading to graphs/string/Ho...inks.v11.0.txt.gz:   0%|          | 0.00/71.2M [00:00<?, ?iB/s]

Downloading to graphs/string/Ho...info.v11.0.txt.gz:   0%|          | 0.00/1.89M [00:00<?, ?iB/s]

CPU times: user 30.7 s, sys: 2.09 s, total: 32.8 s
Wall time: 1min 13s


We can check that this graph is different than the newer one:

In [19]:
hash(older_homosapiens), hash(homosapiens) == hash(older_homosapiens)

(667882773146874271, False)

These different versions can be used for example to do a temporal holdout of links. We can easily find which edges were added in the we version by doing a set-like subtraction:

In [20]:
homosapiens - older_homosapiens