<a href="https://colab.research.google.com/github/AnacletoLAB/grape/blob/main/tutorials/Graph_holdouts_using_GRAPE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Graph holdouts using 🍇🍇 GRAPE 🍇🍇

In this tutorial, we will proceed with a deep-dive from basic holdouts concepts to advanced ones, and explore how to use the [GRAPE library](https://github.com/AnacletoLAB/grape) for many advanced holdout techniques on graph data.

## Do you have holdouts for my task?
While [GRAPE](https://github.com/AnacletoLAB/grape) supports many different holdout methods, we do not claim to provide all of the possible evaluation methods. It may be possible that your task and datasets has specific biases that require designing a custom evaluation method; nevertheless, we hope the following tutorial will provide you with detailed informations that will help you make an inform choice and, if needed, design an evaluation system perfect for your task.

**We stress that it is important to keep in mind that different tasks and datasets may have very different biases and may require different evaluation systems**. It is not uncommon for a model to perform well on a standard dataset but poorly on other tasks. This is because different tasks and datasets have different characteristics and may require different model architectures or training strategies.

Therefore, it is important to carefully consider the biases and requirements of each task and dataset when selecting an evaluation system. Avoid expecting that a model that performs well on one task or dataset will necessarily perform well on others. Instead, the evaluation system should be adaptable to the task at hand, rather than the task being adapted to the evaluation system or a particular dataset. **It is important to choose an evaluation method that is appropriate for the task and dataset, rather than relying on a single method as a one-size-fits-all solution.**

### On ranked model lists

It is important to keep in mind that while there are many top ten lists, rating lists, and other rankings of machine learning models and algorithms, these lists do not necessarily provide a complete picture of which models will perform well on your specific task. These lists are often based on the performance of models on standard datasets or benchmarks, which may not accurately reflect the characteristics and requirements of your task.

![Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.](https://windmillsofmymind.weebly.com/uploads/1/6/2/1/16218570/5312347_orig.jpg)

Therefore, it is important to evaluate all models on your task in order to determine which one is most suitable for your specific needs. **While these lists can be a useful starting point, they should not be relied upon as the sole source of information when selecting a model for your task**.

Expecially beware of lists **providing model performance about a single holdout**, as it makes it impossible to execute meaningful statistical tests and can be gamed trivially by bad actors.

## What is GRAPE?
🍇🍇 [GRAPE](https://github.com/AnacletoLAB/grape) 🍇🍇 is a graph processing and embedding library that enables users to easily manipulate and analyze graphs. With [GRAPE](https://github.com/AnacletoLAB/grape), users can efficiently load and preprocess graphs, generate random walks, and apply various node and edge embedding models. Additionally, [GRAPE](https://github.com/AnacletoLAB/grape) provides a fair and reproducible evaluation pipeline for comparing different graph embedding and graph-based prediction methods.

[![GRAPE vertical pipeline](https://github.com/AnacletoLAB/grape/raw/main/images/sequence_diagram.png?raw=true)](https://github.com/AnacletoLAB/grape)

In this tutorial, we will demonstrate how to use [GRAPE](https://github.com/AnacletoLAB/grape) to perform advanced holdout techniques, such as connected holdouts and degree-constrained holdouts, on a real-world graph. By the end of this tutorial, users will have a better understanding of how to apply these techniques to evaluate the generalization performance of their graph-based models.

You can learn more about the library [on its GitHub repository](https://github.com/AnacletoLAB/grape). ⭐⭐ Remember to give it a star ⭐⭐!

## What is an Holdout?

An holdout is a technique for evaluating the performance of a machine learning model on unseen data. It involves dividing the dataset into a training set, which is used to fit the model, and a test set, which is used to evaluate the performance of the model on unseen data. This latter set is also referred to as **holdout**, hence the name.

![Holdout](https://github.com/AnacletoLAB/grape/blob/main/images/holdout.png?raw=true)

### Inner holdouts
In machine learning, the training set is the set of data that is used to fit the model, the test set is the set of data that is used to evaluate the performance of the model on unseen data, and the validation set is a set of data that is used to tune the model's hyperparameters. The validation set is often used in conjunction with the training set, while the test set is used to provide an independent evaluation of the model's performance on completely unseen data. One of the many example of tools you may use for optimizing the hyper-parameters of a model is [Bayesian optimization](https://en.wikipedia.org/wiki/Bayesian_optimization).

A library worth mentioning that makes many hyper-parameters optimization tools available is [Ray](https://github.com/ray-project/ray).

#### What is Bayesian optimization?
Bayesian optimization is a method for optimizing a black-box objective function that takes in a set of hyperparameters. It is particularly useful when the objective function is expensive to evaluate, because it can be more efficient at finding good hyperparameter values than grid search or random search. The black box may be a big machine learning model, like neural network, or more generically an expensive experiment.

![Bayesian optimization loop](https://github.com/AnacletoLAB/grape/blob/main/images/bayesian%20optimization.png?raw=true)

##### **A generic application example: marketing**
Bayesian optimization can be used in marketing to optimize various aspects of marketing campaigns, such as the targeting of ads to specific audiences, the design of landing pages, and the selection of content to promote. For example, a marketing team might use Bayesian optimization to find the optimal combination of variables, such as the demographics of the target audience, the placement of ads, and the type of content being promoted, in order to maximize the return on investment (ROI) of a marketing campaign.

In this context, the objective function being optimized could be the ROI of the marketing campaign, and the hyperparameters could be the various variables that the marketing team has control over. By using Bayesian optimization to tune these hyperparameters, the marketing team can more effectively allocate their resources and reach their desired outcomes.

##### **In machine learning models**
In the context of machine learning models, Bayesian optimization can be used to tune the hyperparameters of a model in order to improve its performance. For example, you might use Bayesian optimization to tune the learning rate, regularization coefficient, and number of layers in a neural network.

**One reason to use Bayesian optimization is that it can work with non-derivable parameters that cannot be tuned using gradient descent**. For example, the learning rate is typically chosen by hand and is not directly optimized using the gradient of the objective function. Bayesian optimization can find good values for the learning rate and other non-derivable hyperparameters by sampling from a probability distribution over the hyperparameter space and evaluating the objective function at the sampled points.

Another reason to use Bayesian optimization is that it can be more efficient than other methods, such as grid search or random search. This is because it uses a probabilistic model to guide the search process, taking into account the past performance of the objective function and making informed decisions about which points to evaluate next. This can lead to faster convergence and better final performance compared to other methods.

![Bayesian Optimization in action](https://github.com/fmfn/BayesianOptimization/raw/master/examples/bayesian_optimization.gif)

## Some holdouts taxonomy

There are several methods that can be used to evaluate the performance of a machine learning model and estimate its generalization error, which is the difference between the model's performance on the training data and its performance on unseen data.

[Please do reach out if you think of holdout types that are missing from the list!](https://github.com/AnacletoLAB/grape/issues)

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/taxonomy.png?raw=true" width=400 />

### Monte Carlo holdouts
**Monte Carlo holdouts** involve randomly dividing the dataset into a training set and a test set. The model is trained on the training set and then evaluated on the test set. This process is repeated a number of times, with different random splits of the data each time, and the resulting evaluation scores are averaged to give an estimate of the model's generalization error.

### K-fold cross-validation
In **K-fold cross-validation**, the data is split into `K` separated folds, and the model is trained and evaluated `K` times, with a different fold being used as the test set in each iteration. The performance measure is averaged across all `K` iterations.

#### Leave-one-out holdouts
**Leave-one-out holdout** is a specific case of `K`-fold cross-validation where `K` is equal to the size of the dataset. In other words, the data is split into `N` folds, where `N` is the size of the dataset, and the model is trained and evaluated `N` times, with a different data point being left out as the test set in each iteration. This method can be computationally expensive because it requires training and evaluating the model `N` times, but it can be useful when the dataset is small and there is a need to maximize the use of all available data.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/leave-one-out.jpg?raw=true" width=600 />

### Stratified holdouts
**Stratified holdout** is a method for splitting a dataset into a training set and a test set such that the proportion of classes in the training set is the same as the proportion of classes in the original dataset. **This is useful expecially when the classes in the dataset are imbalanced**, meaning that there are significantly more or fewer examples of one class compared to the others. Without stratification, the training set may not accurately represent the class distribution in the original dataset, which can lead to poor model performance on the test set.

Stratified holdout can be applied to both K-fold cross-validation and Monte Carlo cross-validation.

*Just like in a good cake, each slice will have the same proportion of ingredients as in all of the cake*.

<img src="https://upload.wikimedia.org/wikipedia/commons/0/04/Pound_layer_cake.jpg" width=400 />

#### Covariate shift

**Covariate shift** is a phenomenon that occurs when the distribution of features (also known as covariates) in the training set is different from the distribution of features in the set where you actually intend to run predictions. This can lead to poor model performance because the model has not seen data that is representative of the target distribution.

##### **Why covariate shift happen?**

Covariate shift can occur due to a variety of factors, including changes in location, gender, and other demographic variables. For example, a machine learning model trained on data from one country might perform poorly when applied to data from a different country, due to differences in the distribution of input features such as cultural norms, economic conditions, and societal trends.

##### **Mitigating covariate shift**

One way to mitigate the effects of covariate shift is to use stratified holdout. By stratifying the data, we ensure that the class distribution is the same in both the training and test sets, which can help to reduce the effects of covariate shift. However, **stratified holdout is not sufficient to completely eliminate the effects of covariate shift**.

It is important to consider these factors when training and evaluating machine learning models, and to ensure that the training data is representative of the data the model will be used on in practice. This can be achieved by using a holdout period that is long enough to capture a wide range of conditions, and by periodically retraining the model on more recent data to ensure that it remains up to date and accurate. *It is also important to carefully monitor the performance of the model and be prepared to adjust it as needed if it begins to show signs of deteriorating performance.*

### Temporal holdouts

**Temporal holdouts** are a type of holdout method that involve splitting the data into a training set and a holdout set based on time. The training set consists of data from earlier time periods, meant to represent the past and known data, while the holdout set consists of data from later time periods, meant to represent the future and unknown data.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/temporal%20holdout.png?raw=true" width=400 />

#### Limitations of temporal holdouts

One inherent limitation of using a single temporal holdout is that it does not provide any information about the variance or stability of the model, as only one holdout of a given size is possible. A model that performs well on a single holdout set may not generalize well to other datasets or time periods. This is because there is no way to estimate the standard deviation or confidence interval of the model's performance.

This limitation makes it difficult or impossible to run statistical tests such as the *Wilcoxon test*, which compares the performance of two models and determines whether the difference in their performance is statistically significant. In order to run this test, it is necessary to have multiple holdout sets so that the variance of the model's performance can be estimated.

#### Black swans

[**A black swan event**](https://en.wikipedia.org/wiki/Black_swan_theory) is a highly improbable and unpredictable event that has significant consequences or impact. Black swan events are often characterized by their rarity, their severe impact, and the fact that they are often only retrospectively identifiable as significant events.

##### **Why "black swans"?**

The term "black swan" refers to the fact that, prior to the European exploration of Australia, people in the Western world believed that all swans were white, because that was the only type of swan they had ever seen. The discovery of black swans in Australia was therefore a surprise and a metaphor for something that was previously thought to be impossible or unlikely.

<img src="https://upload.wikimedia.org/wikipedia/commons/2/2b/Black_swan_jan09.jpg" width=400 />

##### **ML and black swans**

A black swan event, such as the COVID-19 pandemic, can cause covariate shift by significantly altering the statistical properties of the data, and can have a significant impact on the performance of a machine learning model that has been trained and tested using a temporal holdout. This is because a black swan event can lead to significant changes in the underlying statistical properties of the data, making it difficult for the model to accurately predict outcomes on data from the later time period. For example, a machine learning model trained on data from before the COVID-19 pandemic might perform poorly when applied to data from the pandemic period, due to the unprecedented changes in economic and social conditions brought about by the pandemic.

![Economical black swans](https://github.com/AnacletoLAB/grape/blob/main/images/black_swans.png?raw=true)

##### **Dealing with black swans**

To mitigate the impact of black swan events on machine learning model performance, it is important to use a holdout period that is long enough to capture a wide range of conditions, and to periodically retrain the model on more recent data to ensure that it remains up to date and accurate. It is also important to carefully monitor the performance of the model and be prepared to adjust it as needed if it begins to show signs of deteriorating performance.

### Comparing models: Wilcoxon test
[The **Wilcoxon signed-rank test**](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test#:~:text=The%20Wilcoxon%20signed%2Drank%20test,%2Dsample%20Student's%20t%2Dtest.) is a nonparametric statistical test used to compare two related samples or repeated measurements on a single sample. The Wilcoxon test is used to determine whether there is a significant difference between the two samples or repeated measurements.

The Wilcoxon test can be used in a variety of contexts, including comparing the efficacy of different treatments in a medical study, comparing the performance of different algorithms in a machine learning experiment, or comparing the results of a survey administered at two different times.

## Graph holdouts

Holdout techniques can be applied to various types of graph-based tasks, such as edge prediction, node-label prediction, and edge-label prediction. When dealing with graphs, it is vital to take into account the effect of the graph topology on the distribution of the data.

The traditional holdout techniques such as simple random sampling may not be sufficient for evaluating the performance of graph-based models: for instance, a stratification of the labels of a node-label or edge-label prediction task may not be sufficient to provide a meaningful evaluation of the labels when the topology includes [stars and high-degree nodes, such as in citation graphs, which may lead to a number of biases](https://www.biorxiv.org/content/10.1101/2022.11.21.517376v1).

### Monte Carlo edge holdouts
**Monte Carlo edge holdouts** is a holdout technique in which the holdout set is chosen randomly from the entire dataset. This technique is useful for evaluating the performance of a model on a task that makes an open world assumption. Do note that this type of holdout may create a training set that has more connected components than the original graph.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/random_holdout.jpg?raw=true" width=300 />

### Connected Monte Carlo edge holdouts
**Connected Monte Carlo edge holdouts** is a holdout technique in which the training set is select in such a way to have the same connected components of the original graph, and is used to evaluate models on tasks that make a closed world assumption. 

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/connected_holdout.jpg?raw=true" width=300 />

### Edge K-folds
**Edge K-folds** is a holdout technique in which the graph edges are divided into `k` folds, and the model is trained and evaluated `k` times, each time using a different fold as the holdout set and the remaining folds as the training set. This holdout technique may generate disconnected compontents.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/kfolds.jpg?raw=true" width=300 />

### Goldilock Monte Carlo edge holdouts
**The Goldilocks fable** is a story about a young girl named Goldilocks who goes for a walk in the forest and comes across a house. Inside the house, she finds three bowls of porridge and three chairs. She tries each one and finds that the porridge from the first bowl is too hot, the porridge from the second bowl is too cold, and the porridge from the third bowl is just right. The term "Goldilocks zone" is derived from this story and refers to the idea of finding the *perfect balance or optimal range in any given situation*.

In the context of graphs, the concept of the Goldilocks zone can be applied to the evaluation of machine learning models on these data types. When working with graphs, it is important to consider the density of the graph, as this can have a significant impact on the model's ability to make accurate predictions. If areas of the graph that are too densely connected, the model may make accurate predictions that are not useful due to the redundant or tautological nature of the edges. On the other hand, if an area of the graph is too sparse, there may be too little topology for the model to make meaningful predictions there. To address these issues, it can be helpful to use a **Goldilock edge holdouts**, which allows the model to be evaluated on a portion of the graph that is neither too dense nor too sparse, as defined by node degrees, but falls within the "Goldilocks zone" of optimal density.

<img alt="Stable diffusion of Goldilocks" src="https://github.com/AnacletoLAB/grape/blob/main/images/goldilocks.png?raw=true" width=300/>


### Negative edges
In node embedding and classifier models, **negative sampling** is a technique used to train the model for edge prediction. This is typically done by randomly sampling pairs of nodes and labeling them as negative (i.e., no relationship exists between them) in the training set. The model is then trained to predict these negative relationships in addition to positive relationships (i.e., relationships that are known to exist between pairs of nodes).

Negative sampling can also be used to evaluate the performance of a model on edge prediction tasks. The model is evaluated on a test set of node pairs, and the performance is measured by the model's ability to correctly classify the relationships as either positive or negative.

However, edge prediction is not a true binary prediction task. One issue is that negative sampling methods typically assume that the vast majority of unlabeled edges in the graph are negative. Only a partial knowledge about the relationships between pairs of nodes is available, and there may be a large number of unlabeled edges that represent relationships that are unknown or not explicitly defined in the data.

Furthermore, when negative edges are sampled uniformly at random, the resulting sets of positive and negative edges often have very different edge degree distributions, with a higher number of positive edges connecting nodes with high degree. This can artificially inflate the measured classification performance, as the model is more likely to correctly classify negative edges between high-degree nodes.

To address this issue, [degree-aware node sampling approach](https://www.biorxiv.org/content/10.1101/2022.11.21.517376v1.full.pdf) for negative sampling, which aims to mitigate this bias by sampling negative edges in a way that more accurately reflects the degree distribution of the graph, and reduce the relative bias.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/negative_edges.jpg?raw=true" width=300/>



## Topological oddities
One of the features of the [GRAPE](https://github.com/AnacletoLAB/grape) library is the ability to identify and report on topological oddities in the graph, such as tendril chains and dendritic trees. These features can be useful for understanding the structure of the graph and identifying areas of the graph that may be more difficult for a machine learning model to make accurate predictions.

In addition to identifying and reporting on these features, the [GRAPE](https://github.com/AnacletoLAB/grape) library also provides tools for easily removing them from the graph, if desired. This can be useful for tasks where the presence of these topological oddities may be confusing or misleading. For example, if the goal is to predict new edges in a social network, it may be helpful to remove tendril chains or dendritic trees from the graph, as these areas may be less representative of the overall network and may be more difficult for the model to make accurate predictions.

### Tendrils
In a graph, a **tendril** chain is a group of nodes that are connected to each other by a series of edges, but are not connected to the main body of the graph. These tendril chains may be isolated from the rest of the graph, or they may be connected to the main body of the graph through a single node.

Tendrils can occur in many different types of graphs, including social networks, information networks, and biological networks. They may represent "outlier" nodes that are not well connected to the rest of the graph, or they may represent specialized subgroups or communities within the graph.

![Tendrils](https://github.com/AnacletoLAB/grape/blob/main/images/tendril.jpg?raw=true)

### Dendritic trees
In a graph, a **dendritic tree** is a tree-like structure that is attached to the main component of the graph. A dendritic tree is characterized by a central node (the "root" of the tree) that is connected to a number of other nodes, which are in turn connected to their own set of nodes, and so on. This creates a tree-like structure that is connected to the main component of the graph through a single point, but has a much lower density than the main component.

Dendritic trees can occur in many different types of graphs, including social networks, information networks, and biological networks. They may represent specialized subgroups or communities within the graph, or they may represent outlier nodes that are not well connected to the rest of the graph.

Dendritic trees are a common feature of knowledge graphs that include taxonomies and ontologies. These types of knowledge graphs are used to represent structured information about a particular domain, and often include a hierarchy of concepts that are organized into a tree-like structure.

In a knowledge graph with a taxonomy, a dendritic tree can represent a hierarchy of concepts, with the root node representing the most general concept and the leaf nodes representing the most specific concepts. For example, a taxonomy of animals might have a root node representing "animals," with child nodes representing different types of animals, such as "mammals," "reptiles," and "birds." Each of these child nodes might have its own set of child nodes representing more specific types of animals, and so on.

In a knowledge graph with an ontology, a dendritic tree can represent a hierarchy of classes, with the root node representing the most general class and the leaf nodes representing the most specific classes. For example, an ontology of medical concepts might have a root node representing "medical concepts," with child nodes representing different types of medical concepts, such as "diseases," "treatments," and "symptoms." Each of these child nodes might have its own set of child nodes representing more specific types of medical concepts, and so on.

![Dendritic trees](https://github.com/AnacletoLAB/grape/blob/main/images/dendritic_tree.jpg?raw=true)

## Installing GRAPE
To install the GRAPE library using [PyPI (the Python Package Index)](https://pypi.org/), you will need to have Python and pip (the Python package manager) installed on your computer. Of course, here on COLAB that is not needed. Once these dependencies are installed, you can use the following command to install the GRAPE library:

In [1]:
!pip install grape -q

[K     |████████████████████████████████| 7.6 MB 15.1 MB/s 
[K     |████████████████████████████████| 20.2 MB 7.2 MB/s 
[K     |████████████████████████████████| 157 kB 64.6 MB/s 
[K     |████████████████████████████████| 9.4 MB 62.3 MB/s 
[K     |████████████████████████████████| 322 kB 73.1 MB/s 
[K     |████████████████████████████████| 106 kB 74.8 MB/s 
[K     |████████████████████████████████| 295 kB 59.7 MB/s 
[K     |████████████████████████████████| 965 kB 51.0 MB/s 
[K     |████████████████████████████████| 95 kB 275 kB/s 
[K     |████████████████████████████████| 1.6 MB 48.3 MB/s 
[?25h  Building wheel for grape (setup.py) ... [?25l[?25hdone
  Building wheel for embiggen (setup.py) ... [?25l[?25hdone
  Building wheel for cache-decorator (setup.py) ... [?25l[?25hdone
  Building wheel for compress-json (setup.py) ... [?25l[?25hdone
  Building wheel for ddd-subplots (setup.py) ... [?25l[?25hdone
  Building wheel for dict-hash (setup.py) ... [?25l[?25hdone


## Retrieving the KGCOVID19 graph
In this tutorial, we will be using the KGCOVID19 knowledge graph, which is a structured representation of information about the COVID-19 pandemic. This graph was generated using the KG-COVID-19 framework, which is a tool for extracting and integrating data from various sources and creating a comprehensive knowledge graph that can be used to support decision making and knowledge management during the pandemic. The KGCOVID19 graph is being accessed through the GRAPE library and will be used to illustrate a number of different holdout techniques on graph data.

In [2]:
%%time
from grape.datasets.kghub import KGCOVID19

kg = KGCOVID19()

Downloading to graphs/kghub/KGC...g-covid-19.tar.gz:   0%|          | 0.00/825M [00:00<?, ?iB/s]

CPU times: user 1min 39s, sys: 13 s, total: 1min 52s
Wall time: 3min 30s


### Graph report
In the following cell, we compute the **graph report** of KGCOVID19.

The report provides a detailed description of the structure and content of the graph, including the number and types of nodes and edges, the degree centrality of the nodes, and the presence of topological oddities.

When using the KGCOVID19 graph for holdout techniques, it is important to consider the potential impact of these topological oddities on the results. For instance, disconnected nodes and small connected components such as tuples and triples could potentially bias the results of holdout techniques, and it may preferable to drop them depending on the task.

In [3]:
kg

## Main connected component of the graph
The `remove_components` method is a function in the [GRAPE](https://github.com/AnacletoLAB/grape) library that is used to remove all components in a graph that are not connected to specific nodes or edges. This method has several parameters that can be specified, including the names and types of nodes and edges whose components should be kept, the minimum size of the components to be kept, and the number of components to keep sorted by the number of nodes they contain. Additionally, a boolean value can be set to indicate whether or not to show a loading bar while the method is running. This method is useful for filtering out unimportant components in a graph and keeping only the most relevant ones.

Different filtering approaches may be necessary for different tasks.

In [4]:
%%time
kg_main_component = kg.remove_components(
    top_k_components=1
)

CPU times: user 52 s, sys: 896 ms, total: 52.9 s
Wall time: 28.1 s


## Dropping tendrils and dendritic trees
We will be using the [GRAPE](https://github.com/AnacletoLAB/grape) library to drop both tendrils and dendritic trees from a graph. By removing these tendrils and dendritic trees, we can clean up the graph and focus on only the most important connections between nodes.

In [5]:
%%time
kg_main_component_no_dt = kg_main_component.remove_dendritic_trees()

CPU times: user 1min 20s, sys: 694 ms, total: 1min 21s
Wall time: 42.2 s


Removing singleton node types:

In [6]:
%%time
kg_main_component_no_dt = kg_main_component_no_dt.remove_singleton_node_types()

CPU times: user 23.8 ms, sys: 1.94 ms, total: 25.7 ms
Wall time: 22.7 ms


Enable processing speedups:

In [7]:
kg_main_component_no_dt.enable()

## Recompute the report of the cleaned graph

After cleaning up our graph by removing tendrils and dendritic trees, we will now recompute the report of the cleaned graph. This will allow us to better understand the structure and connections within the graph, as we will have removed any unimportant or extraneous components.

In [8]:
kg_main_component_no_dt

## Random edge holdout method
The `random_holdout` method is a way to evaluate the performance of a machine learning model on a graph by dividing the edges of the graph into two sets: a training set and a validation set. The training set is used to build and train the model, while the validation set is used to evaluate the model's performance.

The proportion of edges in each set is determined by the `train_size` parameter, which is a value between `0` and `1`. For example, if `train_size` is set to `0.8`, then `80%` of the edges in the graph will be used for training, while the remaining `20%` will be used for validation.

The `random_holdout` method can be used with a specified random seed, which determines the randomness of the holdout process. This can be useful if you want to reproduce the holdout process or if you want to compare the performance of different models on the same holdout.

The method also allows you to specify which edge types should be included in the holdout process. This can be useful if you are only interested in evaluating the performance of the model on certain types of edges.

If the graph is a multigraph, you can also specify a minimum number of overlaps required for an edge to be included in the validation set. This can be useful if you want to ensure that the validation set includes a minimum number of edges that are shared by multiple nodes.

The `random_holdout` method returns a tuple of two graphs, the training and validation sets. These graphs can then be used to build and evaluate the performance of a machine learning model. The method can also display a loading bar to show the progress of the holdout process.

### Parameters

The `random_holdout` method takes the following parameters:

* `train_size`: A float value between `0` and `1` that represents the percentage of the graph to be reserved for training.
* `random_state`: An optional integer value that specifies the random seed to use for the holdout.
* `include_all_edge_types`: An optional boolean value that specifies whether to include all the edges between two nodes in the holdout process.
* `edge_types`: An optional list of strings that specifies the edge types to include in the holdout process.
* `min_number_overlaps`: An optional integer value that specifies the minimum number of overlaps required for an edge to be included in the validation set. This parameter is only applicable to multigraphs.
* `verbose`: An optional boolean value that specifies whether to show a loading bar.

In [9]:
%%time
train, test = kg_main_component_no_dt.random_holdout(
    train_size=0.8,
    random_state = 56
)

CPU times: user 23.1 s, sys: 163 ms, total: 23.2 s
Wall time: 18.9 s


## Connected edge holdout method
Connected components in a graph are groups of nodes that are all connected to each other, but are not connected to any other nodes in the graph. In other words, each connected component is a subgraph within the larger graph, and all the nodes within a connected component are reachable from any other node in that component through a series of edges.

One of the main features of the `connected_holdout` method is that it guarantees that the training set will have the same number of connected components as the original graph.

The `connected_holdout` method has a number of parameters that allow the user to customize the holdouts that are created. For example, the `train_size` parameter allows the user to specify the proportion of the graph that should be used for training. The `edge_types` parameter allows the user to specify which types of edges should be included in the test set. The `include_all_edge_types` parameter allows the user to specify whether all the edges between two nodes should be included in the test set.

The `random_state` parameter is an optional integer that is used to seed the random number generator when creating the holdout. This allows the holdout to be reproducible, as the same random number generator seed will always result in the same holdout being created.

The `edge_types` parameter is an optional list of strings that specifies the edge types that should be included in the validation set. If this parameter is not provided, all edge types in the graph will be considered for inclusion in the validation set.

The `include_all_edge_types` parameter is an optional boolean that determines whether all the edges between two nodes should be included in the validation set.

### Goldilock parameters
The minimum node degree and maximum node degree parameters are used to define a "Goldilock zone" in the graph, which is an area of the graph that has a reasonable amount of topology for the model to make accurate predictions, but not so much that the edges are redundant or obvious. The minimum node degree specifies the minimum number of connections that a node must have in order to be included in the sample. The maximum node degree specifies the maximum number of connections that a node can have in order to be included in the sample.

By setting these parameters, we can define a range of node degrees that represents the "goldilock zone" in the graph. Nodes with fewer than the minimum number of connections are considered too sparse to be useful for making predictions, while nodes with more than the maximum number of connections are considered too dense. Only nodes with a degree within this range will be included in the sample.

This approach allows the model to make predictions in areas of the graph that have a reasonable amount of topology, while avoiding areas that are either too dense or too sparse. This can help to ensure that the model is able to make accurate predictions and draw meaningful insights from the data.

In [10]:
%%time
train, test = kg_main_component_no_dt.connected_holdout(
    # We use a split of 90/10 because the edges we are
    # requiring for the holdout are rare
    train_size=0.9,
    random_state = 56,
    include_all_edge_types=True,
    minimum_node_degree=3,
    maximum_node_degree=500
)

CPU times: user 38.7 s, sys: 985 ms, total: 39.7 s
Wall time: 35.1 s


## Node-label holdouts method
The `get_node_label_holdout_graphs` split the node labels of a graph into training and test sets for machine learning tasks. If the `use_stratification` argument is enabled, the train and test graphs will have the same ratios of node types. This is not supported for multi-label graphs.

As per edge types, one important aspect of this implementation is that all unchanged pieces of the graph data structure, such as the node names and edges, are shared using a copy-on-write (COW) structure.

An analogous method for kfolds exists, and is called: `get_node_label_kfold`

In [11]:
%%time
train, test = kg_main_component_no_dt.get_node_label_holdout_graphs(
    train_size=0.8,
    #  It is impossible to create a stratified holdout when the graph has multi-label node types.
    use_stratification=False,
    random_state=4567
)

CPU times: user 59.6 ms, sys: 19.1 ms, total: 78.7 ms
Wall time: 87.6 ms
