# Multi Summaries Demo

The Multi-Summaries package offers a tool for computing (k-forward bisimulation based) graph summaries and combining them into one large multi-summary graph. In this tutorial you will run the pipeline on (ntriples) graphs and analyze the outcomes. Please run the following cell to import dependencies.

In [None]:
from summary_loader.summary_interface import SummaryInterface
import subprocess
from pathlib import Path
from pprint import pp
from IPython.display import SVG, display

The following cell will set up the requirements for this library. The first time you run it, it will install Boost for C++. This can take over 10 minutes, so please be patient when running the following cell.

In [None]:
setup_script_file = "setup_experiments.sh"
setup_script_path = Path("./multi-summaries/setup/setup_experiments.sh")

# This will set up Boost when running for the first time and will take a few minutes to run
print("Setting up. If this is the first run, this will typically take over ten minutes.")
subprocess.run(
    ["/bin/bash", setup_script_file, "-y"],
    cwd="./multi-summaries/setup/",
    capture_output=True,
    text=True
)
print("Setting up successful.")

Run the following cell to set up the Python interface.

In [None]:
git_hash = subprocess.run(
    ["git", "rev-parse", "HEAD"],
    cwd="./multi-summaries/",
    capture_output=True,
    text=True
).stdout.strip()

path_to_experiment_directory = f"./multi-summaries/{git_hash}/"
summarizer_interface = SummaryInterface(path_to_experiment_directory)
print(f"The following directory should contain store the results of the upcoming experiments: {path_to_experiment_directory}")

You can run the cell below to print the setting for the experiment. Most of the settings are meant for setting up slurm job scripts. For this demo you do not have to change any of the settings. If you are interested the README in the `multi-summaries/` directory gives some more details on the settings. Do note the different components:
1. `run_all` is meant to run the full pipeline of components.
2. `preprocessor` takes in an ntriples (`.nt`) file and after preprocessing returns a compact binary file, encoding the pre-processed graph.
3. `bisimulator` reads the output from the preprocessor and computes several more-and-more refined partitions over the vertex set of the graph.
4. `summary_graphs_creator` takes the partitions generated by the bisimulator and combines them in one big joint (quotient-like) graph.
5. `results_plotter` plots some statistics about the generated multi-summary.
6. `serializer` will take the output of the summary graph creator and serializes it into several ntriples files.

In [None]:
experiment_settings = summarizer_interface.experiment_settings
pp(experiment_settings)

The following code sets the dataset that we will run the multi-summarizer on. You can change the path to any ntriples file. We have provided some already in `./data/`.

In [None]:
dataset_path = "./data/fb15k.nt"
summarizer_interface.set_dataset(dataset_path)

The following cell will run the full pipeline and merge the files created by the serialized into one big ntriples file. It will likely complain about the `sbatch` command not being found. This is expected and the code should happily continue without it.

In [None]:
summarizer_interface.run_experiment()
summarizer_interface.merge_files("cdirs")

So now that we have run the pipeline, what has acutally happenend? The bisimulator has created several partitions on the nodes of the input graph, each more refined than the one before it. Each partition represents a way of clustering nodes together (using forward bisumilation). Each of these partitions could represent its own quotient graph (a particular compacted representation of the original graph). Instead of creating a set of indepentend quotient graphs, however, we use some to combine them into one larger graph.

If that sounds like a lot to take in, then just know that we have now created a new graph in which each node represents a set of nodes in the original graph (and these sets might overlap). This representation has several nice properties, including the potential to compress the original graph if it has enough repeated structures.

The following cell displays how big the parts are in each partition. The horizontal axis corresponds to the number of partitions created. The vertical axis represents the size of parts in the repsective partition. Finally, the color represents how many blocks there are of a given size. Due to rounding, this plot may not be very accurate on smaller datasets. The provided `fb15k.nt` dataset, however, should generate a useful plot. Can you see how this plot indeed shows that the partitions become more refined (i.e. more, smaller parts)?

In [None]:
dataset_name = Path(dataset_path).stem
dataset_plots_path = path_to_experiment_directory + f"{dataset_name}/results/"

stats_plot = "per_level_statistics.svg"
block_sizes_plot = "block_sizes_integral_kde_block_based.svg"

display(SVG(filename=dataset_plots_path+block_sizes_plot))
display(SVG(filename=dataset_plots_path+stats_plot))

If the input graph is **not too big**, you can run the cells below to compare the original graph to its multi-summary. Note that for small graphs the multi-summary is often bigger than the original.

This cell will print out the raw data of the original graph:

In [None]:
with open(dataset_path, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        print(line.strip())
        if i == 999:
            print("Stopped printing after 1000 triples")
            break

This cell will print out the raw data of the respective multi-summary graph:

In [None]:
multi_summary_path = path_to_experiment_directory + f"{dataset_name}/rdf_summary_graph/output_graph.nt"
with open(multi_summary_path, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        print(line.strip())
        if i == 999:
            print("Stopped printing after 1000 triples")
            break

Please go to the cell where you loaded in the dataset and try at least two other data sets (remember that we provided some in `./data/`). Once you have done so: Congratulations! Consider this notebook complete.