# Toy Example 

This notebook is a walkthrough off training and inferencing on a small toy graph using GiGL


## Overview Of Components
This notebook shows the process of a simple, human-digestable graph being passed through all the pipeline components in GiGL in preperation for training to help understand how each of the components work.

The pipeline consists of the following components:

- **Config Populator**: Takes a template config and creates a frozen workflow config that dictates all inputs/outputs and business parameters that is read and used by each subsequent component.
    - input: template_config.yaml
    - output: frozen_gbml_config.yaml
&nbsp;

&nbsp;


- **Data Preprocesser**: Transforms necessary node and edge feature assets as needed as a precursor step in most ML tasks according to user provided data preprocessor config class
    - input: frozen_gbml_config.yaml which includes user-defined preprocessor class for custom logic and custom arguments can be passed under dataPreprocessorArgs
    - output: PreprocessedMetadata Proto which includes inferred GraphMetadata and preproccessed graph data Tfrecords after applying user defined preprocessing function
&nbsp;

&nbsp;


- **Subgraph Sampler**: Samples k-hop subgraphs for each node according to user provided arguments
    - input: frozen_gbml_config.yaml, resource_config.yaml
    - output: Subgraph Samples (tfrecord format based on predefined schema in protos) are stored in the uri defined in flattenedGraphMetadata field. 
&nbsp;

&nbsp;


- **Split Generator**: Splits subgraph sampler outputs into train/test/val sets according to user provided split strategy class.
    - input: frozen_gbml_config.yaml which includes instance of SplitStrategy and an instance of Assigner
    - output: TFRecord samples
&nbsp;

&nbsp;


- **Trainer**: The trainer component reads the output of split generator and trains a model on the training set, stops based on validation set, and evaluates on the test set
    - input: frozen_gbml_config.yaml
    - output: state_dict stored in trainedModelUri
&nbsp;

&nbsp;


- **Inferencer**: Runs inference of a trained model on samples generated by Subgraph Sampler. 
    - input: frozen_gbml_config.yaml
    - output: Embeddings and/or prediction assets
&nbsp;

&nbsp;

## Input Graph

We use the input graph defined in [examples/toy_visual_example/graph_config.yaml](./graph_config.yaml). 
You are welcome to change this file to a custom graph off your own choosing.


### Visualizing the input graph

In [None]:
from torch_geometric.data import HeteroData
from visualize import GraphVisualizer

from gigl.src.mocking.toy_asset_mocker import load_toy_graph

data: HeteroData = load_toy_graph(graph_config="./graph_config.yaml")
# Visualize the graph
GraphVisualizer.visualize_graph(data)

### Setting up Configs

The first thing we will need to do is create the resource and task configs. 

- **Task Config**: Specifies task-related configurations, guiding the behavior of components according to the needs of
  your machine learning task. See [Task Config Guide](../../docs/user_guide/config_guides/task_config_guide.md). For this task, we have already provided a task config: [task_config.yaml](./task_config.yaml)

- **Resource Config**: Details the resource allocation and environmental settings across all GiGL components. This
  encompasses shared resources for all components, as well as component-specific settings. See [Resource Config Guide](../../docs/user_guide/config_guides/resource_config_guide.md). For this task we provide a resource [resource_config.yaml](./resource_config.yaml). Although, the provided default values in `shared_resource_config.common_compute_config` are for resources you will not have access to unless you are a core contributor.

  - **Intructions to configure the resource config to work**:
    If you have not already, please follow the [Quick Start Guide](../../docs/user_guide/getting_started/quick_start.md) to setup your cloud environment and setup a default test resource config. You can then copy the relevant `shared_resource_config.common_compute_config` to [resource_config.yaml](./resource_config.yaml)

In [None]:
import os
import pathlib

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import sys
from gigl.common import Uri, UriFactory
notebook_dir = os.getcwd()
# Add the root gigl dir to the Python path so `example` folder can be imported as a module.
sys.path.append(os.path.join(notebook_dir, "..", ".."))

# You are welcome to customize these  to point to your own configuration files.
JOB_NAME = "gigl_test_job"
TEMPLATE_TASK_CONFIG_PATH: Uri = UriFactory.create_uri(f"{notebook_dir}/template_task_config.yaml")
FROZEN_TASK_CONFIG_POINTER_FILE_PATH: Uri = UriFactory.create_uri(f"/tmp/GiGL/{JOB_NAME}/frozen_task_config.yaml")
pathlib.Path(FROZEN_TASK_CONFIG_POINTER_FILE_PATH.uri).parent.mkdir(parents=True, exist_ok=True)
# Ensure you change the resource config path to point to your own resource configuration
# i.e. what was exported to $GIGL_TEST_DEFAULT_RESOURCE_CONFIG as part of the quick start guide.
RESOURCE_CONFIG_PATH: Uri = UriFactory.create_uri(f"{notebook_dir}/resource_config.yaml")

# Export variables so we can reference them in cells that execute bash commands below.
os.environ["JOB_NAME"] = JOB_NAME
os.environ["TEMPLATE_TASK_CONFIG_PATH"] = TEMPLATE_TASK_CONFIG_PATH.uri
os.environ["FROZEN_TASK_CONFIG_POINTER_FILE_PATH"] = FROZEN_TASK_CONFIG_POINTER_FILE_PATH.uri
os.environ["RESOURCE_CONFIG_PATH"] = RESOURCE_CONFIG_PATH.uri

### Note on use of mocked assets

This step is already done for you. We provide instructions below for posterity, incase the mocked data input ["graph_config.yaml"](./graph_config.yaml) is updated.

Note: unless you are a core contributor you will not have access to write to public BQ/GCS resources. In this case, can chose to update `MOCK_DATA_GCS_BUCKET` and `MOCK_DATA_BQ_DATASET_NAME` in `python/gigl/src/mocking/lib/constants.py` to upload to your resources you own.

We run the following command to upload the relevant mocks to GCS and BQ:
```bash
python -m gigl.src.mocking.dataset_asset_mocking_suite \
--select mock_toy_graph_homogeneous_node_anchor_based_link_prediction_dataset \
--resource_config_uri=examples/toy_visual_example/resource_config.yaml
```

Subsequently, we can update the paths in [task_config.yaml](./task_config.yaml)

## Validating the configs

We provide the ability to validate your resource and task configs. Although the validation is not exhaustive, it does help assert that the more common issues are not present schedule expensive compute is scheduled.

In [None]:
from gigl.src.validation_check.config_validator import kfp_validation_checks
validator = kfp_validation_checks(
    job_name=JOB_NAME,
    task_config_uri=TEMPLATE_TASK_CONFIG_PATH,
    resource_config_uri=RESOURCE_CONFIG_PATH,
    start_at="config_populator",
)

### Config Populator

Takes in a template `GbmlConfig` and outputs a frozen `GbmlConfig` by populating all job related metadata paths in
`sharedConfig`. These are mostly GCS paths which the following components read and write from, and use as an
intermediary data communication medium. For example, the field `sharedConfig.trainedModelMetadata` is populated with a
GCS URI, which indicates to the Trainer to write the trained model to this path, and to the Inferencer to read the model
from this path. See full [Config Populator Guide](../../docs/user_guide/overview/components/config_populator.md)

After running the command below we will have created a frozen config and ploaded it to the the `perm_assets_bucket` provided in the `resource config`.
The path to that file will be stored in the file @ `FROZEN_TASK_CONFIG_POINTER_FILE_PATH`

In [None]:
%%bash

python -m \
    gigl.src.config_populator.config_populator \
    --job_name="$JOB_NAME" \
    --template_uri="$TEMPLATE_TASK_CONFIG_PATH" \
    --resource_config_uri="$RESOURCE_CONFIG_PATH" \
    --output_file_path_frozen_gbml_config_uri="$FROZEN_TASK_CONFIG_POINTER_FILE_PATH"

In [None]:
# Extracting the frozen task config file path
FROZEN_TASK_CONFIG_PATH: Uri
with open(FROZEN_TASK_CONFIG_POINTER_FILE_PATH.uri, 'r') as file:
    FROZEN_TASK_CONFIG_PATH = UriFactory.create_uri(file.read().strip())
os.environ["FROZEN_TASK_CONFIG_PATH"] = FROZEN_TASK_CONFIG_PATH.uri
print(f"FROZEN_TASK_CONFIG_PATH: {FROZEN_TASK_CONFIG_PATH}")

## Visualizing the diff between template and frozen config.

As pointed above, running the config populatortes ob related metadata paths to the template config. Lets see what those paths are below.

In [None]:
import yaml
from difflib import unified_diff
from IPython.display import display, HTML

from gigl.src.common.utils.file_loader import FileLoader

def sort_yaml_dict_recursively(obj: dict) -> dict:
    # We sort the yaml recursively as the GiGL proto serialization code does not guarantee order of original keys.
    # This is important for the diff to be stable and not show errors due to key/list order changes.
    if isinstance(obj, dict):
        return {k: sort_yaml_dict_recursively(obj[k]) for k in sorted(obj)}
    elif isinstance(obj, list):
        return [sort_yaml_dict_recursively(item) for item in obj]
    else:
        return obj

def show_colored_unified_diff(f1_lines, f2_lines, f1_name, f2_name):
    diff_lines = list(unified_diff(f1_lines, f2_lines, fromfile=f1_name, tofile=f2_name))
    html_lines = []
    for line in diff_lines:
        if line.startswith('+') and not line.startswith('+++'):
            color = '#228B22'  # green
        elif line.startswith('-') and not line.startswith('---'):
            color = '#B22222'  # red
        elif line.startswith('@'):
            color = '#1E90FF'  # blue
        else:
            color = "#000000"  # black
        html_lines.append(f'<pre style="margin:0; color:{color}; background-color:white;">{line.rstrip()}</pre>')
    display(HTML(''.join(html_lines)))


file_loader = FileLoader()
frozen_task_config_file_contents: str
template_task_config_file_contents: str

with open(file_loader.load_to_temp_file(file_uri_src=FROZEN_TASK_CONFIG_PATH).name, 'r') as f:
    data = yaml.safe_load(f)
    # sort_keys by default 
    frozen_task_config_file_contents = yaml.dump(sort_yaml_dict_recursively(data))

with open(file_loader.load_to_temp_file(file_uri_src=TEMPLATE_TASK_CONFIG_PATH).name, 'r') as f:
    data = yaml.safe_load(f)
    template_task_config_file_contents = yaml.dump(sort_yaml_dict_recursively(data))

# Example usage
show_colored_unified_diff(
    template_task_config_file_contents.splitlines(),
    frozen_task_config_file_contents.splitlines(),
    f1_name='template_task_config.yaml',
    f2_name='frozen_task_config.yaml'
)

In [None]:
from itertools import zip_longest
from IPython.display import display, HTML
from gigl.src.common.utils.file_loader import FileLoader
import yaml


def sort_dict_recursively(obj: dict) -> dict:
    if isinstance(obj, dict):
        return {k: sort_dict_recursively(obj[k]) for k in sorted(obj)}
    elif isinstance(obj, list):
        return [sort_dict_recursively(item) for item in obj]
    else:
        return obj

def show_colored_unified_diff(f1_lines, f2_lines):
    print(f2_lines)
    html_lines = []
    for f2_line in f2_lines:
        if f2_line not in f1_lines:
            color = '#228B22'  # green - addition
            prefix = '+ '
            html_lines.append(f'<pre style="margin:0; color:{color}">{prefix}{f2_line.rstrip()}</pre>')
        else:
            color = "#FF06C9"
            prefix = '  '
            html_lines.append(f'<pre style="margin:0; color:{color}">{prefix}{f2_line.rstrip()}</pre>')

    display(HTML(''.join(html_lines)))

file_loader = FileLoader()
frozen_ask_config_f = file_loader.load_to_temp_file(
    file_uri_src=FROZEN_TASK_CONFIG_PATH,
)
template_task_config_f = file_loader.load_to_temp_file(
    file_uri_src=TEMPLATE_TASK_CONFIG_PATH,
)

frozen_task_config_file_contents: str
template_task_config_file_contents: str

with open(frozen_ask_config_f.name, 'r') as f:
    data = yaml.safe_load(frozen_ask_config_f.name)
    frozen_task_config_file_contents = yaml.dump(sort_dict_recursively(data), sort_keys=False)

with open(template_task_config_f.name, 'r') as f:
    data = yaml.safe_load(template_task_config_f.name)
    template_task_config_file_contents = yaml.dump(sort_dict_recursively(data), sort_keys=False)

print(template_task_config_file_contents)

# Example usage
show_colored_unified_diff(template_task_config_file_contents.splitlines(), frozen_task_config_file_contents.splitlines())


In [None]:
print(TEMPLATE_TASK_CONFIG_PATH)
# template_task_config_file_contents = file_loader.load_to_temp_file(
#     file_uri_src=TEMPLATE_TASK_CONFIG_PATH,
# )
# with open(template_task_config_file_contents.name, 'r') as f:
#     template_task_config_file_contents = f.read()
#     print(template_task_config_file_contents)


frozen_task_config_file_contents = file_loader.load_to_temp_file(
    file_uri_src=FROZEN_TASK_CONFIG_PATH,
)

with open(frozen_task_config_file_contents.name, 'r') as f:
    frozen_task_config_file_contents = f.read()
    print(frozen_task_config_file_contents)

In [None]:
from difflib import unified_diff

diff = unified_diff(a.splitlines(), b.splitlines(), lineterm='')
print('\n'.join(list(diff)))


import yaml

yaml_file = "./frozen_gbml_config.yaml"

def visualize_yaml(file_path):
    with open(file_path, 'r') as yaml_file:
        yaml_data = yaml.safe_load(yaml_file)

    print(yaml.dump(yaml_data))

print("Frozen GBML Config Yaml:")
visualize_yaml(yaml_file)

Below is a code snippet to output the difference between the original config file and the frozen config to illustrate what ConfigPopulator adds

In [None]:
import yaml

def compare_yaml(file_path1, file_path2):
    with open(file_path1, 'r') as yaml_file1:
        yaml_data1 = yaml.safe_load(yaml_file1)

    with open(file_path2, 'r') as yaml_file2:
        yaml_data2 = yaml.safe_load(yaml_file2)

    diff = compare_dicts(yaml_data1, yaml_data2)
    print(yaml.dump(diff, default_flow_style=False))

def compare_dicts(dict1, dict2):
    diff = {}
    for key in set(dict2.keys()):
        if key not in dict1:
            diff[key] = dict2[key]
    return diff

gbml_config = "toy_graph/configs/gbml_toy_config.yaml"
frozen_config_local = "./frozen_gbml_config.yaml"
compare_yaml(gbml_config, frozen_config_local)

### Visualizing Input Graph

We have now configured everything required to run the subsequent steps. Before proceeding we can visualize what the input graph looks like to get a better understanding of what happens in each of the steps.

In [None]:
from visualize import GraphVisualizer

%config InlineBackend.figure_format = 'svg'

graph_visualizer = GraphVisualizer("./graph_config.yaml")
graph_visualizer.visualize_graph()

### Data Preprocessor

The Data Preprocessor uses Tensorflow Transform to achieve data transformation in a distributed fashion. Any custom preprocessing is to be defined in the preprocessor class that is inherited from the DataPreprocessorConfig class.

Overall, this class houses all logic for

- Preparing datasets for ingestion and transformation (see [`prepare_for_pipeline`](https://github.com/Snapchat/GiGL/blob/10f1a35196f3946ae14c3e8e57d1cb685f01ffb5/python/gigl/src/data_preprocessor/lib/data_preprocessor_config.py#L40) function) So this is where you would house logic to pull data from a custom data source or perform any specific transformations.
- Defining transformation imperatives for different node types (see [`get_nodes_preprocessing_spec`](https://github.com/Snapchat/GiGL/blob/10f1a35196f3946ae14c3e8e57d1cb685f01ffb5/python/gigl/src/data_preprocessor/lib/data_preprocessor_config.py#L54) function)
- Defining transformation imperatives for different edge types (see [`get_edges_preprocessing_spec`](https://github.com/Snapchat/GiGL/blob/10f1a35196f3946ae14c3e8e57d1cb685f01ffb5/python/gigl/src/data_preprocessor/lib/data_preprocessor_config.py#L60))

Upon completion, the data preprocesser writes out a PreprocessedMetadata proto as TFRecords to URI specified by the preprocessedMetadataUri field in the sharedConfig section of the frozen config as seen below: 

In [None]:
print("Frozen Config Datapreprocessor Information:")

print("Preprocessed Metadata Uri: ", frozen_config.shared_config.preprocessed_metadata_uri)
print("Data Preprocessor Config: ", frozen_config.dataset_config.data_preprocessor_config)
print("Flattened Graph Metadata: ", frozen_config.shared_config.flattened_graph_metadata)



This proto contains a map to all information about the graph. Nothing has changed in terms of the structure of the graph from the input graph. Only features or transformations (i.e normalization) are applied to the graph.

To run the preprocessor we can do the following:

In [None]:
%%bash -s "$frozen_config_uri"

python -m \
    gigl.src.data_preprocessor.data_preprocessor \
    --job_name toy_graph \
    --task_config_uri "$1"

Upon completion we would see three generated artifacts in the specified gs uri: `edge`, `node`, and `preprocessed_metadata.yaml`. The metadata contains all the inferred GraphMetadata of the graph. One unformatted sample from a tfrecord and the data that is actually stored is visualized below.

In [None]:
from visualize_preprocessor_output import visualize_preprocessed_graph

preprocessed_metadata_uri = frozen_config.shared_config.preprocessed_metadata_uri
node_df, edge_df = visualize_preprocessed_graph(preprocessed_metadata_uri)

The output above visualized one sample of each of the TFRecord's that Data preprocessor creates. To visualize this output better, we can iterate through these TFRecord's and store them in two dataframes (node_df and edge_df)

In [None]:
node_df

In [None]:
edge_df

### Subgraph Sampler

The Subgraph Sampler receives node and edge data from Data Preprocessor and generates k-hop localized subgraphs for each node in the graph. The purpose is to store the neighborhood of each node independently, and as a result reducing the memory footprint for down-stream components, as they need not load the entire graph into memory but only batches of these node neighborhoods. 
To run subgraph sampler we use the following command:

In [None]:
%%bash

cd ~/GiGL

make compile_jars

python -m gigl.src.subgraph_sampler.subgraph_sampler \
  --job_name="toy_graph" \
  --task_config_uri="gs://TEMP DEV GBML PLACEHOLDER/toy_graph/config_populator/frozen_gbml_config.yaml" \
  --resource_config_uri="deployment/configs/e2e_cicd_resource_config.yaml" \
  --main_jar_file_uri="$PACKAGE_GCS_PATH"

Upon completion, there will be two different directories of subgraph samples. One is the main node anchor based link prediction samples and the other is random negative rooted neigborhood samples which are stored in the locations specified in the frozen_config:

In [None]:
print(frozen_config.shared_config.flattened_graph_metadata)


The main, unsupervised_node_anchor_based_link_prediction_samples include root nodes khop neighborhood, positive nodes khop neighborhood and positive edges. These samples will be used for training.
The random_negative_rooted_neighborhood_samples (which include root nodes khop neighborhood)samples are double purpose: they will be used for inferencer and random negative samples for training.

The random negative are used for the model to be able to learn non-existent (negative) edges since it could overfit on just positive samples. This means it would fail to generalize well to unseen data. The negative edges are just an edge chosen at random. At a large scale, this would most probably be a negative edge. 

Below we visualize the Root Node Neighbourhood of 5, the Root Node Neighbourhood of its pos_edge's destination node (1) and the resulting sample for root node 5. 

In [None]:
from visualize_sgs_output import SGSVisualizer

sgs_visualizer = SGSVisualizer(frozen_config_uri)
sgs_visualizer.visualize_random_negative_sample(5)
sgs_visualizer.visualize_random_negative_sample(1)
sgs_visualizer.visualize_node_anchor_prediction_sample(5)

### Split Generator

The Split Generator reads localized subgraph samples produced by Subgraph Sampler, and executes the user specified split strategy logic to split the data into training, validation and test sets. Several standard configurations of SplitStrategy and corresponding Assigner classes are implemented already at a GiGL platform-level: transductive node classification, inductive node classification, and transductive link prediction split routines. For more information on split strategies in Graph Machine Learning checkout these resources:

1. http://web.stanford.edu/class/cs224w/slides/07-theory.pdf
2. https://zqfang.github.io/2021-08-12-graph-linkpredict/ (relevant for explaining transductive vs inductive) 

In this example, we are using the transductive strategy as specified in our frozen_config:

In [None]:
print(frozen_config.dataset_config.split_generator_config)

For transductive, at training time, it uses training message edges to predict training supervision edges. At validation time, the training message edges and training supervision edges are used to predict the validation edges and then all 3 are used to predict test edges. Below is the command to run split generator:


In [None]:
%%bash

python -m \
    gigl.src.split_generator.split_generator \
    --job_name toy_graph \
    --task_config_uri gs://TEMP DEV GBML PLACEHOLDER/toy_graph/config_populator/frozen_gbml_config.yaml

Upon completion, there will be 3 folders for train,test, and val. Each of them contains the protos for the positive and negaitve samples. The path for these folders is specified in the following location in the frozen_config:

In [None]:
print(frozen_config.shared_config.dataset_metadata)


We can visualize the train,test, and val sample for the same root node as above (5) to see the pipeline process.

In [None]:
from visualize_sgn_output import SGNVisualizer

sgn_vis = SGNVisualizer(frozen_config_uri)

In [None]:
sgn_vis.visualize_main_data_output(5)

At this point, we have our graph data samples ready to be processed by the trainer and inferencer components. These components will extract representations/embeddings by learning contextual information for the specified task.