[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/B1AAB/GraphStudio/blob/main/quickstart/graph_101.ipynb)

# Get to Know Bitcoin Graph

This notebook serves as a guided tour of the [Bitcoin Graph](https://registry.opendata.aws/eba) dataset. More usage examples, tutorials, and documentation for this dataset and others can be found at the [Registry of Open Data on AWS](https://registry.opendata.aws/).


The Bitcoin Graph dataset provides a complete history of transactions 
from the Bitcoin blockchain, structured for machine learning. 
The [v1 release](https://eba.b1aab.ai/releases/data-release/v1) consists of 
over `2.4` billion nodes and `39.72` billion edges.


Given the size of the graph, 
a common machine learning approach is to train models on sampled communities. 
To facilitate working with the graph, 
we have split the resources into the following steps 
(see [this page](https://eba.b1aab.ai/docs/bitcoin/overview) for details), 
which are ordered according to the [ETL pipeline](https://eba.b1aab.ai/docs/bitcoin/etl/overview).


1.  Bitcoin Graph prepared for [Neo4j import](https://eba.b1aab.ai/docs/bitcoin/etl/import) 
    and its [ready-to-use database snapshot](https://eba.b1aab.ai/docs/bitcoin/etl/import#load) 
    (_hosted on Registry of Open Data on AWS_)

2.  "Hello-World" [Sampled communities](https://www.kaggle.com/datasets/vjalili/bitcoin-graph-sampled-communities)

3.  Develop and train a 
    ["Hello-World" script classification model](https://github.com/B1AAB/GraphStudio/tree/main/quickstart/script_classification) 
    on the sampled communities.

4.  Evaluate the trained model (**the focus of this Notebook**)

### Q: How have you organized your dataset? Help us understand the key prefix structure of your S3 bucket.

Importing the Bitcoin Graph into a Neo4j database and sampling communities are 
both long-running and resource-intensive processes. 
For a quick start, we provide "Hello-World" sampled communities. 
However, we highly recommend importing the graph and using our tools to sample your own communities, 
as this allows you to tailor the data to your application's specific requirements. 
See [this page](https://eba.b1aab.ai/docs/bitcoin/overview) for details.


Next, we will briefly review the directory structure of the Bitcoin Graph data 
hosted on the Registry of Open Data on AWS. 
The rest of this notebook will then walk through an evaluation of a 
["Hello World" model](https://github.com/B1AAB/GraphStudio/tree/main/quickstart/script_classification), 
serving as a 101 introduction to the dataset. 

The Bitcoin Graph is provided in two formats: 

-   `/data_to_import_neo4j`:

    This directory contains the graph formatted for 
    [Neo4j admin bulk import](https://neo4j.com/docs/operations-manual/current/import) tool. 
    See [this guide](https://eba.b1aab.ai/docs/bitcoin/etl/import) for instructions on how to import the data.

    The data is provided as _headerless_ compressed TSV files, for efficient import into Neo4j. 
    The graph is organized as the following:

    *   Edges are grouped into smaller, timestamped batches. 
        This design enables loading the data into applications that rely on in-memory processing.

    *   Nodes (Script and Transaction) are provided in separate, 
        un-batched files that contain only unique nodes (`unique_` prefix). 
        Block nodes are provided in the `0_BitcoinGraph.tsv.gz` file.

    *   Headers for each file type are stored in corresponding `*_header.tsv.gz` files.

-   `/neo4j_db_dump`:

    This directory contains a compressed, multi-part Neo4j database dump. 
    To use it, download all the files, decompress the archive, and follow the instructions on 
    [this page](https://eba.b1aab.ai/docs/bitcoin/etl/import#load).

### Q: What data formats are present in your dataset? What kinds of data are stored using these formats? Can you give any advice for how you work with these data formats?

Data in the `data_to_import_neo4j` directory is stored in a compressed TSV format (`.tsv.gz`).

Due to the heterogeneity of the Bitcoin graph, each edge type is stored in a separate file. 
For easier reference, the filename for each edge type follows a `[Source initial]2[Target initial]` pattern. 
For instance, the file `1737337696_BitcoinS2S.tsv.gz` belongs to batch `1737337696` and contains `script-to-script` edges. 
Alternatively, you can refer to the corresponding `*_header.tsv.gz` file for the headers.

### Q: What is one question that you have answered using these data? Can you show us how you came to that answer?

Using the Bitcoin graph, you can model the behavior of entities in the context of their deep neighborhood. 
This enables machine learning models to distinguish entities by their behavioral patterns, 
largely independent of external data and with high certainty.


The deeper the neighborhood is considered, and the more application-driven the neighborhood is sampled, 
the more accurate the entity characterization can be. 
For instance, assigning a trust score to a given Bitcoin address/wallet based on 
its trading pattern and the parties it has traded with.


This 101 tutorial shows a generic approach for training a model to learn node 
(i.e., Bitcoin script/address) embeddings based on its neighborhood, 
and then clustering the nodes based on their learned embedding vectors. 
We train the model using generically sampled communities as a showcase. 
Both the sampling and the model can then be adapted to the specifics of your application.

### Setup



#### Install Dependencies

In [None]:
!mkdir -p /tmp/b1aab/graphstudio
!git clone https://github.com/B1AAB/GraphStudio /tmp/b1aab/graphstudio

!pip install -r /tmp/b1aab/graphstudio/quickstart/script_classification/requirements.txt
!pip install -e /tmp/b1aab/graphstudio/quickstart/script_classification
!pip install kagglehub
!pip install gdown

In [None]:
import os
import sys
pkg_path = '/tmp/b1aab/graphstudio/quickstart/script_classification'
if pkg_path not in sys.path:
    sys.path.insert(0, pkg_path)

In [None]:
import os
import torch
import kagglehub
from torch_geometric.loader import DataLoader
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import umap
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from matplotlib.ticker import FuncFormatter
import numpy as np
import json
import gdown
import tempfile

from script_classification.data_loader import BitcoinScriptsDataset
from script_classification.models import GraphEncoder
from script_classification.evaluation.embeddings import embed_roots
from script_classification.evaluation.clustering import find_optimal_clusters_root, plot_find_optimal_clusters_root

In [None]:
working_dir = tempfile.gettempdir()

In [None]:
sns.set_theme()

In [None]:
SEED = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

### Q: Can you show us an example of downloading and loading data from your dataset?

As mentioned earlier, 
this tutorial uses pre-sampled communities from the Bitcoin graph, 
[available on Kaggle](https://www.kaggle.com/datasets/vjalili/bitcoin-graph-sampled-communities). 

These communities were sampled for demonstration purposes only, and 
_we suggest that you sample communities with characteristics that match your own application_ 
by downloading the graph hosted on AWS.
Please see [this page](https://eba.b1aab.ai/docs/bitcoin/sampling/overview) for details. 

#### Download Sampled Communities from Kaggle

In [None]:
kaggle_dataset_path = kagglehub.dataset_download("vjalili/bitcoin-graph-sampled-communities/version/1")
print("Path to dataset files:", kaggle_dataset_path)
data_root = os.path.join(kaggle_dataset_path, "script_to_script_200k")

In [None]:
!mkdir -p /content/bitcoin_graph_sampled_communities/
!mv $data_root /content/bitcoin_graph_sampled_communities/

In [None]:
config = json.load(open("/tmp/b1aab/graphstudio/quickstart/script_classification/config.json"))
saves_root = config["saves_root"]
out_dim = config["out_dim"]
batch_size = config["batch_size"]
hidden_channels = config["hidden_channels"]  # note that this has to be divisible by HEADS

Downloading a trained model, so we do not run training. 
If interested in re-training the model, you may use 
[this notebook](https://github.com/B1AAB/GraphStudio/blob/main/quickstart/script_classification/train.ipynb).

In [None]:
model_save_path = os.path.join(working_dir, "best_encoder.pth")
gdown.download(id="1c2lAdsbtEeuNBSUTUGpkGGSRqU0nBx__", output=model_save_path, quiet=False)

Next we pre-process the "raw" sampled communities into a 
[PyG](https://github.com/pyg-team/pytorch_geometric) dataset. 
You may refer to the 
[data loader source code](https://github.com/B1AAB/GraphStudio/blob/main/quickstart/script_classification/script_classification/data_loader.py)
for details.


In [None]:
data_root="/content/bitcoin_graph_sampled_communities/script_to_script_200k"
dataset = BitcoinScriptsDataset(root=data_root)
EDGE_DIM = getattr(dataset, "num_edge_features", dataset[0].edge_attr.size(1))

In [None]:
graph_id_to_idx_map = {data.graph_id: i for i, data in enumerate(dataset)}

In [None]:
n = len(dataset)
n_train = int(0.7 * n)
n_val   = int(0.15 * n)
n_test  = n - n_train - n_val

gen = torch.Generator().manual_seed(SEED)
train_raw, val_dataset, test_dataset = torch.utils.data.random_split(
    dataset, [n_train, n_val, n_test], generator=gen
)

In [None]:
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False,
    pin_memory=True,
)

In [None]:
encoder = GraphEncoder(
    in_channels=dataset.num_node_features,
    hidden_channels=hidden_channels, 
    out_channels=out_dim,
    edge_dim=EDGE_DIM
).to(device)

In [None]:
ckpt = torch.load(model_save_path, map_location=device)
encoder.load_state_dict(ckpt["model_state"])

embeddings_for_analysis = embed_roots(encoder, test_loader, device)

In [None]:
optimal_k, inertias, silhouettes, k_range, best_silhouettes = find_optimal_clusters_root(embeddings_for_analysis, max_k=15, random_state=33)
print(f"Optimal number of clusters: {optimal_k}")

In [None]:
plot_find_optimal_clusters_root(optimal_k, inertias, silhouettes, k_range)

In [None]:
def visualize_embeddings_umap(graph_embeddings, cluster_labels, title="Graph Embedding Clusters (UMAP)", figsize=(5, 5)):
    reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42, n_jobs=1)
    embeddings_2d = reducer.fit_transform(graph_embeddings)

    plt.figure(figsize=figsize)

    scatter = sns.scatterplot(
        x=embeddings_2d[:, 0],
        y=embeddings_2d[:, 1],
        hue=cluster_labels,
        palette="tab20",
        legend="full",
        s=50,
        alpha=0.8
    )

    plt.legend(title="Cluster ID", loc="center left", bbox_to_anchor=(1.02, 0.5), frameon=False)

    scatter.set_title(title)
    plt.xlabel("UMAP Component 1")
    plt.ylabel("UMAP Component 2")
    plt.show()

In [None]:
k_means = KMeans(n_clusters=optimal_k, random_state=11, n_init=10)
final_clusters = k_means.fit_predict(embeddings_for_analysis)
final_score = silhouette_score(embeddings_for_analysis, final_clusters, metric="cosine")

print(f"Final Silhouette (cosine) with K={optimal_k}: {final_score:.4f}")

### Q: A picture is worth a thousand words. Show us a visual (or several!) from your dataset that either illustrates something informative about your dataset, or that you think might excite someone to dig in further.

Next, we'll visualize the dataset by clustering the embeddings that were generated for the root node of each neighborhood.

In [None]:
visualize_embeddings_umap(embeddings_for_analysis, final_clusters, title=f"Root Clusters (K={optimal_k})")

## Comparing Communities by Graph-Level Features

You may also characterize the sampled communities using their graph-level features. 
These include statistics that summarize the entire sampled subgraph, 
such as the average block height or BTC value calculated per hop from the community's root node.

However, it is crucial to understand a key distinction when using these for evaluation.

The model we trained here performs node-level representation learning. 
It is designed to generate a feature vector (an embedding) 
for the root node of each sampled community, based on its neighborhood. 
In contrast, the statistics described above are graph-level 
features that describe the entire community.

Therefore, clustering the root node embeddings and then evaluating 
those clusters based on graph-level features is a methodological mismatch,
and can lead to misleading conclusions.

For such a comparison to be valid, 
the model must be adapted for graph-level representation learning. 
This typically involves adding a readout or pooling function 
(e.g., global mean pooling) after the GNN layers 
to aggregate all node embeddings into a single embedding for the entire graph
(or other methods).

In [None]:
def format_block_height_axis(ax):
    formatter = FuncFormatter(lambda x, pos: f'{x / 1000:.0f}k')
    ax.xaxis.set_major_formatter(formatter)

In [None]:
def prepare_stats_for_plotting(stats_list, graph_ids):
    plot_data = []
    for i, stats_dict in enumerate(stats_list):
        graph_id = graph_ids[i]
        for hop_level, hop_stats in stats_dict.items():
            hop_transition_key = f"hop{hop_level}->hop{hop_level+1}"
            
            if "Value_avg" in hop_stats:
                plot_data.append({
                    "graph_id": graph_id,
                    "hop_transition": hop_transition_key,
                    "avg_value": hop_stats["Value_avg"],
                    "metric_type": "Avg Value"
                })
            if "BlockHeight_avg" in hop_stats:
                plot_data.append({
                    "graph_id": graph_id,
                    "hop_transition": hop_transition_key,
                    "avg_block_height": hop_stats["BlockHeight_avg"],
                    "metric_type": "Avg BlockHeight"
                })                
            if "OriginalInDegree_avg" in hop_stats:
                plot_data.append({
                    "graph_id": graph_id,
                    "hop_transition": hop_transition_key,
                    "avg_original_in_degree": hop_stats["OriginalInDegree_avg"],
                    "metric_type": "Avg Original In-Degree"
                })
            if "OriginalOutDegree_avg" in hop_stats:
                plot_data.append({
                    "graph_id": graph_id,
                    "hop_transition": hop_transition_key,
                    "avg_original_out_degree": hop_stats["OriginalOutDegree_avg"],
                    "metric_type": "Avg Original Out-Degree"
                })
                
    return pd.DataFrame(plot_data)

In [None]:
def compare_graph_stats(stats_list, graph_ids):
    df = prepare_stats_for_plotting(stats_list, graph_ids)
    hop_order = sorted(df["hop_transition"].dropna().unique(), key=lambda x: int(x.split("->")[0][3:]))
    color_palette = sns.color_palette("viridis", len(graph_ids))

    fig, axes = plt.subplots(1, 4, figsize=(18, 4), sharey=True)

    ax1 = axes[0]
    ax2 = axes[1]
    ax3 = axes[2]
    ax4 = axes[3]

    sns.barplot(data=df[df["metric_type"] == "Avg Value"], 
                y="hop_transition", x="avg_value", hue="graph_id", 
                ax=ax1, order=hop_order, orient="h", palette=color_palette)
    ax1.set_title("Avg. Tx Value by Hop")
    ax1.set_xlabel("Average Value (Log10 BTC)")
    ax1.set_ylabel("Hop Transition")
    ax1.set_xscale("log")
    ax1.get_legend().remove()

    sns.barplot(data=df[df["metric_type"] == "Avg BlockHeight"], 
                y="hop_transition", x="avg_block_height", hue="graph_id", 
                ax=ax2, order=hop_order, orient="h", palette=color_palette)
    ax2.set_title("Avg. Block Height by Hop")
    ax2.set_xlabel("Average Block Height")
    ax2.set_ylabel("")
    ax2.get_legend().remove()
    format_block_height_axis(ax2)
    
    sns.barplot(data=df[df["metric_type"] == "Avg Original In-Degree"], 
                y="hop_transition", x="avg_original_in_degree", hue="graph_id", 
                ax=ax3, order=hop_order, orient="h", palette=color_palette)
    ax3.set_title("Avg. Original In-Degree by Hop")
    ax3.set_xlabel("Average Original In-Degree")
    ax3.set_ylabel("")
    ax3.get_legend().remove()

    sns.barplot(data=df[df["metric_type"] == "Avg Original Out-Degree"], 
                y="hop_transition", x="avg_original_out_degree", hue="graph_id", 
                ax=ax4, order=hop_order, orient="h", palette=color_palette)
    ax4.set_title("Avg. Original Out-Degree by Hop")
    ax4.set_xlabel("Average Original Out-Degree")
    ax4.set_ylabel("")
    ax4.get_legend().remove()

    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()

In [None]:
def get_graphs_from_cluster(target_cluster_id, all_cluster_labels, dataset):
    indices = np.where(all_cluster_labels == target_cluster_id)[0]
    
    if len(indices) == 0:
        print(f"No graphs found for cluster ID {target_cluster_id}.")
        return []
        
    graphs = [dataset[i] for i in indices]
    return graphs

In [None]:
a_g_id = get_graphs_from_cluster(1, final_clusters, test_dataset)[0].graph_id
b_g_id = get_graphs_from_cluster(1, final_clusters, test_dataset)[1].graph_id
compare_graph_stats([dataset.per_graph_stats[a_g_id], dataset.per_graph_stats[b_g_id]], [a_g_id, b_g_id])

### Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendations or advice for someone wanting to answer this question?

The model evaluated here is for demonstration purposes only, 
using communities that were also sampled for demonstration. 
This tutorial illustrates what the data looks like and provides an example of how to work with it. 
This is critical given the heterogeneity and scale of the data, 
the complex ETL pipeline involved, 
and the breadth of potential applications, 
each of which demands a well-tailored solution.


The dataset spans more than 16 years of Bitcoin transactions. 
During this time, every aspect of the data has evolved, 
making the temporal dimension essential. 
Additionally, various on-chain interaction patterns are captured in this graph dataset. 
Our manuscript provides a detailed statistical profile of the blockchain and its graph characteristics. 
We suggest using this information to sample communities that specifically match your application and evaluate the results accordingly.


For instance, 
an application studying how funds are transferred between scripts 2-hops away from the Coinbase node 
(involving mostly miners) 
will differ from one studying communities at least 5-hops away 
(representing mostly user trades). 
To investigate such specific questions, 
you will need to sample your own communities by first downloading the graph from AWS 
and then running sampling algorithms with your own configurations.


On [this page](https://github.com/B1AAB/GraphStudio), 
we provide general recommendations for designing applications with this dataset.