# Overview

RegGAIN is a self-supervised graph contrastive learning framework that infers GRNs by integrating scRNA-seq data with a species-specific prior gene network (e.g. human, mouse). The model outputs directed, cell-type-resolved GRNs, where edges represent predicted TF-target regulatory scores derived from the learned embeddings. These reconstructed GRNs support downstream analyses such as gene module detection, network rewiring across both discrete conditions (e.g., disease and control) and continuous processes (e.g., time-series data), and TF prioritization for biological or clinical interpretation.

Here, we demonstrate the application of RegGAIN using the mouse hematopoietic stem cell lymphoid-lineage (mHSC-L) scRNA-seq dataset.

## Preparations

Before starting the tutorial, we need to do some preparations, including: installing RegGAIN and its required Python packages, etc. These preparations can be completed by following the step-by-step installation guide provided in the README.


In [1]:
# Import your package and essential libraries
import RegGAIN_script as rg
import pandas as pd
import torch

# Check if a GPU is available and set the device accordingly
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda


## Inputs and preprocessing

Load the scRNA-seq data and the prior gene interaction network.

In [2]:
# Provide the paths to your data files here.
exp_data_path = "data.csv" 
prior_net_path = "network_XX.csv"

# This section demonstrates the preprocessing steps.
adata = rg.data_preparation(exp_data_path, prior_net_path)
pyg_data = rg.get_PYG_data(adata, torch.device(device))


Start preprocessing! 
Total number of prior network edges: 14661
Number of nodes with out-degree > 50: 58
Finish! Data shape: n_genes × n_cells = 692 × 847


## Construct the cell-type-specific gene regulatory network

In [3]:
import importlib
importlib.reload(rg)
# Set hyperparameters (default)
config = {
    'epochs': 500,  
    'lr': 0.001,
    'device': device,
    'repeat': 10,
    'seed': 42,
    'k': 50,
    'adjacency_powers': [0, 1, 2],
    'first_layer_dims': [80, 80, 10],
    'hidden_layer_dims_list': "40 40 5,16 16 2",
    'pos': 10,
    
    # Data augmentation parameters
    'edge_alpha1': 0.6, 'edge_alpha2': 0.3,
    'edge_beta1': 0.3, 'edge_beta2': 0.3,
    'node_alpha1': 0.5, 'node_alpha2': 0.2,
    'node_beta1': 0.2, 'node_beta2': 0.2,
}


#  Run the RegGAIN algorithm
results = rg.run_reggain(
    exp_data=exp_data_path,
    prior_net=prior_net_path,
    config=config
)

Using device: cuda
Start preprocessing! 
Total number of prior network edges: 14661
Number of nodes with out-degree > 50: 58
Finish! Data shape: n_genes × n_cells = 692 × 847
Start training!


Run 1/10: 100%|██████████| 500/500 [00:50<00:00,  9.99epoch/s]
Run 2/10: 100%|██████████| 500/500 [00:50<00:00,  9.87epoch/s]
Run 3/10: 100%|██████████| 500/500 [00:49<00:00, 10.17epoch/s]
Run 4/10: 100%|██████████| 500/500 [00:50<00:00,  9.91epoch/s]
Run 5/10: 100%|██████████| 500/500 [00:49<00:00, 10.09epoch/s]
Run 6/10: 100%|██████████| 500/500 [00:52<00:00,  9.60epoch/s]
Run 7/10: 100%|██████████| 500/500 [00:53<00:00,  9.30epoch/s]
Run 8/10: 100%|██████████| 500/500 [00:53<00:00,  9.39epoch/s]
Run 9/10: 100%|██████████| 500/500 [00:55<00:00,  9.01epoch/s]
Run 10/10: 100%|██████████| 500/500 [00:53<00:00,  9.38epoch/s]


Training finished. Processing results...
Result processing complete.


## Inspect and analyze the results 

In [4]:
GRN_df = results['GRN']
embedding_in = results['embedding_in']
embedding_out = results['embedding_out']


print("\nShape of embeddings:")
print(embedding_in.shape)
print(embedding_out.shape)


Shape of embeddings:
(692, 34)
(692, 34)


In [None]:
GRN_df.head(20)


Unnamed: 0,TF,Target,value
0,NFIA,MYC,246.32043
1,IGF1,MYC,246.255165
2,NFE2,MYC,245.602342
3,IGF1,ESR1,245.223019
4,MEIS1,MYC,245.133118
5,FOXO1,ESR1,244.866429
6,IGF1,CD44,244.708844
7,GFI1B,MYC,244.658742
8,FOXO1,XRCC6,244.605643
9,FOXO1,CD44,244.456465


## (Optional) Run evaluation

In [6]:
# Provide paths to the ground truth label files.
label_string_path = "Label_STRING.csv"
label_non_specific_path = "Label_Non-Specific.csv"
label_specific_path = "Label_Specific.csv"

### Evaluate against the STRING network

In [7]:
rg.calculate_epr_aupr(GRN_df, label_string_path, 'Gene1', 'Gene2', 'TF', 'Target', 'value')

Label.csv EPR: 3.727441389779052
Label.csv AUPR ratio: 3.1120957239725637


(3.727441389779052, 3.1120957239725637)

### Evaluate against the Non-Specific network

In [8]:
rg.calculate_epr_aupr(GRN_df, label_non_specific_path, 'Gene1', 'Gene2', 'TF', 'Target', 'value')

Label.csv EPR: 3.4816945138273843
Label.csv AUPR ratio: 3.535993023825075


(3.4816945138273843, 3.535993023825075)

### Evaluate against the cell-type-specific network

In [9]:
rg.calculate_epr_aupr(GRN_df, label_specific_path, 'Gene1', 'Gene2', 'TF', 'Target', 'value')

Label.csv EPR: 1.1789126578315767
Label.csv AUPR ratio: 1.232644665273911


(1.1789126578315767, 1.232644665273911)