
In this section, we demonstrate the generation of single-cell–level regulatory graphs required for running scReGAT on unmatched multi-omics datasets.

As an illustrative example, we employ human cerebral cortex data, originally reported by Mulqueen et al. (scATAC-seq) and Hodge et al. (scRNA-seq).

In [1]:
import os
import sys
import random
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import hypergeom
from scregat.data_process import prepare_model_input, sum_counts, plot_edge, ATACGraphDataset
import scanpy as sc
import itertools
import torch

  from .autonotebook import tqdm as notebook_tqdm
  import pkg_resources


In [2]:
import scregat
import os

print(scregat.data_process.__file__)    
print(os.path.dirname(scregat.__file__))


/opt/conda/envs/scregat/lib/python3.10/site-packages/scregat/data_process.py
/opt/conda/envs/scregat/lib/python3.10/site-packages/scregat


In [7]:
ATAC_h5ad_file = "../reproduce_data/HumanBrain_ATAC.h5ad"
RNA_h5ad_file = "../reproduce_data/HumanBrain_RNA.h5ad"


- It is important to note that both **adata_rna** and **adata_atac** must contain a `celltype` column in their `obs`.  
- The cell types present in **adata_atac** must also be represented in the `obs` of **adata_rna**.  
- The `celltype` entries in **adata_rna** are allowed to include a larger number of categories.  


In [8]:
adata_atac = sc.read_h5ad(ATAC_h5ad_file)
adata_rna = sc.read_h5ad(RNA_h5ad_file)


In [9]:
adata_atac

AnnData object with n_obs × n_vars = 2174 × 292156
    obs: 'celltype', 'celltype_rna'
    var: 'peak'

In [10]:
adata_atac.obs['celltype_rna'] == adata_atac.obs['celltype']
adata_atac.obs

Unnamed: 0,celltype,celltype_rna
GAAGCAGCTCTGACGACCGCGGTT,inhibitory_neuron,inhibitory_neuron
TCCATACCAATGATGCGGCATTCT,oligodendrocytes,oligodendrocytes
ATTGAGGAACCGGAAGACACTAAG,oligodendrocytes,oligodendrocytes
ACGCGACGTCTGACGAGCCACAGG,oligodendrocytes,oligodendrocytes
ATTGAGGAAGTTACGCTCCAACGC,oligodendrocytes,oligodendrocytes
...,...,...
GAAGAGTAGCCAAGGCGTGAATAT,oligodendrocytes,oligodendrocytes
GAAGCAGCTTAACTCAGGTACCTT,oligodendrocytes,oligodendrocytes
GGTTAGTTAGTTACGCTTGGACTC,polydendrocytes,polydendrocytes
GAAGAGTAATGGAGCTCTTGGTAT,oligodendrocytes,oligodendrocytes


In [11]:
adata_atac.var

Unnamed: 0,peak
chr10-10113-10613,chr10:10113-10613
chr10-74022-74522,chr10:74022-74522
chr10-76384-76884,chr10:76384-76884
chr10-129575-130075,chr10:129575-130075
chr10-134348-135024,chr10:134348-135024
...,...
chrX-155148867-155149367,chrX:155148867-155149367
chrX-155215853-155217198,chrX:155215853-155217198
chrX-155263318-155263818,chrX:155263318-155263818
chrX-155263973-155264786,chrX:155263973-155264786


In [12]:
adata_rna

AnnData object with n_obs × n_vars = 15603 × 50281
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'cluster', 'class', 'brain_subregion', 'donor', 'sex', 'facs_sort_criteria', 'seq_batch', 'total_reads', 'percent_reads_unique', 'celltype'

In [13]:
adata_rna.obs

Unnamed: 0,orig.ident,nCount_RNA,nFeature_RNA,cluster,class,brain_subregion,donor,sex,facs_sort_criteria,seq_batch,total_reads,percent_reads_unique,celltype
F1S4_160106_001_B01,0,1945567.0,8635,Inh L4-6 SST B3GAT2,GABAergic,L5,H200.1030,M,NeuN-positive,R8S4-160411-H,2572946,85.573696,inhibitory_neuron
F1S4_160106_001_C01,0,2076398.0,11697,Exc L5-6 RORB TTC12,Glutamatergic,L5,H200.1030,M,NeuN-positive,R8S4-160411-H,2755839,87.619777,excitatory_neuron
F1S4_160106_001_E01,0,1984845.0,12138,Exc L5-6 FEZF2 ABO,Glutamatergic,L5,H200.1030,M,NeuN-positive,R8S4-160411-H,2701064,86.448562,excitatory_neuron
F1S4_160106_001_G01,0,1991032.0,12191,Exc L5-6 FEZF2 EFTUD1P1,Glutamatergic,L5,H200.1030,M,NeuN-positive,R8S4-160411-H,2759117,84.497105,excitatory_neuron
F1S4_160106_001_H01,0,2189892.0,10535,Exc L3-5 RORB ESR1,Glutamatergic,L5,H200.1030,M,NeuN-positive,R8S4-160411-H,2930410,87.014752,excitatory_neuron
...,...,...,...,...,...,...,...,...,...,...,...,...,...
F2S4_170405_060_B01,1,2315521.0,10853,Exc L5-6 FEZF2 ABO,Glutamatergic,L5,H16.06.008,F,"SATB2-pos, NeuN-pos",R8S4-170505,3015621,89.829491,excitatory_neuron
F2S4_170405_060_C01,1,2376631.0,10539,Exc L5-6 THEMIS FGF10,Glutamatergic,L5,H16.06.008,F,"SATB2-pos, NeuN-pos",R8S4-170505,3121536,89.871717,excitatory_neuron
F2S4_170405_060_E01,1,1761501.0,9017,Exc L4-6 FEZF2 IL26,Glutamatergic,L5,H16.06.008,F,"SATB2-neg, NeuN-pos",R8S4-170505,2771628,73.988717,excitatory_neuron
F2S4_170405_060_F01,1,1736137.0,4953,Inh L4-6 SST B3GAT2,GABAergic,L5,H16.06.008,F,"SATB2-neg, NeuN-pos",R8S4-170505,2307182,86.352789,inhibitory_neuron


In [14]:
adata_rna.var

3.8-1.2
3.8-1.3
3.8-1.4
3.8-1.5
5-HT3C2
...
ZYX
ZZEF1
ZZZ3
bA255A11.4
bA395L14.12


In [15]:
adata_atac.obs.celltype.unique()

['inhibitory_neuron', 'oligodendrocytes', 'microglia', 'excitatory_neuron', 'astrocyte', 'polydendrocytes']
Categories (6, object): ['astrocyte', 'excitatory_neuron', 'inhibitory_neuron', 'microglia', 'oligodendrocytes', 'polydendrocytes']

In [16]:
adata_rna.obs.celltype.unique()

['inhibitory_neuron', 'excitatory_neuron', 'oligodendrocytes', 'OPC', 'astrocyte', 'microglia', 'polydendrocytes']
Categories (7, object): ['OPC', 'astrocyte', 'excitatory_neuron', 'inhibitory_neuron', 'microglia', 'oligodendrocytes', 'polydendrocytes']

In [17]:
adata_rna.obs['celltype'] = adata_rna.obs['celltype'].astype('object')
df_rna = sum_counts(adata_rna,by = 'celltype', marker_gene_num=300)

**finished identifying marker genes by COSG**


In [18]:
df_rna

Unnamed: 0_level_0,RXFP1,ATP13A4,PGR,EMC10,GALNT3,LSP1,LOC101929372,GPNMB,P2RY12,CD302,...,ATP6V0C,ABI2,TRIOBP,ICOSLG,RERG,PREX1,GSN,GAP43,SIRPB2,FAM177B
celltype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
OPC,1.977282,243.189362,6.567498,49.746312,58.736322,0.230454,0.162515,324.1351,9.514092,1.739794,...,27.589659,126.436216,25.480129,0.318424,7.756298,91.241953,69.767077,30.394268,0.200216,0.057349
astrocyte,5.460383,592.013255,1.122709,27.688592,3.06355,0.187578,30.816097,2.506366,10.546135,2.108429,...,14.400693,186.549481,30.933883,0.143088,309.865948,193.036843,180.726996,10.029325,0.033279,0.158025
excitatory_neuron,9618.6429,525.937022,22.664902,500.02603,613.249599,48.89234,11.371342,33.073244,384.708391,0.642745,...,2896.196396,11327.085658,270.302763,2.743651,991.89653,1119.595128,978.919536,12965.963452,4.599011,33.38185
inhibitory_neuron,293.852883,310.47571,29.598662,306.189812,106.368897,12.411594,5.664125,53.055618,57.614247,3.288799,...,2170.352186,2949.813759,148.803219,1.305105,2853.188535,186.263144,241.081847,1817.570922,4.223736,12.83123
microglia,0.609588,5.631434,1.062473,3.883262,0.279265,0.073228,0.432567,7.109798,137.660817,2.706166,...,10.098409,9.971359,8.799556,0.590807,0.054346,78.407873,45.76537,0.680299,16.59557,49.648214
oligodendrocytes,8.839663,36.150258,3.003144,161.119213,1.386353,1.194011,2.254029,2.276863,1.573197,0.0,...,57.737471,124.412413,23.632928,7.546591,3.22495,303.360678,427.687454,18.605865,0.832426,3.697203
polydendrocytes,0.022247,2.643582,5.32646,0.011465,0.006362,1.85525,0.0,0.031521,0.011087,0.596851,...,1.539264,3.578376,6.383234,0.0,6.436578,8.039381,7.974619,1.846928,0.0,0.0


In [19]:
1

1

In [20]:
import scregat
import os
# This step adds tissue-specific Hi-C regulatory relationships.
# The user needs to provide a set of files. For example,
#in 
base_dir = '../data/'
os.listdir(base_dir)
# we include a file called PO_brain.txt,
# which contains brain tissue-specific Hi-C links.


['PO.txt',
 'all_tissue_SNP_Gene.txt',
 'TF_Gene_tissue_cutoff1.csv',
 'PO_brain.txt',
 'genes.protein.tss.tsv',
 'hg38.chrom.sizes',
 'PP.txt',
 'readme.md',
 'trrust_rawdata.human.tsv',
 'celltype_specific_cRE_interactions',
 'PP_brain.txt',
 'TF_Gene_tissue_Brain.csv',
 'model_init.pth',
 'dataset_atac_kRG_Pancreas.pkl',
 'processed_files',
 'peaks_process']

In [21]:
dataset_obj = prepare_model_input(
    # [Core Data] Single-cell ATAC-seq AnnData object.
    # Requirement: .X must be Peak-by-Cell matrix; .obs must contain cell type annotations.
    adata_atac = adata_atac,
    
    # [Output Path] Root directory for storing intermediate processed files.
    # The script will create a 'processed_files' folder here (e.g., sorted bed files).
    path_data_root = '../',
    
    # [File Reference] String path to the original ATAC file (used for naming/logging).
    file_atac = ATAC_h5ad_file, 
    
    # [Core Data] RNA expression matrix aggregated by cell type (Pseudo-bulk).
    # Format: Pandas DataFrame. 
    # Index (Rows): Cell type names (must match 'celltype_rna' in adata_atac.obs).
    # Columns: Gene Symbols (e.g., 'TP53').
    df_rna_celltype = df_rna,
    
    # [Prior Knowledge] Path to eQTL (expression Quantitative Trait Loci) data.
    # Used to link SNPs/non-coding regions to target genes.
    path_eqtl = '../data/all_tissue_SNP_Gene.txt',
    
    # [Prior Knowledge] Suffix for Hi-C interaction files to specify tissue context.
    # e.g., if set to "_brain", the code looks for 'PP_brain.txt' and 'PO_brain.txt'.
    # PP = Promoter-Promoter, PO = Promoter-Other (Enhancer).
    Hi_C_file_suffix = "_" + "brain",  
    
    # [Preprocessing] Whether to convert genomic coordinates from hg19 to hg38.
    # Set True if input ATAC peaks are hg19 (requires LiftOver tool); False if already hg38.
    hg19tohg38 = False,
    data_dir = '../data',
    # [QC Filter] Peak filtering threshold.
    # Peaks must be accessible in at least 1% (0.01) of cells to be retained.
    min_percent = 0.01,
    
    # [Prior Knowledge] Whether to use an extended TF (Transcription Factor) database.
    # False: Uses TRRUST only (Curated, high-confidence).
    # True: Uses TRRUST + CHEA3/ChIP-seq aggregated data.
    use_additional_tf = True,
    
    # [Prior Knowledge] Reliability threshold for the extended TF database.
    # Only applies if use_additional_tf=True.
    # Integer indicating in how many tissues/datasets the TF-Gene link must appear to be kept.
    # 10 indicates a high-confidence, conserved regulatory relationship.
    tissue_cuttof = 10
)

only dataset_obj ...
processing Hi-C ...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


processing TF ...
additional TF...
total candidate tf-gene:  28054




In [22]:
dataset_obj.list_graph[0]

Data(x=[61528, 1], edge_index=[2, 69180], y=[1], edge_tf=[155, 2], y_exp=[1348], cell='TCCATACCAATGATGCGGCATTCT')

In [23]:
file_atac_test = os.path.join('./', 'dataset_atac_kRG_MFG.pkl')
with open(file_atac_test, 'wb') as w_pkl:
    str_pkl = pickle.dumps(dataset_obj)
    w_pkl.write(str_pkl)

### Add Tissue-specific TF-gene  (from CHEA3)

In [24]:
df = pd.read_csv("../data/TF_Gene_tissue_Brain.csv", index_col=0)
df.columns = ['TF', 'TargetGene', 'tissue_count']
df_tf = df

In [25]:
df_tf.head()

Unnamed: 0,TF,TargetGene,tissue_count
0,TFAP2A,KDM4B,1
1,TFAP2A,PLXNA1,1
2,TFAP2A,LYPD6B,1
3,TFAP2A,ESRP1,1
4,TFAP2A,MAB21L1,1


In [26]:
# Extract all gene names from the RNA dataframe columns
gene_list = list(dataset_obj.df_rna.columns)

# Convert gene list to a set for faster membership checks
set_gene = set(gene_list)

# Filter TF dataframe to keep only rows where both TF and TargetGene are in the gene set
tf_base_filtered = df_tf[df_tf['TF'].isin(set_gene) & df_tf['TargetGene'].isin(set_gene)]

# Generate all possible gene pairs (Cartesian product of gene set with itself)
connections = [pair for pair in itertools.product(set_gene, set_gene)]
gene_pair_base = connections

# Collect unique TFs and TargetGenes from the filtered dataframe
tf_map_gene = set(tf_base_filtered['TF'].unique())
target_map_gene = set(tf_base_filtered['TargetGene'].unique())

# Create tuples of (TF, TargetGene) from the filtered dataframe
tf_base_tuples = set(zip(tf_base_filtered['TF'], tf_base_filtered['TargetGene']))

# Intersect with all possible gene pairs to keep only valid TF-target pairs
map_pair = tf_base_tuples.intersection(gene_pair_base)

# Convert the set of valid pairs into a list
map_pair_list = list(map_pair)

# Create a new dataframe from the valid TF-target pairs
df_tf_new = pd.DataFrame(map_pair_list, columns=['TF', 'TargetGene'])

# Concatenate the new TF-target dataframe with the existing one in dataset_obj
df_tf_all = pd.concat([df_tf_new, dataset_obj.df_tf])

# Remove duplicate rows to ensure uniqueness
df_tf_all = df_tf_all.drop_duplicates()

# Update dataset_obj with the merged TF-target dataframe
dataset_obj.df_tf = df_tf_all


In [27]:
# Create a dictionary to store the index of each element in dataset_atac.array_peak
peak_index_dict = {peak: idx for idx, peak in enumerate(dataset_obj.array_peak)}

# Initialize lists to store indices
index_1 = []
index_2 = []

# Iterate over 'TF' and 'TargetGene' columns in dataset_atac.df_tf
for k1, k2 in zip(dataset_obj.df_tf['TF'].values, dataset_obj.df_tf['TargetGene'].values):
    # Use the dictionary to quickly retrieve indices
    index_1.append(peak_index_dict[k1])
    index_2.append(peak_index_dict[k2])

# Stack the two index lists column-wise and convert to a PyTorch tensor
tf_edge_vec = torch.tensor(np.vstack([index_1, index_2]).T)

# Assign the TF edge tensor to the edge_tf attribute of each graph in the list
for t in dataset_obj.list_graph:
    t.edge_tf = tf_edge_vec


In [28]:
dataset_obj.list_graph[0]

Data(x=[61528, 1], edge_index=[2, 69180], y=[1], edge_tf=[4036, 2], y_exp=[1348], cell='TCCATACCAATGATGCGGCATTCT')

In [29]:
file_atac_test = os.path.join('./', 'dataset_atac_kRG_MFG.pkl')
with open(file_atac_test, 'wb') as w_pkl:
    str_pkl = pickle.dumps(dataset_obj)
    w_pkl.write(str_pkl)


## Training the Model and Obtaining Regulatory Scores

You can train the model and generate regulatory scores directly from the command line.  
For example, we used the following command to produce a sample dataset and a model weight file:

```bash
python ./scReGAT/run_scregat_cli.py \
  --input_file ./dataset_atac_kRG_MFG.pkl \
  --output_file ./RS_score.h5ad \
  --save_model_path ./scReGAT/data/model_init.pth \
  --gpu 2
```

---

## Argument Description

```python
def parse_args():
    parser = argparse.ArgumentParser(description="Run scReGAT Model Training and Inference")

    # --- I/O parameters ---
    parser.add_argument('--input_file', type=str, required=True, 
                        help='Path to the input ATAC pickle file (e.g., dataset_atac_core_MFG.pkl)')
    parser.add_argument('--output_file', type=str, required=True, 
                        help='Path to save the output AnnData file (e.g., result.h5ad)')

    # --- Single-cell expression integration parameters ---
    parser.add_argument('--use_sc_exp', action='store_true',
                        help='Enable integration of single-cell RNA expression data into graph node features')
    parser.add_argument('--rna_file', type=str, default=None,
                        help='Path to the RNA .h5ad file; required if --use_sc_exp is set')

    # --- Model saving and loading ---
    parser.add_argument('--save_model_path', type=str, default=None, 
                        help='Optional path to save trained model parameters')
    parser.add_argument('--load_model_path', type=str, default=None, 
                        help='Optional path to load pre-trained model parameters')

    # --- Training control ---
    parser.add_argument('--skip_train', action='store_true',
                        help='Skip the training phase and run inference directly')
    parser.add_argument('--seed', type=int, default=1233, 
                        help='Random seed (default: 1233)')
    parser.add_argument('--epochs', type=int, default=4, 
                        help='Number of training epochs (default: 4)')
    parser.add_argument('--lr', type=float, default=1e-4, 
                        help='Learning rate (default: 1e-4)')
    parser.add_argument('--batch_size', type=int, default=15, 
                        help='Training batch size (default: 15)')
    parser.add_argument('--sparse_loss_weight', type=float, default=0.1, 
                        help='Weight for sparse loss (default: 0.1)')

    # --- Testing / inference parameters ---
    parser.add_argument('--test_batch_size', type=int, default=20, 
                        help='Batch size for inference (default: 20)')
    parser.add_argument('--test_ratio', type=float, default=0.5, 
                        help='Ratio of cells to use for testing (default: 0.5)')

    # --- Hardware parameters ---
    parser.add_argument('--gpu', type=int, default=1, 
                        help='GPU ID to use; set -1 to use CPU (default: 1)')

    return parser.parse_args()
```

---

## Recommendation

We recommend users initialize training on new datasets using the model weight file:

```
./scReGAT/data/model_init.pth
```

This provides a stable starting point for training and helps ensure stability across experiments.

---
