# Creating a BINN

This notebook demonstrates some examples on how a BINN can be created and trained.

The method begins with constructing a directed graph representing biological pathways, mapping input features (e.g., proteins or genes) to nodes within the network. This graph is manipulated to produce hierarchical layers and connectivity matrices, which guide the structure of the BINN.

If you want to create your own BINN from scratch, you need some input data (`input_data` below) in the form of a pandas dataframe.



In [1]:
from binn import BINN
import pandas as pd

input_data = pd.read_csv("../binn/data/sample_datamatrix.csv")

binn = BINN(
    data_matrix=input_data,
    network_source="reactome",
    input_source="uniprot",
    n_layers=4,
    dropout=0.2,
)

binn

  from .autonotebook import tqdm as notebook_tqdm



[INFO] BINN is on device: cpu


BINN(
  (layers): Sequential(
    (Layer_0): Linear(in_features=448, out_features=471, bias=True)
    (BatchNorm_0): BatchNorm1d(471, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (Dropout_0): Dropout(p=0.2, inplace=False)
    (Tanh_0): Tanh()
    (Layer_1): Linear(in_features=471, out_features=306, bias=True)
    (BatchNorm_1): BatchNorm1d(306, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (Dropout_1): Dropout(p=0.2, inplace=False)
    (Tanh_1): Tanh()
    (Layer_2): Linear(in_features=306, out_features=125, bias=True)
    (BatchNorm_2): BatchNorm1d(125, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (Dropout_2): Dropout(p=0.2, inplace=False)
    (Tanh_2): Tanh()
    (Layer_3): Linear(in_features=125, out_features=28, bias=True)
    (BatchNorm_3): BatchNorm1d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (Dropout_3): Dropout(p=0.2, inplace=False)
    (Tanh_3): Tanh()
    (Output): Linear(in_feat

You can also provide your own pathways and mapping to create a PathwayNetwork. The PathwayNetwork is what underlies the pruning of the BINN to create sparsity. The pathway file is a standard edgelist. The mapping maps the input to the edgelist.

In [2]:
from binn import PathwayNetwork

mapping = pd.read_csv("../binn/data/downloads/uniprot_2_reactome_2025_01_14.txt", sep="\t")
pathways = pd.read_csv("../binn/data/downloads/reactome_pathways_relation_2025_01_14.txt", sep="\t")
pathways = list(pathways.itertuples(index=False, name=None))
mapping = list(mapping.itertuples(index=False, name=None))
input_entities = input_data["Protein"].tolist()

network = PathwayNetwork(
    input_data=input_entities,
    pathways=pathways,
    mapping=mapping,
)

list(network.pathway_graph.edges())[0]

('R-HSA-109703', 'R-HSA-109704')

In [3]:

# or custom pathways and mapping
mapping = pd.read_csv(
    "../binn/data/downloads/uniprot_2_reactome_2025_01_14.txt",
    sep="\t",
    header=None,
    names=["input", "translation", "url", "name", "x", "species"],
)
pathways = pd.read_csv(
    "../binn/data/downloads/reactome_pathways_relation_2025_01_14.txt",
    sep="\t",
    header=None,
    names=["target", "source"],
)

binn = BINN(data_matrix=input_data, mapping=mapping, pathways=pathways)

binn.layers


[INFO] BINN is on device: cpu


Sequential(
  (Layer_0): Linear(in_features=448, out_features=471, bias=True)
  (BatchNorm_0): BatchNorm1d(471, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_0): Dropout(p=0, inplace=False)
  (Tanh_0): Tanh()
  (Layer_1): Linear(in_features=471, out_features=306, bias=True)
  (BatchNorm_1): BatchNorm1d(306, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_1): Dropout(p=0, inplace=False)
  (Tanh_1): Tanh()
  (Layer_2): Linear(in_features=306, out_features=125, bias=True)
  (BatchNorm_2): BatchNorm1d(125, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_2): Dropout(p=0, inplace=False)
  (Tanh_2): Tanh()
  (Layer_3): Linear(in_features=125, out_features=28, bias=True)
  (BatchNorm_3): BatchNorm1d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_3): Dropout(p=0, inplace=False)
  (Tanh_3): Tanh()
  (Output): Linear(in_features=28, out_features=2, bias=True)
)

We can also build an ensemble of heads, in which the output of each layer in the network is passed through a linear layer before being summed in the end.

In [4]:
binn = BINN(
    data_matrix=input_data,
    network_source="reactome",
    heads_ensemble=True,
    n_layers=4,
    dropout=0.2,
)

binn.layers


[INFO] BINN is on device: cpu


_EnsembleHeads(
  (blocks): ModuleList(
    (0): Sequential(
      (Linear_0): Linear(in_features=448, out_features=471, bias=True)
      (BatchNorm_0): BatchNorm1d(471, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (Tanh_0): Tanh()
    )
    (1): Sequential(
      (Linear_1): Linear(in_features=471, out_features=306, bias=True)
      (BatchNorm_1): BatchNorm1d(306, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (Tanh_1): Tanh()
    )
    (2): Sequential(
      (Linear_2): Linear(in_features=306, out_features=125, bias=True)
      (BatchNorm_2): BatchNorm1d(125, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (Tanh_2): Tanh()
    )
    (3): Sequential(
      (Linear_3): Linear(in_features=125, out_features=28, bias=True)
      (BatchNorm_3): BatchNorm1d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (Tanh_3): Tanh()
    )
  )
  (heads): ModuleList(
    (0): Sequential(
      (0): Linear(in_f

In [5]:
binn.inputs[0]

np.str_('A0M8Q6')

Looking at the layer names, we see that these correspond to the input and intermediary layers in the model.

In [6]:
layers = binn.layer_names
layers[0][0]

np.str_('A0M8Q6')

# Training

In [7]:
from binn import BINN, BINNDataLoader, BINNTrainer
import pandas as pd

# Load your data
data_matrix = pd.read_csv("../binn/data/sample_datamatrix.csv")
design_matrix = pd.read_csv("../binn/data/sample_design_matrix.tsv", sep="\t")

# Initialize BINN
binn = BINN(data_matrix=data_matrix, network_source="reactome", n_layers=4, dropout=0.2)

## Initialize DataLoader
binn_dataloader = BINNDataLoader(binn)

# Create DataLoaders
dataloaders = binn_dataloader.create_dataloaders(
    data_matrix=data_matrix,
    design_matrix=design_matrix,
    feature_column="Protein",
    group_column="group",
    sample_column="sample",
    batch_size=32,
    validation_split=0.2,
)
# Train the model
trainer = BINNTrainer(binn)
trainer.fit(dataloaders=dataloaders, num_epochs=50)


[INFO] BINN is on device: cpu
Mapping group labels: {np.int64(1): 0, np.int64(2): 1}
[Epoch 1/50] Train Loss: 0.6418, Train Accuracy: 0.6054
[Epoch 1/50] Val Loss: 0.6931, Val Accuracy: 0.5312
[Epoch 2/50] Train Loss: 0.6577, Train Accuracy: 0.6573
[Epoch 2/50] Val Loss: 0.6929, Val Accuracy: 0.5312
[Epoch 3/50] Train Loss: 0.6903, Train Accuracy: 0.6360
[Epoch 3/50] Val Loss: 0.6925, Val Accuracy: 0.5312
[Epoch 4/50] Train Loss: 0.6185, Train Accuracy: 0.6705
[Epoch 4/50] Val Loss: 0.6921, Val Accuracy: 0.5312
[Epoch 5/50] Train Loss: 0.6848, Train Accuracy: 0.5985
[Epoch 5/50] Val Loss: 0.6916, Val Accuracy: 0.5312
[Epoch 6/50] Train Loss: 0.6150, Train Accuracy: 0.6623
[Epoch 6/50] Val Loss: 0.6906, Val Accuracy: 0.6719
[Epoch 7/50] Train Loss: 0.6657, Train Accuracy: 0.5985
[Epoch 7/50] Val Loss: 0.6872, Val Accuracy: 0.6875
[Epoch 8/50] Train Loss: 0.6146, Train Accuracy: 0.6616
[Epoch 8/50] Val Loss: 0.6794, Val Accuracy: 0.6562
[Epoch 9/50] Train Loss: 0.5771, Train Accuracy: 0