# BINN - Biologically Informed Neural Network

This notebooks demonstrates some examples on how a BINN can be created.

Read some test data. This requires and input and a pathway file. These correspond to the first layer (input) and intermediary (hidden) layers in the model. We also include the option to have a translation-file which maps the input to the intermediary layers.

In this example, the input layers consist of proteins with UniProt IDs and the intermediary layers consist of biological pathways with Reactome IDs. The translation file maps the UniProt IDs to the Reactome IDs.

In [1]:
import pandas as pd

input_data = pd.read_csv("../data/test_qm.csv")
mapping = pd.read_csv("../data/uniprot_2_reactome_2025_01_14.txt", sep="\t")
pathways = pd.read_csv("../data/reactome_pathways_relation_2025_01_14.txt", sep="\t")

input_entities = input_data["Protein"].tolist()
pathways = list(pathways.itertuples(index=False, name=None))
mapping = list(mapping.itertuples(index=False, name=None))

The BINN is created using a PathwayNetwork. We can create our own PathwayNetwork like so:

In [2]:
from binn import PathwayNetwork

network = PathwayNetwork(
    input_data=input_entities,
    pathways=pathways,
    mapping=mapping,
)

network.pathway_graph.edges()

  from .autonotebook import tqdm as notebook_tqdm


OutEdgeView([('R-HSA-109703', 'R-HSA-109704'), ('R-HSA-165160', 'R-HSA-109703'), ('R-HSA-109704', 'R-HSA-112399'), ('R-HSA-165158', 'R-HSA-109704'), ('R-HSA-111885', 'R-HSA-418594'), ('R-HSA-112040', 'R-HSA-111885'), ('R-HSA-180024', 'R-HSA-111885'), ('R-HSA-202040', 'R-HSA-111885'), ('R-HSA-111931', 'R-HSA-111933'), ('R-HSA-163615', 'R-HSA-111931'), ('R-HSA-111933', 'R-HSA-111997'), ('R-HSA-111932', 'R-HSA-111933'), ('R-HSA-111957', 'R-HSA-111933'), ('R-HSA-111996', 'R-HSA-112043'), ('R-HSA-111995', 'R-HSA-111996'), ('R-HSA-111997', 'R-HSA-111996'), ('R-HSA-112043', 'R-HSA-112040'), ('R-HSA-170660', 'R-HSA-112040'), ('R-HSA-170670', 'R-HSA-112040'), ('R-HSA-379401', 'R-HSA-112311'), ('R-HSA-380615', 'R-HSA-112311'), ('R-HSA-112399', 'R-HSA-2428928'), ('R-HSA-112412', 'R-HSA-112399'), ('R-HSA-114508', 'R-HSA-416476'), ('R-HSA-426048', 'R-HSA-114508'), ('R-HSA-1250196', 'R-HSA-1227986'), ('R-HSA-1251932', 'R-HSA-1227986'), ('R-HSA-1306955', 'R-HSA-1227986'), ('R-HSA-1963640', 'R-HSA-122

The BINN is implemented in PyTorch Lightning and takes a datamatrix as minimal input. If we have `use_reactome=True` we will use default reactome database to build the graph. Alternatively, we can provide arguments to supply our own pathways and mapping.

In [6]:
from binn import BINN

# Create using reactome (default)
binn = BINN(
    data_matrix=input_data,
    use_reactome=True,
    n_layers=4,
    dropout=0.2,
    validate=False,
)

print(binn.layers)

# or custom pathways and mapping
mapping = pd.read_csv(
    "../data/uniprot_2_reactome_2025_01_14.txt",
    sep="\t",
    header=None,
    names=["input", "translation", "url", "name", "x", "species"],
)
pathways = pd.read_csv(
    "../data/reactome_pathways_relation_2025_01_14.txt",
    sep="\t",
    header=None,
    names=["target", "source"],
)

binn = BINN(data_matrix=input_data, mapping=mapping, pathways=pathways)

binn.layers


BINN is on the device: cpu
Sequential(
  (Layer_0): Linear(in_features=448, out_features=471, bias=True)
  (BatchNorm_0): BatchNorm1d(471, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_0): Dropout(p=0.2, inplace=False)
  (Tanh 0): Tanh()
  (Layer_1): Linear(in_features=471, out_features=306, bias=True)
  (BatchNorm_1): BatchNorm1d(306, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_1): Dropout(p=0.2, inplace=False)
  (Tanh 1): Tanh()
  (Layer_2): Linear(in_features=306, out_features=125, bias=True)
  (BatchNorm_2): BatchNorm1d(125, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_2): Dropout(p=0.2, inplace=False)
  (Tanh 2): Tanh()
  (Layer_3): Linear(in_features=125, out_features=28, bias=True)
  (BatchNorm_3): BatchNorm1d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_3): Dropout(p=0.2, inplace=False)
  (Tanh 3): Tanh()
  (Output): Linear(in_features=28, out_features=2,

Sequential(
  (Layer_0): Linear(in_features=448, out_features=471, bias=True)
  (BatchNorm_0): BatchNorm1d(471, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_0): Dropout(p=0, inplace=False)
  (Tanh 0): Tanh()
  (Layer_1): Linear(in_features=471, out_features=306, bias=True)
  (BatchNorm_1): BatchNorm1d(306, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_1): Dropout(p=0, inplace=False)
  (Tanh 1): Tanh()
  (Layer_2): Linear(in_features=306, out_features=125, bias=True)
  (BatchNorm_2): BatchNorm1d(125, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_2): Dropout(p=0, inplace=False)
  (Tanh 2): Tanh()
  (Layer_3): Linear(in_features=125, out_features=28, bias=True)
  (BatchNorm_3): BatchNorm1d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_3): Dropout(p=0, inplace=False)
  (Tanh 3): Tanh()
  (Output): Linear(in_features=28, out_features=2, bias=True)
)

We can also build an ensemble of heads, in which the output of each layer in the network is passed through a linear layer before being summed in the end.

In [8]:
binn = BINN(
    data_matrix=input_data,
    use_reactome=True,
    heads_ensemble=True,
    n_layers=4,
    dropout=0.2,
    validate=False,
)

binn.layers


BINN is on the device: cpu


EnsembleHeads(
  (blocks): ModuleList(
    (0): Sequential(
      (Linear_0): Linear(in_features=448, out_features=471, bias=True)
      (BatchNorm_0): BatchNorm1d(471, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (Activation_0): Tanh()
    )
    (1): Sequential(
      (Linear_1): Linear(in_features=471, out_features=306, bias=True)
      (BatchNorm_1): BatchNorm1d(306, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (Activation_1): Tanh()
    )
    (2): Sequential(
      (Linear_2): Linear(in_features=306, out_features=125, bias=True)
      (BatchNorm_2): BatchNorm1d(125, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (Activation_2): Tanh()
    )
    (3): Sequential(
      (Linear_3): Linear(in_features=125, out_features=28, bias=True)
      (BatchNorm_3): BatchNorm1d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (Activation_3): Tanh()
    )
  )
  (heads): ModuleList(
    (0): Sequential(

In [4]:
binn.trainable_params

np.float64(4920.0)

Looking at the layer names, we see that these correspond to the input and intermediary layers in the model.

In [5]:
layers = binn.layer_names
layers[0][0]

np.str_('A0M8Q6')