# Scikit-learn API

For those not familiar with PyTorch, we've created a wrapper for scikit-learn. This contains the familiar fit/predict-methods.

In [1]:
from binn import BINNClassifier, Network, SuperLogger
import pandas as pd


IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html


Similar to the PyTorch API, we load data and create a network, however, now we instead create a BINNClassifier object (this is the scikit-learn wrapper class).

In [2]:
pathways = pd.read_csv("../data/pathways.tsv", sep="\t")
translation = pd.read_csv("../data/translation.tsv", sep="\t")
input_data = pd.read_csv("../data/test_qm.csv")
design_matrix = pd.read_csv("../data/design_matrix.tsv", sep="\t")

network = Network(
    input_data=input_data,
    pathways=pathways,
    mapping=translation,
    source_column="child",
    target_column="parent"
)

binn = BINNClassifier(
    network=network,
    n_layers=4,
    dropout=0.2,
    epochs=3,
    threads=10,
    logger=SuperLogger("logs/test")
)
binn.clf.features

Missing logger folder: logs/test/lightning_logs



BINN is on the device: cpu


Index(['A0M8Q6', 'O00194', 'O00391', 'O14786', 'O14791', 'O15145', 'O43707',
       'O75369', 'O75594', 'O75636',
       ...
       'Q9UBE0', 'Q9UBQ7', 'Q9UBR2', 'Q9UBX5', 'Q9UGM3', 'Q9UK55', 'Q9UNW1',
       'Q9Y490', 'Q9Y4L1', 'Q9Y6Z7'],
      dtype='object', length=449)

We have to make our data-matrix fit the input layer in the BINN. Then we fit the BINN.

In [3]:
from util_for_examples import generate_data, fit_data_matrix_to_network_input

X = fit_data_matrix_to_network_input(input_data.reset_index(), features=binn.clf.features)

X, y = generate_data(X, design_matrix)

X_test = X[:10]
X_train = X[10:]
y_test = y[:10]
y_train = y[10:]

binn.fit(X_train, y_train, epochs=5)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.

  | Name   | Type             | Params
--------------------------------------------
0 | layers | Sequential       | 364 K 
1 | loss   | CrossEntropyLoss | 0     
--------------------------------------------
364 K     Trainable params
0         Non-trainable params
364 K     Total params
1.457     Total estimated model params size (MB)
Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
The number of training batches (24) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Epoch 4: 100%|██████████| 24/24 [00:02<00:00,  9.90it/s, v_num=0, train_loss=0.711, train_acc=0.572]

`Trainer.fit` stopped: `max_epochs=5` reached.


Epoch 4: 100%|██████████| 24/24 [00:07<00:00,  3.12it/s, v_num=0, train_loss=0.711, train_acc=0.572]


We can predict some instances.

In [4]:
binn.predict(X_test)

tensor([[ 0.6729, -0.2405],
        [ 0.5627,  0.0713],
        [-0.3760, -0.7794],
        [ 0.0175, -0.2304],
        [-0.4105,  0.2780],
        [-0.5066, -0.5398],
        [ 0.1077,  0.3245],
        [ 0.6781, -0.3180],
        [-0.0867, -0.2553],
        [-1.3702,  1.6543]])