### Multiple Instance Learning for Molecular Conformers

In molecular modeling, a single molecule can adopt multiple **conformations** due to rotations around single bonds. Each conformation, or **conformer**, may exhibit slightly different physicochemical properties or biological activity.  

Traditional QSAR or ML approaches often select a single “representative” conformer per molecule, which can **miss important structural information**. Multiple Instance Learning (MIL) offers a principled solution.

---

### What is MIL?

**Multiple Instance Learning (MIL)** is a type of machine learning where **labels are associated with sets of instances (bags)** rather than individual instances.  
- **Bag:** A molecule  
- **Instance:** A conformer of that molecule  
- **Label:** Molecular property or activity  

The MIL model learns to predict the **bag-level label** while potentially considering the contributions of individual instances.  

This framework naturally accommodates the **multi-conformer nature of molecules** and allows models to:
1. Aggregate information across all conformers.
2. Optionally identify **key conformers** that drive the molecular property.

---

### Goals of this Notebook

1. Generate conformers for a set of molecules.  
2. Compute 3D molecular descriptors for each conformer.  
3. Train MIL models to predict molecular properties using all conformers.  
4. (Optional) Identify key conformers that contribute most strongly to the predictions.  

By the end of this tutorial, you will understand how MIL can **leverage conformational diversity** in molecules for predictive modeling and interpretable insights.

### 1. Load dataset

The example datasets contain molecule structure (SMILES) and measured bioactivity (pKi or IC50) – the higher the better. Each SMILES is converted to a Mol object in RDKit.

In [1]:
import numpy as np
import pandas as pd
from rdkit import Chem

from sklearn.metrics import r2_score, accuracy_score
from sklearn.model_selection import train_test_split

# Data
from huggingface_hub import hf_hub_download

In [2]:
def reg_to_clf(y):
    return np.where(np.array(y) > 6, 1, 0)

def accuracy_metric(y_true, y_pred, task=None):
    if task == "classification":
        return accuracy_score(y_true, y_pred)
    elif task == "regression":
        return r2_score(y_true, y_pred)

In [3]:
TASK = "regression"
# TASK = "classification"

In [4]:
REPO_ID = "KagakuData/notebooks"

csv_path = hf_hub_download(REPO_ID, filename="chembl/CHEMBL279.csv", repo_type="dataset")
data = pd.read_csv(csv_path, header=None)

data_train, data_test = train_test_split(data, test_size=0.2)

In [5]:
smi_train, prop_train = data_train[0].to_list(), data_train[2].to_list()
smi_test, prop_test = data_test[0].to_list(), data_test[2].to_list()

if TASK == "classification":
    prop_train, prop_test = reg_to_clf(prop_train), reg_to_clf(prop_test)

In [6]:
mols_train, y_train = [], []
for smi, prop in zip(smi_train, prop_train):
    mol = Chem.MolFromSmiles(smi)
    if mol:
        mols_train.append(mol)
        y_train.append(prop)

In [7]:
mols_test, y_test = [], []
for smi, prop in zip(smi_test, prop_test):
    mol = Chem.MolFromSmiles(smi)
    if mol:
        mols_test.append(mol)
        y_test.append(prop)

## 2. Conformer generation

For each molecule, an ensemble of conformers is generated. Then, molecules for which conformer generation failed are filtered out from both, the training and test set. Generated conformers can be accessed by mol.GetConformers(confID=0).

In [8]:
from qsarmil.conformer import RDKitConformerGenerator
from qsarmil.utils.logging import FailedConformer, FailedDescriptor

In [9]:
conf_gen = RDKitConformerGenerator(num_conf=10, num_cpu=40)

In [10]:
# Generate conformers for training molecules
confs_train = conf_gen.run(mols_train)

# Filter out molecules where conformer generation failed
valid = [(c, y) for c, y in zip(confs_train, y_train) if not isinstance(c, FailedConformer)]

# Unpack back into separate lists
confs_train, y_train = map(list, zip(*valid))

Generating conformers: 100%|██████████████████████████████████████████████████████████| 537/537 [00:20<00:00, 26.16it/s]


In [11]:
# Generate conformers for test molecules
confs_test = conf_gen.run(mols_test)

# Filter out molecules where conformer generation failed
valid = [(c, y) for c, y in zip(confs_test, y_test) if not isinstance(c, FailedConformer)]

# Unpack back into separate lists
confs_test, y_test = map(list, zip(*valid))


Generating conformers: 100%|██████████████████████████████████████████████████████████| 135/135 [00:06<00:00, 21.77it/s]


### 3. Descriptor calculation for conformers

Once conformers are generated for each molecule, the next step is to compute **3D molecular descriptors** — numerical representations that capture the geometric and physicochemical properties of each conformer.  
Since each molecule can have multiple conformers, the resulting data are structured as **bags of descriptor vectors**, where each **bag** corresponds to a molecule and each **instance** within the bag corresponds to a conformer.

To streamline this process, a **descriptor wrapper** is used.  
This wrapper provides a unified interface for applying multiple descriptor calculators sourced from external packages (e.g., **RDKit** and **MolFeat**). It automatically handles descriptor computation for all conformers in each molecule.

In this example, several descriptor types are combined:
- **RDKit-based 3D descriptors:** `GEOM`, `AUTOCORR`, `RDF`, `MORSE`, `WHIM`, `GETAWAY`
- **MolFeat descriptors:** `Pharmacophore3D`, `USRDescriptors`, `ElectroShapeDescriptors`

After computing the raw descriptors, the values are **scaled** using `BagMinMaxScaler` from the `milearn.preprocessing` module.  
This scaling step ensures that descriptor values across conformers and molecules are brought to a consistent range, which helps stabilize and improve model training.


In [12]:
from qsarmil.descriptor.rdkit import (RDKitGEOM, 
                                      RDKitAUTOCORR, 
                                      RDKitRDF, 
                                      RDKitMORSE, 
                                      RDKitWHIM, 
                                      RDKitGETAWAY)

from molfeat.calc import Pharmacophore3D, USRDescriptors, ElectroShapeDescriptors

from qsarmil.descriptor.wrapper import DescriptorWrapper

from milearn.preprocessing import BagMinMaxScaler

In [13]:
desc_calc = DescriptorWrapper(Pharmacophore3D(factory="pmapper"), num_cpu=4, verbose=True)

In [14]:
x_train = desc_calc.run(confs_train)

Calculating descriptors: 100%|████████████████████████████████████████████████████████| 537/537 [06:25<00:00,  1.39it/s]


In [15]:
x_test = desc_calc.run(confs_test)

Calculating descriptors: 100%|████████████████████████████████████████████████████████| 135/135 [01:48<00:00,  1.25it/s]


In [16]:
scaler = BagMinMaxScaler()

scaler.fit(x_train)

x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

### 4. Mini-Benchmark: Evaluating Multiple MIL Architectures

This section performs a **mini-benchmark** of several **Multiple Instance Learning (MIL)** neural network architectures for both regression and classification tasks.  
The goal is to compare different pooling strategies and network types to evaluate how well they model the relationship between bags (molecules) and their instances (conformers).

---

#### Models Included

The benchmark covers multiple families of MIL models:

- **Wrapper MIL Networks**  
  These models wrap standard MLPs with MIL pooling strategies.  
  - `BagWrapperMLPNetwork` – pooling applied to bag embeddings  
  - `InstanceWrapperMLPNetwork` – pooling applied to instance embeddings  

- **Classic MIL Networks**  
  Standard architectures where pooling is directly integrated into the MIL framework.  
  - `BagNetwork`  
  - `InstanceNetwork`

- **Attention-Based MIL Networks**  
  Models that learn to assign importance weights to individual conformers.  
  - `AdditiveAttentionNetwork`  
  - `SelfAttentionNetwork`  
  - `HopfieldAttentionNetwork`

- **Other MIL Architectures**  
  - `DynamicPoolingNetwork` – adaptive pooling mechanism that adjusts to bag composition

Each architecture is instantiated separately for **regression** and **classification** versions.

---

#### Benchmark Procedure

1. The list of models is selected depending on the task type (`TASK`):  
   - `regressor_list` for regression  
   - `classifier_list` for classification  

2. Each model is trained on the **scaled training data** (`x_train_scaled`, `y_train`) using `.fit()`.

3. Predictions are generated on the test set (`x_test_scaled`):  
   - For regression: `model.predict()` returns continuous outputs.  
   - For classification: probabilistic outputs are thresholded at 0.5.

4. The **performance metric** (here labeled `"ACC"`) is computed using the helper function `accuracy_metric()` and stored in a result dataframe `res_df`.

This structure allows easy expansion: new architectures or pooling strategies can be added to the benchmark by simply appending them to the corresponding model list.

---

#### Stepwise Hyperparameter Optimization in *milearn*

The *milearn* framework also supports **stepwise hyperparameter optimization** through the `StepwiseHopt` class.  
This optimizer explores a predefined hyperparameter grid in a **parameter-by-parameter** manner, identifying the best value for each while keeping previously optimized parameters fixed.

Key features:
- Parallel evaluation of multiple parameter values via `ThreadPoolExecutor`.  
- Automatic handling of `torch` thread allocation for efficient CPU utilization.  
- Built-in tracking of training loss, epochs, and elapsed time for each candidate configuration.  
- Designed for both **regression** and **classification** MIL models.  

Default search space parameters (defined in `DEFAULT_PARAM_GRID`) include:
- Network depth and activation functions  
- Learning rate, batch size, and weight decay  
- MIL-specific parameters (`tau`, `instance_dropout`)  
- Random seed for reproducibility  

To run hyperparameter optimization before training, simply call:
```python
model.hopt(x_train_scaled, y_train, param_grid=DEFAULT_PARAM_GRID, verbose=True)


In [17]:
import logging
import warnings
warnings.filterwarnings("ignore")
logging.getLogger("pytorch_lightning").setLevel(logging.ERROR)
logging.getLogger("lightning").setLevel(logging.ERROR)

import time
import torch
import random

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# MNIST dataset creation
from milearn.data.mnist import load_mnist, create_bags_or, create_bags_and, create_bags_xor, create_bags_reg

# Preprocessing
from milearn.preprocessing import BagMinMaxScaler

# Network hparams
from milearn.network.module.hopt import DEFAULT_PARAM_GRID

# MIL wrappers
from milearn.network.regressor import BagWrapperMLPNetworkRegressor, InstanceWrapperMLPNetworkRegressor
from milearn.network.classifier import BagWrapperMLPNetworkClassifier, InstanceWrapperMLPNetworkClassifier

# MIL networks
from milearn.network.regressor import (InstanceNetworkRegressor,
                                       BagNetworkRegressor,
                                       AdditiveAttentionNetworkRegressor,
                                       SelfAttentionNetworkRegressor,
                                       HopfieldAttentionNetworkRegressor,
                                       DynamicPoolingNetworkRegressor)

from milearn.network.classifier import (InstanceNetworkClassifier,
                                        BagNetworkClassifier,
                                        AdditiveAttentionNetworkClassifier,
                                        SelfAttentionNetworkClassifier,
                                        HopfieldAttentionNetworkClassifier,
                                        DynamicPoolingNetworkClassifier)

# Utils
from sklearn.metrics import r2_score, accuracy_score
from sklearn.model_selection import train_test_split

In [18]:
regressor_list = [

        # wrapper mil networks
        ("MeanBagWrapperMLPNetworkRegressor", BagWrapperMLPNetworkRegressor(pool="mean")),
        ("MeanInstanceWrapperMLPNetworkRegressor", InstanceWrapperMLPNetworkRegressor(pool="mean")),
    
        # classic mil networks
        ("MeanBagNetworkRegressor", BagNetworkRegressor(pool="mean")),
        ("MeanInstanceNetworkRegressor", InstanceNetworkRegressor(pool="mean")),

        # attention mil networks
        ("AdditiveAttentionNetworkRegressor", AdditiveAttentionNetworkRegressor()),
        ("SelfAttentionNetworkRegressor", SelfAttentionNetworkRegressor()),
        ("HopfieldAttentionNetworkRegressor", HopfieldAttentionNetworkRegressor()),

        # other mil networks
        ("DynamicPoolingNetworkRegressor", DynamicPoolingNetworkRegressor()),
    ]

classifier_list = [

        # wrapper mil networks
        ("MeanBagWrapperMLPNetworkClassifier", BagWrapperMLPNetworkClassifier(pool="mean")),
        ("MeanInstanceWrapperMLPNetworkClassifier", InstanceWrapperMLPNetworkClassifier(pool="mean")),
    
        # classic mil networks
        ("MeanBagNetworkClassifier", BagNetworkClassifier(pool="mean")),
        ("MeanInstanceNetworkClassifier", InstanceNetworkClassifier(pool="mean")),

        # attention mil networks
        ("AdditiveAttentionNetworkClassifier", AdditiveAttentionNetworkClassifier()),
        ("SelfAttentionNetworkClassifier", SelfAttentionNetworkClassifier()),
        ("HopfieldAttentionNetworkClassifier", HopfieldAttentionNetworkClassifier()),

        # other mil networks
        ("DynamicPoolingNetworkClassifier", DynamicPoolingNetworkClassifier()),
    ]

In [19]:
if TASK == "regression":
    method_list = regressor_list
elif TASK == "classification":
    method_list = classifier_list

res_df = pd.DataFrame()

total = len(method_list)
for i, (method_name, model) in enumerate(method_list, start=1):
    print(f"\n[{i}/{total}] Training {method_name}...")

    model.hopt(x_train_scaled, y_train, param_grid=DEFAULT_PARAM_GRID, verbose=True)
    model.fit(x_train_scaled, y_train)

    print(f"→ {method_name} training complete. Evaluating...")

    if TASK == "regression":
        y_pred = model.predict(x_test_scaled)
    elif TASK == "classification":
        y_prob = model.predict(x_test_scaled)
        y_pred = np.where(y_prob > 0.5, 1, 0)

    acc = accuracy_metric(y_test, y_pred, task=TASK)
    res_df.loc[method_name, "ACC"] = acc

    print(f"✓ {method_name} done — ACC: {acc:.4f}")



[1/8] Training MeanBagWrapperMLPNetworkRegressor...
Optimizing hyperparameter: hidden_layer_sizes (3 options)
[1/28 |  3.6% |  0.2 min] Value: (2048, 1024, 512, 256, 128, 64), Epochs: 41, Loss: 0.7431
[2/28 |  7.1% |  0.1 min] Value: (256, 128, 64), Epochs: 46, Loss: 0.7956
[3/28 | 10.7% |  0.2 min] Value: (128,), Epochs: 85, Loss: 1.0821
Best hidden_layer_sizes = (2048, 1024, 512, 256, 128, 64), val_loss = 0.7431
Optimizing hyperparameter: activation (5 options)
[4/28 | 14.3% |  2.2 min] Value: relu, Epochs: 49, Loss: 1.1415
[5/28 | 17.9% |  2.1 min] Value: leakyrelu, Epochs: 41, Loss: 1.1453
[6/28 | 21.4% |  2.1 min] Value: gelu, Epochs: 47, Loss: 0.6880
[7/28 | 25.0% |  1.7 min] Value: elu, Epochs: 41, Loss: 0.7689
[8/28 | 28.6% |  2.2 min] Value: silu, Epochs: 54, Loss: 0.6819
Best activation = silu, val_loss = 0.6819
Optimizing hyperparameter: learning_rate (2 options)
[9/28 | 32.1% |  6.7 min] Value: 0.0001, Epochs: 131, Loss: 0.6391
[10/28 | 35.7% |  3.8 min] Value: 0.001, Epoc

In [20]:
res_df.sort_values(by="ACC", ascending=False)

Unnamed: 0,ACC
MeanInstanceWrapperMLPNetworkRegressor,0.587974
MeanBagNetworkRegressor,0.577722
DynamicPoolingNetworkRegressor,0.574897
MeanBagWrapperMLPNetworkRegressor,0.552846
AdditiveAttentionNetworkRegressor,0.522261
HopfieldAttentionNetworkRegressor,0.51295
MeanInstanceNetworkRegressor,0.51083
SelfAttentionNetworkRegressor,0.431029
