## Model

This section includes the brief overview of Supervised Attemtion Multiple Instance Learning (SAMIL).

**General MIL Architecture**

* Instance representation layer $f$

  * An instance representation layer $f$ transforms each instance into a feature representation.
  * Produces an instance-specific embedding $h_k = f(x_k)$.
  * Use a stack of convolution layers and a MLP layer to extract and project each instance's feature representation to low-dimensional embedding.

* Pooling layer $\sigma$  
  * A pooling layer $\sigma$ aggregates across instances to form a bag-level representation in permutation-invariant fashion.
  * Leverages tanh function as an activation function.

* Output layer $g$
  * An output layer $g$ maps the bag-level representation to a prediction.
  * Softmax-based probabilistic classification.

**Training Objective**

The SAMIL has overall 2 stages, which are (1)Self-supervised pretraining and (2) Fine-tuning to diagnose AS.

1. **Self-Supervised Pretraining**: The objective is to minimize InfoNCE loss, in order to train image-level encoder $\phi=\psi \cdot f$ that composes project head $\psi$ with feature layer $f$ given a training set of $J$ images.
  $$L_{img-CL}(\phi_q) = \sum_{j=1}^{J}-\log\frac{\exp{q_j^T k_j^+ / t}}{\exp{q_j^T k_j^+ / t} + \sum_{p=0}^P \exp(\exp{q_j^T k_{jp}^- / t})},$$ where
  $q_j = \phi_q(x_j'), k_j^+ = \phi_k(x^+_j)$.

  Here, $q_j \in R^L$ is an embedding of the "query" image, $k^+_j \in R^L$ is an embedding of the "positive key" for InfoNCE loss, and $k_{jP}^- \in R^L$ are $P$ embeddings of "negative keys" retrieved from the queue. Uses SGD as optimizer.

2. **Fine-tuning to Diagnose Aortic Stenosis (AS)**: After initializing instance representation layer $f$ and pooling layer $\sigma$, we fine-tune $f$, $\sigma$, and output layer $g$ by minimizing the overall loss:
  $$L = L_{CE} + \lambda_{SA}L_{SA}$$
  * $L_{CE}$ is to minimize the cross-entropy loss between each bag's observed AS diagnosis $Y$ and the MIL-predicted probabilities given each bag of images $X$.
  * $L_{SA}$ is a supervised attention loss, minimizing the KL-divergence between relvance scores $R = {r_1, ..., r_K}$ from a view-relevance classifier $v$ and the attention weights $A={a_1, ..., a_K}$:

    $$L_{SA}(w, U) = KL(R||A) = \sum_{k=1}^K r_k \log{\frac{r_k}{a_k}}$$
  * Hyperparameter $\lambda_{SA}>0$ sets the relative weight of the SA loss term.
  * Also SGD was applied for the optimizer.
