<a href="https://colab.research.google.com/github/LouisStefanuto/Detection-of-the-PIK3CA-mutation-in-breast-cancer/blob/MISVM/MISVM_perso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook using MISVM approach

Une idee à tester ça serait de reprendre ce code pour un autre classifieur :

1. Initialiser les tiles avec le label de leur bag

2. Fit un modèle de régression dessus (SVM)

3. Update les poids des tiles bags positifs par sgn(prédiction) et pour la tile 
avec la plus grande proba la mettre à 1

4. Si pas de changement de label, STOP

Before starting, you will need to install some packages to reproduce the baseline.

In [None]:
!pip install tqdm
!pip install scikit-learn

In [None]:
import logging
import os

In [None]:
# import data
PATH_COLAB = '/content/drive/MyDrive/challenge_ens_2023_small/moco_features.zip'
PATH_DEVICE = '..'
try:
    from google.colab import drive
    logging.info('Working on Colab.')
    
    # connect your drive to the session
    drive.mount('/content/drive')

    %cd /content/drive/MyDrive/challenge_data_ens_small/

    # unzip data into the colab session
    ! unzip $PATH_COLAB -d /content
    logging.info('Data unziped in your Drive.')

    %cd /content

    %cp -R drive/MyDrive/challenge_ens_2023_small/supplementary_data/ .
    %cp drive/MyDrive/challenge_ens_2023_small/train_output.csv .


except:
    logging.info('Working on your device.')
    
    data_exists = os.path.exists(PATH_DEVICE + '/train_input') and os.path.exists(PATH_DEVICE + '/test_input') and os.path.exists(PATH_DEVICE + '/train_output.csv')
    
    if data_exists:
        logging.info(f"Dataset found on device at : '{PATH_DEVICE}.'") 
    else:
        raise FileNotFoundError(f"Data folder not found at '{PATH_DEVICE}'")

In [None]:
%ls .

In [None]:
from pathlib import Path
from tqdm import tqdm

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

# Data architecture

After downloading or unzipping the downloaded files, your data tree must have the following architecture in order to properly run the notebook:
```
your_data_dir/
├── train_output.csv
├── train_input/
│   ├── images/
│       ├── ID_001/
│           ├── ID_001_tile_000_17_170_43.jpg
...
│   └── moco_features/
│       ├── ID_001.npy
...
├── test_input/
│   ├── images/
│       ├── ID_003/
│           ├── ID_003_tile_000_16_114_93.jpg
...
│   └── moco_features/
│       ├── ID_003.npy
...
├── supplementary_data/
│   ├── baseline.ipynb
│   ├── test_metadata.csv
│   └── train_metadata.csv
```

For instance, `your_data_dir = /storage/DATA_CHALLENGE_ENS_2022/`


This notebook aims to reproduce the baseline method on this challenge called `MeanPool`. This method consists in a logistic regression learnt on top of tile-level MoCo V2 features averaged over the slides.

For a given slide $s$ with $N_s=1000$ tiles and corresponding MoCo V2 features $\mathbf{K_s} \in \mathbb{R}^{(1000,\,2048)}$, a slide-level average is performed over the tile axis.

For $j=1,...,2048$:

$$\overline{\mathbf{k_s}}(j) = \frac{1}{N_s} \sum_{i=1}^{N_s} \mathbf{K_s}(i, j) $$

Thus, the training input data for MeanPool consists of $S_{\text{train}}=344$ mean feature vectors $\mathbf{k_s}$, $s=1,...,S_{\text{train}}$, where $S_{\text{train}}$ denotes the number of training samples.

## Data loading

In [None]:
# put your own path to the data root directory (see example in `Data architecture` section)
data_dir = Path("./")

# load the training and testing data sets
train_features_dir = data_dir / "train_input" / "moco_features"
test_features_dir = data_dir / "test_input" / "moco_features"
df_train = pd.read_csv(data_dir  / "supplementary_data" / "train_metadata.csv")
df_test = pd.read_csv(data_dir  / "supplementary_data" / "test_metadata.csv")

# concatenate y_train and df_train
y_train = pd.read_csv(data_dir  / "train_output.csv")
df_train = df_train.merge(y_train, on="Sample ID")

print(f"Training data dimensions: {df_train.shape}")  # (344, 4)
df_train.head()

## Data processing

We now load the features matrices $\mathbf{K_s} \in \mathbb{R}^{(1000,\,2048)}$ for $s=1,...,344$ and perform slide-level averaging. This operation should take at most 5 minutes on your laptop.

In [None]:
size_train = len(df_train)
X_train = np.zeros((size_train, 2048))
y_train = np.zeros((size_train))

centers_train = []
patients_train = []


for i, (sample, label, center, patient) in enumerate(tqdm(
    df_train[["Sample ID", "Target", "Center ID", "Patient ID"]].values
)):
    if i >= size_train:
      break
    # load the coordinates and features (1000, 3+2048)
    _features = np.load(train_features_dir / sample)
    # get coordinates (zoom level, tile x-coord on the slide, tile y-coord on the slide)
    # and the MoCo V2 features
    coordinates, features = _features[:, :3], _features[:, 3:]  # Ks
    # slide-level averaging
    X_train[i] = np.mean(features, axis=0)
    y_train[i] = label
    centers_train.append(center)
    patients_train.append(patient)

centers_train = np.array(centers_train)
patients_train = np.array(patients_train)

## MI-SVM approach

In [None]:
from sklearn.svm import SVC


def train_MISVM(df, features_dir):
    svm = SVC(kernel='rbf', gamma=0.37, C = 0.1, probability=True)

    size = len(df)
    X_S_I = np.zeros((size, 2048))
    Y_I = np.zeros((size))


    for i, (sample, label, _, _) in enumerate(tqdm(
        df[["Sample ID", "Target", "Center ID", "Patient ID"]].values
    )):
        # load the coordinates and features (1000, 3+2048)
        _features = np.load(features_dir / sample)
        # get coordinates (zoom level, tile x-coord on the slide, tile y-coord on the slide)
        # and the MoCo V2 features
        coordinates, features = _features[:, :3], _features[:, 3:]  # Ks
        # slide-level averaging
        X_S_I[i] = np.mean(features, axis=0)
        Y_I[i] = label

    S_I = - np.ones(len(X_train), dtype=int)

    changes = 1
    while changes > 0:

        svm.fit(X_S_I, Y_I)

        changes = 0

        for bag, (sample, label, _, _) in enumerate(tqdm(
            df[["Sample ID", "Target", "Center ID", "Patient ID"]].values
        )):
            if label:
                # load the coordinates and features (1000, 3+2048)
                _features = np.load(features_dir / sample)
                features = _features[:, 3:]  # Ks

                # compute classification score for each tile
                f_i = svm.predict_proba(features)[:,1]
                new_S = np.argmax(f_i)

                if new_S != S_I[bag]:
                    S_I[bag] = new_S
                    X_train[bag] = features[new_S]
                    changes += 1

        print(f"Changes : {changes}")

    return svm

In [None]:
def eval_MISVM(model, df, features_dir):

    preds = np.zeros(len(df))

    for bag, (sample, _, _) in enumerate(tqdm(
            df[["Sample ID", "Center ID", "Patient ID"]].values
        )):
            # load the coordinates and features (1000, 3+2048)
            _features = np.load(features_dir / sample)
            features = _features[:, 3:]  # Ks

            # compute classification score for each tile
            f_i = model.predict_proba(features)[:,1]
            preds[bag] = np.max(f_i)
    return preds

In [None]:
#svm = train_MISVM(df_train, train_features_dir)

In [None]:
#preds = eval_MISVM(svm, df_train, train_features_dir)

In [None]:
# /!\ we perform splits at the patient level so that all samples from the same patient
# are found in the same split

patients_unique = np.unique(patients_train)

y_unique = np.array(
    [np.mean(y_train[patients_train == p]) for p in patients_unique]
)
centers_unique = np.array(
    [centers_train[patients_train == p][0] for p in patients_unique]
)

print(
    "Training set specifications\n"
    "---------------------------\n"
    f"{len(X_train)} unique samples\n"
    f"{len(patients_unique)} unique patients\n"
    f"{len(np.unique(centers_unique))} unique centers"
)

In [None]:
aucs = []
models = []
# 5-fold CV is repeated 5 times with different random states
for k in range(5):
    kfold = StratifiedKFold(5, shuffle=True, random_state=k)
    fold = 0
    # split is performed at the patient-level
    for train_idx_, val_idx_ in kfold.split(patients_unique, y_unique):
        # retrieve the indexes of the samples corresponding to the
        # patients in `train_idx_` and `test_idx_`
        train_idx = np.arange(len(X_train))[
            pd.Series(patients_train).isin(patients_unique[train_idx_])
        ]
        val_idx = np.arange(len(X_train))[
            pd.Series(patients_train).isin(patients_unique[val_idx_])
        ]
        # set the training and validation folds
        df = df_train.iloc[train_idx]
        df_val = df_train.iloc[val_idx]

        y_fold_val = df_val["Target"]
        
        # instantiate and fit a SVM via MISVM
        svm = train_MISVM(df, train_features_dir)

        # get the predictions (1-d probability)
        preds_val = eval_MISVM(svm, df_val, train_features_dir)


        # compute the AUC score using scikit-learn
        auc = roc_auc_score(y_fold_val, preds_val)
        print(f"AUC on split {k} fold {fold}: {auc:.3f}")
        aucs.append(auc)
        # add the logistic regression to the list of classifiers
        models.append(svm)
        fold += 1
    print("----------------------------")
print(
    f"5-fold cross-validated AUC averaged over {k+1} repeats: "
    f"{np.mean(aucs):.3f} ({np.std(aucs):.3f})"
)

# Submission

Now we evaluate the previous models trained through cross-validation so that to produce a submission file that can directly be uploaded on the data challenge platform.

## Inference

In [None]:
preds_test = 0
# loop over the classifiers
for svm in models:
    preds_test += eval_MISVM(svm, df_test, test_features_dir)

# and take the average (ensembling technique)
preds_test = preds_test / len(models)

## Saving predictions

In [None]:
submission = pd.DataFrame(
    {"Sample ID": df_test["Sample ID"].values, "Target": preds_test}
).sort_values(
    "Sample ID"
)  # extra step to sort the sample IDs

# sanity checks
assert all(submission["Target"].between(0, 1)), "`Target` values must be in [0, 1]"
assert submission.shape == (149, 2), "Your submission file must be of shape (149, 2)"
assert list(submission.columns) == [
    "Sample ID",
    "Target",
], "Your submission file must have columns `Sample ID` and `Target`"


file_name = "benchmark_test_output_MISVM_rbf_kernel.csv"
# save the submission as a csv file
submission.to_csv(data_dir / file_name, index=None)

%cp ./$file_name /content/drive/MyDrive/challenge_ens_2023_small/$file_name
submission.head()

# Dealing with images

The following code aims to load and manipulate the images provided as part of  this challenge.

## Scanning images paths on disk

This operation can take up to 5 minutes.

In [None]:
train_images_dir = data_dir / "train_input" / "images"
train_images_files = list(train_images_dir.rglob("*.jpg"))

test_images_dir = data_dir / "test_input" / "images"
test_images_files = list(test_images_dir.rglob("*.jpg"))

print(
    f"Number of images\n"
    "-----------------\n"
    f"Train: {len(train_images_files)}\n" # 344 x 1000 = 344,000 tiles
    f"Test: {len(test_images_files)}\n"  # 149 x 1000 = 149,000 tiles
    f"Total: {len(train_images_files) + len(test_images_files)}\n"  # 493 x 1000 = 493,000 tiles
)

## Reading

Now we can load some of the `.jpg` images for a given sample, say `ID_001`.

In [None]:
ID_001_tiles = [p for p in train_images_files if 'ID_001' in p.name]

In [None]:
fig, axes = plt.subplots(5, 5)
fig.set_size_inches(12, 12)

for i, img_file in enumerate(ID_001_tiles[:25]):
    # get the metadata from the file path
    _, metadata = str(img_file).split("tile_")
    id_tile, level, x, y = metadata[:-4].split("_")
    img = plt.imread(img_file)
    ax = axes[i//5, i%5]
    ax.imshow(img)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f"Tile {id_tile} ({x}, {y})")
plt.show()

## Mapping with features

Note that the coordinates in the features matrices and tiles number are aligned.

In [None]:
sample = "ID_001.npy"
_features = np.load(train_features_dir / sample)
coordinates, features = _features[:, :3], _features[:, 3:]
print("xy features coordinates")
coordinates[:10, 1:].astype(int)

In [None]:
print(
    "Tiles numbering and features coordinates\n"
)
[tile.name for tile in ID_001_tiles[:10]]