# Cell type classification

In this notebook we develop an approach to preprocess the generated metadata files to allow for simple leave-one-timepoint respectively leave-one-patient out evaluation of our models trained on the identification of CD4+/-, CD8+/- and  CD16+/- samples.

---

## 0. Environmental setup


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedGroupKFold, GroupShuffleSplit
import os
from imblearn.under_sampling import RandomUnderSampler
from tqdm import tqdm
from collections import Counter

%load_ext nb_black

<IPython.core.display.Javascript object>

In [9]:
def get_data_splits_for_label(
    data,
    label_col,
    n_folds,
    group_col,
    random_state=1234,
    val_size=0.2,
    sample_limit=None,
):

    # Split in folds
    features = np.array(list(range(len(data)))).reshape(-1, 1)
    labels = np.array(data.loc[:, label_col])
    groups = np.array(data.loc[:, group_col])

    fold_data = {"train": [], "val": [], "test": []}
    group_kfold = StratifiedGroupKFold(n_splits=n_folds)
    for train_index, test_index in group_kfold.split(features, labels, groups=groups):

        train_val_fold_data = data.iloc[train_index]
        train_val_fold_labels = labels[train_index]
        train_val_fold_groups = groups[train_index]

        train_index, val_index = next(
            StratifiedGroupKFold(n_splits=int(1.0 / val_size)).split(
                train_val_fold_data, train_val_fold_labels, groups=train_val_fold_groups
            )
        )
        train_fold_data = train_val_fold_data.iloc[train_index]
        val_fold_data = train_val_fold_data.iloc[val_index]

        test_fold_data = data.iloc[test_index]

        fold_data["train"].append(train_fold_data)
        fold_data["val"].append(val_fold_data)
        fold_data["test"].append(test_fold_data)

    return fold_data

<IPython.core.display.Javascript object>

---

## 1. Read in data

We will now read in the two metadata files containing the chrometric features and the image locations alongside the individual labels.

In [7]:
img_md = pd.read_csv(
    "../../../data/meningioma/classification/preprocessed/image_locs_and_labels.csv",
    index_col=0,
)
nmco_md = pd.read_csv(
    "../../../data/meningioma/classification/preprocessed/nmco_feats_and_labels.csv",
    index_col=0,
)

<IPython.core.display.Javascript object>

---

## 2. StratifiedGrouped K-Fold

We will now split the individual metadata files into k-folds to assess the generalizability of our cell type classifiers.

### 2.1. Patient-3-Fold

First, we will split the data into 3 different folds such that each fold contains the data of exactly two patients in a stratified fashion such that the relative frequencies of the labels for the different cell types is approximately the same across all folds.

#### 2.1.a. CD4 -Patient-3-Fold
We start of the the CD4 labels.

In [37]:
output_dir = "../../../data/meningioma/classification/preprocessed/kfold/cd4"
os.makedirs(output_dir, exist_ok=True)

<IPython.core.display.Javascript object>

In [38]:
label_col = "cd4"
group_col = "patient"
random_state = 1234
n_folds = 3

<IPython.core.display.Javascript object>

In [40]:
cd4_img_fold_data = get_data_splits_for_label(
    data=img_md,
    label_col=label_col,
    n_folds=n_folds,
    group_col=group_col,
    random_state=random_state,
)
for k, v in cd4_img_fold_data.items():
    for i in range(len(v)):
        fold_label_data = cd4_img_fold_data[k][i].to_csv(
            os.path.join(output_dir, "cd4_img_loc_md_{}_fold_{}.csv".format(k, i))
        )

<IPython.core.display.Javascript object>

---
#### 2.1.b. CD8 -Patient-3-Fold
We start of the the CD4 labels.

In [46]:
output_dir = "../../../data/meningioma/classification/preprocessed/kfold/cd8"
os.makedirs(output_dir, exist_ok=True)

<IPython.core.display.Javascript object>

In [47]:
label_col = "cd8"
group_col = "patient"
random_state = 1234
n_folds = 3

<IPython.core.display.Javascript object>

In [48]:
cd8_img_fold_data = get_data_splits_for_label(
    data=img_md,
    label_col=label_col,
    n_folds=n_folds,
    group_col=group_col,
    random_state=random_state,
)
for k, v in cd8_img_fold_data.items():
    for i in range(len(v)):
        fold_label_data = cd8_img_fold_data[k][i].to_csv(
            os.path.join(output_dir, "cd8_img_loc_md_{}_fold_{}.csv".format(k, i))
        )

<IPython.core.display.Javascript object>

---
#### 2.1.c. CD16-Patient-3-Fold

Next, we turn to CD16.

In [49]:
output_dir = "../../../data/meningioma/classification/preprocessed/kfold/cd16"
os.makedirs(output_dir, exist_ok=True)

<IPython.core.display.Javascript object>

In [50]:
label_col = "cd16"
group_col = "patient"
random_state = 1234
n_folds = 3

<IPython.core.display.Javascript object>

In [51]:
cd16_img_fold_data = get_data_splits_for_label(
    data=img_md,
    label_col=label_col,
    n_folds=n_folds,
    group_col=group_col,
    random_state=random_state,
)
for k, v in cd16_img_fold_data.items():
    for i in range(len(v)):
        fold_label_data = cd16_img_fold_data[k][i].to_csv(
            os.path.join(output_dir, "cd16_img_loc_md_{}_fold_{}.csv".format(k, i))
        )

<IPython.core.display.Javascript object>

---

### 2.2. Timepoint-3-fold

We also will assess the generalizability across timepoints again individually for the CD4,CD8 and CD16 cell types