# Nuclear chromatin phenotypes of PBMCs vary between cancer types

---
This notebook summarizes the analysis corresponding to the results presented in figure 3 of the paper. It can be used to rerun the analysis and regenerate the corresponding panels.

---

## 0. Environmental setup

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import random
import os
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import matplotlib as mpl

mpl.rcParams["figure.dpi"] = 1200

# SMALL_SIZE = 16
# MEDIUM_SIZE = 18
# BIGGER_SIZE = 20

# mpl.rc("font", size=SMALL_SIZE, weight="normal")  # controls default text sizes
# mpl.rc("axes", titlesize=SMALL_SIZE)  # fontsize of the axes title
# mpl.rc("axes", labelsize=MEDIUM_SIZE)  # fontsize of the x and y labels
# mpl.rc("xtick", labelsize=SMALL_SIZE)  # fontsize of the tick labels
# mpl.rc("ytick", labelsize=SMALL_SIZE)  # fontsize of the tick labels
# mpl.rc("legend", fontsize=SMALL_SIZE)  # legend fontsize
# mpl.rc("figure", titlesize=BIGGER_SIZE)  # fontsize of the figure title

import sys

sys.path.append("../..")
from src.utils.notebooks.eda import *
from src.utils.notebooks.figure3 import *
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import cross_val_score, StratifiedGroupKFold

seed = 1234
random.seed(1234)
np.random.seed(1234)

%reload_ext nb_black

In [None]:
nuc_feature_desc = pd.read_csv(
    "../../data/chrometric_feature_description.csv", index_col=0
)
feature_name_dict = dict(
    zip(
        list(nuc_feature_desc.loc[:, "feature"]),
        list(nuc_feature_desc.loc[:, "long_name"]),
    )
)
feature_color_dict = {
    "morphology": "b",
    "intensity": "g",
    "boundary": "r",
    "texture": "c",
    "chromatin condensation": "m",
    "moments": "y",
    np.nan: "k",
}
feature_color_dict = {
    feature: feature_color_dict[category]
    for (feature, category) in zip(
        list(nuc_feature_desc.loc[:, "long_name"]),
        list(nuc_feature_desc.loc[:, "category"]),
    )
}

In [None]:
color_palette = {
    "Meningioma": "cornflowerblue",
    "Glioma": "orange",
    "Head & Neck": "orchid",
}

---

## 1. Read in data

To assess the differences of the cell states of PBMCs in the presence of different cancer types, we obtained PBMCs of 30 patients with different cancer types namely: Meningioma, Glioma and Head&neck cancer. Those cancer types are the most abundant among the population of patients treated with Proton therapy. For each patient we obtained data by staining the PBMCs for the DNA, gH2AX and Lamin A/C.

First, we read in the required data set that describe each PBMCs by a number of hand-crafted features extracted from the fluorescent images of the cells.

In [None]:
all_data = pd.read_csv("../../data/treated_population_data.csv", index_col=0)
all_data = preprocess_data(all_data, remove_constant_features=False)
all_data = all_data.loc[all_data.timepoint == "prior"]
all_data = all_data.rename(columns=feature_name_dict)
len(all_data)

In [None]:
fig, ax = plt.subplots(figsize=[12, 4], ncols=2)
cancer_order = ["Meningioma", "Glioma", "Head & Neck"]
ax = ax.flatten()
ax[0] = sns.countplot(
    x="sample",
    data=all_data,
    ax=ax[0],
    order=np.unique(all_data.loc[:, "sample"]),
    hue_order=cancer_order,
    hue="cancer",
    dodge=False,
    palette=color_palette,
)
ax[0].legend([], [], frameon=False)
ax[0].set_xlabel("ID of the biological sample")
ax[0].set_title("Distribution of biological samples")
for tick in ax[0].get_xticklabels():
    tick.set_rotation(90)

ax[1] = sns.countplot(
    x="cancer",
    hue="cancer",
    data=all_data,
    ax=ax[1],
    order=cancer_order,
    dodge=False,
    palette=color_palette,
    hue_order=cancer_order,
)
ax[1].set_xlabel("Cancer type")
ax[1].set_title("Distribution of cancer types")

plt.show()
plt.close()

___

#### Subsampling

We next subsample the data set such that for each cancer type we have the same number of nuclei in the data set. Additionally, we ensure that for the individual cancer type population are approximately uniformly represented by the different biological (patient) samples.

In [None]:
deselected_patients = ["p42"]

all_data = all_data.loc[~all_data.loc[:, "sample"].isin(deselected_patients)]

In [None]:
sampled_data = get_stratified_data(
    all_data,
    id_column="id",
    cond_column="cancer",
    seed=1234,
)

In [None]:
fig, ax = plt.subplots(figsize=[12, 4], ncols=2)
cancer_order = ["Meningioma", "Glioma", "Head & Neck"]
ax = ax.flatten()
ax[0] = sns.countplot(
    x="sample",
    data=sampled_data,
    ax=ax[0],
    order=np.unique(all_data.loc[:, "sample"]),
    hue_order=cancer_order,
    hue="cancer",
    dodge=False,
    palette=color_palette,
)
ax[0].legend([], [], frameon=False)
ax[0].set_xlabel("ID of the biological sample")
ax[0].set_title("Distribution of biological samples \n (sampled data set)")
for tick in ax[0].get_xticklabels():
    tick.set_rotation(90)

ax[1] = sns.countplot(
    x="cancer",
    hue="cancer",
    data=sampled_data,
    ax=ax[1],
    order=cancer_order,
    dodge=False,
    palette=color_palette,
    hue_order=cancer_order,
)
ax[1].set_xlabel("Cancer type")
ax[1].set_title(
    "Distribution of cancer types in the sampled data set \n (sampled data set)"
)
ax[1].legend(loc="lower right")

plt.show()
plt.close()

----

#### Sample and feature selection

We now filter out constant features and nuclei with missing features.

In [None]:
data = preprocess_data(sampled_data, remove_constant_features=True)

---

#### Data preparation

After sampling the data, we will now prepare the data for the consecutive analysis, i.e. extracting only chrometric features and corresponding metadata information.

In [None]:
all_chrometric_data = get_chrometric_data(
    data,
    proteins=["gh2ax", "lamin", "cd3"],
    exclude_dna_int=True,
)

sample_labels = data.loc[:, "sample"]
cancer_labels = data.loc[:, "cancer"]

Finally, we remove highly correlated features (Pearson $\rho > 0.8$) from the chrometric features.

In [None]:
chrometric_data = remove_correlated_features(all_chrometric_data, threshold=0.8)

---

## 3. Panels

Now we generate the individual panels for figure 3 of the paper.


### 3a. Visualization of changes of nuclear phenotypes in different cancer types

First, we provide a visual representation of the different nuclear phenotypes in health and cancer. To this end, we will randomly sample 36 nuclei from each of the three cancer types and plot a corresponding montage of the max-z projected DNA images. To visualize size differences each nuclei is padded to a size of 150x150 pixels. Note that the nuclei images were obtained from range-normalized DAPI images. The range normalization was used to mitigate batch effects.

In [None]:
image_file_path = "preprocessed/full_pipeline/segmentation/nuclei_images"
sampled_mg_images = get_random_images(
    data.loc[data.cancer == "Meningioma"],
    image_file_path,
    data_dir_col="data_dir",
    n_images=16,
    seed=1234,
    file_ending=".tif",
    file_name_col="file_name",
)

sampled_gl_images = get_random_images(
    data.loc[data.cancer == "Glioma"],
    image_file_path,
    data_dir_col="data_dir",
    n_images=16,
    seed=1234,
    file_ending=".tif",
    file_name_col="file_name",
)

sampled_hn_images = get_random_images(
    data.loc[data.cancer == "Head & Neck"],
    image_file_path,
    data_dir_col="data_dir",
    n_images=16,
    seed=1234,
    file_ending=".tif",
    file_name_col="file_name",
)

#### Meningioma population

In [None]:
fig_mg, ax_mg = plot_montage(
    sampled_mg_images,
    pad_size=150,
    mask_nuclei=True,
    cmap="inferno",
    nrows=4,
    ncols=4,
)
fig_mg.set_facecolor(color_palette["Meningioma"])

#### Glioma population

In [None]:
fig_gl, ax_gl = plot_montage(
    sampled_gl_images,
    pad_size=150,
    mask_nuclei=True,
    cmap="inferno",
    ncols=4,
    nrows=4,
)
fig_gl.set_facecolor(color_palette["Glioma"])

#### Head & Neck population

In [None]:
fig_hn, ax_hn = plot_montage(
    sampled_hn_images,
    pad_size=150,
    mask_nuclei=True,
    cmap="inferno",
    ncols=4,
    nrows=4,
)
fig_hn.set_facecolor(color_palette["Head & Neck"])

---

### 3b. Parametric analysis captures captures differences of PBMCs in the presence of Meningioma, Glioma and Head & Neck cancers

While the montage do not show large-scale differences between the different cancer types, we will now turn to the assessment of the parametric descriptions of the nuclear phenotypes of the PBMCs in those three cancer types. To this end, we first visualize the data set using a tSNE plot to assess potential large-scale differences between the cancer types and individual patient samples.

In [None]:
chrometric_embs = get_tsne_embs(chrometric_data)
chrometric_embs["cancer"] = np.array(cancer_labels)
chrometric_embs["sample"] = np.array(sample_labels)

In [None]:
fig, ax = plt.subplots(figsize=[9, 6])
ax = sns.scatterplot(
    data=chrometric_embs,
    x="tSNE 1",
    y="tSNE 2",
    hue="cancer",
    hue_order=cancer_order,
    ax=ax,
    s=14,
    marker="o",
    palette=color_palette,
    legend=False,
)
ax.set_xlim([-40, 40])
ax.set_ylim([-40, 40])
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=[9, 6])
ax = sns.scatterplot(
    data=chrometric_embs,
    x="tSNE 1",
    y="tSNE 2",
    hue="sample",
    hue_order=np.unique(sample_labels),
    ax=ax,
    s=14,
    marker="o",
    palette="tab20",
)
plt.legend(
    bbox_to_anchor=(1.02, 0.5),
    loc="center left",
    borderaxespad=0,
    title="sample",
    ncol=2,
)
ax.set_xlim([-40, 40])
ax.set_ylim([-40, 40])
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=[9, 6])
ax = sns.scatterplot(
    data=chrometric_embs.loc[chrometric_embs.cancer == "Meningioma"],
    x="tSNE 1",
    y="tSNE 2",
    hue="sample",
    hue_order=np.unique(
        chrometric_embs.loc[chrometric_embs.cancer == "Meningioma", "sample"]
    ),
    ax=ax,
    s=18,
    marker="o",
    palette="tab10",
)
plt.legend(
    bbox_to_anchor=(0.5, 1.05),
    loc="center",
    borderaxespad=0,
    title="",
    ncol=10,
    fancybox=False,
    frameon=False,
    columnspacing=0.4,
)
ax.set_xlim([-40, 40])
ax.set_ylim([-40, 40])
ax.set_xlabel("")
ax.set_ylabel("")
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=[9, 6])
ax = sns.scatterplot(
    data=chrometric_embs.loc[chrometric_embs.cancer == "Glioma"],
    x="tSNE 1",
    y="tSNE 2",
    hue="sample",
    hue_order=np.unique(
        chrometric_embs.loc[chrometric_embs.cancer == "Glioma", "sample"]
    ),
    ax=ax,
    s=18,
    marker="o",
    palette="tab10",
)
plt.legend(
    bbox_to_anchor=(0.5, 1.05),
    loc="center",
    borderaxespad=0,
    title="",
    ncol=10,
    fancybox=False,
    frameon=False,
    columnspacing=0.4,
)
ax.set_xlim([-40, 40])
ax.set_ylim([-40, 40])
ax.set_xlabel("")
ax.set_ylabel("")
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=[9, 6])
ax = sns.scatterplot(
    data=chrometric_embs.loc[chrometric_embs.cancer == "Head & Neck"],
    x="tSNE 1",
    y="tSNE 2",
    hue="sample",
    hue_order=np.unique(
        chrometric_embs.loc[chrometric_embs.cancer == "Head & Neck", "sample"]
    ),
    ax=ax,
    s=18,
    marker="o",
    palette="tab10",
)
plt.legend(
    bbox_to_anchor=(0.5, 1.05),
    loc="center",
    borderaxespad=0,
    title="",
    ncol=10,
    fancybox=False,
    frameon=False,
    columnspacing=0.4,
)
ax.set_xlim([-40, 40])
ax.set_ylim([-40, 40])
ax.set_xlabel("")
ax.set_ylabel("")
plt.show()

---

## Classification of the different cancer types.

To quantify the separability of the three cancer types using the chrometric phenotypes of the PBMCs of the different cancer patients we perform a 10-fold stratified cross-validation analysis using a RandomForest classifier. The classifier provides a simple non-linear classification model which also yields an importance measure for the individual chrometric features indicating which ones are most different between the three populations.

##### Nuclei split

At first we will split the data randomly on a nuclei-basis, i.e. nuclei of the same patient will be likely included in both the training and the test sets.

In [None]:
lda = LinearDiscriminantAnalysis(n_components=2)
lda_cancer_cv_conf_mtx_nuclei = get_cv_conf_mtx(
    estimator=lda,
    features=chrometric_data,
    labels=cancer_labels,
    scale_features=True,
    n_folds=10,
    order=cancer_order,
)
lda_normalized_cv_conf_mtx_nuclei = lda_cancer_cv_conf_mtx_nuclei.divide(
    lda_cancer_cv_conf_mtx_nuclei.sum(axis=1), axis=0
)

In [None]:
fig, ax = plt.subplots(figsize=[5, 4])
ax = sns.heatmap(
    lda_normalized_cv_conf_mtx_nuclei,
    annot=True,
    fmt=".4f",
    cmap="viridis",
    vmin=0,
    vmax=1,
    # cbar=False,
)
ax.set_xlabel("Predicted cancer type")
ax.set_ylabel("True cancer type")
plt.show()

In [None]:
lda_transformed = pd.DataFrame(
    lda.fit(chrometric_data, cancer_labels).transform(chrometric_data),
    columns=["LD 1", "LD 2"],
    index=chrometric_data.index,
)
lda_transformed["cancer"] = np.array(cancer_labels)
lda_transformed["sample"] = np.array(sample_labels)
g = sns.jointplot(
    data=lda_transformed,
    x="LD 1",
    y="LD 2",
    hue="cancer",
    s=5,
    hue_order=cancer_order,
    height=6,
    palette=color_palette,
    xlim=[-5, 4],
    ylim=[-4, 4],
    legend=False,
)
g.ax_joint.set_xlabel("")
g.ax_joint.set_ylabel("")
# g.ax_joint.legend(title="", prop={"size": 16, "weight": "bold"})
# g.ax_joint.legend(title="")
# g.savefig(os.path.join(output_dir, "cancer_type_lda.png"), dpi=1200, transparent=True)

In [None]:
not_plotted = (
    np.sum(lda_transformed.loc[:, "LD 1"] > 4)
    + np.sum(lda_transformed.loc[:, "LD 1"] < -5)
    + np.sum(lda_transformed.loc[:, "LD 2"] > 4)
    + np.sum(lda_transformed.loc[:, "LD 2"] < -4)
)
print(
    "{} nuclei of a total {} are not shown in the LDA plot.".format(
        not_plotted, len(lda_transformed)
    )
)

---

### Leave-one-patient out cross-validation

We run a leave-one-patient out cross-validation in order to characterize how the individual patients contribute to the separability of the different cancer types. In particular, we are interested in patients whose representative PBMC population is particularly accurate or inaccurate classified when the classifier is trained on the data of all other patients. Note that to avoid class imbalance, at each iteration were we leave out patient with a specific cancer type, we take a balanced random subsample among the PBMC population of all other patients for training such that each cancer type is equally represented.

However, we first again look at the average confusion matrix.

In [None]:
rfc = RandomForestClassifier(
    n_estimators=500, n_jobs=10, random_state=seed, class_weight="balanced"
)

In [None]:
lopo_cancer_cv_conf_mtx_patient = get_cv_conf_mtx(
    estimator=rfc,
    features=chrometric_data,
    labels=cancer_labels,
    groups=sample_labels,
    scale_features=False,
    n_folds=len(set(sample_labels)),
    order=cancer_order,
    balance_train=True,
)
normalized_cv_conf_lopo_mtx_patient = lopo_cancer_cv_conf_mtx_patient.divide(
    lopo_cancer_cv_conf_mtx_patient.sum(axis=1), axis=0
)

In [None]:
fig, ax = plt.subplots(figsize=[5, 4])
ax = sns.heatmap(
    normalized_cv_conf_lopo_mtx_patient,
    annot=True,
    fmt=".4f",
    cmap="viridis",
    vmin=0,
    vmax=1,
    annot_kws={"size": 16, "weight": "bold"},
    # cbar=False,
)
ax.set_xlabel("Predicted cancer type")
ax.set_ylabel("True cancer type")
plt.show()

In [None]:
def summarize_group_cv_results_by_fold(
    model,
    features,
    labels,
    groups,
    n_folds=None,
    balance_train=False,
    scoring=accuracy_score,
    random_state=1234,
):
    if n_folds is None:
        n_folds = len(np.unique(groups))
    cv = StratifiedGroupKFold(n_splits=n_folds)
    rus = RandomUnderSampler(random_state=random_state)

    result = {
        "group": [],
        "score": [],
        "avg_max_pred_prob": [],
        "avg_true_class_pred_prob": [],
        "majority_class": [],
        "majority_predicted_class": [],
    }

    features = np.array(features)
    labels = np.array(labels)
    groups = np.array(groups)
    for train_idx, test_idx in cv.split(X=features, y=labels, groups=groups):
        X_train, X_test = features[train_idx], features[test_idx]
        y_train, y_test = labels[train_idx], labels[test_idx]

        if balance_train:
            X_train, y_train = rus.fit_resample(X_train, y_train)

        # print(Counter(y_test))
        model.fit(X_train, y_train)
        classes = model.classes_
        preds = model.predict(X_test)
        pred_probs = model.predict_proba(X_test)
        test_groups = np.unique(groups[test_idx])
        test_group = "_".join(sorted(list(test_groups)))
        score = scoring(y_test, preds)
        for c in classes:
            if "prop_" + str(c) in result:
                result["prop_" + str(c)].append(np.mean(preds == c))
            else:
                result["prop_" + str(c)] = [np.mean(preds == c)]
        avg_max_pred_prob = np.mean(np.max(pred_probs, axis=1))
        true_class_pred_probs = []
        for i in range(len(y_test)):
            true_class_pred_probs.append(pred_probs[i, classes == y_test[i]])
        avg_true_class_pred_prob = np.mean(true_class_pred_probs)

        result["group"].append(test_group)
        result["score"].append(score)
        result["avg_max_pred_prob"].append(avg_max_pred_prob)
        result["avg_true_class_pred_prob"].append(avg_true_class_pred_prob)
        result["majority_class"].append(Counter(y_test).most_common(1)[0][0])
        result["majority_predicted_class"].append(Counter(preds).most_common(1)[0][0])

    return pd.DataFrame(result, index=list(range(n_folds)))

In [None]:
lopo_cv_result = summarize_group_cv_results_by_fold(
    model=rfc,
    features=chrometric_data,
    labels=cancer_labels,
    groups=sample_labels,
    balance_train=True,
)

In [None]:
lopo_cv_result.describe()

In [None]:
tumor_types = ["Meningioma", "Glioma", "Head & Neck"]
lopo_patient_cv_mtx = pd.DataFrame(
    np.zeros((3, 3)), index=tumor_types, columns=tumor_types
)
for c in tumor_types:
    for p in tumor_types:
        lopo_patient_cv_mtx.loc[c, p] = len(
            lopo_cv_result.loc[
                (lopo_cv_result.majority_class == c)
                & (lopo_cv_result.majority_predicted_class == p)
            ]
        )
normalized_lopo_patient_cv_mtx = lopo_patient_cv_mtx.divide(
    lopo_patient_cv_mtx.sum(axis=1), axis=0
)

In [None]:
fig, ax = plt.subplots(figsize=[5, 4])
ax = sns.heatmap(
    normalized_lopo_patient_cv_mtx,
    annot=True,
    fmt=".2f",
    cmap="viridis",
    vmin=0,
    vmax=1,
    annot_kws={"size": 16, "weight": "bold"},
    # cbar=False,
)
ax.set_xlabel("Predicted cancer type")
ax.set_ylabel("True cancer type")
plt.show()

In [None]:
lopo_cv_result_lda = summarize_group_cv_results_by_fold(
    model=lda,
    features=chrometric_data,
    labels=cancer_labels,
    groups=sample_labels,
    balance_train=True,
)
lopo_cv_result_lda.describe()

To compare the performance to a random baseline and thus be able to assess if the classification performance is significantly better than random chance. We repeat that procedure 10 times when we randomly permute the cancer types of the individual patients before hand.

In [None]:
np.random.seed(seed + 1111)
bs = range(10)

lopo_perm_cv_results = []

for b in tqdm(bs):
    perm_cancer_labels = get_permute_group_labels(cancer_labels, sample_labels)[0]
    lopo_perm_cv_result = summarize_group_cv_results_by_fold(
        model=rfc,
        features=chrometric_data,
        labels=perm_cancer_labels,
        groups=sample_labels,
        balance_train=True,
    )
    lopo_perm_cv_result["permutation"] = b
    lopo_perm_cv_results.append(lopo_perm_cv_result)
lopo_perm_cv_results = pd.concat(lopo_perm_cv_results)

In [None]:
lopo_perm_cv_results["condition"] = "Permuted"
lopo_cv_result["condition"] = "Observed"
all_lopo_results = lopo_cv_result.append(lopo_perm_cv_results)

We will now jointly plot the performance measured by the (balanced) accuracy score for each sample and thereby distinguish between the scores obtained with and without permuting the cancer labeles.

In [None]:
fig, ax = plot_lopo_cv_results_by_class(
    all_lopo_results,
    cancer_order,
    x="majority_class",
    y="score",
    hue="condition",
    figsize=[6, 4],
    test="Mann-Whitney",
    pval_text_format="star",
    alpha=0.5,
)
ax.set_xlabel("Cancer types")
ax.set_ylabel("Classification accuracy by patient")
plt.show()

The above plot validates that the performance in each cancer type is significantly higher than what we expect by random chance. However, we also notice that for two Glioma patients almost none the members of the respective PBMC population are correctly classified. The plot below visualizes this by showing the accuracy scores for each patient.

In [None]:
fig, ax = plt.subplots(figsize=[6, 4])
sample_colors = [
    color_palette[k] for k in list(lopo_cv_result.loc[:, "majority_class"])
]
sample_palette = dict(zip(list(lopo_cv_result.loc[:, "group"]), sample_colors))
ax = sns.barplot(
    data=lopo_cv_result,
    x="group",
    y="score",
    palette=sample_palette,
    order=list(lopo_cv_result.sort_values("score").loc[:, "group"]),
)
plt.xticks(rotation=90)
ax.set_xlabel("Patient sample")
ax.set_ylabel("Classification accuracy")
plt.show()

The two samples that the classifier performs worst on are patient 22 and patient 29, both of which are Astrocytoma patients. While P47 and P57 also are classified as Astrocytoma patients, the tumor size of those former two and the later two is vastly different: 1.8/3.7ccm vs. 94.6/85.7ccm respectively. Importantly, P22 as well as P57 have undergone chemotherapy before the blood drawing. The plot below shows the variable missclassification of P22,P29.

In [None]:
lopo_cv_result.sort_values("score").head(2)

Finally, we will plot the overall performance of the leave one out cross-validation approach against the random background which we obtained by permuting the cancer type labels. Note that we color individual points corresponding to individual samples based on the average prediction performance of the actual cancer type.

In [None]:
fig, ax = plot_lopo_cv_results(
    data=all_lopo_results,
    alpha=0.7,
    cbar_label="Prediction probability \n of the true cancer type",
)
ax.set_xlabel("")
ax.set_ylabel("Accuracy by LoPo CV fold")
plt.show()

---

#### Ablation study

In [None]:
nc_abl_results = run_nuclei_ablation_study_cv(
    estimator=rfc,
    features=chrometric_data,
    labels=cancer_labels,
    groups=sample_labels,
    n_repeats=10,
    balance_train=True,
    scale_features=True,
    n_folds=len(set(sample_labels)),
    random_state=1234,
)

In [None]:
nc_abl_results.frac_nuclei = np.round(nc_abl_results.frac_nuclei, 2)
g = sns.catplot(
    data=nc_abl_results,
    x="frac_nuclei",
    y="lopo_accuracy",
    kind="point",
    errorbar="se",
    capsize=0.2,
    height=4,
    aspect=1.5,
)
g.set_xlabels("")
g.set_ylabels("")
g.set(ylim=(0.60, 0.75))
# g.set_xlabels("Fraction of nuclei\n(training set)")
# g.set_ylabels("Average accuracy\n(leave one patient out)")
# g.set_titles("Distinction between tumor types")

In [None]:
def run_patient_ablation_study_cv(estimator, features, labels, groups, n_repeats=5, balance_train=True,
                                  scale_features=True, n_folds=10, random_state=1234):
    np.random.seed(random_state)
    classes = np.unique(labels)

    rus = RandomUnderSampler(random_state=random_state)

    if scale_features:
        sc = StandardScaler()
        features = pd.DataFrame(
            sc.fit_transform(features), index=features.index, columns=features.columns
        )

    features = np.array(features)
    labels = np.array(labels)
    groups = np.array(groups)

    patient_label_dict = {}
    min_n_patients = np.infty
    for c in classes:
        patient_label_dict[c] = np.unique(groups[labels == c])
        min_n_patients = min(min_n_patients, len(patient_label_dict[c]))
    print(patient_label_dict)

    results = {"n_train_patients": [], "sample": [], "lopo_accuracy": []}
    for i in tqdm(range(1, min_n_patients), position=0):
        for j in tqdm(range(n_repeats), position=1):
            skf = StratifiedGroupKFold(n_splits=n_folds)
            accs = []
            for train_index, test_index in skf.split(features, labels, groups):
                X_train, X_test = features[train_index], features[test_index]
                y_train, y_test = labels[train_index], labels[test_index]


                train_groups = groups[train_index]
                selected_train_patients = []
                for c in classes:
                    class_train_patients = np.unique(train_groups[y_train == c])
                    selected_train_patients.extend(
                            list(np.random.choice(class_train_patients, size=i, replace=False)))

                train_mask = []
                for train_group in train_groups:
                    train_mask.append(train_group in selected_train_patients)
                X_train = X_train[train_mask]
                y_train = y_train[train_mask]

                if balance_train:
                    X_train, y_train = rus.fit_resample(X_train, y_train)

                estimator.fit(X_train, y_train)
                acc = accuracy_score(y_test, estimator.predict(X_test))
                accs.append(acc)

            results["n_train_patients"].append(i)
            results["sample"].append(j)
            results["lopo_accuracy"].append(np.mean(accs))

    results = pd.DataFrame(results)
    return results


In [None]:
pt_abl_results = run_patient_ablation_study_cv(
    estimator=rfc,
    features=chrometric_data,
    labels=cancer_labels,
    groups=sample_labels,
    n_repeats=10,
    balance_train=True,
    scale_features=True,
    n_folds=len(set(sample_labels)),
    random_state=1234,
)

In [None]:
pt_abl_results.groupby("n_train_patients").describe()

In [None]:
g = sns.catplot(
    data=pt_abl_results,
    x="n_train_patients",
    y="lopo_accuracy",
    kind="point",
    errorbar="se",
    capsize=0.2,
    height=4,
    aspect=1.5,
)
g.set_xlabels("")
g.set_ylabels("")
g.set(ylim=(0.50, 0.80))
# g.set_xlabels("Number of patients\n(training set)")
# g.set_ylabels("Average accuracy\n(leave one patient out)")
# g.set_titles("Control vs. Cancer")

---

### 3c. Nuclear chromatin biomarkers identifying cancer populations

#### Feature importance

After having validated that there are significant differences between the individual cancer types in particular when comparing PBMCs of Head & Neck cancer patient with those of Glioma and Meningioma patients, we next assess the implicit feature importance of a RandomForest classifier trained on the task to distinguish between the cancer types in order to get an idea of the features which are most indicative of the different cancer types.

In [None]:
fig, ax = plot_feature_importance_for_estimator(
    rfc,
    chrometric_data,
    cancer_labels,
    scale_features=False,
    cmap=["gray"],
    figsize=[2, 1],
    feature_color_dict=feature_color_dict,
    n_features=15,
)

The analysis suggests that the heterochromatin content as well as the size of the nucleus are most discriminative between the different cancer populations in addition to a number of features characterizing the shape of the intensity distribution of the DNA inside the nucleus in 2D.

While the previously shown feature importance plots already suggest a number of candidate chrometric biomarkers that capture the differences of the nuclear phenotypes of the PBMCs in the different cancer types, we run marker screen by testing for differential distributions of the individual chrometric features between the different cancer type populations. To this end, we apply a t-test to test for difference in the means and adjust for multiple testing using the Benjamini-Hochberg procedure.

In [None]:
marker_screen_results = find_markers(chrometric_data, cancer_labels)

#### Meningioma

At first we look at the features whose mean is significantly different in Meningioma patients compared to patients of the other two cancer types.

In [None]:
marker_screen_results.loc[marker_screen_results.label == "Meningioma"].head(10)

We find that the PBMCs of Meningioma patients are on average slighlty smaller compared to the other two cancer types and show a slightly higher curvature of the nuclear boundaries.

---

#### Glioma

Next we look at the features whose mean is significantly different in Glioma patients compared to patients of the other two cancer types.

In [None]:
marker_screen_results.loc[marker_screen_results.label == "Glioma"].head(10)

Not surprisingly we find that the most indicative features are the similar ones as suggested by the feature importance plot. We find that the PBMCs of Glioma patients have a significantly reduced HC/EC content compared to the other cancer types and also are on average smaller, where as their nuclear boundaries show smoother curvature changes and the intensity distribution of the DNA inside the nucleus is more skewed and features shorter tails on average. Additionally the projected nuclear shapes are on average more convex compared to the other two cancer types.

---

#### Head & Neck cancers

Finally, we also evaluate the chrometric phenotype of PBMCs of the Head & Neck cancer patients.

In [None]:
marker_screen_results.loc[marker_screen_results.label == "Head & Neck"].head(10)

The PBMCs of the Head & Neck cancer patients feature have on average an increase heterochromatin content, are enlarged and show larger changes of the curvature of the nuclear boundaries, while the kurtosis and the skewness of the DNA intensity distribution is significantly reduced in both cases.

---

As a joint proxy to study the alterations in size, we focus at the nuclear volume, the variation in the shape by the concavity of the nucleus and the change in chromatin compaction by the relative heterochromatin to euchromatin ratio, additionally we observe differences in the curvature. Finally, the shape of the DNA intensity distributions of the z-projected nucleus are significantly different. To visualize those differences, we look at the distributions of those markers in the different cancer types.

In [None]:
markers = [
    "volume",
    "hetero_to_euchromatin_volume_ratio",
    "std_curvature",
]

marker_labels = [
    r"Nuclear volume in px$^3$",
    "relative HC/EC ratio",
    "Standard deviation of the curvature",
]
plot_cancer_type_markers_dist(
    data, markers, marker_labels, cut=0, palette=color_palette, figsize=[4, 4]
)

In [None]:
marker_labels = [
    "Nuclear volume \n" + r"(in px$^3)$",
    "HC/EC ratio",
    "Standard deviation \n of the curvature",
]
fig, ax = plot_joint_markers_cancer_types(
    data, markers, marker_labels, label_col="cancer", palette=color_palette
)
ax.set_ylabel("Normalized marker value")
ax.set_xlabel("Chrometric Marker")
sns.move_legend(
    ax,
    "lower center",
    bbox_to_anchor=(0.5, 1),
    ncol=3,
    title=None,
    frameon=False,
)
plt.show()

---

### 3d. Proteomic differences of PBMCs in cancer

Finally, we also assess the proteomic differences between the different cancer populations. To this end, we plot the relative Lamin and gH2AX expression measured by the sum of the intensities of the corresponding imaging channels normalized by the nuclear volume. Additionally, we plot the number of identified gH2AX foci which are computed as the local maxima peaks found in the corresponding channel images.

Note that those features are only available for the first data set that was stained for those proteins.

In [None]:
markers = [
    "rel_lamin_3d_int",
    "rel_gh2ax_3d_int",
    "gh2ax_foci_count",
    "gh2ax_sum_foci_area",
    "gh2ax_avg_foci_area",
]
marker_labels = [
    "Volume-normalized nuclear\nLamin A/C intensity",
    "Normalized nuclear\n" r"$\gamma$H2AX intensity",
    r"Number of $\gamma$H2AX foci",
    r"Sum of the $\gamma$H2AX foci area",
    r"Average size of the $\gamma$H2AX foci",
]
plot_cancer_type_markers_dist(
    data,
    markers,
    marker_labels,
    quantiles=None,
    cut=0,
    plot_type="bar",
    palette=color_palette,
    figsize=[3, 4],
)

---

## 4. Supplemental

In [None]:
markers = [
    "volume",
    "hetero_to_euchromatin_volume_ratio",
    "std_curvature",
]

marker_labels = [
    "Nuclear volume \n" + r"(in px$^3)$",
    "HC/EC ratio",
    "Standard deviation \n of the curvature",
]
fig, ax = plot_joint_markers_cancer_types(
    all_data, markers, marker_labels, label_col="cancer", palette=color_palette
)
ax.set_ylabel("Normalized marker value")
ax.set_xlabel("Chrometric Marker")
sns.move_legend(
    ax,
    "lower center",
    bbox_to_anchor=(0.5, 1),
    ncol=3,
    title=None,
    frameon=False,
)
plt.show()

In [None]:
markers = [
    "rel_lamin_3d_int",
    "rel_gh2ax_3d_int",
    "gh2ax_foci_count",
    "gh2ax_sum_foci_area",
    "gh2ax_avg_foci_area",
]
marker_labels = [
    "Volume-normalized nuclear\nLamin A/C intensity",
    "Normalized nuclear\n" r"$\gamma$H2AX intensity",
    r"Number of $\gamma$H2AX foci",
    r"Sum of the $\gamma$H2AX foci area",
    r"Average size of the $\gamma$H2AX foci",
]
plot_cancer_type_markers_dist(
    all_data,
    markers,
    marker_labels,
    quantiles=None,
    cut=0,
    plot_type="bar",
    palette=color_palette,
    figsize=[3, 4],
)