This practical is inspired from the Patho-Bench tutorial, which can be found __[here](https://github.com/mahmoodlab/Patho-Bench/blob/main/tutorial/tutorial.ipynb)__

‚ö†Ô∏è __Before you start this pratical, make sure to load the conda env for this pratical__

Run `source /data/Training-MG/files/data/AI_praticals_2025/MG_AI/bin/activate`

If you have not done so this morning, use ipykernel to access this env in the notebook
`ipykernel install --name MG_AI --display-name "Python (MG_AI)"`

and select the `Python (MG_AI)` kernel in your kernel list.

# Practical 2: Using the slide embeddings for downstream tasks

## Introduction
__Lung cancer and its major subtypes__

Lung cancer remains the leading cause of cancer-related death worldwide, responsible for approximately 1.8 million deaths each year (GLOBOCAN 2022). It encompasses several biologically distinct diseases, broadly divided into:

- Small-cell lung carcinoma (SCLC) ‚Äî about 15% of cases, usually very aggressive and strongly associated with smoking.
- Non-small-cell lung carcinoma (NSCLC) ‚Äî around 85% of cases, including:
    - Lung adenocarcinoma (LUAD)
    - Lung squamous cell carcinoma (LUSC)
      
Among them, LUAD is the most common subtype (‚âà40‚Äì50% of all lung cancers). It typically develops in the peripheral regions of the lungs and, although often associated with smoking, it also occurs in non-smokers, especially in women and in Asian populations. In contrast, LUSC usually arises in the central bronchi and shows a stronger correlation with heavy tobacco exposure.

Distinguishing LUAD from LUSC is a routine diagnostic task for pathologists using hematoxylin‚Äìeosin (H&E) stained slides. Their characteristic morphology (glandular structures in LUAD vs keratin pearls and intercellular bridges in LUSC) makes visual diagnosis relatively straightforward. Consequently, LUAD‚ÄìLUSC classification is often used as a benchmark for computational pathology models, though it is an ‚Äúeasy task‚Äù for trained pathologists.

Now that we have seen how to compute slide embeddings, in this practical, we will go beyond this basic classification and explore more clinically meaningful applications of artificial intelligence in pathology using the slide embeddings obtained from [CPTAC-LUAD](https://www.cancerimagingarchive.net/collection/cptac-luad/) using two SOTA models: Titan and Feather.

## Data preparation
First, let's load Titan and Feather slide embeddings.

In [1]:
main_folder = "/data/Training-MG/files/data/AI_praticals_2025/AI_pratical_2_patho_bench"

In [None]:
import h5py
import pandas as pd

with h5py.File(f"{main_folder}/slides_titan.h5") as f:
    ids_titan = f['ids'].asstr()[:]
    features_titan = f['features'][:]
df_titan = pd.DataFrame({"slide_id": ids_titan, "titan_embeddings": features_titan.tolist()})

with h5py.File(f"{main_folder}/slides_feather.h5") as f:
    ids_feather = f['ids'].asstr()[:]
    features_feather = f['features'][:]
df_feather = pd.DataFrame({"slide_id": ids_feather, "feather_embeddings": features_feather.tolist()})

Now we will load the datasets that will be used for EGFR mutation and survival prediction

In [None]:
df_EGFR = pd.read_csv(f"{main_folder}/CPTAC_LUAD_EGFR_mutation.tsv", sep="\t")
EGFR_label_dict = {0: "wildtype", 1: "mutant"}
df_EGFR

In [None]:
df_OS = pd.read_csv(f"{main_folder}/CPTAC_LUAD_OS.tsv", sep="\t")
df_OS

Now, we will create two datasets: one for EGFR, one for OS. 
For each annotated slide id, the EGFR dataset will have to following columns:
- slide_id
- EGFR_mutation
- titan_embeddings
- feather_embeddings
- split
  
For each annotated slide id, the OS dataset will have to following columns:
- slide_id
- OS
- OS_event
- OS_days
- titan_embeddings
- feather_embeddings
- split

üí° Hint: You can use [pandas' merge method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) for this

In [None]:
# Your code here

## Unsupervised exploration
Let us first see if we can observe clusters for EGFR mutation and OS using Titan and Feather embeddings.
Using [scikit-learn's PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and [seaborn's scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) compute the first two PCs of Titan and Feather embeddings. Then, for each embedding type, plot a graph showing EGFR mutated samples, and another showing the OS labels.

In [None]:
from sklearn.decomposition import PCA
import seaborn as sns
# Your code here

- Do the two models provide similar embeddings? 
- Are there more outliers in one of the two models?
- Do you see some clusters appearing?

Given that the sample size is smaller than the number of slide embeddings' features, we will perform dimensionality reduction on our data before performing any downstream task to prevent overfitting. Add a `titan_pcs` and `feather_pcs` column to your EGFR and OS dataframes which will contain the first 10 PCs of each slide embedding.

In [None]:
# Your code here

## The importance of EGFR in lung adenocarcinoma
The EGFR gene (Epidermal Growth Factor Receptor) encodes a transmembrane receptor tyrosine kinase involved in cell proliferation, survival, and differentiation. Activating mutations in EGFR lead to continuous signaling through pathways such as MAPK and PI3K‚ÄìAKT, promoting uncontrolled tumour growth. EGFR mutations occur in roughly 10‚Äì15% of LUAD cases in Western populations, and up to 40‚Äì50% in East Asian patients, and are more common in never-smokers (and in women and East-Asian populations) than in smokers (Ko 2022). There is also growing evidence that ambient air pollution (PM2.5) can promote EGFR-mutant LUAD in people who have never smoked (Hill et al., Nature 2023).

The discovery of EGFR mutations in the early 2000s revolutionized lung cancer therapy by introducing targeted treatments known as EGFR tyrosine kinase inhibitors (TKIs) (Lynch et al., NEJM 2004; Paez et al., Science 2004). These drugs ‚Äî such as gefitinib, erlotinib, or osimertinib ‚Äî block aberrant EGFR signaling and can lead to dramatic tumour shrinkage and prolonged survival compared with standard chemotherapy. First-line osimertinib (a 3rd-generation EGFR-TKI) has dramatically improved survival outcomes in EGFR-mutant lung adenocarcinoma. The FLAURA2 trial combining osimertinib + platinum-based chemotherapy reported a median OS of 47.5 months‚Äînearly double the survival observed before the TKI era, when platinum chemotherapy alone yielded a median OS of 23.6 months (Maemondo et al., NEJM 2010; FLAURA2 WCLC 2025 abstract; AstraZeneca Press Release, 2025).

In clinical practice, EGFR status is typically determined through molecular testing (DNA sequencing, PCR, or next-generation sequencing). However, these tests require specialized equipment, time, and cost, and may not be available in all hospitals ‚Äî particularly in low-resource settings.

Routine histology is available for every case. If AI could screen for EGFR mutation status directly from H&E:
- results could be fast and low-cost,
- triage could prioritize confirmatory molecular testing,
- and settings with limited molecular infrastructure could benefit sooner.

#### Logistic regression task
Using the EGFR dataset you just created, train one [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) model (more info [here](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)) per slide embedding type using the slides' PCs and EGFR labels. Then, evaluate the results using [scikit-learn's classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) and [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

üí° Hint: Use the `split` column to select the train and test set

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
random_state = 42
# Perform a logistic regression with feather, evaluate the results and plot the confusion matrix

In [None]:
# Do the same for titan and compare the results

- Do both slide embeddings performs the same? Is one better than the other?
- Highlight the incorrectly classified slides in a scatterplot using the first two PCs as x and y axis
- Compare the slides incorrectly classified by each model. Are they the same?
- You can look at the thumbnails of the slides in the folder below to display incorrectly classified slides

In [None]:
!ls /data/Training-MG/files/data/AI_praticals_2025/AI_pratical_2_patho_bench/CPTAC_LUAD_thumbnails/

## Predicting survival from histology
Beyond molecular alterations, a key clinical question is: how aggressive is this tumour?
Predicting patient survival or prognosis is central to treatment decisions. Current prognostic factors (tumour stage, grade, etc.) explain only part of the variability in outcomes.

WSIs encode tumour morphology, microenvironment, and spatial organization that correlate with prognosis. Recent advances in deep learning have demonstrated that AI models can extract subtle patterns from histology that correlate with survival, sometimes capturing signals beyond human perception (e.g., pan-cancer TCGA analysis: Wulczyn et al., PLOS ONE 2020).

#### Survival prediction task
Using our the OS dataframe, specifically the OS_days, OS_event, and PCs columns from the slide embeddings, train a [Cox proportional hazard model](https://scikit-survival.readthedocs.io/en/stable/api/generated/sksurv.linear_model.CoxPHSurvivalAnalysis.html) per slide embedding type, using [scikit-learn train test split function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to create a train and test dataset. Compute the concordance index of each model and plot the cumulative hazard and survival functions.

üí° Hint: you can find some examples [there](https://scikit-survival.readthedocs.io/en/stable/api/generated/sksurv.linear_model.CoxPHSurvivalAnalysis.html#sksurv.linear_model.CoxPHSurvivalAnalysis.score)

In [None]:
from sksurv.linear_model import CoxPHSurvivalAnalysis
# perform a Cox proportional hazard analysis. The documentation for this method can be found here: https://scikit-survival.readthedocs.io/en/stable/api/generated/sksurv.linear_model.CoxPHSurvivalAnalysis.html
# plot the survival curves

- How good are the results?
- Is one model performing better than another?