# Usage Tutorial (EBV vs Others):

This tutorial uses the EBV vs Others (TCGA) task as an example to describe. It is a two-classification task (can be extended to multi-classification tasks), including EBV and Others. 

## Install Environment:

Create the environment with conda commands:
```
conda create -n conceppath python=3.8.18
conda activate conceppath
```

Install the dependencies:
```
git clone https://github.com/HKU-MedAI/ConcepPath.git
cd ConcepPath
pip install -r requirements.txt
```

The [TCGA data](https://portal.gdc.cancer.gov/) can be accessed from this link. And the processing code mainly refers to [CLAM](https://github.com/mahmoodlab/CLAM).

The folder structure of the this part: ({xxx} represents variables)

```
└── {experiment_root_path}
    ├── data
        ├── datasets
            └── {dataset_name}
                └── {task_name}_label.csv
        └── raw_data
            └── {dataset_name}
                ├── csv
                    ├── bm.csv
                    └── pl_mag{target_mag}x_patch{base_patch_size}_{target_patch_size}.csv
                └── segmented_patch
                    └── pl_mag{target_mag}x_patch{base_patch_size}_{target_patch_size}
                        ├── masks
                            └── {wsi_name}.jpg
                        ├── patches
                            └── {wsi_name}.h5
                        ├── stitches
                            └── {wsi_name}.jpg
                        └── process_list_autogen.csv
                        
    └── {experiment_name}
        ├── input
            ├── csv
                ├── split
                    └── {fold_i}.csv
            └── prompt
                ├── patch_prompts.json
                └── region_prompts.json
        └── output
            ├── attn_score
                └──{vlm_name_i}
                    └── {wsi_name}_{vlm_name}.pkl
            ├── heatmap
                └── attn_map
                    └── {vlm_name}
                        └── {wsi_name}
                            ├── aa_thumbnail.png
                            └── {patch_prompt_i}.png
            ├── metrics
                └── metrics.csv
            └── model
                └── {model_name}

```

Steps include：      
1. Descriptive Prompt Generation;
2. Image Data Preprocessing;
3. Training;
4. Testing and saving attention scores;
5. Postprocessing;
6. Other

# 1. Descriptive Prompt Generation:

Generate descriptive prompt on two levels: slide-level and patch-level. Templates for questioning on LLM (GPT-4) are:
Patch-level + Postive(EBV):
```txt
Q1: Please provide a summary of the factors found in primary tumor whole slide images that may indicate Epstein-Barr Virus (EBV)-positive subtype of Gastric cancer, along with a description of their image features in short terms, separated by semicolons. Please avoid using subtype names in your response.
Q*: Please avoid using the word "EBV" in your response.
Q*: Please give more.
```
Patch-level + Negative(others):
```
Q2: Suppose we let Gastric cancer subtypes other than Epstein-Barr Virus (EBV)-positive subtype into one group. Please provide a summary of the factors found in primary tumor whole slide images that may indicate this group of subtypes, along with a description of their image features in short terms, separated by semicolons.
Q*: Please avoid using the word "EBV" in your response.
Q*: Please give more.
```
Slide-level + Postive(EBV):
```txt
Q3: Summary the appearance of whole slide images of Epstein-Barr Virus (EBV)-positive subtype of Gastric cancer in short terms, separated by semicolons.
Q*: Please avoid using the word "EBV" in your response.
Q*: Please give more.
```
Slide-level + Negative(others):
```
Q4: Suppose we let Gastric cancer subtypes other than Epstein-Barr Virus (EBV)-positive subtype into one group. Summary the appearance of whole slide images of this group in short terms, separated by semicolons. 
Q*: Please avoid using the word "EBV" in your response.
Q*: Please give more.
```

Write these answers with format as the following: (columns="prompt_level,label,descriptive_prompt")
```csv
patch,ebv,Lymphoid stroma: Dense lymphoid stroma with infiltrating lymphocytes; lymphocytes surrounding tumor cells.
patch,ebv,Lymphoepithelioma-like appearance: Undifferentiated tumor cells with abundant lymphocytic infiltrate; tumor cells interspersed with lymphocytes.
patch,ebv,Epstein-Barr Virus (EBV)-encoded RNA (EBER) expression: Strong and diffuse nuclear staining for EBER; intense staining in the nuclei of tumor cells.
...
slide,ebv,Lymphocyte-rich infiltrate; Well-defined glandular structures; Epithelial cell apoptosis; Nuclear atypia; Syncytial growth pattern; Presence of viral-associated features; Intense lymphoid reaction; Absence of signet ring cells; Expression of viral RNA.
```
Next, run the helper script to parse these prompts into JSON format. The primary functions included within the helper scripts are:
 - split prompts into two file based on its level (slide vs.patch)
 - avoid using class label(or associated keywords) 
 - filter the top-N concepts of patch-level prompts

Note: `EXPERIMENT_NAME` should be the same as used in `main.py`.
```bash
EXPERIMENT_NAME=molecular_ebv_others_full_train python tools/parse_prompt.py
```

And the references used in this section listed in 6.1 Prompt References.

# 2. Image Data Preprocessing:

## 2.1 Generate processing list:

### Parameter Description:
- **`--exp_root_path:`**: root path of experiment;
- **`--WSI_dir:`**: WSIs' saving directory;
- **`--save_dir`**: result's saving directory;
- **`--base_patch_size`**: basic patch size;
- **`--target_mag`**: target magnification;
- **`--number`**: number of processing files;
- **`--WSI_name`**: WSI dataset (/{save_dir}/{WSI_name})

In [None]:
from utils.processing_utils import *
import os

exp_root_path = "/home/r10user13/ConcepPath"                           

generate_pl_bm(
        WSI_dir="/data1/r10user13/TCGA-WSI/STAD/STAD",
        save_dir= os.path.join(exp_root_path, "/data/raw_data/"),
        base_patch_size=448,
        target_mag=20,
        number="all",
        WSI_name="STAD"
)

## 2.2 Segment patches and save coords information:

### Parameter Description:
- **`--source:`**: WSIs' saving directory;
- **`--save_dir:`**: save directory;
- **`--patch_size`**: patch size;
- **`--step_size`**: step size (if it is less than patch size, there will be overlap, if it is greater than patch size, there will be gaps);
- **`--seg`**: whether to generate mask;
- **`--patch`**: whether to generate a patch;
- **`--stitch`**: whether to generate stitch;
- **`--process_list`**: process list;

In [None]:
!python create_patches_fp.py \
    --source /data1/r10user3/TCGA-WSI/STAD/STAD \                                               
    --save_dir /home/r10user13/ConcepPath/data/raw_data/STAD/segmented_patch/pl_mag20x_patch448_448 \
    --patch_size 448 \
    --step_size 448 \
    --seg \
    --patch \
    --stitch \
    --process_list /home/r10user13/ConcepPath/data/raw_data/STAD/csv/pl_mag20x_patch448_448.csv

## 2.3 Integrate the dataset and generate label files:

### Parameter Description:
- **`--raw_label:`**: raw label saving path;
- **`--dataset_path`**: result's saving path;
- **`--file_name`**: saving file name;
- **`--seg_patch_dir`**: path of segmented patches;
- **`--WSI_dir`**: WSIs' saving directory;

In [None]:
import pandas as pd
import glob, os
from utils.processing_utils import *

raw_label = '/home/r10user13/ConcepPath/data/molecular_label_raw.csv'
dataset_path = "/home/r10user13/ConcepPath/data/datasets/STAD"
file_name = "molecular_ebv_others.csv"
seg_patch_dir = '/home/r10user13/ConcepPath/data/raw_data/STAD/segmented_patch'
WSI_dir = "/data1/r10user3/TCGA-WSI/STAD/STAD"

file_list = glob.glob(os.path.join(WSI_dir, "*"))
raw_label_df = pd.read_csv(raw_label)
label_dict = {
    "slide_fp": [],
    "label": []
}

def panduan(label):
    if label == "EBV":
        return "EBV"
    else:
        return "others"
    
for i, row in raw_label_df[raw_label_df["Note"].isna()].iterrows():
    result = [s for s in file_list if all(sub in s for sub in [row['TCGA barcode']])]
    label = panduan(row['Molecular Subtype'])
    label_dict["slide_fp"] += result
    label_dict["label"] += [label]*len(result)

label_map = dict(zip(label_dict["slide_fp"], label_dict["label"]))

directory_paths = {
    seg_patch_dir: WSI_dir
}
    
generate_label_file(
    directory_paths, 
    dataset_path, 
    file_name, 
    label_map)

## 2.4 Feature Extraction:

### Parameter Description:
- **`--vlm_model:`**: vision-language model name;
- **`--label_fp:`**: path of label file;
- **`--batch_size`**: batch size;
- **`--save_rp`**: saving path of extracted features;
- **`--num_workers`**: number of workers;
- **`--base_mag`**: basic magnification;
- **`--base_patch_size`**: basic patch size;

In [None]:
!python create_feature_extraction.py  \
    --vlm_model quilt1m \
    --label_fp /home/r10user13/ConcepPath/data/datasets/STAD/molecular_ebvmsi_others.csv  \
    --batch_size 128 \
    --save_rp /data2/r10user13/ConcepPath \
    --num_workers 32 \
    --base_mag 20 \
    --base_patch_size 448

## 2.5 Dataset split:

### Parameter Description:
- **`--shot_num:`**: number of shot [1,2,4,8,16,32,64, ... ,full_train];
- **`--exp_rp:`**: root path of experiment;
- **`--exp_name`**: name of experiment;
- **`--fold_num`**: number of fold;
- **`--labe_fp`**: path of label file (generated in step);
- **`--base_mag`**: basic magnification;
- **`--base_patch_size`**: basic patch size;

In [None]:
# split
from sklearn.model_selection import train_test_split, StratifiedKFold
import os, shutil, random
import pandas as pd
import numpy as np

# shot_num = [1,2,4,8,16,32,64]
shot_num = "full_train"
exp_rp = "/home/r10user13/ConcepPath/experiment"
exp_name = "stad_molecular_ebv_others_ft"
fold_num = 5
labe_fp = "/home/r10user13/ConcepPath/data/datasets/STAD/molecular_ebv_others.csv"


label_df = pd.read_csv(labe_fp)
data_fp_label_map = dict(zip(
    list(label_df["slide_fp"]), list(label_df["label"])
))

exp_rp_ = os.path.join(exp_rp, exp_name, 'input/csv/split')
if not os.path.exists(exp_rp_):
    os.makedirs(exp_rp_)
    

if shot_num == "full_train":
    slide_fp = label_df["slide_fp"]
    seg_fp = label_df["seg_fp"]
    label = label_df["label"]
    
    X = np.array(slide_fp)
    y = np.array(label)
    skf = StratifiedKFold(n_splits=fold_num, shuffle=True, random_state=42)
    i = 0 
    for train_index, test_index in skf.split(X, y):
        X_train, X_test = list(X[train_index]), list(X[test_index])
        y_train, y_test = list(y[train_index]), list(y[test_index])
        X_val, y_val = X_test, y_test
        out = pd.DataFrame({
            "data_path": X_train+X_test+X_val, 
            "label": y_train+y_test+y_val, 
            "type": ["train"]*len(X_train)+["test"]*len(X_test)+["val"]*len(X_val)
        })
        
        out.to_csv(os.path.join(exp_rp_, f"fold{i}.csv"))
        
        i+=1
        
else:
    data_label = {}
    label_list = label_df["label"].unique()
    
    for label_i in label_list:
        slide_fp = label_df[label_df["label"]==label_i]["slide_fp"]
        seg_fp = label_df[label_df["label"]==label_i]["seg_fp"]
        label = label_df[label_df["label"]==label_i]["label"]
        
        train_data, data_label[label_i]["test_data"], data_label[label_i]["train_label"], data_label[label_i]["test_label"] = train_test_split(
            slide_fp, label, test_size=0.2, stratify=label, random_state=2023)
        
        random.shuffle(train_data)
        data_label[label_i]["train_data"] = train_data
    
    for shot in shot_num:
        for i in range(fold_num):
            
            cur_i = i*shot
            data_path_list = []
            label_list = []
            type_list = []
            
            for label_i in label_list:
                
                train_data_i = data_label[label_i]["train_data"][cur_i:cur_i+shot]
                
                test_data_i = data_label[label_i]["test_data"]
                
                val_data_i = random.sample(list(set(data_label[label_i]["train_data"])-set(train_data_i)), len(test_data_i))
                
                data_path_list_i = train_data_i+test_data_i+val_data_i
                label_list_i = [data_fp_label_map[data_path] for data_path in data_path_list_i]
                type_list_i = ["train"]*2*shot_num+["test"]*len(test_data_i)+["val"]*len(val_data_i)
                
                data_path_list += data_path_list_i
                label_list += label_list_i
                type_list += type_list_i
                
                
            out = pd.DataFrame({
                "data_path": data_path_list, 
                "label": label_list, 
                "type": type_list
            })
            out.to_csv(os.path.join(exp_rp_, f"fold{i}.csv"))

# 3. Training:

For detailed description of parameters, please refer to the *get_params()* function in main.py.

In [None]:
# fold0 train
!python main.py \
    --is_adapted \
    --weighted_type p2c \
    --orth_ratio 2 \
    --n_ddp 8 \
    --n_ctx 16 \
    --fold_name fold0 \
    --vlm_name quilt1m \
    --feature_rp /data2/r10user13/ConcepPath/stad_quilt1m_20x_448/ \
    --num_epochs 200 \
    --learning_rate 0.0001 \
    --task_type train \
    --n_classes 2 \
    --experiment_rp /home/r10user13/ConcepPath/experiment \
    --experiment_name stad_molecular_ebv_others_re

# 4. Testing and saving attention scores:

In [None]:
# fold0 test
!python main.py \
    --n_flp 8 \
    --n_ctx 16 \
    --model_fp /home/r10user13/ConcepPath/experiment/stad_molecular_ebv_others_ft/output/model/quilt1m_fold0_lr:0.0001_ctx:16_flp:8_specific__test_best_auc_model.pt \
    --fold_name fold0 \
    --vlm_name quilt1m \
    --feature_rp /data2/r10user13/ConcepPath/stad_quilt1m_20x_448/ \
    --task_type test \
    --n_classes 2 \
    --experiment_rp /home/r10user13/ConcepPath/experiment \
    --experiment_name stad_molecular_ebv_others_ft 


# 5. Postprocessing:
## 5.1 Analyze training metrics

Calculate the mean and standard deviation of the metric of 5 folds.

In [None]:
experiment_rp = "/home/r10user13/ConcepPath/experiment"
experiment_name = "stad_molecular_ebv_others_re"

import os
metric_fp = os.path.join(experiment_rp, experiment_name, "output/metrics/metrics.csv")

from utils.processing_utils import * 

metrics_analysis(metric_fp)

## 5.2 Create heatmap of attention score:

Need to run the Step4. "Testing and saving attention scores" to create the heatmap after saving the attention score file.

In [None]:
# quilt1m
!python create_attn_map.py \
    --seg \
    --vlm_model quilt1m \
    --experiment_rp /home/r10user13/ConcepPath/experiment \
    --experiment_name stad_molecular_ebv_others_ft \
    --process_list /home/r10user13/ConcepPath/data/molecular_ebv_others_pl.csv \
    --label_fp /home/r10user13/ConcepPath/data/datasets/STAD/molecular_ebv_others.csv \
    --n_flp 8 \
    --n_classes 2 \
    --n_patch_prompt 26

# 6. Others
## 6.1 Prompt References:

 - HER2 Summary Prompts

Shang, Jiuyan et al. “Evolution and clinical significance of HER2-low status after neoadjuvant therapy for breast cancer.” Frontiers in oncology vol. 13 1086480. 22 Feb. 2023, doi:10.3389/fonc.2023.1086480

Venetis, Konstantinos et al. “HER2 Low, Ultra-low, and Novel Complementary Biomarkers: Expanding the Spectrum of HER2 Positivity in Breast Cancer.” Frontiers in molecular biosciences vol. 9 834651. 15 Mar. 2022, doi:10.3389/fmolb.2022.834651

Zhang, Huina et al. “HER2-low breast cancers: Current insights and future directions.” Seminars in diagnostic pathology vol. 39,5 (2022): 305-312. doi:10.1053/j.semdp.2022.07.003

An, Junsha et al. “New Advances in Targeted Therapy of HER2-Negative Breast Cancer.” Frontiers in oncology vol. 12 828438. 4 Mar. 2022, doi:10.3389/fonc.2022.828438

Lee, Hyo-Jae et al. “HER2-Positive Breast Cancer: Association of MRI and Clinicopathologic Features With Tumor-Infiltrating Lymphocytes.” AJR. American journal of roentgenology vol. 218,2 (2022): 258-269. doi:10.2214/AJR.21.26400

den Hollander, Petra et al. “Targeted therapy for breast cancer prevention.” Frontiers in oncology vol. 3 250. 23 Sep. 2013, doi:10.3389/fonc.2013.00250

Patrizio, Armando et al. “Thyroid Metastasis from Primary Breast Cancer.” Journal of clinical medicine vol. 12,7 2709. 4 Apr. 2023, doi:10.3390/jcm12072709


 - Lung (LUAD vs. LUSC) Summary Prompts

Song, Xiaojie et al. “Construction of a Novel Ferroptosis-Related Gene Signature for Predicting Survival of Patients With Lung Adenocarcinoma.” Frontiers in oncology vol. 12 810526. 3 Mar. 2022, doi:10.3389/fonc.2022.810526

Qiu, Wang-Ren et al. “Predicting the Lung Adenocarcinoma and Its Biomarkers by Integrating Gene Expression and DNA Methylation Data.” Frontiers in genetics vol. 13 926927. 30 Jun. 2022, doi:10.3389/fgene.2022.926927

Wang, Wen et al. “What's the difference between lung adenocarcinoma and lung squamous cell carcinoma? Evidence from a retrospective analysis in a cohort of Chinese patients.” Frontiers in endocrinology vol. 13 947443. 29 Aug. 2022, doi:10.3389/fendo.2022.947443

Hu, Xiaoshan et al. “Novel cellular senescence-related risk model identified as the prognostic biomarkers for lung squamous cell carcinoma.” Frontiers in oncology vol. 12 997702. 17 Nov. 2022, doi:10.3389/fonc.2022.997702


 - Molecular (EBV,MSI,CIN,GS) Summary Prompts

Zhu, Chunrong et al. “Genomic Profiling Reveals the Molecular Landscape of Gastrointestinal Tract Cancers in Chinese Patients.” Frontiers in genetics vol. 12 608742. 14 Sep. 2021, doi:10.3389/fgene.2021.608742

Dedieu, Stéphane, and Olivier Bouché. “Clinical, Pathological, and Molecular Characteristics in Colorectal Cancer.” Cancers vol. 14,23 5958. 2 Dec. 2022, doi:10.3390/cancers14235958

Hinata, Munetoshi, and Tetsuo Ushiku. “Detecting immunotherapy-sensitive subtype in gastric cancer using histologic image-based deep learning.” Scientific reports vol. 11,1 22636. 22 Nov. 2021, doi:10.1038/s41598-021-02168-4

Han, Shuting et al. “Epstein-Barr Virus Epithelial Cancers-A Comprehensive Understanding to Drive Novel Therapies.” Frontiers in immunology vol. 12 734293. 10 Dec. 2021, doi:10.3389/fimmu.2021.734293

Sun, Keran et al. “EBV-Positive Gastric Cancer: Current Knowledge and Future Perspectives.” Frontiers in oncology vol. 10 583463. 14 Dec. 2020, doi:10.3389/fonc.2020.583463

Genitsch, Vera et al. “Epstein-barr virus in gastro-esophageal adenocarcinomas - single center experiences in the context of current literature.” Frontiers in oncology vol. 5 73. 26 Mar. 2015, doi:10.3389/fonc.2015.00073

Saito, Motonobu, and Koji Kono. “Landscape of EBV-positive gastric cancer.” Gastric cancer : official journal of the International Gastric Cancer Association and the Japanese Gastric Cancer Association vol. 24,5 (2021): 983-989. doi:10.1007/s10120-021-01215-3

Joshi, Smita S, and Brian D Badgwell. “Current treatment and recent progress in gastric cancer.” CA: a cancer journal for clinicians vol. 71,3 (2021): 264-279. doi:10.3322/caac.21657

Amato, Martina et al. “Microsatellite Instability: From the Implementation of the Detection to a Prognostic and Predictive Role in Cancers.” International journal of molecular sciences vol. 23,15 8726. 5 Aug. 2022, doi:10.3390/ijms23158726

Ratti, Margherita et al. “Microsatellite instability in gastric cancer: molecular bases, clinical perspectives, and new treatment approaches.” Cellular and molecular life sciences : CMLS vol. 75,22 (2018): 4151-4162. doi:10.1007/s00018-018-2906-9

Salnikov, Mikhail et al. “Tumor-Infiltrating T Cells in EBV-Associated Gastric Carcinomas Exhibit High Levels of Multiple Markers of Activation, Effector Gene Expression, and Exhaustion.” Viruses vol. 15,1 176. 7 Jan. 2023, doi:10.3390/v15010176
