# MulGT Usage Tutorial:

**MulGT** uses the 512 $\times$ 512 size patches segmented from the WSIs at 20x magnification as input data. In order to allow readers to better understand and practice, below we provide a complete usage tutorial.

You can download the [TCGA data](https://portal.gdc.cancer.gov/) from this link. And this tutorial is divided into three main parts: Preprocessing, GPU Training, Testing, and Evaluation and is mainly built on [CLAM](https://github.com/mahmoodlab/CLAM), but we changed some codes to suit the specific needs. The details of them are as follows:

## 1. Preprocessing:

This section is divided into five parts: Basic Information Statistics, Patch Segmentation, Feature Extraction, Graph Generation, and Dataset Split. The details of each part are as follows:

### 1.1 Basic Information Statistics:

Use the `generate_pl_bm` function to analyze the basic information of the WSIs. It mainly realizes the following three functions:

1. **Objective Lens Magnification**: 
Count and record the target WSI file's default objective lens magnification information. The recorded data is saved in a file named `bm.csv`. The structure and content of it are shown in the table below:

<center>

| Column Name | Description | Data Type |
|-------------|-------------|-----------|
| slide_path | Saving path of WSI files | String |
| base_mag | Default magnification of WSI files | String |

</center>

2. **Process List Generation**: 
Based on the target objective lens magnification and the cutting block size of the target objective lens magnification, generate the corresponding `pl_mag{target_magnification}x_patch{base_patch_size}_{target_patch_size}.csv`. The structure and content of it are shown in the table below:

<center>

| Column Name | Description | Data Type |
|-------------|-------------|-----------|
| slide_id | File name of the WSI files | String |

</center>

3. **Data cleaning**: 
Cleared some WSIs without default objective magnifications.

##### (1) Parameter Description:

- **`--WSI_dir`**: Saving directory of the WSI files.
- **`--save_dir`**: Saving directory of the generated CSV files.
- **`--base_patch_size`**: Patch size at the target magnification.
- **`--target_mag`**: Target magnification.

##### (2) Usage Example:

In the following experiments, we use a dataset with a default 40x magnification to demonstrate, called `WSI_bm40`, and the WSI files of `WSI_bm40` are saved in the `/path/to/exp/WSI_bm40/WSI_bm40` directory. The codes for basic statistics can be structured as follows:

```python
generate_pl_bm(
        WSI_dir="/path/to/exp/WSI_bm40/WSI_bm40", 
        save_dir="/path/to/exp/WSI_bm40/csv", 
        base_patch_size=512, 
        target_mag=20
)
```

**Note**: Please replace the paths and values with those relevant to your setup.

In [1]:
def generate_pl_bm(
        WSI_dir, 
        save_dir, 
        base_patch_size, 
        target_mag
    ):
    import openslide, glob
    import pandas as pd

    process_list = {} 
    base_mag_csv = {
        "slide_path": [],
        "base_mag": []
    }

    for WSI in glob.glob(WSI_dir+"/*"):
        slide = openslide.open_slide(WSI)
        wsi_name = WSI.split("/")[-1]
        if slide.properties.get(openslide.PROPERTY_NAME_OBJECTIVE_POWER) == None:
            continue

        base_mag = int(slide.properties.get(openslide.PROPERTY_NAME_OBJECTIVE_POWER))
        target_min_patch_size = int(base_patch_size*(base_mag/target_mag))

        # Update for process_list
        if target_min_patch_size not in process_list:
            process_list[target_min_patch_size] = [wsi_name]
        else:
            process_list[target_min_patch_size].append(wsi_name)

        # Update for base_mag_csv
        base_mag_csv["slide_path"].append(WSI)
        base_mag_csv["base_mag"].append(base_mag)

        # save base_mag_csv.csv
        df = pd.DataFrame(base_mag_csv)
        df.to_csv(save_dir+f"bm.csv")

        # save patch_size_i process_list.csv
        for k in process_list.keys():
            df = pd.DataFrame({
                "slide_id": process_list[k]
            })
            df.to_csv(save_dir+f"pl_mag{target_mag}x_patch{base_patch_size}_{k}.csv")

### 1.2 Patch Segmentation

Patch segmentation is required to be performed based on the default magnification of the WSI files and the patch size used at target magnification, called `target_patch_size`. Finally, the x and y coordinates of the upper left corner of the patch segmentation result can be obtained. 

For the **MulGT**, the segmentation results at 20x magnification with a patch size of 512 are required, and this segmentation process is performed based on CLAM's `create_patches_fp.py`.

##### (1) Parameter Description:

- **`--source`**: Saving directory of the WSI files.
- **`--save_dir`**: Saving directory of the patch segmentation results.
- **`--patch_size`**: Patch size at target magnification.
- **`--step_size`**: Step size for segmentation. If no overlap is required, this should be equal to the `patch_size`.
- **`--seg`**: Flag to indicate whether to generate the mask.
- **`--patch`**: Flag to indicate whether to generate the patch.
- **`--stitch`**: Flag to indicate whether to generate the stitch.
- **`--process_list`**: Path of the ßprocess list file, mainly utilizing the `slide_id` column, and other columns can be omitted.

##### (2) Usage Example:
The command line for patch segmentation can be structured as follows:

```bash
python create_patches_fp.py --source /path/to/exp/WSI_bm40/WSI_bm40 --save_dir /path/to/exp/WSI_bm40/segmented_patch/mag20x_patch512_1024 --patch_size 1024 --step_size 1024 --seg --patch --stitch --process_list /path/to/exp/WSI_bm40/csv/pl_mag20x_patch512_1024.csv
```

**Note**: Please replace the paths and values with those relevant to your setup.

### 1.3 Feature Extraction:

Based on CLAM's `extract_features_fp.py`, some changes have been made, the feature shape extracted from each patch image can be selected, re-write the `extract_features_fp_re.py` file.

##### (1) Parameter Description:

- **`--csv_path`**: Path of the process list file, mainly utilizing the `slide_id` column, and other columns can be omitted.
- **`--data_h5_dir`**: Saving directory of the patch segmentation results.
- **`--data_slide_dir`**: Saving directory of the WSI files.
- **`--feat_dir`**: Saving directory of the extracted features.
- **`--batch_size`**: Batch size (e.g., `32`, `64`, etc.).
- **`--target_patch_size`**: Size of the input image to the encoder. There is a built-in downsampling function.
- **`--slide_ext`**: Suffix for WSI files.
- **`--out_shape`**: Shape of the extracted feature.

##### (2) Usage Example:

The command line for feature extraction can be structured as follows:

```bash
python extract_features_fp.py --csv_path /path/to/WSI_bm40/csv/pl_mag20x_patch512_1024.csv --data_h5_dir /path/to/exp/WSI_bm40/segmented_patch/mag20x_patch512_1024 --data_slide_dir /path/to/exp/WSI_bm40/WSI_bm40 --feat_dir /path/to/exp/WSI_bm40/extracted_feature/mag20x_patch512_1024 --batch_size 256 --target_patch_size 1024 --slide_ext .svs --out_shape 512
```

**Note**: Please replace the paths and values with those relevant to your setup.

### 1.4 Graph Generation:
Based on the above extracted features at 20x magnification, use the `generate_graph()` function to construct the input graph data. Each graph data includes three parts: features, coord_idx, adj_matrix, typing_label, stage_label.

##### (1) Parameter Description:
- **`--target_mag`**: Target magnification.
- **`--bm_path`**: Path of the base magnification information file.
- **`--feature_dir`**: Saving directory of the extracted features.
- **`--base_patch_size`**: Patch size at the target magnification.
- **`--save_path`**: Path of the generated graphs.
- **`--typing_label_path`**: Path of the label file for the typing task.
- **`--stage_label_path`**: Path of the label file for the stage task.

**Note**: Please replace the paths and values with those relevant to your setup.
##### (2) Usage Example:

The code for graph generation can be structured as follows:

```python
generate_graph(
     target_mag = 20,
     bm_path = "/path/to/exp/WSI_bm40/csv/bm.csv",
     feature_dir = "/path/to/exp/WSI_bm40/extracted_feature/",
     base_patch_size = 512,
     save_path = "/path/to/exp/WSI_bm40/mulgt_graph/"
     typing_label_path = "/path/to/exp/WSI_bm40/csv/typing_label.csv",
     stage_label_path = "/path/to/exp/WSI_bm40/csv/stage_label_path.csv"
)
```

**Note**: Please replace the paths and values with those relevant to your setup.

In [None]:
# For each WSI need to generate:
# feature.pt: nodes' features, shape like: [num, feature_len]
# c_idx.txt: coordinates corresponding to the order of nodes features above
# adj_s.pt: linkage matrix, shape like: [num, num]

def generate_graph(
        target_mag, bm_path, feature_dir, base_patch_size, save_path, typing_label_path, stage_label_path
    ):
    import pandas as pd
    import h5py, torch, os
    import numpy as np

    bm = pd.read_csv(bm_path)

    stage_labels = pd.read_csv(stage_label_path)
    stage_labels_slide_id = stage_labels['slide_id'].to_list()
    stage_labels_label = stage_labels['label'].to_list()

    stage_label_keys = list(set(stage_labels_label))
    stage_label_values = [i for i in range(len(stage_label_keys))]
    stage_label_dict = dict(zip(stage_label_keys, stage_label_values))
    mapped_stage_labels_label = [stage_label_dict[i] for i in stage_labels_label]

    stage_labels_dict = dict(zip(stage_labels_slide_id, mapped_stage_labels_label))

    typing_labels = pd.read_csv(typing_label_path)
    typing_labels_slide_id = typing_labels['slide_id'].to_list()
    typing_labels_label = typing_labels['label'].to_list()

    typing_label_keys = list(set(typing_labels_label))
    typing_label_values = [i for i in range(len(typing_label_keys))]
    typing_label_dict = dict(zip(typing_label_keys, typing_label_values))
    mapped_typing_labels_label = [typing_label_dict[i] for i in typing_labels_label]

    typing_labels_dict = dict(zip(typing_labels_slide_id, mapped_typing_labels_label))


    def get_adj_matrix(h5_coords, target_patch_size):

        adj_matrix = np.zeros((len(h5_coords), len(h5_coords)))
        for patch_index in range(len(h5_coords)):
            x,y = h5_coords[patch_index]
            nb_coord_list = [
                [x-target_patch_size, y-target_patch_size], [x, y-target_patch_size], [x+target_patch_size, y+target_patch_size],
                [x-target_patch_size, y], [x, y], [x+target_patch_size, y],
                [x-target_patch_size, y+target_patch_size], [x, y+target_patch_size], [x+target_patch_size, y+target_patch_size]
            ]
            for xy_ in nb_coord_list:
                x_,y_ = xy_
                if np.any(np.all(h5_coords == [x_, y_], axis=1)):
                    to_patch_index = np.where((h5_coords==(x_,y_)).all(axis=1))[0][0]
                    adj_matrix[patch_index, to_patch_index] = 1
                    adj_matrix[to_patch_index, patch_index] = 1
        return adj_matrix
        
    for slide_path, base_mag in zip(bm["slide_path"],bm["base_mag"]):
        wsi_name = slide_path.split("/")[-1].replace(".svs","")
        h5_name = wsi_name+".h5"
        base_mag = int(base_mag)

        feature_i = {}
        all_feature = []
        all_coord = ""

        target_patch_size = int(base_patch_size*(base_mag/target_mag))
        # print(feature_path,target_mag,target_patch_size,h5_name)
        h5_path = f"{feature_dir}/mag{target_mag}x_patch512_{target_patch_size}/h5_files/{h5_name}"
        h5_content = h5py.File(h5_path,'r')
        h5_features = h5_content["/features"][:]
        h5_coords = h5_content["/coords"][:]
        
        feature_i[target_mag] = {
            "features": h5_features,
            "coords": h5_coords
        }

        for coord in h5_coords:
            x,y = coord
            coord_i = str(x//target_patch_size) + '\t' + str(y//target_patch_size) + '\n'
            all_coord+=coord_i

        all_feature.append(h5_features)
        all_feature = torch.tensor(np.concatenate(all_feature, axis=0))

        adj_matrix = torch.tensor(get_adj_matrix(h5_coords, target_patch_size), dtype=torch.float64)

        
        if not os.path.exists(save_path):
            os.makedirs(save_path)

        all_coord = torch.tensor(h5_coords)

        output = {
            "features": all_feature,
            "adj_matrix": adj_matrix,
            "coord_idx": all_coord,  
            "stage_label": stage_labels_dict[wsi_name],
            "typing_label": typing_labels_dict[wsi_name]
        }
        
        out_path = os.path.join(save_path, "pt_files", wsi_name)
        torch.save(output, out_path+".pt")
        
        # torch.save(all_feature, os.path.join(out_path, "features.pt"))
        # torch.save(adj_matrix, os.path.join(out_path, "adj_s.pt"))
        # with open(os.path.join(out_path, "c_idx.pt"), "w") as f:
        #     f.writelines(all_coord)

### 1.5 Dataset Split

Based on CLAM's `create_split_seq.py`, some changes have been made, the input of the label file has been added, and the task has been set to any number of classifications，re-write the `create_split_seq_re.py` file. The `label.csv` needs to include three columns of useful information: case_id, slide_id and label. 

Because the typing and stage label have been written into the graph data, the main purpose of this step is to determine data segmentation, which has nothing to do with whether to use typing or stage label files, and can be used.

##### (1) Parameter Description:
- **`--label_path`**: Path of the label file for typing/stage classification task.
- **`--seed`**: Random seed.
- **`--task`**: Task name.
- **`--k`**: Number of splits.
- **`--label_frac`**: Fraction of labels.
- **`--val_frac`**: Fraction of labels for validation.
- **`--test_frac`**: Fraction of labels for test.
- **`--save_dir`**: Saving directory for the split result.

##### (2) Usage Examples:
The command line for dataset split can be structured as follows:

```bash
python create_splits_seq_re.py --label_path /path/to/WSI_bm40/csv/typing_label.csv --seed 1 --task WSI_bm40 --label_frac 1.0 --k 10 --val_frac 0.2 --test_frac 0.2 --save_dir /path/to/WSI_bm40/
```

**Note**: Please replace the paths and values with those relevant to your setup.

## 2. GPU training:

In `main.py` and `utils/core_utils.py`, some minor additions and changes were made to enable the training of the HIGT model, and re-write the `main_re.py` and `utils/core_utils_re.py` files.

##### (1) Parameter Description:
- **`--max_epochs:`**: Number of training epochs.
- **`--label_path:`**: Path of the label file for typing/stage file.
- **`--drop_out`**: Enable dropout (p=0.25).
- **`--early_stopping`**: Enable early stopping.
- **`--lr`**: Learning rate.
- **`--k`**: Number of folds.
- **`--label_frac`**: Fraction of training labels.
- **`--exp_code`**: Experiment code for saving results.
- **`--weighted_sample`**: Enable weighted sampling.
- **`--bag_loss`**: Loss function for slide-level classification.
- **`--task`**: Task name.
- **`--split_dir`**: Path of split result to use.
- **`--model_type`**: Model type.
- **`--log_data`**: Enable log data using tensorboard.
- **`--data_root_dir`**: Saving directory of input data.
- **`--results_dir:`**: Saving directory of result.

MulGT Specific Parameter:
- **`--typing_n_classes`**: Number of categories for subtype classification tasks
- **`--stage_n_classes`**: Number of categories for stage classification tasks
- **`--mulgt_task_type`**: Classification task type: [multi, subtype, stage]
- **`--grad_norm`**: Enable to normalize the gradient
- **`--mulgt_pool_method`**: Pooling method: [dense_mincut_pool, dense_diff_pool]

##### (2) Usage Examples:

The command line for GPU training can be structured as follows:

```bash
CUDA_VISIBLE_DEVICES=0 python main_re.py --max_epochs 100 --label_path /path/to/exp/WSI_bm40/csv/typing_label.csv --drop_out --early_stopping --lr 2e-4 --k 10 --label_frac 1.0 --exp_code WSI_bm40 --weighted_sample --bag_loss ce --task WSI_bm40 --split_dir /path/to/exp/WSI_bm40/splits/WSI_bm40_{label_frac*100} --model_type MulGT --log_data --data_root_dir /path/to/exp/WSI_bm40/mulgt_graph --results_dir /path/to/exp/WSI_bm40/mulgt_result --typing_n_classes 2 --stage_n_classe 2 --mulgt_task_type multi --grad_norm --mulgt_pool_method dense_diff_pool
```

**Note**: Please replace the paths and values with those relevant to your setup.

## 3. Testing and Evaluation Script:


In `eval.py`, some minor additions and changes were made to enable the testing and evaluating of the **MulGT** model, and re-write the `eval_re.py` file.

##### (1) Parameter Description:
- **`--drop_out`**: Enable dropout (p=0.25).
- **`--k`**: Number of folds.
- **`--models_exp_code`**: Fraction of training labels.
- **`--save_exp_code`**: Experiment code for saving results.
- **`--task`**: Task name.
- **`--model_type`**: Model type used for training.
- **`--results_dir:`**: Results directory.
- **`--split_dir`**: Manually specify the set of splits to use.
- **`--data_root_dir`**: Data directory.

##### (2) Usage Examples:

The command line for testing and evaluation can be structured as follows:

```bash
CUDA_VISIBLE_DEVICES=0 python eval_re.py --drop_out --k 10 --models_exp_code WSI_bm40 --save_exp_code WSI_bm40 --task WSI_bm40 --model_type MulGT --results_dir /path/to/exp/WSI_bm40/result --split_dir /path/to/exp/WSI_bm40/splits/WSI_bm40_{label_frac*100} --data_root_dir /path/to/exp/WSI_bm40/graph
```

**Note**: Please replace the paths and values with those relevant to your setup.

## 4. Folder Structure Overview:

The final folder structure and contents will look like this:

```
path_to/exp/
    ├── WSI_bm40
        ├── WSI_40
            ├── slide_1.svs
            ├── slide_2.svs
            └── ...
        ├── csv
            ├── bm.csv
            ├── typing/stage_label.csv
            └── pl_mag20x_patch512_1024.csv
        ├── segmented_patch
            ├── mag20x_patch512_1024
                ├── masks
                    ├── slide_1_masks.jpg
                    ├── slide_2_masks.jpg
                    └── ...
                ├── patches
                    ├── slide_1_patches.h5
                    ├── slide_2_patches.h5
                    └── ...
                ├── stitches
                    ├── slide_1_stitches.jpg
                    ├── slide_2_stitches.jpg
                    └── ...
                └── process_list_autogen.csv
        ├── extracted_feature
            ├── mag20x_patch512_1024
                ├── h5_files
                    ├── slide_1_feats.h5
                    ├── slide_2_feats.h5
                    └── ...
                ├── pt_files
                    ├── slide_1_feats.pt
                    ├── slide_2_feats.pt
                    └── ...      
        ├── mulgt_graph
            ├── mag20x_patch512_1024
                ├── pt_files
                    ├── slide_1_graph.pt
                    ├── slide_2_graph.pt
                    └── ...
        ├── splits
            ├── WSI_bm40_{label_frac*100}
                ├── split_0_bool.csv
                ├── split_0_descriptor.csv
                ├── split_0.csv
                └── ... 
        ├── mulgt_results
            ├── split_0.csv
            ├── s_0_checkpoint.pt
            ├── split_0_results.pkl
            └── ...
```

**Note**: Please replace the paths and values with those relevant to your setup.