# Tutorial
This notebook goes through every stages in details. One may use it to reproduce the results or apply to other data. You need a few non-built-in libraries as listed below.

In [None]:
import os
from pathlib import Path

# Get the current notebook's directory
current_dir = Path().resolve()

# Move to the parent directory
parent_dir = current_dir.parent
os.chdir(parent_dir)

# Verify the change
print("Current working directory:", os.getcwd())

In [None]:
import importlib

def check_libraries():
    libraries = ['pandas', 'numpy', 'torch', 'sklearn', 'torch_geometric', 'tqdm']
    for lib in libraries:
        try:
            importlib.import_module(lib)
            print(f"{lib} is installed.")
        except ImportError:
            print(f"{lib} is NOT installed.")

# Run the function to check libraries
check_libraries()

## Stage 1: Partition the slide/sample into tiles following three partition methods (square, vertical, horizontal)
---
- **File**: `Algo/stage1_WSI_partitioning.py`
    - *data_root*: Root directory containing sample data files.
    - *output_folder*: Output folder to save the partitioned data.
    - *min_size*: Minimum number of cells allowed in a tile. 
    - *max_size*: Maximum number of cells allowed in a tile. 
---
- **Pre-condition**:
    - `Processed_data` folder contains:
        - `sample_name_list.txt` which contains the names of the slides/samples
        - `xxx_cell_type.txt` which contains the cell types of the cells. One for each sample and the prefix is the name of the sample.
        - `xxx_coordinate.txt` which contains the coordinates of the cells. One for each sample and the prefix is the name of the sample.
---
- **Output**:
  - `Folder` named after an sample will be created in the *output folder* (one for each sample), which contains three folders - `Horizontal`, `Vertical` and `Vertical`. 
  - In each of the partition folder, four types of files are created:
    - `cell_type_encoding.txt` which contains the mapping from the unique cell type to an integer.
    - `tile_name_list.txt` which contains the name of the tiles created.
    - `tile_size_info.txt` which contains the size, i.e., number of cells in tiles.
    - `xxx_cell_type.txt` which contains the cell types of the cells. One for each tile and the prefix is the name of the tile.
    - `xxx_coordinate.txt` which contains the coordinates of the cells. One for each tile and the prefix is the name of the tile.

In [None]:
!python Algo/stage1_WSI_partitioning.py \
    --data_root "./Processed_data/" \
    --min_size 1000 \
    --max_size 10000 \
    --output_folder "./Results/Intermediate_ouputs/stage1_output/"

## Stage 2: Construct cellular spatial graphs for every tile
---
- **File**: `Algo/stage2_sub_graph_construction.py`
    - *data_source*: Root directory containing sample data files.
    - *input_folder*: Input folder containing partitioned data from the previous stage.
    - *knn_k*: Number of neighbors to consider in KNN graph construction.
    - *output_folder*: Output folder to save the generated topology structures and node attribute matrices.
---
- **Pre-condition**:
    - The `Processed_data` folder (specified as *data_source*) contains:
        - `sample_name_list.txt`: Contains the names of the slides/samples.
        - `xxx_cell_type.txt`: Contains the cell types of the cells, with one file per sample. The prefix `xxx` is the name of the sample.
        - `xxx_coordinate.txt`: Contains the coordinates of the cells, with one file per sample. The prefix `xxx` is the name of the sample.
    - The `stage1_output` folder (specified as *input_folder*) contains:
        - Partitioned data organized into folders named after each sample, each containing subfolders for `Horizontal`, `Vertical`, and `Square` partitions.
        - Within each partition folder, the following files:
            - `tile_name_list.txt`: Contains the names of the tiles created in that partition.
            - `xxx_coordinate.txt`: Contains the coordinates of the cells in each tile.
            - `xxx_cell_type.txt`: Contains the cell types of the cells in each tile.

---
- **Output**:
  - A `Folder` named after the selection of *knn_k* will be created in the *output folder*, which contains `Folders` named after every sample (one for each sample). 
  - Partition folders (`Horizontal`, `Vertical`, and `Square`) will be created inside every sample folder.
  - A `Folder` named after each tile in each partition folders will be created (one for each tile in each partition).
  - In each of these tile folders, the following files are created:
    - `edge_index.txt`: Contains the edge indices of the KNN graph, representing connections between cells.
    - `node_attribute.txt`: Contains the one-hot encoded node attribute matrix for each cell type.
    - `skipped_points.txt` (if applicable): Contains the number of points skipped in the last tile due to insufficient cells for KNN construction.


In [None]:
!python Algo/stage2_sub_graph_construction.py \
    --data_source "./Processed_data/" \
    --input_folder "./Results/Intermediate_ouputs/stage1_output/" \
    --knn_k 69 \
    --output_folder "./Results/Intermediate_ouputs/stage2_output/"

## Stage 3: Train TCN Model on all sub graphs.
---
- **File**: `Algo/stage3_collective_unsupervised_training.py`
    - *stage2_output_folder*: Output folder from the previous stage containing the constructed cellular spatial graphs.
    - *stage1_output_folder*: Output folder from Stage 1 containing the partitioned data.
    - *output_folder*: Output folder to save the trained models and results.
    - *image_name*: Name of the image/slide to be processed.
    - *num_tcn*: Maximum number of TCNs want to discover.
    - *num_epoch*: Number of epochs for training.
    - *embedding_dimension*: Dimension of the embedding layer in the model.
    - *learning_rate*: Learning rate for the optimizer.
    - *improvement_threshold*: Minimum improvement needed in six epochs to continue training.
    - *loss_cutoff*: Empirical cutoff of the final loss to avoid underfitting.
    - *load_weights*: Whether to load previous weights if available, otherwise start training from scratch.
    - *knn_k*: Number of neighbors considered in the KNN graph used during training.
    - *save_epochs*: List of specific epochs where the model should be saved and not deleted.
---
- **Pre-condition**:
    - The `stage2_output` folder (specified as *stage2_output_folder*) contains:
        - Folders named after each sample, each containing subfolders for `Horizontal`, `Vertical`, and `Square` partitions.
        - Inside each partition folder:
            - `edge_index.txt`: Contains the edge indices of the KNN graph.
            - `node_attribute.txt`: Contains the one-hot encoded node attribute matrix for each cell type.
            - `skipped_points.txt` (if applicable): Contains the number of points skipped in the last tile due to insufficient cells for KNN construction.
    - The `stage1_output` folder (specified as *stage1_output_folder*) contains:
        - Partitioned data organized into folders named after each sample, each containing subfolders for `Horizontal`, `Vertical`, and `Square` partitions.
        - Inside each partition folder:
            - `tile_name_list.txt`: Contains the names of the tiles created in that partition.
            - `xxx_coordinate.txt`: Contains the coordinates of the cells in each tile.
            - `xxx_cell_type.txt`: Contains the cell types of the cells in each tile.
---
- **Output**:
  - A `Folder` named after the selected *knn_k* and *TCN configuration* will be created in the *output folder*. This will contain:
    - A `Folder` named after each sample (one for each sample).
    - Within each sample folder:
        - `unsupervised_epochs.csv`: Logs the unsupervised loss for each epoch.
        - `TCN_model_epoch<epoch>.pt`: Saved model checkpoints at specified epochs or every 20 epochs.
        - Model checkpoints for epochs specified in `save_epochs` are saved and not deleted during the cleanup process.
        - Additional checkpoints may be saved during training and managed by deleting the oldest one when more than three models are saved.


In [None]:
! python Algo/stage3_collective_unsupervised_training.py \
    --stage2_output_folder "./Results/Intermediate_ouputs/stage2_output/" \
    --stage1_output_folder "./Results/Intermediate_ouputs/stage1_output/" \
    --output_folder "./Results/Intermediate_ouputs/stage3_output/" \
    --image_name "combined_all_PCF" \
    --num_tcn 10 \
    --num_epoch 1000 \
    --embedding_dimension 128 \
    --learning_rate 0.003 \
    --improvement_threshold 0.001 \
    --loss_cutoff -0.6 \
    --knn_k 69 \
    --save_epochs 100 200 300 400 500 600 700 800 900

## Stage 4-1: Predict TCN with Trained Model by partition methods
---
- **File**: `Algo/stage4-1_TCN_assignment_by_partition.py`
    - *stage1_output_folder*: Output folder from Stage 1 containing partitioned data.
    - *stage2_output_folder*: Output folder from Stage 2 containing the constructed cellular spatial graphs.
    - *stage3_output_folder*: Output folder from Stage 3 containing trained TCN models.
    - *output_folder*: Output folder to save the results of this processing stage.
    - *load_config*: Configuration of the model to load, which includes the KNN, TCN settings, and learning rate.
    - *image_name*: Name of the image dataset to be processed.
    - *num_tcn*: Number of clusters in the TCN.
    - *embedding_dimension*: Dimension of the embedding layer in the model.
    - *knn_k*: Number of neighbors considered in the KNN graph used during training.
    - *load_epoch*: Specific epoch to load the model from. If not provided, the latest checkpoint is used.
---
- **Pre-condition**:
    - The `stage1_output` folder (specified as *stage1_output_folder*) contains:
        - Partitioned data organized into folders named after each sample, each containing subfolders for `Horizontal`, `Vertical`, and `Square` partitions.
        - Inside each partition folder:
            - `tile_name_list.txt`: Contains the names of the tiles created in that partition.
            - `xxx_coordinate.txt`: Contains the coordinates of the cells in each tile.
            - `xxx_cell_type.txt`: Contains the cell types of the cells in each tile.
    - The `stage2_output` folder (specified as *stage2_output_folder*) contains:
        - Folders named after each sample, each containing subfolders for `Horizontal`, `Vertical`, and `Square` partitions.
        - Inside each partition folder:
            - `edge_index.txt`: Contains the edge indices of the KNN graph.
            - `node_attribute.txt`: Contains the one-hot encoded node attribute matrix for each cell type.
    - The `stage3_output` folder (specified as *stage3_output_folder*) contains:
        - Trained TCN models organized into folders named after the configuration (`k-69_TCN-10_lr-0.003/`) and the image name.
        - Each configuration folder contains:
            - Model checkpoints saved at different epochs (`TCN_model_epoch<epoch>.pt`).
---
- **Output**:
  - A `Folder` named after the selected *checkpoint_epoch_<epoch>* will be created in the *output folder*. This will contain:
    - Processed results for each partition (`Square`, `Vertical`, `Horizontal`):
        - `Square_TCN_assign_matrix.csv`, `Vertical_TCN_assign_matrix.csv`, `Horizontal_TCN_assign_matrix.csv`: Cluster assignment matrices for each partition.
        - `Square_TCN_adjacent_matrix.csv`, `Vertical_TCN_adjacent_matrix.csv`, `Horizontal_TCN_adjacent_matrix.csv`: Cluster adjacency matrices for each partition.
        - `Square_node_mask.csv`, `Vertical_node_mask.csv`, `Horizontal_node_mask.csv`: Node masks for each partition.
---


In [None]:
! python Algo/stage4-1_TCN_assignment_by_partition.py \
    --stage1_output_folder "./Results/Intermediate_ouputs/stage1_output/" \
    --stage2_output_folder "./Results/Intermediate_ouputs/stage2_output/" \
    --stage3_output_folder "./Results/Intermediate_ouputs/stage3_output/" \
    --output_folder "./Results/Intermediate_ouputs/stage4-1_output/" \
    --load_config "k-69_TCN-20_lr-0.003/" \
    --image_name "combined_all_PCF" \
    --num_tcn 20 \
    --embedding_dimension 128 \
    --knn_k 69 \
    --load_epoch 480

## Stage 4-2: Process and Generate Final Result Table
---
- **File**: `Algo/stage4-2_consensus_based_TCN_assignment.py`
    - **image_name**: Name of the image dataset to be processed.
    - **stage1_root**: Root directory for Stage 1 outputs, containing initial processed data such as coordinates and cell types.
    - **last_stage_root**: Root directory for the last stage outputs, containing the clustering results and TCN assignments.
    - **this_stage_root**: Root directory for this stage's outputs, where the final result table will be saved.
    - **load_config**: Configuration of the model to load (e.g., KNN and TCN settings).
    - **epoch_to_use**: Epoch number of the model to be used for generating the results.
---
- **Pre-condition**:
    - The `stage1_output` folder (specified as *stage1_root*) contains:
        - Partitioned data organized into folders named after each sample, each containing subfolders for `Horizontal`, `Vertical`, and `Square` partitions.
        - Inside each partition folder:
            - `tile_name_list.txt`: Contains the names of the tiles created in that partition.
            - `xxx_coordinate.txt`: Contains the coordinates of the cells in each tile.
            - `xxx_cell_type.txt`: Contains the cell types of the cells in each tile.
    - The `stage4-1_output` folder (specified as *last_stage_root*) contains:
        - Folders named after each sample and a configuration-specific subfolder (e.g., `k-69_TCN-20_lr-0.003/checkpoint_epoch_<epoch>`).
        - Inside each sample's subfolder:
            - `partition_type_TCN_assign_matrix.csv`: The soft clustering assignments for each partition type.
            - `partition_type_node_mask.csv`: The node mask indicating valid cells for clustering.
---
- **Output**:
  - A `Folder` named after the selected *configuration* and *epoch* will be created in the *this_stage_root* directory. This will contain:
    - A `Folder` named after each sample (one for each sample).
    - Inside each sample folder:
        - `Square_result_table.csv`, `Vertical_result_table.csv`, `Horizontal_result_table.csv`: Result tables for each partition type.
        - `final_result_table.csv`: The final result table after merging results from all partitions, applying the majority vote on cluster assignments, and resolving cell type encoding.
---


In [17]:
! python Algo/stage4-2_consensus_based_TCN_assignment.py \
    --image_name "combined_all_PCF" \
    --stage1_root "./Results/Intermediate_ouputs/stage1_output/" \
    --last_stage_root "./Results/Intermediate_ouputs/stage4-1_output/" \
    --this_stage_root "./Results/Intermediate_ouputs/stage4-2_output/" \
    --load_config "k-69_TCN-20_lr-0.003/" \
    --epoch_to_use 1000