# **Kmeans-Dbscan Segmentation Notebook**

---
# **How to Run Notebook**
---


1. Set up `virtual conda environment` if you have not already done so. Uncomment to run.

In [None]:
# !conda create conda create ../environments/environment.yml --no-builds

In the code editor running this Jupyter Notebook, change the kernel to the new `TILSEG_PROJECT2024` conda environment. THis will allow you to use the needed imports.

2. Update the `respository_path` variable to use the 'TILSEG_PROJECT2024' Cloned Github Folder path. 
This path is needed to access the example files used in the notebook.

In [1]:
import os
directory_path = os.getcwd()
repository_path = os.path.dirname(directory_path)

3. Run the `Initalization Block`. This is necessary as Python adds a directory for this notebook to the list of locations where modules can be searched from when importing.

In [2]:
import sys
sys.path.append(repository_path)

4. Import the needed modules in the `Import Block`

In [3]:
# External library imports
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

#Local Library Imports
from tilseg.preprocessing import preprocess
from tilseg.seg import segment_TILs
from tilseg.model_selection import opt_kmeans
from tilseg.refine_kmeans import KMeans_superpatch_fit

5. Data download block. Used to install the `gdown` module to access data from google drive. Uncoment to run, but you only need to run this once per laptop.

In [1]:
# !pip install gdown

Collecting gdown
  Downloading gdown-5.1.0-py3-none-any.whl.metadata (5.7 kB)
Collecting beautifulsoup4 (from gdown)
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting filelock (from gdown)
  Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4->gdown)
  Downloading soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Downloading gdown-5.1.0-py3-none-any.whl (17 kB)
Downloading beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.9/147.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading filelock-3.13.1-py3-none-any.whl (11 kB)
Downloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, filelock, beautifulsoup4, gdown
Successfully installed beautifulsoup4-4.12.3 filelock-3.13.1 gdown-5.1.0 soupsieve-2.5


### **Current repository contains the following file strucutre of the `Example` Folder:**
#### These files will be used to walk through an example of using TILSEG_PROJECT2024 in analysis.<br>
<img src= "Notebook_Images/Image_9.png" style="width: 600px;"><br>
#### Dont worry - these folders should be empty. They will be filled via the exercises in this notebook.

---
# **Core Features Overview**
---

### The TILSEG_PROJECT2024 software package is intended for use in breast cancer slide segmentation analysis, aimed at accelerating breast cancer detection. This package consists for 4 main components:

## From 2023 Capstone (OLD COMPONENTS):
### 1. <u>Preprocessing (preprocessing.py):</u>
<img src= "Notebook_Images/image_7.png" style="width: 600px;">
<img src= "Notebook_Images/image_8.png" style="width: 597px;">

#### creates superpatch .tif file from cropped 3000 by 4000 pixel patches from a stained breast cancer slide. The original image is segmented into all possible patches where a select number (default: 6) are chosen that represent different sections of grays scales from a guassian distribution. 

#### Sub-Components:
* #### test
<br>
<span style="background-color: rgba(255, 255, 0, 0.5); font-size: 20px;">UPDATES/BUG FIXES FROM 2024 PROJECT: </span>

* #### Changed the os handling to read in the full filepaths of each .svs image since the original code was using only the filename (this led to filepath exception errors)
* #### def get_superpatch_patches (def preprocess << def get_superpatch_patches) updated to now have a random state argument to allow for the superpatch to be made from the same patches each time a notebook is run
* #### def sort_patches (def preprocess << def main_preprocessing << def sort_patches) was updated to use a Gaussian Mixture distribution to identify the peaks associated with the pink tissue and white background to reduce the background in the returned superpatches. Original method was documented very poorly and did not accurately remove white background patches, as shown below.

    | Superpatch - Old Sort Patches Function   | Superpatch - New Sort Patches Function  |
    |--------------|--------------|
    | <img src= "Notebook_Images/super_before.png" style="width: 570px;"> | <img src= "Notebook_Images/image_8.png" style="width: 570px;">|



### 2. <u>Image Segmentation (seg.py >> def segment_TILs) </u>
#### Applies a clustering model (e.g. KMeans) on a superpath and applies the model to a folder of patches to generate the following files: TILs overlayed on the original H&E patch, binary segmentation masks of each cluster, individual clusters overlayed on the original patch, image of all the clusters, and a CSV file containing countour information of each TIL segmented from the patch. Currently accepts fitted and non-fitted 'KMeans', 'DBSCAN', 'OPTICS', 'BIRCH' algorithms.

#### Sub-Components:
* #### def image_postprocessing
<br>
<span style="background-color: rgba(255, 255, 0, 0.5); font-size: 20px;">UPDATES/BUG FIXES FROM 2024 PROJECT: </span>

* #### def segment_TILS was updated to take in a `multiple_images` flag to be able to be able to fit a kmeans model to a patch rather than just a superpatch to use the predicted clusters on this patch in downstream scoring

* #### def immune_cluster_analyzer (def segment_TILS << def image_postprocessing << def immune_cluster_analyzer) was updated to return the `cluster mask` of the highest TIL contour count to be able to do further segmenetation using dbscan (explained in next section)

* #### def draw_til_images (def segment_TILS << def image_postprocessing << def draw_til_images) had a bug for a wrong array type fed to .drawContours package that was fixed

* #### def segment_TILS had a bug fixed to only check for .tif images in a patches folder (avoid errors of hidden .ipynb or files)

## From 2024 Software Project (NEW COMPONENTS):

### 3. <u>Spatial Modeling (refine_kmeans.py >> def kmean_to_spatial_model wrappers) </u>
#### Created wrappers to run def segment_TILS on a folder of patches and use the output kmeans labels of the highest contour cluster to do further clustering with dbscan. Similarily, a wrapper was created to run segment_TILS on a single patch as both the superpath and patch to run dbscan on-itself and generate a ground truth scoring dbscan classification on the cluster.

#### Sub-Components:
* #### mask_to_features
* #### km_dbscan_wrapper


### 4. <u>Scoring / Preprocessing Updates (functions HERE) </u>
#### Hanson and Stanley add information about what you did

### 4. <u>Bug Fixes from Original Code</u>

---
# **Example Walkthrough**
---

## A) Pre-Preprocessing Step on a Slide Image
### This section will show you how to utilize the preprocessing functions to construct a superpatch and a folder of associated slide patches.

#### 1. To begin, download a sample Raw Slide Image (.Svs) by running the block below. This file is stored in a public Google Drive, as the filesize is too large to upload to Github.

In [4]:
!gdown 'https://drive.google.com/uc?id=1_aR-Vwd0B3suQW214zfkLudl6HK3w4q3' -O "Image_Files/TCGA-A2-A0CW-01Z-00-DX1.svs"

Downloading...
From (original): https://drive.google.com/uc?id=1_aR-Vwd0B3suQW214zfkLudl6HK3w4q3
From (redirected): https://drive.usercontent.google.com/download?id=1_aR-Vwd0B3suQW214zfkLudl6HK3w4q3&confirm=t&uuid=6bb0bda8-cbd0-4d6f-b609-2fea9a607881
To: /Users/laurenfrank/TilsegV2/Example/Image_Files/TCGA-A2-A0CW-01Z-00-DX1.svs
100%|████████████████████████████████████████| 667M/667M [00:24<00:00, 27.7MB/s]


##### A Slide Image (.svs) should have been saved to the `Image_Files` Folder. This file will be used in the next step.
<img src= "Notebook_Images/Image_4.png" style="width: 165px;">, <span style="font-size: 6em;">&rarr;</span> <img src= "Notebook_Images/image_5.png" style="width: 170px;">

#### 2. Create the Superpatch and Patch Images Using the `Preprocess` Function in `Tilseg.Processing` Module
- #### Using the .svs image, the preprocess function will create a superpatch using 6 of the total patches. Feel free to experiment with a different patch sizes (e.g. 3, 9, 12, ...) to see how this affects the superpatch.
- #### Random state of 13 was specified to make notebook consistent between runs
- #### The filepath of the .svs image will be printed along with the amount of pixels lost during the patch making phase
- #### NOTE: `preprocess` can be used on a folder of .svs images rather than just the one slide image (as was done in this example), but this will significantly increase the run time. When using multiple images, still only one superpatch would be made, but it would be made of (num_patches * num_csv_images) patches (e.g. 6 patches * 2 .csv images = 12 patches in superpatch)

In [4]:
path = repository_path + '/Example/Image_Files'
superpatch = preprocess(path, patches=6, training=True, save_im=True,random_state = 13)

/Users/laurenfrank/TilsegV2/Example/Image_Files/TCGA-A2-A0CW-01Z-00-DX1.svs
Percent of pixels lost in pre-processing for TCGA-A2-A0CW-01Z-00-DX1.svs:                       1.7593642775049286e-06 %


| Before     | After    |
|--------------|--------------|
| <img src= "Notebook_Images/Image_4.png" style="width: 165px;">, <span style="font-size: 6em;">&rarr;</span> <img src= "Notebook_Images/image_5.png" style="width: 170px;"> | <img src= "Notebook_Images/Image_4.png" style="width: 170px;">, <span style="font-size: 6em;">&rarr;</span> <img src= "Notebook_Images/image_6.png" style="width: 500px;"> |


#### 3. Creates Three_Patches_Example Folder & Single_Patch_Example Folders and Move Patches to these Folders

#### - For sake of time, only three images from the created folder "TCGA-A2-..." will be used in model construction. The 3 patches chosen had a good ratio of pink (breast tissue) to slide background (white), which will be useful in downstream analysis:
* #### position_7_8tissue.tif
* #### position_14_20tissue.tif
* #### position_6_16tissue.tif

#### - Run the below block to construct these two folders in addition to Sub Result Folders

In [13]:

!mkdir Image_Files/Three_Patches_Example
!mv Image_Files/TCGA-A2-A0CW-01Z-00-DX1/position_7_8tissue.tif Image_Files/TCGA-A2-A0CW-01Z-00-DX1/position_14_20tissue.tif Image_Files/TCGA-A2-A0CW-01Z-00-DX1/position_6_16tissue.tif Image_Files/Three_Patches_Example

!mkdir Image_Files/Single_Patch_Example
!cp Image_Files/Three_Patches_Example/position_7_8tissue.tif Image_Files/Single_Patch_Example/position_7_8tissue.tif

!mkdir Results/Image_Seg_Case_i
!mkdir Results/Image_Seg_Case_ii
!mkdir Results/Dbscan_Case_i
!mkdir Results/Dbscan_Case_ii

Inside the `Image_Files` Folder you should see:

<img title="a title" alt="Alt text" src="Notebook_Images/image_10.png" width="180">  
<span style="font-size: 6em;">&rarr;</span>
<img title="a title" alt="Alt text" src="Notebook_Images/image_11.png" width="440"><br>
<img title="a title" alt="Alt text" src="Notebook_Images/image_13.png" width="180">
<span style="font-size: 6em;">&rarr;</span>
<img title="a title" alt="Alt text" src="Notebook_Images/image_14.png" width="180"><br>

Inside the `Results` Folder you should see:<br>

<img title="a title" alt="Alt text" src="Notebook_Images/image_21.png" width="175">
<span style="font-size: 6em;">&rarr;</span>
<img title="a title" alt="Alt text" src="Notebook_Images/image_20.png" width="610">  


## B) Image Sementation
### This section will walk through how to use the segmentation functions to train a kmeans model on a superpatch or test the ground-truth prediction of a single patch.

### Case i: Running Segment_TILS on a Single Patch

#### 1. In Section A, a single patch was saved to the `Single_Patch_Example` Folder. This will file will be used in the demonstration of the Single Patch Functions.

#### `position_7_8tissues.tif`<br>
<img src= "Notebook_Images/image_12.png" style="width: 600px;">

#### 2. Open the Patch Image and Normalize the Pixels

In [4]:
patch_path = repository_path + '/Example/Image_Files/Single_Patch_Example/position_7_8tissue.tif'
img = Image.open(patch_path)
numpy_img = np.array(img)
numpy_img_reshape = np.float32(numpy_img.reshape((-1, 3))/255.)

#### 3. Optimize the Kmeans Model on Patch (Almost always 4 clusters)

In [5]:
hyperparameter_dict = opt_kmeans(numpy_img_reshape,n_clusters = [2,3,4,6,7,8,9,10])
kmeans_fit = KMeans_superpatch_fit(patch_path,hyperparameter_dict, random_state = 13)

#### 4. Run segment_TILS
* ##### Since we will be running the segmentation on a single patch, we will set the multiple_images flag = False. This also means that the in_dir_path should only be the path to the patch rather than to a folder of patches, as we will show in the next case.
* ##### It should be noted that segment_TILS can also create/fit a model when passed in a hyperparameter dict and algorithm type; however, since the `KMeans_superpatch_fit` already returns a fitted model, this feature was bypassed via the `model` argument. If you chose to use it, you would put None into the `model` argument and feed the parameters into the `hyperparameter_dict` argument.
* ##### Lastly, this notebook was inteded to showcase the implementation of KMeans-Dbscan, so only KMeans was used. However, this function can be fed KMeans, DBSCAN, BIRCH, or OPTICS models - not just KMeans.

In [6]:
TIL_count_dict, kmean_labels_dict, cluster_mask_dict, cluster_index = segment_TILs(in_dir_path = patch_path,
                                                        out_dir_path = repository_path + '/Example/Results/Image_Seg_Case_i',
                                                        hyperparameter_dict = None,
                                                        algorithm = 'KMeans',
                                                        model = kmeans_fit,
                                                        save_TILs_overlay = True,
                                                        save_cluster_masks = True,
                                                        save_cluster_overlays = True,
                                                        save_all_clusters_img = True,
                                                        save_csv = True,
                                                        multiple_images = False)

#### In the above block, a kmeans model was fitted and predicted on the single patch. This type of function can be used to validate the results from a superpatch fitted model onto this patch - this will be done in case ii.

#### 5. Exploring the Output

In [7]:
for i,key in enumerate(kmean_labels_dict):
    print(f'File {i+1}')
    print(f'Filepath: {key}') 
    print(f'Cluster Labels: {kmean_labels_dict[key][:30]}')
    print(f'Unique Labels: {set(kmean_labels_dict[key])}')
    print(f'Cluster label number that had the most contours: {cluster_index}')
    print(f'TIL_count of cluster {cluster_index}: {TIL_count_dict[key]}')

File 1
Filepath: position_7_8tissue
Cluster Labels: [0 0 0 0 2 2 2 2 2 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 3 3 3 3]
Unique Labels: {0, 1, 2, 3}
Cluster label number that had the most contours: 1
TIL_count of cluster 1: 2742


#### As you can see above, this image was found to have 4 optimzed clusters (0,1,2,3) and cluster 3 has the most TIL contours of 2742! Moreover, the clustering and contour map plots can be found in the `Clustering` Folder in `Results/Image_Seg_Case_i`

<img title="a title" alt="Alt text" src="Notebook_Images/Image_1.png" width="200">  
<span style="font-size: 6em;">&rarr;</span>
<img title="a title" alt="Alt text" src="Notebook_Images/image_2.png" width="200">
<span style="font-size: 6em;">&rarr;</span>
<img title="a title" alt="Alt text" src="Notebook_Images/image_16.png" width="600">

When zoomed in on the `AllClusters.jpg`, `ContourMask.jpg`, and `ContourOverlap.jpg`:

<img title="a title" alt="Alt text" src="Notebook_Images/image_17.png" width="400">
<img title="a title" alt="Alt text" src="Notebook_Images/image_18.png" width="400">  
<img title="a title" alt="Alt text" src="Notebook_Images/image_19.png" width="400">  

#### The KMeans clsters can be shown in the first image, while the binary mask/contour map of cluster 3 can be seen in the second image (cluster with the most small, round contours). This cluster was then overlaid the original raw image in the third figure.

### Case ii: Running Segment_TILS on a Folder of Patches

#### 1. Like in Case i, the images placed into the `Three_Patches_Example` Folder will now be used for analysis Multiple Image Functions.

#### `position_7_8tissues.tif`, `position_7_8tissues.tif`, `position_7_8tissues.tif` <br>
<img src= "Notebook_Images/image_15.png" style="width: 700px;">

#### 2. Open the Superpatch Image and Normalize the Pixels

In [8]:
superpatch_path = repository_path + '/Example/Image_Files/superpatch_training.tif'
img = Image.open(superpatch_path)
numpy_img = np.array(img)
numpy_img_reshape = np.float32(numpy_img.reshape((-1, 3))/255.)

#### 3. Optimize the Kmeans Model on Patch (Almost always 4 clusters)

In [10]:
hyperparameter_dict = opt_kmeans(numpy_img_reshape,n_clusters = [2,3,4,6,7,8,9,10])
kmeans_fit = KMeans_superpatch_fit(superpatch_path,hyperparameter_dict,random_state = 13)

#### 4. Run segment_TILS
* ##### Since we will be running the segmentation on a folder of patches, we will set the multiple_images flag = True. This also means that the in_dir_path should be the path to the direcctory of patches (unlike before where it was the path to the single image)
* #### In case ii, the superpatch is used to fit a KMeans model that is then fed into the segment_TILS function and predicted ontop of the separate patches in the folder

In [None]:
TIL_count_dict, kmean_labels_dict, cluster_mask_dict, cluster_index = segment_TILs(in_dir_path = patch_path,
                                                        out_dir_path = repository_path + '/Example/Results/Image_Seg_Case_ii',
                                                        hyperparameter_dict = None,
                                                        algorithm = 'KMeans',
                                                        model = kmeans_fit,
                                                        save_TILs_overlay = True,
                                                        save_cluster_masks = True,
                                                        save_cluster_overlays = True,
                                                        save_all_clusters_img = True,
                                                        save_csv = True,
                                                        multiple_images = False)

#### In the above block, a kmeans model was fitted and predicted on the single patch. This type of function can be used to validate the results from a superpatch fitted model onto this patch - this will be done in case ii.

---

### Running Kmeans-Dbscan Model on Same Patch - Kmeans fed into Dbscan

## From 2024 Software Project (NEW COMPONENTS):

## 1) Multiple Images (Predicting Superpatch Model on Superpatches)

### Running Segment_TILS on Folder of Patches from Slide - KMeans Only

### Running Kmeans-Dbscan Model on Superpatch and Folder of Patches - KMeans fed into Dbscan

In [5]:
from tilseg.refine_kmeans import kmean_to_spatial_model_superpatch_wrapper
im_labels, dbscan_model, cluster_mask_dict = kmean_to_spatial_model_superpatch_wrapper(superpatch_path = repository_path + '/Example/Image Files/superpatch_training.tif',
                                            in_dir_path = repository_path + 'Example/Image Files/TCGA-A2-A0CW-01Z-00-DX1',
                                            spatial_hyperparameters= {'eps': 15,'min_samples': 100},
                                            n_clusters = [1,2,4,5,6,7,8,9],
                                            out_dir_path = repository_path + 'Example/Results',
                                            save_TILs_overlay = True,
                                            save_cluster_masks = True,
                                            save_cluster_overlays =  True,
                                            save_all_clusters_img = True,
                                            save_csv = True)

Found hyperparameters. Time took: 4.0150078694025675 minutes.


KeyboardInterrupt: 

CLustering Results should have been saved to the `Results` Folder

<img title="a title" alt="Alt text" src="Notebook_Images/Image_1.png" width="200">  
<span style="font-size: 6em;">&rarr;</span>
<img title="a title" alt="Alt text" src="Notebook_Images/image_2.png" width="200">
<span style="font-size: 6em;">&rarr;</span>
<img title="a title" alt="Alt text" src="Notebook_Images/image_3.png" width="600">

### BREAK