# Polygon merger

**Overview:**

This includes functionality to merge polygons from adjacent tiles (regions of interest, ROI) that form a tiled array. This is particularly useful when working with algorithmic segmentation output, which typically produces segmentation masks that are much smaller in size (say, 512x512 pixels) relative to the tissue area. This means that at the edge of each tile the segmentation contour discontinues sharply. This is very problematic if you need to analyze histological structures that are very large, oftentimes thousands of pixels perimeter-wise. 

This extends on some of the workflows described in Amgad et al, 2019:

__Mohamed Amgad, Habiba Elfandy, Hagar Hussein, ..., Jonathan Beezley, Deepak R Chittajallu, David Manthey, David A Gutman, Lee A D Cooper, Structured crowdsourcing enables convolutional segmentation of histology images, Bioinformatics, 2019, btz083__


This slide used as a test example:

[TCGA-A2-A0YE-01Z-00-DX1](http://candygram.neurology.emory.edu:8080/histomicstk#?image=5d5d6910bd4404c6b1f3d893&bounds=41996%2C43277%2C49947%2C46942%2C0 )

__Original__ :
![original](img/polygon_merger_unmerged.jpg)

__Merged__ :
![merged](img/polygon_merger_merged.jpg)


**Implementation summary**

The key requirement is that that the masks (ROIs) are rectangular and unrotated, everything else is taaken care of. This algorithm fuses polygon clusters in coordinate (not mask) space, which means is can merge almost-arbitrarily large structures without memory issues. The algorithm, in brief, works as follows:

- Extract contours from the given masks using functionality from the ``masks_to_annotations_handler.py``, making sure to correctly account for contour offset so that all coordinates are relative to whole-slide image.

- Identify contours that tough edge of each ROI and which edges they touch. 

- Identify shared edges between ROIs. 

- For each shared edge, find contours that are within the vicinity of each other (using bounding box location). If they are, then convert to shapely polygon and check if they actually are within a threshold distance of each other. If so, consider them a "pair" for merger.

- Hierarchically cluster pairs of polygons such that all contiguous polygons (using 4-connectivity) are to be merged.

- Get the union of each polygon "clluster" elements. The polygons are first dilated a bit to make sure any small gaps are covered, then they are merged and eroded.

This initial set of "vetting" steps ensures that the number of comparisons is ``<< n^2``. This is very important since algorithm complexity plays a key role as whole slide images may contain tens of thousands of objects.

**Where to look?**

```
|_ histomicstk/
|   |_annotations_and_masks/
|      |_polygon_merger.py 
|      |_tests/
|         |_ polygon_merger_test.py
|         |_ annotations_to_masks_handler_test.py
|         |_test_files/
|            |_polygon_merger_roi_masks/
|_ docs/
    |_examples/
       |_polygon_merger.ipynb
```

In [1]:
from __future__ import print_function

import os
import sys
CWD = os.getcwd()
sys.path.append(os.path.join(CWD, '..', '..', 'histomicstk', 'annotations_and_masks'))

import os
import girder_client
from pandas import read_csv

from polygon_merger import Polygon_merger
from masks_to_annotations_handler import (
    get_annotation_documents_from_contours, )

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 7, 7

## 1. Constants & prep work

In [2]:
APIURL = 'http://candygram.neurology.emory.edu:8080/api/v1/'
SAMPLE_SLIDE_ID = '5d586d76bd4404c6b1f286ae'

gc = girder_client.GirderClient(apiUrl=APIURL)
# gc.authenticate(interactive=True)
gc.authenticate(apiKey='kri19nTIGOkWH01TbzRqfohaaDWb6kPecRqGmemb')

# read GTCodes dataframe
PTESTS_PATH = os.path.join(
    CWD, '..', '..', 'histomicstk', 'annotations_and_masks', 'tests')
GTCODE_PATH = os.path.join(PTESTS_PATH, 'test_files', 'sample_GTcodes.csv')
GTCodes_df = read_csv(GTCODE_PATH)
GTCodes_df.index = GTCodes_df.loc[:, 'group']

# This is where masks for adjacent rois are saved
MASK_LOADPATH = os.path.join(
    PTESTS_PATH,'test_files', 'polygon_merger_roi_masks')
maskpaths = [
    os.path.join(MASK_LOADPATH, j) for j in os.listdir(MASK_LOADPATH)
    if j.endswith('.png')]

## 2. Polygon merger

### This is the class object you will be using

In [3]:
print(Polygon_merger.__doc__)

Methods to merge polygons in tiled masks.


In [4]:
print(Polygon_merger.__init__.__doc__)

Init Polygon_merger object.

        Arguments:
        -----------
        maskpaths : list
            list of strings representing pathos to masks
        GTCodes_df : pandas DataFrame
            the ground truth codes and information dataframe.
            This is a dataframe that is indexed by the annotation group name
            and has the following columns.

            group: str
                group name of annotation, eg. mostly_tumor.
            GT_code: int
                desired ground truth code (in the mask). Pixels of this value
                belong to corresponding group (class).
            coords_x : str
                vertix x coordinates comma-separated values
            coords_y
                vertix y coordinated comma-separated values
            color: str
                rgb format. eg. rgb(255,0,0).
        merge_thresh : int
            how close do the polygons need to be (in pixels) to be merged
        contkwargs : dict
            dictionary o

In [5]:
print(Polygon_merger.run.__doc__)

Run full pipeline to get merged contours.

        Returns:
        - pandas DataFrame: has the same structure as output from
        get_contours_from_mask().

        


### This is the core merging method if you're interested.

### Required arguments for init

#### Ground truth codes file

This contains the ground truth codes and information dataframe. This is a dataframe that is indexed by the annotation group name and has the following columns:

- ``group``: group name of annotation (string), eg. "mostly_tumor"
- ``GT_code``: int, desired ground truth code (in the mask) Pixels of this value belong to corresponding group (class)
- ``color``: str, rgb format. eg. rgb(255,0,0).

IMPORTANT NOTE:

Zero pixels have special meaning and do NOT encode specific ground truth class. Instead, they simply mean 'Outside ROI' and should be IGNORED during model training or evaluation.

In [6]:
GTCodes_df.head()

Unnamed: 0_level_0,group,overlay_order,GT_code,is_roi,is_background_class,color,comments
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
roi,roi,0,255,1,0,"rgb(200,0,150)",
evaluation_roi,evaluation_roi,0,254,1,0,"rgb(255,0,0)",
mostly_tumor,mostly_tumor,1,1,0,0,"rgb(255,0,0)",core class
mostly_stroma,mostly_stroma,2,2,0,1,"rgb(255,125,0)",core class
mostly_lymphocytic_infiltrate,mostly_lymphocytic_infiltrate,1,3,0,0,"rgb(0,0,255)",core class


#### maskpaths

These are absolute paths for the masks/tiles/ROIs to be used.

In [7]:
[os.path.split(j)[1] for j in maskpaths[:5]]

['TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-44350_top-43750_mag-BASE.png',
 'TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-44350_top-44262_mag-BASE.png',
 'TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-44350_top-44774_mag-BASE.png',
 'TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-44350_top-45286_mag-BASE.png',
 'TCGA-A2-A0YE-01Z-00-DX1.8A2E3094-5755-42BC-969D-7F0A2ECA0F39_left-44350_top-45798_mag-BASE.png']

Note that the pattern ```_left-123_``` and ```_top-123_``` is assumed to encode the x and y offset
of the mask at base magnification. If you prefer some other convention, you will need to manually provide the
parameter ``roi_offsets`` to the method ``Polygon_merger.set_roi_bboxes``.

In [8]:
print(Polygon_merger.set_roi_bboxes.__doc__)

Get dictionary of roi bounding boxes.

        Arguments:
        - roi_offsets: dict (default, None): dict indexed by maskname,
        each entry is a dict with keys top and left each is an integer.
        If None, then the x and y offset is inferred from mask name.

        Sets:
        - self.roiinfos: dict: dict indexed by maskname, each entry is a
        dict with keys top, left, bottom, right, all of which are integers.

        


## 3. Let's init and run the merger

To keep things "pretty", we discard background contours (in this case, stroma), that
are now enclosed with anouther contour. See docs for ``masks_to_annotations_handler.py``
if this is confusing. It is purely aesthetic.

In [9]:
pm = Polygon_merger(
    maskpaths=maskpaths, GTCodes_df=GTCodes_df,
    discard_nonenclosed_background=True, verbose=1,
    monitorPrefix='test')
contours_df = pm.run()


test: Set contours from all masks

test: Set ROI bounding boxes

test: Set shard ROI edges

test: Set merged contours

test: Get concatenated contours
test: _discard_nonenclosed_background_group: discarded 4 contours


### This is the result

In [10]:
contours_df.head()

Unnamed: 0,group,color,ymin,ymax,xmin,xmax,has_holes,touches_edge-top,touches_edge-left,touches_edge-bottom,touches_edge-right,coords_x,coords_y
0,mostly_tumor,"rgb(255,0,0)",,,,,0,,,,,"44350,44350,44384,44384,44385,44385,44386,4438...","44445,44446,44446,44445,44445,44445,44445,4444..."
1,mostly_tumor,"rgb(255,0,0)",,,,,0,,,,,"44350,44350,44350,44350,44350,44350,44350,4435...","44615,44615,44615,44771,44771,44772,44772,4477..."
2,mostly_tumor,"rgb(255,0,0)",,,,,0,,,,,"44350,44350,44350,44350,44350,44350,44350,4435...","45129,45129,45283,45283,45284,45284,45285,4528..."
3,mostly_tumor,"rgb(255,0,0)",,,,,0,,,,,"45822,45822,45822,45823,45823,45823,45823,4582...","43915,43916,43916,43916,43917,43917,43917,4391..."
4,mostly_tumor,"rgb(255,0,0)",,,,,0,,,,,"46312,46312,46315,46316,46316,46316,46317,4631...","44252,44253,44256,44256,44257,44257,44257,4425..."


## 4. (Optional) - Visualize results on HistomicsTK

In [11]:
# deleting existing annotations in target slide (if any)
existing_annotations = gc.get('/annotation/item/' + SAMPLE_SLIDE_ID)
for ann in existing_annotations:
    gc.delete('/annotation/%s' % ann['_id'])

# get list of annotation documents
annotation_docs = get_annotation_documents_from_contours(
    contours_df.copy(), separate_docs_by_group=True,
    docnamePrefix='test',
    verbose=False, monitorPrefix=SAMPLE_SLIDE_ID + ": annotation docs")

# post annotations to slide -- make sure it posts without errors
for annotation_doc in annotation_docs:
    resp = gc.post(
        "/annotation?itemId=" + SAMPLE_SLIDE_ID, json=annotation_doc)

Now you can go to:

[TCGA-A2-A0YE-01Z-00-DX1](http://candygram.neurology.emory.edu:8080/histomicstk#?image=5d586d76bd4404c6b1f286ae&bounds=41996%2C43277%2C49947%2C46942%2C0 )

and confirm that the posted annotations make sense and correspond to tissue boundaries and expected labels.

## 5. (EXTRA) - Explore some of the inner workings

### Core method being called to merge polygons

This relies on ``shapely``. 

In [12]:
print(Polygon_merger._get_merged_polygon.__doc__)

Merge polygons using shapely (Internal).

        Given a single cluster from _get_merge_clusters_from_df(), This creates
        and merges polygons into a single cascaded union. It first dilates the
        polygons by buffer_size pixels to make them overlap, merges them,
        then erodes back by buffer_size to get the merged polygon.

        


You  may also want to checkout ``Polygon_merger._add_merged_edge_contours`` for details on how the various
``shapely`` geometries (polygon, multipolygon, eyc) are handled.