# Segmenting and describing regions

This part of the course is about separating specific regions of interest in a image from the background automatically, and extracting descriptors from them for downstream analysis.

Run the code making sure you understand the syntax. Complete the parts marked **TODO** either in the text or in the code.

## Setup

First, we need to import some packages so that their functions are available to us.

In [None]:
import os                       # operating system operations like file paths etc
import numpy as np              # multidimensional arrays, linear algebra
from skimage import morphology  # morphological operations
from skimage import io          # to load and save data
from skimage import color       # color conversion utilities
from skimage.util import invert   # invert an image (if binary, black->white, white->black)
from skimage import img_as_ubyte # Convert an image to 8-bits
from skimage.filters import threshold_otsu # Otsu's thresholding method
from skimage.measure import regionprops, regionprops_table # region properties

import matplotlib.pyplot as plt # plotting
from matplotlib.ticker import MaxNLocator
from mpl_toolkits.mplot3d import Axes3D # 3D plotting

import pandas as pd


path_to_images= './data'    # Local: where the images are relative to this notebook
#path_to_images= os.path.join('Module2','data')    # Nuvolos: where the images are relative to this notebook

## Segmentation

### Baseline - Otsu

In [None]:
from skimage.filters import threshold_otsu

# Load and normalize the neuroblastoma image
image_file = os.path.join(path_to_images, 'neuroblastoma_5_orig_small.jpg')
img = io.imread(image_file)
img = img/np.max(img) # normalize the image to [0,1]
print(f'Image has shape {img.shape}')

# threshold the image with Otsu's method
thresh = threshold_otsu(img)
binary_img = img > thresh

# look for connected components to count cells
labeled_img, num_labels = morphology.label(binary_img, background=0, return_num=True, connectivity=1)
print(f'Found {num_labels} connected components')


# Display the original image, the Sobel the detected edges side by side
fig, axes = plt.subplots(1, 3, figsize=(14, 7))
axes[0].imshow(img, cmap='gray')
axes[0].set_title('Original Image')
axes[0].axis('off')
axes[1].imshow(binary_img, cmap='gray')
axes[1].set_title('Otsu-thresholded image')
axes[1].axis('off')
axes[2].imshow(labeled_img, cmap='tab20c')
axes[2].set_title('CC-labelled image')
axes[2].axis('off')

plt.show()

### Watershed

#### Principles - intensity is height

In [None]:
# Load and normalize the neuroblastoma image
image_file = os.path.join(path_to_images, 'neuroblastoma_5_orig_small.jpg')
img = io.imread(image_file)
img = img/np.max(img) # normalize the image to [0,1]

plt.imshow(img, cmap='gray', origin='lower')
plt.title('Original Image')
plt.xlabel('X')
plt.ylabel('Y')


# Create a grid spanning x, y coordinates
x = range(img.shape[1])
y = range(img.shape[0])
X, Y = np.meshgrid(x, y)

# Create z values from the gray-level values of each pixel
Z = img

# Create a 3D plot
fig = plt.figure(figsize=(10, 10)) 
ax = fig.add_subplot(111, projection='3d')
ax.view_init(elev=120, azim=90, roll=180)
ax.plot_surface(X, Y, Z, cmap='gray')

# Set labels and title
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Gray-level value')
ax.set_title('Topographic view')

# Show the plot
plt.show()

#### Full Watershed algorithm

Let's implement all the steps needed.

In [None]:
# WATERSHED SEGMENTATION
from scipy import ndimage as ndi
from skimage.feature import peak_local_max
from skimage.segmentation import watershed

# Load and normalize the neuroblastoma image
image_file = os.path.join(path_to_images, 'neuroblastoma_5_orig_small.jpg')
img = io.imread(image_file)
img = img/np.max(img) # normalize the image to [0,1]

# Binarize by thresholding, then invert
img_bin=img < 0.2
img_bin_inv = invert(img_bin)

# compute distance transform - shortest distance from this pixel to the background
img_dist = ndi.distance_transform_edt(img_bin_inv)


# find coordinates of local maxima in distance transform image
coords_max = peak_local_max(img_dist, footprint=np.ones((25,25)), labels=img_bin_inv)
print(f'Found {coords_max.shape[0]} local maxima')

# create an image with these coordinates marked as True
local_maxima = np.zeros(img_dist.shape, dtype=bool)
local_maxima[tuple(coords_max.T)] = True
print(f'Local maxima image has {np.sum(local_maxima.ravel())} marked points')

# now create markers for the watershed algorithm, use one-connected neighbourhoods to define whether
# two local maxima should be merged
markers, _ = ndi.label(local_maxima)
print(f'Markers image has {len(np.unique(markers))-1} markers') # -1 because 0 is background
print(f'Cell labels: {np.unique(markers)}')
import matplotlib.pyplot as plt

# Run watershed algorithm to label each pixel with the marker of the local maximum it is closest to
labels = watershed(img_dist, markers, mask=img_bin_inv, compactness=1)
print(f'Watershed algorithm found {len(np.unique(labels))-1} unique labels') # -1 because 0 is background

# Show all processing steps in Watershed segmentation
fig, axes = plt.subplots(3, 3, figsize=(32, 30))
axes[0,0].imshow(img, cmap='gray')
axes[0,0].set_title('1. Image')
axes[0,0].axis('off')
axes[0,1].imshow(img_bin, cmap='gray')
axes[0,1].set_title('2. Thresholded Image')
axes[0,1].axis('off')
axes[0,2].imshow(img_bin_inv, cmap='gray')
axes[0,2].set_title('3. Inverted thresholded')
axes[0,2].axis('off')

axes[1,0].imshow(img_dist, cmap='gray')
axes[1,0].set_title('3. Distance Transform')
axes[1,0].axis('off')
axes[1,1].imshow(local_maxima, cmap='gray')
axes[1,1].set_title('4. Local maxima in Distance Transform')
axes[1,1].axis('off')
axes[1,1].scatter(coords_max[:,1], coords_max[:,0], c='y', s=20, marker="+", alpha=0.5)
axes[1,2].imshow(markers, cmap='gray')
axes[1,2].set_title('5.  Maximum markers')
axes[1,2].axis('off')

# Find non-zero pixels in the markers image
non_zero_pixels = markers[markers != 0]

# Plot a string equal to the value of the non-zero pixel at that location in the image
axes[2,0].imshow(markers, cmap='gray')
for i, pixel in enumerate(non_zero_pixels):
    row, col = np.where(markers == pixel)
    axes[2,0].text(col[0], row[0], str(pixel), color='red', fontsize=12)
axes[2,0].set_title('6. Max markers w/ labels')
axes[2,0].axis('off')

axes[2,1].imshow(labels, cmap='tab20c')
axes[2,1].set_title('7. Watershed Labels')
axes[2,1].axis('off')

# count the number of pixels in each labelled cell
unique_labels, label_counts = np.unique(labels[labels>0].ravel(), return_counts=True)
axes[2,2].bar(unique_labels, label_counts, align='center')
axes[2,2].set_xticks(unique_labels)
axes[2,2].set_title('Cell sizes (in pixels)')
axes[2,2].set_xlabel('Cell ID')
axes[2,2].set_ylabel('Number of pixels')


plt.tight_layout(pad=0.3)







As you can see this algorithm has many steps, in addition to the usual preprocessing steps such as normalizing, thresholding etc.

Let's examine one by one the effect of the steps.

**TODO** What is the effect of the size structuring element ("footprint" in skimage lingo) used to detect local maxima of the distance function? Try modifying it. Do you get fewer, more, or the same number of local maxima? Why does this happen? 

**TODO** What is the effect of the compactness parameter in the Watershed algorithm?

## Description

We already have our very first region descriptor just from counting pixels - region size (area)! in many applications such as flow cytometry, cell size is an important parameter. but we can do much more.

The function [`regionprops`](https://scikit-image.org/docs/stable/api/skimage.measure.html#skimage.measure.regionprops) in `skimage.measure` can extract basic geometric descriptors, moments, XXX.

You can also use [`regionprops_table`](https://scikit-image.org/docs/stable/api/skimage.measure.html#skimage.measure.regionprops_table) (recommended) and get back a Python dictionary that can be easily converted into a Pandas dataframe for visualization or downstream analysis

### Basic geometric descriptors

To extract the descriptors you want, just pass them as a list of strings to the `properties` argument of the `regionprops_table` function. See below for an example.

Here we will look at 

- Area: 'area'
- Perimeter: 'perimeter'



Let's start by computing the area of each region. Note that the unit of measurement is pixels because we don't know the resolution of the image.

In [None]:
region_properties=regionprops_table(labels,intensity_image=img, properties=['label','area'])
df_rp=pd.DataFrame(region_properties)
print(df_rp)


**TODO** check that you obtain the same numbers as with the previous 'manual' computation using `np.unique`.

Now, let's look at other properties. Our aim for example is to see if we can distinguish cells that have similar sizes, but different shapes. For example let's try to find a way to distinguish cells 11 and 4, and cells 15 and 16.

In [None]:

region_properties=regionprops_table(labels,intensity_image=img, properties=['label','area','perimeter'])
df_rp=pd.DataFrame(region_properties)

# we can use matplotlib.pyplot commands directly from pandas, which makes it easier to reference columns
df_rp.plot.scatter(x='area',y='perimeter',c='label', cmap='tab20c')



As you can see, adding perimeter as second dimension makes the difference between cells clearer - cells with the same area (x-axis) can have very different perimeters (y-axis). These descriptors form a *feature space* that we can use to quantify cells and their differences.

For example, we could measure separation between cells 11 and 4, and cells 15 and 16 objectively by measuring their Euclidean distance. Since the scales are different between area and perimeter, we would need to normalise each between 0 and 1 first (or use Standardization by removing the mean and dividing by the standard deviation), then compute the distance.

**BONUS** implement this and measure distances between these cells!

Now, let's see how to implement our own descriptors.

In [None]:
# compute compactness ourselves 
# (we could also rely on regionprops 'equivalent_diameter_area' property)
df_rp['compactness']=4*np.pi*df_rp['area']/df_rp['perimeter']**2

df_rp.plot.scatter(x='area',y='compactness',c='label', cmap='tab20c')

print(df_rp)


Now, looking at area, perimeter, compactness - what is the best way to distinguish cells 11 vs 4? and 15 vs 16? Is there a good way in both cases?

**TODO** Copy-paste descriptors for 11 vs 4 and 15 vs 16 to support your argument numerically. Check the original image to see if the descriptors correspond to your intuitive understanding of the differences between the cells.

### Moment descriptors

As you can see, simple geometric descriptors are not necessarily sufficient to distinguish cells that look different to the human eye.

We can go further by using image moments, providing global descriptors. First-order moments describe the density of the shape. Second-order moments describe rate of change in a shape's properties such as area. We can look at 

- Orientation: 'orientation' - orientation of the major axis of the ellipse having the same second moments as the region
- Moments: 'moments' - these are not invariant to translation, scale, or rotation
- Normalized moments: 'moments_normalized' - invariant to translation and scale but not rotation
- Normalized central moments (Hu 1972): 'moments_hu' - invariant to translation, scaling, rotation 

In [None]:
region_properties_moments=regionprops_table(labels,intensity_image=img, properties=['label','orientation','moments_hu'])
df_rp_moments=pd.DataFrame(region_properties_moments)

df_rp_full=pd.merge(df_rp,df_rp_moments,on='label')

df_rp_full.plot.scatter(x='area',y='moments_hu-2',c='label', cmap='tab20c',logy=True)

df_rp_full


Now the difference between cells should be more obvious. As you introduce more an higher-order descriptors, you have a higher chance of separating cells that may look similar with simpler descriptors. At the same time, this may highlight irrelevant differences.

**TODO** Decide which combination of basic descriptors and moment descriptors gives you the best separation between cells 4 vs 11, 15 vs 16.

**BONUS** To get a broader understanding of the sensitivity of methods to change in data, replicate the watershed segmentation pipeline + descriptors extraction below using the bacilli image. What are the most important changes you have to implement? 