This notebook contains some basic EDA and notes about segmentation as it pertains to the HPA Single Cell Classification competition. I am a beginner in this field, and explaining my work as I go is as much for my benefit as anyone else's, so I appreciate any advice/corrections etc. 

In [None]:
!pip install selectivesearch
!pip install "../input/pycocotools/pycocotools-2.0-cp37-cp37m-linux_x86_64.whl"
!pip install "../input/hpapytorchzoozip/pytorch_zoo-master"
!pip install "../input/hpacellsegmentatorraman/HPA-Cell-Segmentation/"

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
import os
import cv2
import skimage.io as io
import skimage.segmentation
import selectivesearch
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import scipy.ndimage as ndi
from skimage import filters, measure, transform, util
from skimage.morphology import (binary_erosion, closing, disk,
                                remove_small_holes, remove_small_objects)


**Defining the problem**  

This is a weakly supervised classification problem. The training labels are given at the image level, and we are tasked with predicting labels at the cell level. We will therefore have to identify each cell in each image, segment them, and classify them individually. 

The labels themselves refer to the 'subcellur protein localization patterns of single cells'. We have 17 classes of these patterns, and a 'negative' class.

The challenge here will be segment the images correctly, and then find a way to use the imprecise labels to classify our segmented cells accurately. The problem can therefore be framed as an instance segmentation problem. 

For each image in the test set, you must predict a list of instance segmentation masks and their associated detection score (Confidence). The submission csv file uses the following format:

ImageID,ImageWidth,ImageHeight,PredictionString  
ImageAID,ImageAWidth,ImageAHeight,LabelA1 ConfidenceA1 EncodedMaskA1 LabelA2 ConfidenceA2 EncodedMaskA2 ...  
ImageBID,ImageBWidth,ImageBHeight,LabelB1 ConfidenceB1 EncodedMaskB1 LabelB2 ConfidenceB2 EncodedMaskB2 …


Note that a mask MAY have more than one class. If that is the case, predict separate detections for each class using the same mask.



So we are submitting a list of masks, their respective labels, and the respective confidences with which these label predictions were made. 

The submissions are evaluated using Mean Average Precision, or mAP. See the video below.   
https://www.youtube.com/watch?v=FppOzcDvaDI

In [None]:
train = pd.read_csv('../input/hpa-single-cell-image-classification/train.csv')
sub = pd.read_csv('../input/hpa-single-cell-image-classification/sample_submission.csv')

In [None]:
sub.loc[0]

In [None]:
train.info()

In [None]:
train.head()

Labels are in the form shown above, will change it to a list of ints

In [None]:
train.Label = train.Label.apply(lambda x: list(map(int,x.split('|'))))
train.head()

In [None]:
a = sns.countplot([len(x) for x in train.Label])
a.set_title('number of labels per image')

Vast majority of images either have 1 2 or 3 labels, with at least one image with 5 labels. 

In [None]:
set(np.sum(train.Label))

19 different training labels, want to look at their distributions in the training set. 

Which are more common, do they tend to occur together?

In [None]:
label_mat = np.zeros((len(train),19), int)
for image in range(len(train)):
    for label in train.Label[image]:
        label_mat[image, label] = 1

In [None]:
fig = plt.figure(figsize = (12,6))
a = sns.barplot(list(range(19)),label_mat.sum(axis = 0),)
a.set_title('Distribution of Label Occurances')

Quite imbalanced labels, 0 is by far the most common and 11, 18  are close to 0. Oversampling may be necessary. 

In [None]:
plt.figure(figsize = (12,8))
a = sns.heatmap(pd.DataFrame(label_mat).corr())
a.set_title('Do labels Occur Together?')

In [None]:
pd.DataFrame(label_mat).corr()

Can see most labels are weakly correlated with each other, with some, for example 4 and 0, with strong negative correlation. 

Not many strong positive correlations. 

Each entry in train.csv has four associate images

Blue - Nucleus  
Red - Microtubules  
Yellow - Endoplasmic reticulum  
Green - Protein of interest

First lets visualise a few of these separately, then find a way to combine them 

In [None]:
colours = ['_red.png', '_blue.png', '_yellow.png', '_green.png']
TRAIN_PATHS = '../input/hpa-single-cell-image-classification/train'
train_paths = [[os.path.join(TRAIN_PATHS, train.iloc[idx,0])+ colour for colour in colours] for idx in range(len(train))]

In [None]:
titles = ['microtubules','nucleus', 'endoplasmic reticulum', 'protein of interest']
fig, axs = plt.subplots(3, 4, figsize =(16,8))
for entry in range(3):
    for channel in range(4):
        img = plt.imread(train_paths[entry][channel])
        axs[entry, channel].imshow(img)        
        if entry == 0:
            axs[0, channel].set_title(titles[channel])

In [None]:
def to_rgb(idx,paths = train_paths, gbr = False, blue_only = False):
    red = cv2.imread(paths[idx][0],0)
    yellow = cv2.imread(paths[idx][2],0)
    blue = cv2.imread(paths[idx][1],0)
    
    if blue_only: 
        return cv2.resize(blue, (512,512))
    else:
        return np.dstack((cv2.resize(red,(512,512)),cv2.resize(yellow,(512,512)) ,cv2.resize(blue,(512, 512))))


In [None]:
#to visualise classes, will take images with single class labels. 

single_label = []
for label in range(19):
    for idx in range(len(train)):
        if train.loc[idx, 'Label'] == [label]:
            single_label.append(idx)
            break
titles_2 = [
'0-Nucleoplasm',
'1-Nuclear membrane',
'2-Nucleoli',
'3-Nucleoli fibrillar center',
'4-Nuclear speckles',
'5-Nuclear bodies',
'6-Endoplasmic reticulum',
'7-Golgi apparatus',
'8-Intermediate filaments',
'9-Actin filaments',
'10-Microtubules',
'11-Mitotic spindle',
'12-Centrosome',
'13-Plasma membrane',
'14-Mitochondria',
'15-Aggresome',
'16-Cytosol',
'17-Vesicles and punctate cytosolic patterns',
'18-Negative'
]


In [None]:
fig, axs = plt.subplots(19,1, figsize = (10, 100))

for label, idx in enumerate(single_label):
    axs[label].imshow(to_rgb(idx))
    axs[label].set_title(titles_2[label])

A note about 'Negative' labels from the organiser

> 1) Image-level label uncertainty The image-level labels are what we refer to as weak or noisy. During annotation, the image-level labels are set per sample (i.e per a group of up to 6 images from the same sample). This means that the labels present in the majority of the images will be annotated. For negative samples it is not uncommon that lets say 4 images show no staining, while the remaining 2 show some unspecific staining or some granular pattern. If you compare the image-level label with the precise pattern observed in any given cell from this group of images, the label will be correct for the vast majority of cells, but perhaps not for all of them (as in your example 1 and 3 above).

> 2) Single cell label accuracy in test set The test set consists of images where each single cell has been annotated independently. Hence the accuracy of these labels is much better, and will be correct for each cell in every image. The statement below made by Trang Le, @lnhtrang, in our notebook explaining the patterns is correct for how the test set was annotated.


So in other words, in some of the 'Negative' samples, one or two of the cells in the image may in fact have a protein pattern, but most will not. 

How to best carry out segmentation on this dataset? 

Popular models being used for this competition include

https://github.com/CellProfiling/HPA-Cell-Segmentation  

https://www.cellpose.org/

These are both pre-trained models based on variations of convolutional neural networks called U-Nets. Before going near either of those, as a newcomer to the field of computer vision and object identification, I want to explore some more basic segmentation techniques in order to appreciate the need for the tools listed above. 


The task we have is to perform single cell classification, therefore we are going to have to produce masks over each cell we identify and classify them individually. How best to produce these masks? Why do we need a pre-trained CNN at all?


Thresholding pretty clearly isn't going to be the answer given that the task is to segment the individual cells in the image, and the background is already completely distinct.

The first segmentation technique that I will implement is a graph-based technique first proposed by Felsenzwalb. It involves representing our image as an undirected graph G = (V,E) of vertices and edges, with each vertex being a single pixel and each edge connecting a pair of two vertices. An edge has a 'weight' which is given by the distance between the two vertices (pixels) in the edge. The distance metric is the difference in colour, intensity, and location. 

A segmentation solution is a partition of V into multiple connected components. Starting with each pixel in its own component, the segmentation is arrived at through a 'bottom-up' technique. 

The basic idea is that a if the distance metric between two components is small compared to the internal difference of both those components, then merge the two components, otherwise do nothing. 

The higher the scale factor 'k', the higher the threshold for a decision to label two components to be separate. So higher k will give us larger components/segments.

For more information on how the algorithm works see http://cs.brown.edu/people/pfelzens/papers/seg-ijcv.pdf


In [None]:
fig, axs = plt.subplots(4,4, figsize = (30, 30))
ims = [to_rgb(i) for i in range(4)]
k = [0,500, 1000, 2000]
for i, image in enumerate(ims):
    for j in range(4):
        if j == 0:
            axs[i,0].imshow(image)
        else:
            axs[i,j].imshow(skimage.segmentation.felzenszwalb(image, scale = k[j]))
            axs[i,j].set_title('Image {} k='.format(i) + str(k[j]))

As we can see, this approach runs into problems where cells overlap, as is the case with much of our dataset. 

Selective Search is an algorithm used to propose regions which contain objects. It is built on top of a segmentation algorithm such as Felzenszwalb. 

It is a hierarchical grouping algorithm, iteratively grouping together the most similar regions (which were obtained by Felzenszwalb), until a stopping point determined by the min_size parameter. 

sigma is a Gaussian blur parameter to smooth the image.  


Note, while this algorithm returns bounding boxes, not the image masks we will need for submission,it will be interesting to see if it can reliably identify individual cells, especially when their boundaries overlap. 

In [None]:

def selective_search_regions(idx, scale = 500, sigma = 0.8, min_size = 100, min_region = 1000):
    img = to_rgb(idx)
    img_lbl,regions = selectivesearch.selective_search(img, scale=scale, sigma=sigma, min_size=min_size)
    candidates = set()
    for r in regions:
        # excluding same rectangle (with different segments)
        if r['rect'] in candidates:
            continue
        # excluding regions smaller than 2000 pixels
        if r['size'] < min_region:
            continue
        # distorted rects
        x, y, w, h = r['rect']
        if w / h > 1.2 or h / w > 1.2:
            continue
        candidates.add(r['rect'])
    return img_lbl, candidates


In [None]:
fig, axs = plt.subplots(4,1,figsize = (10,40))
for idx in range(4):
    img = to_rgb(idx)
    img_lbl, candidates = selective_search_regions(idx)
    # draw rectangles on the original image
    axs[idx].imshow(img)
    for x, y, w, h in candidates:
        rect = mpatches.Rectangle(
            (x, y), w, h, fill=False, edgecolor='red', linewidth=1)
        axs[idx].add_patch(rect)

    
    
    

As you can see, the results are pretty unsatisfactory. Even after playing around with the parameters, the set of parameters which correctly identify the different cells in one image will result in far too many regions (or too few) in the next. We need a method more robust than this for our dataset of thousands of images of cells. As well, we need a model that predicts the masks as well as just identifying the objects. 

A U-Net is a form of CNN designed for semantic segmentation. Because our task is closer to instance segmentation, there is quite a lot of post-processing to be done on the output of a U-Net in order to produce masks for the individual cells in our images. Luckily, a lot of that work has been done for us by the creators of https://github.com/CellProfiling/HPA-Cell-Segmentation. 

Another approach would be to use a purpose-built instance segmentation model such as Mask R-CNN, which outputs individual object masks, bounding boxes, and class predictions for each objects.  

In [None]:
def load_images(df = train, paths = train_paths, n = 4):
    blue = []
    rgb = []
    for i in range(4):
        rgb_img = to_rgb(i)/255 
        blue_img = to_rgb(i, blue_only = True)/255
        rgb.append(rgb_img)
        blue.append(blue_img)
    return rgb, blue
    

In [None]:
import hpacellseg.cellsegmentator as cellsegmentator
from hpacellseg.utils import label_cell, label_nuclei
NUC_MODEL = "../input/hpacellsegmentatormodelweights/dpn_unet_nuclei_v1.pth"
CELL_MODEL = "../input/hpacellsegmentatormodelweights/dpn_unet_cell_3ch_v1.pth"
segmentator = cellsegmentator.CellSegmentator(
    NUC_MODEL,
    CELL_MODEL,
    device="cuda",
    multi_channel_model=True,
)

N = 4
rgb, blue = load_images(n = N)
masks = []


nuc_segmentations = segmentator.pred_nuclei(blue)
cell_segmentations = segmentator.pred_cells(rgb, precombined = True)

for nuc, cell in zip(nuc_segmentations, cell_segmentations):
    masks.append(label_cell(nuc, cell))



In [None]:
plt.imshow(cell_segmentations[0])

The above shows the output from the semantic segementation performed by U-Net. Below will be the masks once post-processing is complete. 

In [None]:
fig, axs = plt.subplots(4,3, figsize = (20,30))

for i in range(N):
    axs[i,0].imshow(to_rgb(i))
    axs[i,1].imshow(masks[i][1])
    axs[i,2].imshow(to_rgb(i))
    axs[i,2].imshow(masks[i][1], alpha = 0.5)

TBC when I have the time. Appreciate any comments + advice