# Image Analysis with Python - <font color='teal'>Tutorial Pipeline Section 2</font>

*originally created in 2016*<br>
*updated and converted to a Jupyter notebook in 2017*<br>
*updated and converted to python 3 in 2018*<br>
*by Jonas Hartmann (Gilmour group, EMBL Heidelberg)*<br>
*updated in 2022 by Cheng-Yu Huang*<br>

##  Table of Contents

1. [About this Tutorial](#about)
2. [Initialization](#initialize)
11. [Postprocessing: Removing ROIs at the Image Border](#postpro)
12. [Identifying Cell Edges](#edges)
13. [Extracting Quantitative Measurements](#measure)
14. [Simple Analysis & Visualization](#analysis)
15. [Writing Output to Files](#write)
16. [BONUS - Batch Processing](#batch)

##  About this Tutorial <a id=about></a>

*This tutorial covers the part 2 of the image analysis tutorial*


#### Instructions

- In the section 1 of the Codelab, you performed adaptive thresholding and connected-component analysis of our raw image.

- Here we are going to continue from where we left behind, starting with the segmentation result, we will first clean all the cell patches near the border of the image, and detect the edge of each cells. Then we will perform the statistical analysis to the results.

## Initialization <a id=initialize></a>

In this section we will load the raw image data and our segmentation results, for further processing

In [None]:
# (i) Importing necessary modules and packages

# The numerical arrays manipulation module numpy as np
import numpy as np

# The plotting module matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# The image processing module scipy.ndimage as ndi
import scipy.ndimage as ndi

# Import imread function from skimage.io
from skimage.io import imread

In [None]:
# (ii) Specify the directory path and file name

# Create a string variable with the relative (or absolute) path to your raw image
# and segmentation results. 
img_filepath = r'example_data\example_cells_1.tif'
seg_filepath = r'example_cells_1_seg.tif'

In [None]:
# (iii) Load the raw image and the segmentation results

# Read images
img = imread(img_filepath)
seg = imread(seg_filepath)

In [None]:
# (iv) Look at the images to confirm that everything worked as intended

# Imshow the raw image
plt.imshow(img, interpolation='none', cmap='gray')
# Overlay the segmentation result, with an alpha value of 0.4
plt.imshow(seg, interpolation='none', cmap='prism', alpha=0.4)

## Postprocessing: Removing ROIs at the Image Border <a id=postpro></a>

#### <font color='teal'> Exercise </font>

Iterate through all the ROIs in your segmentation and remove those touching the image border.

Follow the instructions in the comments below. Note that the instructions will get a little less specific from here on, so you need to figure out how to approach a problem yourself.

In [None]:
# (i) Create an image border mask

# We need some way to check if a cell is at the border. For this, we generate a 'mask' of the image border,
# i.e. a Boolean array of the same size as the image where only the border pixels are set to `1` and all 
# others to `0`, like this:
#   1 1 1 1 1
#   1 0 0 0 1
#   1 0 0 0 1
#   1 0 0 0 1
#   1 1 1 1 1
# There are multiple ways of generating this mask, for example by erosion or by array indexing.
# It is up to you to find a way to do it. (Hint: one of the the easiest ways to do this is via scipy.ndimage.binary_dilation.
# check the parameter "border_value")

### YOUR CODE HERE!

In [None]:
# (ii) 'Delete' the cells at the border

# 1) Find the cell ROIs that are crossing the border of the image

# Find the border ROI IDs, by first multiply the border_mask by the segmentation mask
### YOUR CODE HERE!

# Then get an array of ROI IDs by finding the unique elements in the array
### YOUR CODE HERE!

In [None]:
# 2) 'Delete' ROIs by their IDs

# Create a copy of the segmentation with np.copy()
### YOUR CODE HERE!

# Iterate over ROI IDs on the border and set the those ROIs to background (0)
### YOUR CODE HERE!
    
    # Create a mask that contains only the 'current' ROI of the iteration
    ### YOUR CODE HERE!
    
    # Set the position of that roi_mask to background (zero) in the clean_seg
    ### YOUR CODE HERE!

In [None]:
# OPTIONAL: re-label the remaining cells to keep the numbering consistent from 1 to N (with 0 as background).
# Hint: Use python function <enumerate>

# Use enumerate
### YOUR CODE HERE!

In [None]:
# (iii) Visualize the result

# Show the result as transparent overlay over the raw or smoothed image. 
# Here you have to combine alpha (to make cells transparent) and 'np.ma.array'
# (to hide empty space where the border cells were deleted).

# Create mask by 'np.ma.array'
### YOUR CODE HERE!

# Show image
### YOUR CODE HERE!

## Identifying Cell Edges <a id=edges></a>

#### <font color='teal'> Exercise </font>

Create a labeled mask of cell edges by following these steps:


- Create an array of the same size and data type as the segmentation but filled with only zeros
    - This will be your final cell edge mask; you gradually add cell edges as you iterate over cells
    

- *For each cell...*
    - Erode the cell's mask by 1 pixel
    - Using the eroded mask and the original mask, create a new mask of only the cell's edge pixels
    - Add the cell's edge pixels into the empty image generated above, labeling them with the cell's original ID number


Follow the instructions in the comments below.

In [None]:
# (i) Create an array of the same size and data type as the segmentation but filled with only zeros

# Hint: use np.zeros_like()
### YOUR CODE HERE!

In [None]:
# (ii) Iterate over the ROI IDs
### YOUR CODE HERE!
    
    # (iii) Erode the ROI's mask by 1 pixel
    # Hint: 'ndi.binary_erode'
    ### YOUR CODE HERE!
    
    # (iv) Create the cell edge mask
    # Hint: 'np.logical_xor'
    ### YOUR CODE HERE!
    
    # (v) Add the cell edge mask to the empty array generated above, labeling it with the cell's ID
    ### YOUR CODE HERE!

In [None]:
# (vi) Visualize the result

# Note: Because the lines are so thin (1pxl wide), they may not be displayed correctly in small figures.
#       You can 'zoom in' by showing a sub-region of the image which is then rendered bigger. You can
#       also go back to the edge identification code and make the edges multiple pixels wide (but keep 
#       in mind that this will have an effect on your quantification results!).

### YOUR CODE HERE!

## Extracting Quantitative Measurements <a id=measure></a>

#### <font color='teal'>Exercise</font>

Extract the measurements listed above for each cell and collect them in a dictionary.

Note: The ideal data structure for data like this is the `DataFrame` offered by the module `Pandas`. However, for the sake of simplicity, we will here stick with a dictionary of lists.

Follow the instructions in the comments below.

In [None]:
# (i) Create a dictionary that contains a key-value pairing for each measurement

# The keys should be strings describing the type of measurement (e.g. 'intensity_mean') and 
# the values should be empty lists. These empty lists will be filled with the results of the
# measurements.

results = {"cell_id"      : [],
           "int_mean"     : [],
           "int_mem_mean" : [],
           "cell_area"    : [],
           "cell_edge"    : []}

# Solution note: the spacing between the strings and colons doesn't matter for the code's
# execution. It is used solely to make the code more readable!

In [None]:
# (ii) Record the measurements for each cell

# Iterate over the segmented cells ('np.unique').
# Inside the loop, create a mask for the current cell and use it to extract the measurements listed above. 
# Add them to the appropriate list in the dictionary using the 'append' method.
# Hint: Remember that you can get out all the values within a masked area by indexing the image 
#       with the mask. For example, 'np.mean(image[cell_mask])' will return the mean of all the 
#       intensity values of 'image' that are masked by 'cell_mask'!

# Get cell ids with np.unique
### YOUR CODE HERE!

# Iterate over cell IDs
### YOUR CODE HERE!

    # Mask the current cell and cell edge
    ### YOUR CODE HERE!
    
    # Get the measurements
    ### YOUR CODE HERE!

In [None]:
# (iii) Import Pandas as pd, and make the dictionary a pandas object

# Import pandas as pd
import pandas as pd

# Make results as a pandas dataframe ('pd.DataFrame()')
df = pd.DataFrame(results)

# Show the pandas dataframe
df

In [None]:
# (iv) You can write pandas dataframe as csv for data analysis in other softwares
df.to_csv("measurement.csv")

# To Read:
# df1 = pd.read_csv('measurement.csv')

In [None]:
df.describe()

## Simple Analysis & Visualisation <a id=analysis></a>

#### Background

By extracting quantitative measurements from an image we cross over from 'image analysis' to 'data analysis'. 

This section briefly explains how to do basic data analysis and plotting, including boxplots, scatterplots and linear fits. It also showcases how to map data back onto the image, creating an "image-based heatmap".

#### <font color='teal'>Exercise</font>

Analyze and plot the extracted data in a variety of ways.

Follow the instructions in the comments below.

In [None]:
# (i) Familiarize yourself with the data structure of the results dict and summarize the results

# 1) Try to print the mean of mean intensity of all cells
mean_int_mean = np.mean(df['int_mean'])
print(f'Mean of int_mean is {mean_int_mean}')

In [None]:
# 2) Try df.describe() to get all the necessary stats
# Bonus: can you make all of the numbers round up to 2 decimal places? (Try Google)
pd.options.display.float_format = "{:.2f}".format


In [None]:
# (ii)-1 Create a histogram showing the distribution of cell surface area in pixels 

# Use the function 'plt.hist'. Change the "bins" parameter of the function to see the more detailed 
# trend of the data. What do you observe?

### YOUR CODE HERE!

In [None]:
# (ii)-2 Create a box plot showing the mean cell and mean membrane intensities for both channels. 

# Use the function 'plt.boxplot'. Use the 'label' keyword of 'plt.boxplot' to label the x axis with 
# the corresponding key names. Feel free to play around with the various options of the boxplot 
# function to make your plot look nicer. Remember that you can first call 'plt.figure' to adjust 
# settings such as the size of the plot.

### YOUR CODE HERE!

In [None]:
# (iii) Create a scatter plot of cell outline length over cell area

# Use the function 'plt.scatter' for this. Be sure to properly label the 
# plot using 'plt.xlabel' and 'plt.ylabel'.
# Note: it is a good idea to make the marker (the data point) more transparent so that
# where you found the plot less transparent it means there are data points overlapping.

plt.figure(figsize=(8,5))
plt.scatter(results["cell_area"], results["cell_edge"],
           edgecolor='k', s=30, alpha=0.5)
plt.xlabel('cell area [pxl^2]')
plt.ylabel('cell edge length [pxl]')

# BONUS: Do you understand why you are seeing the pattern this produces? 
###
# ->> The curve reflects how circumference scales with area!

# Can you generate a 'null model' curve that assumes all cells to be circular?
cell_area_range = np.linspace(min(results["cell_area"]), max(results["cell_area"]), num = 100)
circle_circumference = 2*np.pi*np.sqrt(cell_area_range/ np.pi)
plt.plot(cell_area_range, circle_circumference, color='r', alpha=0.8)
plt.legend(['circles', 'data'], loc=2, fontsize=10)

# What is the result? Do you notice something odd about it? What could be the reason for
# this and how could it be fixed?
###
# ->> In general, the cells don't deviate all that much from the circular case.
# ->> Strangely, some cells have a smaller outline than the circumference of a circle
#     of equivalent area. This is mathematically impossible.
# ->> A possible reason could be that the measures are taken in pixels, which leads
#     to a so-called discretization error. It could be fixed by "meshing" the cell
#     outline and interpolating a more accurate measurement of circumference.

In [None]:
# (iv) Perform a linear fit of membrane intensity over cell area

# Use the function 'linregress' from the module 'scipy.stats'. Be sure to read the docs to
# understand the output of this function. Print the output.

# Compute linear fit
from scipy.stats import linregress
linfit = linregress(df["cell_area"], df["int_mem_mean"])

# Print all the results
linprops = ['slope', 'intercept','rvalue','pvalue', 'stderr'] #linfit properties
for index,prop in enumerate(linprops):
    print( prop, '\t', '{:4.2e}'.format(linfit[index]) )

In [None]:
# (v) Think about the result

# Note that the fit seems to return a highly significant p-value but a very low correlation 
# coefficient (r-value). Based on prior knowledge, we would not expect a linear correlation of 
# this sort to be present in our data. 
#
# This should prompt several questions:
#   1) What does this p-value actually mean? Check the docs of 'linregress'!
###
#       ->> This p-value only means that, given a linear fit through this data, the slope of the
#           fit is very unlikely to be zero. However, it does not make a statement on whether or
#           not it makes sense to use a linear fit in the first place. Looking at the scatterplot
#           below or at the correlation coefficient r, it is clear that a linear fit on this data
#           is not meaningful.
#       ->> Note also: With single-cell approaches, we quickly get to a large number of data points. 
#           This makes hypothesis testing in general less useful, as p-values tend to become very
#           small even if the null hypothesis holds. It makes sense to instead report effect sizes.
#           This is a tricky topic but well worth reading up on.
#
#   2) Could there be artifacts in our segmentation that bias this analysis?
###
#       ->> Oversegmentation is an important source of bias here. If a cell is oversegmented,
#           it will be considered as two or three cells. These will naturally have a lower
#           cell area and will naturally have a lower membrane intensity because some of their
#           edges are actually not on membranes. In other words, they will fall into the bottom
#           left of the plot, distorting the data.
#
# In general, it's always good to be very careful when doing any kind of data analysis. Make sure you 
# understand the functions you are using and always check for possible errors or sources of bias!

In [None]:
# (vi) Overlay the linear fit onto a scatter plot

# Recall that a linear function is defined by `y = slope * x + intercept`.

# To define the line you'd like to plot, you need two values of x (the starting point and
# and the end point of the line). What values of x make sense? Can you get them automatically?
#   ->> The max and min values in the data are a good choice.
x_vals = [min(df["cell_area"]), max(df["cell_area"])]

# When you have the x-values for the starting point and end point, get the corresponding y 
# values from the fit through the equation above.
y_vals = [linfit[0] * x_vals[0] + linfit[1], linfit[0] * x_vals[1] + linfit[1]]

# Plot the line with 'plt.plot'. Adjust the line's properties so it is well visible.
# Note: Remember that you have to create the scatterplot before plotting the line so that
#       the line will be placed on top of the scatterplot.
plt.figure(figsize=(8,5))
plt.scatter(df["cell_area"], df["int_mem_mean"], 
            edgecolor='k', s=30, alpha=0.5)
plt.plot(x_vals, y_vals, color='red', lw=2, alpha=0.8)

# Use 'plt.legend' to add information about the line to the plot.
plt.legend(["linear fit, Rsq={:4.2e}".format(linfit[2]**2.0)], frameon=False, loc=4)

# Label the plot and finally show it with 'plt.show'.
plt.xlabel("cell area [pxl]")
plt.ylabel("Mean membrane intensity [a.u.]")
plt.title("Scatterplot with linear fit")
plt.show()

In [None]:
# (vii) Map the cell area back onto the image as a 'heatmap'

# Scale the cell area data to 8bit so that it can be used as pixel intensity values.
areas_8bit = np.array(df["cell_area"]) / max(df["cell_area"]) * 255

# Initialize a new image array, with dtype as uint8
area_map = np.zeros_like(clean_seg, dtype = np.uint8)

# Iterate over the segmented cells
for index, cell_id in enumerate(df["cell_id"]):
    
    # Extract cell mask
    cell_mask = clean_seg == cell_id
    
    # Add cells to the area map
    area_map[cell_mask] = areas_8bit[index]
    
# BONUS: See if you can exclude outliers to make the color mapping more informative!
    
# Visulize the results:
# Create the mask array
area_map_mask = np.ma.array(area_map, mask = area_map == 0)
plt.imshow(img, interpolation='none', cmap='gray')
plt.imshow(area_map_mask, interpolation='none', cmap='viridis', alpha=0.6)

In [None]:
# (viii) Write a figure to a png or pdf

# Recreate the scatter plot from above (with or without the regression line), then save the figure
# as a png using 'plt.savefig'. Alternatively, you can also save it to a pdf, which will create a
# vector graphic that can be imported into programs like Adobe Illustrator.

plt.scatter(df["cell_area"], df["int_mem_mean"], 
            edgecolor='k', s=30, alpha=0.5)
plt.plot(x_vals, y_vals, color='red', lw=2, alpha=0.8)
plt.legend(["linear fit, Rsq={:4.2e}".format(linfit[2]**2.0)], frameon=False, loc=4)
plt.xlabel("cell area [pxl]")
plt.ylabel("Mean membrane intensity [a.u.]")
plt.title("Scatterplot with linear fit")

# Save as png and pdf
plt.savefig('example_cells_1_scatterFit.png')
plt.savefig('example_cells_1_scatterFit.pdf')
plt.clf()  # Clear the figure buffer

## \**BONUS\** >> Batch Processing: See tutorial section 3 <a id=batch></a>

## <font color='teal'>*Congratulations! You have completed the tutorial!*</font>

**We hope you enjoyed the ride and learned a lot!**

### Concluding Remarks

It's important to remember that the phrase ***"Use it or loose it!"*** fully applies for the skills taught in this tutorial.

If you now just go back to the lab and don't touch python or image analysis for the next half year, most of the things you have learned here will be lost.

So, what can you do?


- If possible, start applying what you have learned to your own work right away


- Even if your current work doesn't absolutely *need* coding / image analysis (which to be honest is hard to believe! ;p), you can still use it at least to make some nice plots!


- Another very good approach is to find yourself an interesting little side project you can play around with

***We wish you the best of luck for all your coding endeavors!***