# CHEM 60 - March 4th, 2024 (Image Processing)

Today, we get to play with a new kind of data! Images!

To get started, click on '**File**' in the left menu, then '**Save a copy in Drive**' to ensure you are editing *your* version of this assignment (if you don't, your changes won't be saved!). After you click '**Save a copy in Drive**' a popup that says **Notebook copy complete** should appear, and it may ask you to <font color='blue'>**Open in a new tab**</font>. When open, your new file will be named `Copy of CHEM60_Class_14_....ipynb` (you may want to rename it before/after you move it to your chosen directory).

#Imports

Here are the Python imports that we will need today plus the usual formatting things.

Run the below code block to get started.

In [None]:
# Standard library imports
import copy
import math as m

# Third party imports
import cv2
import matplotlib.patches as patches
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
from PIL import ImageOps
from PIL.PngImagePlugin import PngImageFile, PngInfo


# This part of the code block is telling matplotlib to make certain font sizes exra, extra large by default
# Here is where I list what parametres I want to set new defaults for
params = {'legend.fontsize': 'xx-large',
         'axes.labelsize': 'xx-large',
         'axes.titlesize':'xx-large',
         'xtick.labelsize':'xx-large',
         'ytick.labelsize':'xx-large'}
# This line updates the default parameters of pyplot (to use our larger fonts)
plt.rcParams.update(params)

First, mount the Drive. You've done this every week now! If the details of things like imports or data access need to be clarified, go back and check out the [class 0](https://colab.research.google.com/drive/1q96pdc5CBfjhqkALe-ohqPJwNMcXzwqS?usp=share_link) notebook on this.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Assuming you ended up with `Mounted at /content/gdrive`, you're good to move on!



---



# The Motivation

Today's class is another special one - a data story from our chemistry department. Prof. Haushalter shared a lovely result from his research group and details on their main experimental technique.

>  [T]he majority of my research lab's raw data comes from gel electrophoresis, which separates proteins or nucleic acids into distinct bands which can be visualized by staining and then imaged with a specialized digital camera.  The camera's software (e.g BioRad ImageLab) quantifies the intensity of each band and allows us to compare the relative amounts of material (e.g. the fraction of DNA that has been cut at a specific location by an enzyme).  I've attached a sample.  

Behold the sample!

![Example image from Prof. Haushalter showing a gel with rectangles of various shades of grey. The figure is titled S326C hOGG1 Thermolability Assay. There is a bar chart at the bottom with time on the x-asis and Excision percent on the y-axis](https://kavassalis.space/s/Haushalter_Sample_gel_data.png)

This is very exciting because the chemical 'signal' here is an image. While techniques for dealing with 2-dimensional data (like images!) look different in some respects from what we have been doing with 1-dimensional signals so far (spectra!), there are all sorts of things in common here.

One thing critical to all kinds of data analysis is, first, understanding what the data represents. While a nuanced appreciation for this experiment would likely require you to take CHEM182 and 184, we'll do a speed read of a paper together and share out key takeaways before moving on with the notebook.

# A very quick primer on gel electrophoresis

Gel electrophoresis is a way to separate mixtures of macromolecules (like DNA) based on their size (and other properties). The technique applies and electric field to the gel to cause the molecules to move through it. Smaller or more charged molecules move through the gel faster than larger or less charged molecules (and end up further down the gel).

Here's my hand-wavy explanation as to how it's done (Prof. Haushalter can tell me how wrong I am!):

1. Preparation of the Gel: The first step in gel electrophoresis is to prepare the gel, which is the medium through which the molecules will move. Depending on the type of macromolecule being separated, different types of gels may be used. Sometimes you'll see things like "Polyacrylamide electrophoresis", which tells you the gel used was Polyacrylamide.

2. Loading of the Sample: The sample containing the mixture of molecules is loaded into wells in the gel.

3. Application of an Electric Field: An electric field is applied once the sample is loaded. The macromolecules will start moving based on their size and charge. DNA and RNA carry a negative charge (I think?) and thus move toward the positive end of the field.

4. Separation of the Molecules: As the molecules move through the gel, they will separate based on size and shape. Smaller molecules will move faster and travel further through the gel, while larger molecules will move slower and not travel as far.

5. Visualization: After a certain period, the electric field is removed, and the separation of the molecules is visualized. This is often done using stains or dyes that bind to the macromolecules and can floresce under UV light (and be seen by a special camera!).



# Paper reading time!

Open up the link to the class paper notes [Google Doc](https://docs.google.com/document/d/1BY2KcwaBKU78y2et_xvpmZCOYYMQ_a7avm7PFVJ_EDE/edit?usp=sharing).

To ground yourself in what the paper is about, everyone should read the **Abstract** (it has two sections!), the **Introduction**, and the **Conclusions**. Use the shared doc to take notes (you can always add notes here too).

# Let's get into the computing

Let's appreciate the elements of the above figure. The hefty piece of code below recreates it (some style adjustments could still be done). While only one "plot" is present (the bar graph at the bottom), `matplotlib` is happy to treat the three seperate pieces of information (the annotations with the experimental details), the photo of the gel, and the bar plot as three subplots. This style of multiplot is able to convey a lot of information, but all of the pieces are clearly related (which is different from some of the multiplots we looked at in the first class).

Today, you don't need to spend much time going through the code for this figure. It is mostly here as an example for those who may pick final projects that require figures like these. After the break, you'll start deciding what you want to recreate!

In [None]:
#Set up the main title and subplots
fig = plt.figure(figsize=(8, 8))  # adjust as needed
fig.suptitle('S326C hOGG1 Thermolability Assay', fontsize=36)

# Create 2 subplots with different sizes
gs = gridspec.GridSpec(3, 1, height_ratios=[.25, 1, 3])  # adjust ratios to get the aesthetics right

ax1 = plt.subplot(gs[0])   # subplot for annotations
ax2 = plt.subplot(gs[1])   # subplot for image
ax3 = plt.subplot(gs[2])   # subplot for bar graph

# Keep your annotations and table in ax1
ax1.annotate('       37°C pre-incubation', (0.175, 0.85), xycoords='figure fraction', fontsize='16')
ax1.annotate('     _______________________', (0.175, 0.84), xycoords='figure fraction', fontsize='16')
ax1.annotate('        4°C pre-incubation', (0.58, 0.85), xycoords='figure fraction', fontsize='16')
ax1.annotate('     _______________________', (0.58, 0.84), xycoords='figure fraction', fontsize='16')

ax1.annotate('Time (min):', xy=(0, -.2), xycoords='axes fraction', fontsize=14, ha='right')

# You can then add the text above the first subplot in a table format
columns = [' ']*1 + [' ']*5 + [' ']*1 + [' ']*5
cell_text = [['0.5', '1', '5', '10', '30', '0',  '0.5', '1', '5', '10', '30']]
rows = [' ']

# We don't actually want axis labels on this one because it's not a plot
ax1.axis('tight')
ax1.axis('off')

table = ax1.table(cellText=cell_text, rowLabels=rows, colLabels=columns, cellLoc = 'center', rowLoc = 'center', loc='center')
table.auto_set_font_size(False)
table.set_fontsize(18)
table.scale(.95, 2.95)
for key, cell in table.get_celld().items():
    cell.set_linewidth(0)

# Then you can add your gel electrophoresis example image
rect = patches.Rectangle((0,0),1,1,linewidth=2,edgecolor='black',facecolor='none', transform=ax2.transAxes)
ax2.add_patch(rect)

# This loads in the image file
im = plt.imread('/content/gdrive/Shared drives/Chem_60_Spring_2024/In_Class_Notebooks/data/class14_test-electrophoresis.png')
ax2.imshow(im)
ax2.axis('off') # again, we don't want this to have axis

# adding the annotation (this syntax makes text with an arrow pointing to what you want)
ax2.annotate('9-mer', xy=(0, .4), xytext=(-.05, .4),
            arrowprops=dict(facecolor='black', arrowstyle='->'), fontsize = 12, xycoords='axes fraction', va='center', ha='right', textcoords='axes fraction')

ax2.annotate('20-mer', xy=(0, .75), xytext=(-.05, .75),
            arrowprops=dict(facecolor='black', arrowstyle='->'), fontsize = 12, xycoords='axes fraction', va='center', ha='right', textcoords='axes fraction')

# Plot for the bar plots in ax3
barWidth = 0.3
time = ['0.5', '1', '5', '10', '30']

# from july 25 S326C thermo gel2 (where did this come from??)
bars1 = [0.05415693647*100, 0.03990872511*100, 0.1166135814*100, 0.1713672166*100, 0.3250584173*100]
bars2 = [0.2042089156*100, 0.1737426582*100, 0.3496066294*100, 0.3483463648*100, 0.4675014029*100]

# The formatting of the bars can be changed for visual preference
r1 = np.arange(len(bars1)) + .1 # I am manually creating the positions of the bars
r2 = [x + barWidth + .1 for x in r1] # They should be a little bit apart

# Now we add them to the plot
ax3.bar(r1, bars1, color='k', edgecolor='k', width=barWidth, label='37°C pre-incubation')
ax3.bar(r2, bars2, color='lightgray', edgecolor='k', width=barWidth, label=' 4°C pre-incubation')

ax3.set_xlabel('Time (min)')
ax3.set_xticks([r + barWidth for r in range(len(bars1))], time)
ax3.set_ylabel('Excision (%)')

# Setting the y-axis limits and ticks
ax3.set_ylim([0, 50])
ax3.set_yticks(range(0, 51, 10))

# Remove frames
ax3.spines['right'].set_visible(False)
ax3.spines['top'].set_visible(False)

plt.show()


The data above is summarized with a bar chart. But the *data* actually comes from that gel image.

**So how do the gels give us data?**

The fluorescing molecules bound to the DNA (or other macromolecule), when exposed to a particular wavelength of light will absorb said light and then re-emit (fluoresce) at a longer wavelength. A special camera (known as a charge-coupled device) can take a picture of that emitted light. The brighter the light seen in a particular location, the more DNA is present at that location in the gel. These cameras can be set up to have have very long exposure times (and need to be fully shielded from any other light sources) to capture very small signals. It's very cool tech. The raw images go through some pretty sophisticated software to lead us to a dataset to tell our story.

# A bit about working with images

I plotted a .png of the gel above, but let's look at the *raw* data here. Delightfully, these images are saved under a file type known as... GEL. `.gel` files are an image standard created to store gel electrophoresis data. They are an extention of a standard TIFF using private tags.

That last part probably sounded like jargon (it is!). You likely have seen .tiff images before, but perhaps less often than .pngs. TIFF (Tagged Image File Format), like the PNG (Portable Network Graphics), is just an image file. There are differences in how the two file types store data (pixel values) for compression purposes, but for our needs, and why .gel files are based on .tiff files, the thing that makes the two special is what they store in *addition* to the pixel values (ie. they both come with bonus information, not just an image).

The name "Tagged Image File Format" is alluding to this - "tagged" refers to the way information is stored and organized within the file. Each "tag" specifies the attributes of the image file, such as its dimensions, colour format, whether it is compressed or not, etc. You can also add custom tags to store whatever you want. This is how scientific meta data gets added to tiffs to create gels. Let's look at what I mean.

First, we'll load an image.

In [None]:
img = Image.open('/content/gdrive/Shared drives/Chem_60_Spring_2024/In_Class_Notebooks/data/class14_july_25_S326C_thermo_gel_2.gel')
img

Oh no, an unprocessed image! We'll worry about what this is supposed to look like in a minute. Let's look at the metadata (the tags!)

In [None]:
metadata = img.tag_v2
print(metadata)

We've got some numbers! The meta data comes to us as a Python dictionary. The "keys" of the dictionary (256, 257, etc.) are standardized tags that we can actually look up.

- 256: ImageWidth. This indicates the width of the image, which is 886 pixels in this case.
- 257: ImageLength. This indicates the height (or length) of the image, which is 400 pixels in this case.
- 258: BitsPerSample. This indicates the bit depth of the image. (16,) means it is a 16-bit image.
- 259: Compression. The value of 1 typically stands for no compression. We don't typically want our raw data to be compressed.
- 262: PhotometricInterpretation. A value of 0 usually means WhiteIsZero.
- 269: DocumentName. This tag records the name of the document from which this image was created.
- 270: ImageDescription (this will be gel stuff!)

Because this is a *standard*, you can look up any tag you want at the Library of Congress: https://www.loc.gov/preservation/digital/formats/content/tiff_tags.shtml (you learned about other standards in CS5, like ASCII and UNICODE).

We can double check these properties if we want, like calling for the image width and height using `.size`.

In [None]:
img.size

Important information is held here, not only for being able to interpret the gel, but for scientific reproducability! We can know when this file was made, where the data was originally stored, pretty much anything you would want to know.

Because this is a Python dictionary, we can use the keys to grab details when we need them:

In [None]:
metadata[270]

These were the settings the image was taken under.

In [None]:
metadata[305]

This is the name and version of the camera software that was apparently running at the time (I can find very little about this particular version on the internet!). Things like this can be especially useful if it turns out there was some issue with a particular version of the software, and you need to track what samples were analyzed with said version.


## PRACTICE QUESTION

Try checking some of the other tags!

In [None]:
# see what else is stored.


To contrast this to a png, we can look at the already processed image that I used for the sample plot above:

In [None]:
img_png = Image.open('/content/gdrive/Shared drives/Chem_60_Spring_2024/In_Class_Notebooks/data/class14_test-electrophoresis.png')
img_png

## PRACTICE QUESTION

Try grabbing metadata for this. What happens?



---



In [None]:
# try the syntax that worked on the .gel!



---

The tags central to tiff files (or .gel files) give them a lot of flexibility in the kinds of information that they can pair with images. Because of this flexibility (storing all this extra information takes up space) they are often bulky files and aren't compatible with all devices (they're not great for the web or to use to share pictures if your goal is everyone opening and seeing the same picture).

PNG, don't use tags, they use... 'chunks' (that's the official term for it). In a PNG, the data is broken into chunks which store all kinds of information, including what is functionally our metadata. Because pngs are designed to be compatible with all manner of image rendering devices, you can't embed custom information like in a tiff. The kinds of things stored are... harder to parse...

Example:

In [None]:
img_png.info.items()

The first "chunk" displayed in the above dictionary is the 'icc_profile'. This contains information on the International Color Consortium (ICC) profile used in the image. While it might be unreadable to me, it contains informatiion to help preserve colour accuracy across difference devices. Every device has its own way of rendering colours and the same image might look a bit different on two different screens (or printers). ICC profiles help to maintain colour consistency by transforming the colour data of the image using a "profile connection space". That is very useful for lots of applications, but not strictly the thing we need for scientific reproducability.

Okay. That's enough about images. Let's make the gel we loaded look like it was supposed to!

# Why did our image look funny?

Let's look at it again. Why isn't this what we expected (knowing what the png of this gel looks like)? We learned something from its metadata.

## PRACTICE QUESTION

Resist scrolling for the answer and discuss with your neighbour why the image looks funny! What did its tags teach us? What do we remember about the colours associated with pixel values?


---




**notes maybe?**



---

Let's look at this again

In [None]:
img

The PhotometricInterpretation tag with a value of 0 means that the image uses a "WhiteIsZero" colour scheme. In this scheme, a value of 0 (the minimum value for most image formats) represents white, while the maximum value (255 in the case of an 8-bit grayscale image which you probably saw in CS5 or 65535 for a 16-bit image!) represents black.

The more common approach is "BlackIsZero" - that's certainly what Matplotlib is expecting. Here is a quick line plot showing what the colour '0' is normally:

In [None]:
plt.plot([0,1], color='0')

This should have made a black line.



---



Okay, so we first need to put things into the grayscale colour space our interpreter is expecting. Let's make the image a `numpy` array so we can work with it.

In [None]:
# Convert Image to Numpy array
im_array = np.array(img)
im_array

16-bit grayscale stores a lot more information and nuiance between shades than 8-bit can. It makes my brain a bit unhappy though because these numbers are big... Because we are not going for maximum accuracy today, we will convert this into an 8-bit image (and then we can use a few more pre-built functions that happen to expect 0-255 to be our colour span).

*If you want to know how to do this with a 16-bit image, this syntax will invert it (`img_inverted = img.point(lambda i: 65535 - i)`.*

First, we're going to normalize our image.

$$ x = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)} $$

This is technically not needed, but if you notice:

In [None]:
im_array.min(), im_array.max()

The minimum and maximum values of our gel aren't 0 and 65535, so we won't end up with true white or true black in the final image. We are essentially increasing the contrast of the gel to make it a little easier to both look at and work with (the difference between signal and background will be more apparent).

In [None]:
# Normalize to 0-255
im_array_normalized = ((im_array - np.min(im_array)) * (1/(np.max(im_array) - np.min(im_array)) * 255)).astype('uint8')

Let's check that that worked:

In [None]:
# im_array_normalized.min(), im_array_normalized.max()

## PRACTICE QUESTION
Did it work?



---



Let's look at the new image:

In [None]:
# Convert numpy array back to an image.
img_normalized = Image.fromarray(im_array_normalized)

# Invert colours (ImageOps.invert expects an 8-bit image)
img_inverted = ImageOps.invert(img_normalized)
img_inverted # this will just display the image so we can look at it

Okay. This at least looks right, if not... with some extra stuff we don't want. But we can crop the image to get rid of the extra stuff (what is the blob on the side outside the gel window? I do not know!)



In [None]:
#             left,upper,right,lower
crop_region = (240, 160, 750, 260)
cropped_img =  img_inverted.crop(crop_region)
cropped_img

The variables within the `crop_region` tuple define a box by its left, upper, right, and lower pixel coordinates. The origin (0, 0) is in the upper-left corner. x-coordinates increase when you go right, and y-coordinates increase when you go down.

So in the example:

- `crop_region = (240, 150, 750, 260)`

The specific values are as follows:

- `240` is the **left** pixel coordinate. It is the x-coordinate of the left boundary of the box.
- `150` is the **upper** pixel coordinate. It is the y-coordinate of the top boundary of the box.
- `750` is the **right** pixel coordinate. It is the x-coordinate of the right boundary of the box.
- `260` is the **lower** pixel coordinate. It is the y-coordinate of the bottom boundary of the box.

So, when `img_inverted.crop(crop_region)` was run, the image that should have been returned was a rectangular portion of the original image (with the colours inverted).

Why did we have to do this manually? Surely an algorithm could have known what region we wanted?

Honestly, no. There were several features in the regions we didn't want that looked just as 'salient' (visually important) as the features we did want. Image properties alone can't tell you where you actually would expect to see signal and where you wouldn't. Ideally, the camera software knowns where you have centred the sample though, and can crop with that knowledge.

## PRACTICE QUESTION

You can try recropping the image to get comfortable with the coordinates.

I recommend returning to the example bounds (or close to them) before moving on though (if your crop includes weird stuff, the next stage of image processing is hard!)

Now one final piece, because it so happens that blob on the side was very dark, our gel bands are kind of faint. I am going to renormalize the data (tweak the contrast again) so it looks a bit better.

In [None]:
im_array = np.array(cropped_img)
im_array_normalized = ((im_array - np.min(im_array)) * (1/(np.max(im_array) - np.min(im_array)) * 255)).astype('uint8')
img_normalized_cropped = Image.fromarray(im_array_normalized)
img_normalized_cropped

Notice the object names I am using there. I am attempting to keep track of all the modifications I have made to the raw image. This is good practice (and easy to forget to do).

# Let's extract information from the image

Here is a biiig function. This is going to make a complex looking multiplot (that I am honestly delighted by). I recommend looking at how the function is called below and then coming up to read it. This isn't extracting information yet, just letting us look close up at what we have.

In [None]:
def a_big_function_to_plot_the_gel_and_highlight_regions(gel_image, roi_top_left, width, height):
  # Compute the bottom-right coordinate
  roi_bottom_right = (roi_top_left[0] + width, roi_top_left[1] + height)

  # 1. Create the grid for the plots
  fig = plt.figure(figsize=(15, 15))
  gs = gridspec.GridSpec(ncols=3, nrows=2, height_ratios=[3, 1])

  # Define subplot axes
  ax0 = plt.subplot(gs[0, :2])  # Full gel image; spans first two columns
  ax1 = plt.subplot(gs[0, 2])  # ROI on top right
  ax2 = plt.subplot(gs[1, 2], projection='3d')  # 3D bar plot on bottom right
  ax3 = plt.subplot(gs[1, 0])  # X-Z view on bottom left
  ax4 = plt.subplot(gs[1, 1])  # Y-Z view on bottom middle

  # 2. Highlight Region of Interest ROI in original image and plot
  drawn_image = gel_image.copy()
  cv2.rectangle(drawn_image, roi_top_left, roi_bottom_right, 0, 2)
  # ax0.imshow(drawn_image, cmap='gray')
  ax0.imshow(drawn_image, cmap='gray', vmin=0, vmax=255)
  ax0.set_title("Gel")
  ax0.axis('off')

  # 3. Extract pixels in the ROI and plot
  roi = gel_image[roi_top_left[1]:roi_bottom_right[1], roi_top_left[0]:roi_bottom_right[0]]
  # ax1.imshow(roi, cmap='gray')
  ax1.imshow(roi, cmap='gray', vmin=0, vmax=255)
  ax1.set_title("Region of Interest")
  ax1.axis('off')

  # 4. Create the X-Z view
  pixel_values = roi.flatten(); pixel_values_reshaped = np.reshape(pixel_values, (height, width))
  xz_view = np.mean(pixel_values_reshaped, axis=0) # (set axis=0 for mean along Y-axis)
  darkness_values = 255 - np.reshape(pixel_values, (height, width))
  dz = darkness_values.ravel()  # Flatten 'darkness' levels
  cmap = plt.cm.gray_r; darkness_xz = 255 - xz_view;
  norm = plt.Normalize(vmin=np.min(gel_image), vmax=np.max(gel_image))
  colour_xz = cmap(norm(darkness_xz))
  x_coords = np.arange(roi_top_left[0], roi_bottom_right[0])
  ax3.bar(x_coords, darkness_xz, color=colour_xz)
  ax3.set_xlabel('X')
  ax3.set_ylabel('Pixel Intensity')
  ax3.set_title('X-Z View')


  # 5. Create the Y-Z view
  yz_view = np.mean(pixel_values_reshaped, axis=1) # (set axis=1 for mean along X-axis)
  darkness_yz = 255 - yz_view
  norm = plt.Normalize(vmin=np.min(gel_image), vmax=np.max(gel_image))
  colour_yz = cmap(norm(darkness_yz))
  y_coords = np.arange(roi_top_left[1], roi_bottom_right[1])
  ax4.bar(y_coords, darkness_yz, color=colour_yz)
  ax4.set_xlabel('Y')
  ax4.set_ylabel('Pixel Intensity')
  ax4.set_title('Y-Z View')

  # 6. Plot 3D view of pixel intensities in the ROI
  x_grid, y_grid = np.meshgrid(x_coords, y_coords)
  x = x_grid.ravel(); y = y_grid.ravel()
  # Compute 'darkness' levels for 3D representation
  z = np.zeros_like(dz)
  cmap = plt.cm.gray_r;
  norm = plt.Normalize(vmin=np.min(gel_image), vmax=np.max(gel_image))
  colours = cmap(norm(dz))
  bars = ax2.bar3d(x, y, z, 1, 1, dz, color=colours, shade=True)
  # ax2.set_zlabel('Pixel Intensity', fontsize=14)
  ax2.set_zlim([0,150])
  ax2.set_xlabel('x')
  ax2.set_ylabel('y')
  ax2.set_title('3D ROI')

  # Adjust vertical space between plots
  plt.subplots_adjust(hspace=-0.5)
  # Adjust layout for labels visibility
  plt.tight_layout(pad=1.0)
  plt.show()

Let's see what this all does.

We are creating another numpy array from the image so we can more easily do the needed operations in the figure function.

Then, I am selecting one well (by visual inspection) that I want to look at.

In [None]:
gel_image = np.array(img_normalized_cropped)

# Specify the initial top-left coordinates (x, y)
roi_top_left = (15, 5)
# Specify the width and height of rectangle
width, height = (50, 40)


a_big_function_to_plot_the_gel_and_highlight_regions(gel_image, roi_top_left, width, height)

## PRACTICE QUESTION

Talk with your neighbour - what is this showing? What kinds of information can we visually extract about the DNA in this particular well? What experimental conditions does it correspond to (look at the motivation figure for this).



---



**notes**



---


## PRACTICE QUESTION

Try to zoom in on another region in the figure and see what you see. What things are the same? What are different?



---



In [None]:
# some code! What parts of the above do you need to write down here? What from the above don't you need to write down here?

**some notes**



---



# Background subtraction

You likely noticed a range of gray around your well of choice. Despite our attempts to normalize things to give us a more clear background vs sample, we have a lot of gray. We need to perform some kind of background correction (remove the gray).

While this will look quite different to the baseline correction we did for the IR spectra, the basic idea is the same. When you know what the background value should be, you can nudge it there (so long as you can identify a region you think should definitely be in the background).

Methods for background correction usually involve subtracting an estimate of the background signal from the overall signal. The background signal can be estimated in several ways, such as by measuring the signal in an adjacent, empty area of the gel, or by assuming that the background signal is uniform across the gel. We'll do it in the simplest way possible here - find a region we label background, and subtract its average value everywhere.

Let's look for one.

In [None]:
# Specify the initial top-left coordinates (x, y)
roi_top_left_bg = (230, 40)
# Specify the width and height of rectangle
width_bg, height_bg = (40, 40)

a_big_function_to_plot_the_gel_and_highlight_regions(gel_image, roi_top_left_bg, width_bg, height_bg)

## PRACTICE QUESTION

What properties would be required for a region to be considered an appropriate background? What does the region above represent?

**notes**

In [None]:
# try finding a better region?

If things below break, come back to these background conditions:


```
# Specify the initial top-left coordinates (x, y)
roi_top_left_bg = (230, 40)
# Specify the width and height of rectangle
width_bg, height_bg = (40, 40)

```



Let's confirm that we picked something for a background that looks distinct from a well we know we have a real signal in.

A simple way to do that is with a histogram of the pixel values in both regions.

Read through the below code to make sure what it is doing makes sense to you.

In [None]:
# roi for sample
roi_bottom_right = (roi_top_left[0] + width, roi_top_left[1] + height)
roi_sample = gel_image[roi_top_left[1]:roi_bottom_right[1], roi_top_left[0]:roi_bottom_right[0]]
# roi for background
roi_bottom_right_bg = (roi_top_left_bg[0] + width_bg, roi_top_left_bg[1] + height_bg)
roi_background = gel_image[roi_top_left_bg[1]:roi_bottom_right_bg[1], roi_top_left_bg[0]:roi_bottom_right_bg[0]]

# flatten the ROIs (ie. make a 1d array not a 2d array)
pixels_sample = roi_sample.flatten()
pixels_background = roi_background.flatten()

# create subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5), sharey=True, tight_layout=True)

# plot histograms
axs[0].hist(pixels_sample, bins=256, color='blue', alpha=0.7)
axs[0].set_title('Sample')

axs[1].hist(pixels_background, bins=256, color='red', alpha=0.7)
axs[1].set_title('Background')

plt.show()

## PRACTIE QUESTION

How are these distributions the same? How are they different? Does looking at the distributions this way help you figure out what qualities you might want in a background region?


---



**notes**



---

Now, let's come up with a background value to subtract. The simplest method is to just find the average background pixel value.


In [None]:
# Compute the mean background intensity
mean_bg = 255 - np.mean(roi_background)
mean_bg

## PRACTICE QUESTION

Why would I have taken 255 - the average? Is this a typo??



---



**notes for yourself**



---

Here I am background 'subtracting' by ...


```
gel_image_background_corrected = gel_image+mean_bg
```




In [None]:
gel_image_background_corrected = gel_image+mean_bg
a_big_function_to_plot_the_gel_and_highlight_regions(gel_image_background_corrected, roi_top_left, width, height)

## PRACTICE QUESTION

What did the above do? How does it look different from the previous plot?

Could we have done this with different logic?


---



If you are going alter this process, make sure you look at the background_corrected background window too:

In [None]:
a_big_function_to_plot_the_gel_and_highlight_regions(gel_image_background_corrected, roi_top_left_bg, width_bg, height_bg)

One thing to note, when we do background correction in this way (by floating point arithmetic), we end up creating an object that isn't a 8-bit image anymore.

In [None]:
gel_image_background_corrected

See? Instead of integer values between 0 and 255, we have floating point numbers (some greater than 255). While this will plot okay, we need to turn it back into an 8-bit image to do the next stage of image processing.

In [None]:
background_corrected = ((gel_image_background_corrected - np.min(gel_image_background_corrected)) * (1/(np.max(gel_image_background_corrected) - np.min(gel_image_background_corrected)) * 255)).astype('uint8')

An alternative approach (that will actually keep the background looking more crisp, is to just clip the too high values: `background_corrected = np.clip(gel_image_background_corrected, 0, 255).astype('uint8')`. Important note in case you come back here to try this out on other images, just calling `gel_image_background_corrected.astype('uint8')` won't work as you might expect, because it will unhelpfully cast all numbers greater than 255 to their value minus 255 (so white pixels will become black).



# Automate Finding the Bands

So, we can locate the bands with our eyes, but surely we can just ask the computer to tells us where they are.

We can, but it's imperfect. First, let's look at the pixel values in this image again.

In [None]:
# create plot
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

# this needs to be 1D so we can make a histogram
pixels_sample = background_corrected.flatten()
# plot histograms
ax.hist(pixels_sample, bins=256, color='k', alpha=0.7)
# ax.set_ylim([0,600]) # it might help to change axis bounds to see features more clearly!
ax.set_title('Sample')
plt.show()

What do we see here? Shades of gray.

If we want to have some very clear rule to tell the difference between sample and background, we'd  like to find some clear cut-off between the white space around the samples and the samples themselves. Unfortunately, this isn't easy because we see a continuum of values present. Excellent background subtraction can help, but the limitations of this kind of data means that we have to accept the cut-off won't always be clear.


## PRACTICE QUESTION

Recreate the historgram, but zoom in on the y-axis (remember how `ax.set_ylim([a,b])` works) to see if you can identify something to differentiate sample from background in a way we might not have noticed yet.


---



In [None]:
# your new histogram



---

Perhaps you did see a bit of a bi-modal feature?


Visual inspection of a histogram like this is a common way of 'thresholding'. Let's see what it means to divide the image into 'data' vs 'background' using the [`numpy.where()`](https://numpy.org/doc/stable/reference/generated/numpy.where.html) function.

You've seen this function before, but let's go over the syntax because it's very powerful:

`np.where(background_corrected <= 150, 255, 0)`

This will return an object of the same size and shape as `background_corrected`. For every pixel in `background_corrected` that's value is <= 150 (the threshold I choose for this example), a value of 255 will be stored (white) and for every pixel with a value > 150, a value of 0 will be stored (black).

Let's see it work:

In [None]:
threshold = np.where(background_corrected <= 150, 255, 0).astype('uint8')
threshold

It found some things! Not all the things though!



## PRACTICE QUESTION

Change the threshold value above and see what it does. You may want to return it to 150 if the below breaks (but other values will work too and some will actually be better).


---



In [None]:
# code!



---


At the end of the day, we want to know the pixel intensities (how much light was emitted) by each well. That means we need to know the location of each well. We have good guesses for the 20-mer thanks to the thresholding, but they're all different sizes and the edges aren't proper edges. There are several algorithms to solve this problem (but digging into them would take... too long for today. I recommend CS153 if this sparks joy). We are going to borrow a tool from the OpenCV library to just do this task  - it will find the rectangles that most closely match the regions in our binary image. This function is called [`cv2.findContours`](https://docs.opencv.org/4.x/d4/d73/tutorial_py_contours_begin.html). The implementation is below:


In [None]:
# Find contours in the binary image
contours, hierarchy = cv2.findContours(threshold, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

# open cv will actually add these lines to the image, so we don't want it modifying our data
img_copy = copy.deepcopy(background_corrected)

# Filter contours based on geometric properties
for contour in contours:
    # Get rectangle bounding contour
    [x,y,w,h] = cv2.boundingRect(contour)

    # Discard small pieces that are less than 5% of the screen height (it will accidentally find tiny things we don't care about)
    if h < 0.05*background_corrected.shape[0]:
        continue

    # Draw rectangle around contour and show them on the image
    cv2.rectangle(img_copy,(x,y),(x+w,y+h),(0,0,255),2)

# Display marked image
plt.figure(figsize=(10,10))
plt.imshow(cv2.cvtColor(img_copy, cv2.COLOR_BGR2RGB))
plt.show()

If this looks very wrong, look at what is below and see if running all or some of this above your contour finder will make a difference:

```
# Specify the initial top-left coordinates (x, y)
roi_top_left_bg = (230, 40)
# Specify the width and height of rectangle
width_bg, height_bg = (40, 40)

# identify the background region for later use
roi_bottom_right_bg = (roi_top_left_bg[0] + width_bg, roi_top_left_bg[1] + height_bg)
roi_background = gel_image[roi_top_left_bg[1]:roi_bottom_right_bg[1], roi_top_left_bg[0]:roi_bottom_right_bg[0]]

mean_bg = 255 - np.mean(roi_background)
gel_image_background_corrected = gel_image+mean_bg
background_corrected = ((gel_image_background_corrected - np.min(gel_image_background_corrected)) * (1/(np.max(gel_image_background_corrected) - np.min(gel_image_background_corrected)) * 255)).astype('uint8')

threshold = np.where(background_corrected <= 150, 255, 0).astype('uint8')

```


## SHARE with the class

If you changed any parameters above from the defaults I set (and I hope you did!), paste your image into the [same Google Doc](https://docs.google.com/document/d/1BY2KcwaBKU78y2et_xvpmZCOYYMQ_a7avm7PFVJ_EDE/edit?usp=sharing) we used at the start of class.

Double check yours looks close enough to the others before going on.



---

Now, let's split the 20-mer signals from the 9-mer. The 9-mer are more faint and were harder for our algorithm to find.

The code looks at the relative position of the found contours (rectangles) and assigns them to the top or bottom row depending on where they appear in the image.



```
    if y < background_corrected.shape[0] // 3:
      top_rectangles.append((x,y,w,h))
      cv2.rectangle(img_copy,(x,y),(x+w,y+h),(255,0,255),2) # highlight the top ones
    else: # if y >= y_threshold
        bottom_rectangles.append((x,y,w,h))
        cv2.rectangle(img_copy,(x,y),(x+w,y+h),(0,0,255),2)
```



In [None]:
top_rectangles = []
bottom_rectangles = []

# do this on a fresh copy of the image
img_copy = copy.deepcopy(background_corrected)

# Filter contours based on geometric properties
for contour in contours:
    # Get rectangle bounding contour
    [x,y,w,h] = cv2.boundingRect(contour)

    # Discard small pieces that are less than 5% of the screen height (it will accidentally find tiny things we don't care about)
    if h < 0.05*background_corrected.shape[0]:
        continue

    if y < background_corrected.shape[0] // 3:
      top_rectangles.append((x,y,w,h))
      cv2.rectangle(img_copy,(x,y),(x+w,y+h),(255,0,255),2) # highlight the top ones
    else: # if y >= y_threshold
        bottom_rectangles.append((x,y,w,h))
        cv2.rectangle(img_copy,(x,y),(x+w,y+h),(0,0,255),2)

# Display marked image
plt.figure(figsize=(10,10))
plt.imshow(cv2.cvtColor(img_copy, cv2.COLOR_BGR2RGB))
plt.show()



This should make the top ones all white and the bottom ones (however many found and it is okay so long as the number is > 1) black.

Now, let's assume the rectangles should all be the same size. We want to compare equal areas, so we'll pick the largest and make them all the same.

That is what is happening here:



```
# Determine the size of the largest rectangle in top row
max_area_top = max(top_rectangles, key=lambda rect: rect[2]*rect[3])
    
# Adjust all the top rectangles to be centred around their original centre and have the size of the largest one
equalized_top_rectangles = [(x + w//2 - max_area_top[2]//2, y + h//2 - max_area_top[3]//2, max_area_top[2], max_area_top[3]) for x, y, w, h in top_rectangles]

```



In [None]:
# Determine the size of the largest rectangle in top row
max_area_top = max(top_rectangles, key=lambda rect: rect[2]*rect[3])

# Adjust all the top rectangles to be centred around their original centre and have the size of the largest one
equalized_top_rectangles = [(x + w//2 - max_area_top[2]//2, y + h//2 - max_area_top[3]//2, max_area_top[2], max_area_top[3]) for x, y, w, h in top_rectangles]

# Plot rectangles
img_copy = copy.deepcopy(background_corrected)  # Create a copy of the image
for x,y,w,h in equalized_top_rectangles:
    cv2.rectangle(img_copy, (x, y), (x+w, y+h), (0, 255, 0), 2)

plt.figure(figsize = (10, 10))
plt.imshow(cv2.cvtColor(img_copy, cv2.COLOR_BGR2RGB))
plt.show()

Now let's look at the bottom ones. So long as we found one, we can use that to find out the distance we expect between the 20-mer and 9-mer positions:


```
# Determine the smallest y distance from the top rectangle to a bottom rectangle
min_y_distance = min([b[1] - t[1] for t in equalized_top_rectangles for b in bottom_rectangles if b[0] - t[0] < 5])

```

And then fill in the rest assuming the relative position is the same for the entire row:



```
# Estimate the bottom rectangles
estimated_bottom_rectangles = [(x, y + min_y_distance, max_area_top[2], max_area_top[3]) for x,y,w,h in equalized_top_rectangles]

```





In [None]:
# Determine the smallest y distance from the top rectangle to a bottom rectangle
min_y_distance = min([b[1] - t[1] for t in equalized_top_rectangles for b in bottom_rectangles if b[0] - t[0] < 5])

# Estimate the bottom rectangles
estimated_bottom_rectangles = [(x, y + min_y_distance, max_area_top[2], max_area_top[3]) for x,y,w,h in equalized_top_rectangles]

# Plot rectangles
img_copy = copy.deepcopy(background_corrected)  # Create a copy of the image
for x,y,w,h in estimated_bottom_rectangles:
    cv2.rectangle(img_copy, (x, y), (x+w, y+h), (0, 255, 0), 2)

plt.figure(figsize = (10, 10))
plt.imshow(cv2.cvtColor(img_copy, cv2.COLOR_BGR2RGB))
plt.show()

Each of these rectangles is a region of interest that we want to compare!

Let's go back and look at our function to plot the individual bands!

Now, we don't have to manually enter the position of the samples, we can grab it from the list of lists we just made.

Here is a 20-mer example.

In [None]:
roi_top_left = (equalized_top_rectangles[0][0], equalized_top_rectangles[0][1])
width = equalized_top_rectangles[0][2]
height = equalized_top_rectangles[0][3]
a_big_function_to_plot_the_gel_and_highlight_regions(gel_image_background_corrected, roi_top_left, width, height)

And here is the corresponding 9-mer:

In [None]:
roi_top_left = (estimated_bottom_rectangles[0][0], estimated_bottom_rectangles[0][1])
width = estimated_bottom_rectangles[0][2]
height = estimated_bottom_rectangles[0][3]
a_big_function_to_plot_the_gel_and_highlight_regions(gel_image_background_corrected, roi_top_left, width, height)

## PRACTICE QUESTION

What experimental conditions do these correspond to?


---



**notes**



---
Okay, if you made it here, that's great! You can submit your notebook now, or... continue below...


# Submit your notebook

It's time to download your notebook and submit it on Canvas. Go to the File menu and click **Download** -> **Download .ipynb**

Then, go to **Canvas** and **submit your assignment** on the assignment page. Once it is submitted, swing over to the homework now and start working through the paper.

# Integration

How do we find the signal strength corresponding to each well? We need to numerically integrate. This one is just adding up the pixel values (in a way that makes sense for what 0 and 255 represent).

In [None]:
# reminder of how to grab the coordinates for a region of interest
roi_top_left = (estimated_bottom_rectangles[0][0], estimated_bottom_rectangles[0][1])
width = estimated_bottom_rectangles[0][2]
height = estimated_bottom_rectangles[0][3]
roi_bottom_right = (roi_top_left[0] + width, roi_top_left[1] + height)

# and define the ROI itself
roi = gel_image_background_corrected[roi_top_left[1]:roi_bottom_right[1], roi_top_left[0]:roi_bottom_right[0]]


## PRACTICE QUESTION

Write a function to calculate the pixel intensities (integrate) for a given ROI. The function should return a large value when the ROI emitted a lot of light (ie. it looked dark!).

```
def integrate_roi(roi):
  # some things!
  return integrand
```



In [None]:
def integrate_roi(roi):

  return integrand

Test it out!

In [None]:
roi_top_left = (equalized_top_rectangles[0][0], equalized_top_rectangles[0][1])
width = equalized_top_rectangles[0][2]
height = equalized_top_rectangles[0][3]
roi_bottom_right = (roi_top_left[0] + width, roi_top_left[1] + height)
roi = gel_image_background_corrected[roi_top_left[1]:roi_bottom_right[1], roi_top_left[0]:roi_bottom_right[0]]

integrate_roi(roi)

vs

In [None]:
roi_top_left = (estimated_bottom_rectangles[0][0], estimated_bottom_rectangles[0][1])
width = estimated_bottom_rectangles[0][2]
height = estimated_bottom_rectangles[0][3]
roi_bottom_right = (roi_top_left[0] + width, roi_top_left[1] + height)
roi = gel_image_background_corrected[roi_top_left[1]:roi_bottom_right[1], roi_top_left[0]:roi_bottom_right[0]]

integrate_roi(roi)

It's the ratio between these values that generates the numbers in the bar plot!

If your function to find the integral works, run the below code and see what you get...

In [None]:
list_to_store_results=[]

for experiment in range(len(equalized_top_rectangles)):
  roi_top_left = (equalized_top_rectangles[experiment][0], equalized_top_rectangles[experiment][1])
  width = equalized_top_rectangles[experiment][2]
  height = equalized_top_rectangles[experiment][3]
  roi_bottom_right = (roi_top_left[0] + width, roi_top_left[1] + height)
  roi = gel_image_background_corrected[roi_top_left[1]:roi_bottom_right[1], roi_top_left[0]:roi_bottom_right[0]]

  top_integral = integrate_roi(roi)

  roi_top_left = (estimated_bottom_rectangles[experiment][0], estimated_bottom_rectangles[experiment][1])
  width = estimated_bottom_rectangles[experiment][2]
  height = estimated_bottom_rectangles[experiment][3]
  roi_bottom_right = (roi_top_left[0] + width, roi_top_left[1] + height)
  roi = gel_image_background_corrected[roi_top_left[1]:roi_bottom_right[1], roi_top_left[0]:roi_bottom_right[0]]

  bottom_integral = integrate_roi(roi)

  list_to_store_results.append([equalized_top_rectangles[experiment][0],bottom_integral/top_integral])
list_to_store_results.sort()

And the plot

In [None]:
fig, ax3 = plt.subplots(1,1)
# Plot for the bar plots in ax3
barWidth = 0.3
time = ['0.5', '1', '5', '10', '30']

# from july 25 S326C thermo gel2 (Here they are)
bars1 = [results[1]*100 for results in list_to_store_results[:5]]
bars2 = [results[1]*100 for results in list_to_store_results[6:]]


# The formatting of the bars can be changed for visual preference
r1 = np.arange(len(bars1)) + .1 # I am manually creating the positions of the bars
r2 = [x + barWidth + .1 for x in r1] # They should be a little bit apart

# Now we add them to the plot
ax3.bar(r1, bars1, color='k', edgecolor='k', width=barWidth, label='37°C pre-incubation')
ax3.bar(r2, bars2, color='lightgray', edgecolor='k', width=barWidth, label=' 4°C pre-incubation')

ax3.set_xlabel('Time (min)')
ax3.set_xticks([r + barWidth for r in range(len(bars1))], time)
ax3.set_ylabel('Excision (%)')

# Setting the y-axis limits and ticks
# ax3.set_ylim([0, 50])
# ax3.set_yticks(range(0, 51, 10))

# Remove frames
ax3.spines['right'].set_visible(False)
ax3.spines['top'].set_visible(False)

## PRACTICE QUESTION

[Share](https://docs.google.com/document/d/1BY2KcwaBKU78y2et_xvpmZCOYYMQ_a7avm7PFVJ_EDE/edit?usp=sharing) this plot too!



---



# Unneeded extra fun with K-means

Instead of the thresholding we did above, we could have actually done it with K-means clustering!

This function is almost the same as the last time we wrote it, but for an entirely different problem and data type

In [None]:
def kmeans(image, K, max_iters=100):
    # Flatten the image to 1D array
    pixels = image.flatten()

    # Step 1: Initialize random centroids
    centroids = np.random.choice(pixels, K)

    for _ in range(max_iters):
        # Step 2: Form clusters
        clusters = np.argmin(np.abs(pixels[:, None] - centroids), axis=1)

        new_centroids = np.array([pixels[clusters==k].mean() for k in range(K)])

        # If centroids don't change significantly, break
        if np.allclose(centroids, new_centroids):
            break

        centroids = new_centroids

    # Replace each pixel value with the corresponding centroid
    segmented_image = centroids[clusters].reshape(image.shape).astype(np.uint8)

    return segmented_image

Let's see

In [None]:
segmented_image = kmeans(gel_image, K=2)
segmented_image

How awesome is that.

This is just to show the amazing generality of some of the techniques you are learning!