# Lesson 4 Batch Processing and Tracking

Congratulations, you've just completed an awesome image processing pipeline that takes an image and does something rather useful. You've tested it a few times and you're happy with the results. 

Now maybe you need to apply the same operation to 1000 images. Maybe you need to apply them to 20 datasets with slightly different parameters. Maybe you need to aggregate information across the results of each processed image to make your conclusions. These needs can be addressed by applying your pipeline with batch processing. In this lesson, you will learn:

1. Two batch processing strategies and their advantages & disadvantages
2. Cell tracking approach and measurements


## 4.1 Batch processing strategies (vertical and horizontal integration)

### 4.1.1 Project aim: batch processing of multiple images

In [None]:
# read certain file names in your target folder


**Exercise 1:** You're at the point where you have the functions below. 

In [112]:
# load image data

# preprocess the loaded image data

# extract information from the processed images


How might you apply the pipeline you developed to every image in a folder, saving the results in a list?

### 4.1.2 Strategy one: the vertical approach

#### A straightforward procedure that applies all processing steps to each image one at a time, from start to end.

In [14]:
# vertical batch processing framework


#### Example: counting cell numbers from 5 images

In [106]:
# prdefined functions:

# func 1: load the red channel of a target image

# func 2: preprocess the loaded image data with filtering, thresholding, and morphological operation

# func 3: extract the cell numbers from the preprocessed images and also visualize the labeled cells


In [16]:
# batch processing


In this vertical pipeline, we converted different steps as different functions. **The advantages of using predefined functions:** 
##### 1) Concise flow: clear main flow for coding, reading, and management; 
##### 2) No side effect: local parameter changes inside  the function won't affect the global parameters in the main flow; 
##### 3) Time saving: easily reused for other projects; 
##### 4) Reproducible:minimal input error and avoiding inconsistent operations. 

These are very important when processing large dataset with multiple steps, especially with many lines of codes.

**Exercise 2:**: understand the importance of predefined function
    
["imread" function: source codes](https://github.com/luispedro/imread/blob/master/imread/imread.py)

### 4.1.3 Strategy two: the horizontal approach
When you face a problem and want to debug suspicious functions, vertical approach may not be the best choice. 
Instead of applying all processing steps to each image one at a time, you can use the horizontal approach that applies one processing step to all images and then move to the next step. In this way, each processing step will serve as a seperate unit for testing and debugging.

In [2]:
# horizontal batch processing


**Exercise 3:**: adjust the "preprocessing" function to get better cell masks 

In [109]:
# adjust "preprocessing" function


In [3]:
# rerun the adjusted function


So now rather than completing your pipline "vertically", from start to finish, top to bottom, you are completing your pipeline "horizontally", applying each stage across the board in sequence.

In fact, when you've made your code modular with little functions, Python tries to support this approach with _functional programming tools_. We'll talk about one of the especially useful ones:

- `map(function, iterable)`. Applies `function` to each element in `iterable` (i.e. anything you can loop through) and gives you a new list with the results.

In [4]:
# horizontal batch processing with "map"


Now this looks like a pretty concise way to process a pipeline! But is it the best way?

### 4.1.4 Memory cost for batch processing


One thing to consider is that at the end of the program, `files`, `images`, `processed`, and `results` are all full arrays available to you in *program memmory*. This could be useful for debugging (e.g., check the ntermediate results after preprocessing). It may also be useful when you need to aggregated across intermediate results.

But the downside of having all the intermediate information is that it may use a lot of memory, even out of memory in your computer.

**Exercise 4:** in our lesson on loading an image we discussed how much memory an image can use up. If you are processing 1000 16-bit, 1024x1024 pixel images each with 3 channels, how much memory does the first line above use?

In [7]:
# memory calculation


**Exercise 5:** Not all approaches above are as memory intensive as the others. Which approach is the least offensive?

Horizontal processing is probably not going to go smoothly if you have a multi-stage pipeline. You could delete each stage as you go, keeping in memory only the last and current stage using `del` stagements, e.g. `del images`, but this can introduce bugs when you need some parameters later. For very large numbers of images, this strategy suffers.

In [13]:
# "del" intermediate results


#### A good way to design a pipeline is to draw out the pipes. 
Map out the dependencies between pieces of information so you can see where to break things into functions and figure out how much data you actually need to keep in memory. The example below is a workflow for processing a green structural channel and a red calcium fluorescence channel in a 60-frame time series of images. The goal is to find cells, make masks, and track the fluorescence intensities over time.

![Diagram of a pipeline](pipeline_diagrams.png "A typical pipeline diagram")

We can see that loading raw images and preprocessing them can be done with one function, applied image-by-image, without saving the intermediate. However we need to stop after this and max project. We can then remove all of the green channel data from memory and work on the red channel, keeping only the cell location masks. We then apply the cell location masks to each image, one at a time, take out resulting total cell intensities and save them in an array.

Without doing this, we may have tried to load both channels at the beginning, which will double our memory consumption.

---------------------------------------------------------------------------------------------------------------------

# 4.2 Tracking objects / pipeline design

It is important to be able to order the various image processing tools you have learnt in an automated pipeline so you don't have to apply the same N transformations over and over again to your dataset of M images. Today wou will learn how to build your very own image processing pipeline, along with some nifty video processing tools.

We will do this in the context of an image processing method known as "tracking". You already know how to identify separate objects in your images and measure their properties (position, size, intensity etc.). Now imagine you have a time-series of images (or a video...). Tracking is essentially the ability to identify the same object in your entire time-series consistently. We will also demonstrate how this process can fail, and what to watch out for when you are analyzing your own data.


In [9]:
# video loading and display


In [None]:
# predefined functions for extracting cell information


In [None]:
# define a function to find the nearest cell centroid


In [10]:
# main flow to track the cell centroid of the video


In [11]:
# plot cell centroid changes


In [12]:
# calculate cell motion distances and visualize them
