# Lesson 4 Batch Processing and Tracking

Congratulations, you've just completed an awesome image processing pipeline that takes an image and does something rather useful. You've tested it a few times and you're happy with the results. 

Now maybe you need to apply the same operation to 1000 images. Maybe you need to apply them to 20 datasets with slightly different parameters. Maybe you need to aggregate information across the results of each processed image to make your conclusions. These needs can be addressed by applying your pipeline with batch processing. In this lesson, you will learn:

1. Two batch processing strategies and their advantages & disadvantages
    - EPISODE 1: Vertical strategy
    - EPISODE 2: Horizontal strategy
    - EPISODE 3: Memory management
2. Cell tracking approach and measurements
    - EPISODE 4: Cell tracking

## 4.1 Batch processing strategies (vertical and horizontal integration)

### 4.1.1 Project aim: batch processing of multiple images


In [1]:
import os
# change the current path to the data path
data_path = "/Users/guolanlu/Desktop/Lesson4/Data/L4Data/ImagesAll";
os.chdir(data_path)
print('The number of images = ',len(os.listdir()))
print(os.listdir()[1:10])


The number of images =  50
['HAC-Cit-KRABdox_w2YFPled_s41_t10.TIF', 'HAC-Cit-KRABdox_w2YFPled_s41_t11.TIF', 'HAC-Cit-KRABdox_w2YFPled_s41_t12.TIF', 'HAC-Cit-KRABdox_w2YFPled_s41_t13.TIF', 'HAC-Cit-KRABdox_w2YFPled_s41_t14.TIF', 'HAC-Cit-KRABdox_w2YFPled_s41_t15.TIF', 'HAC-Cit-KRABdox_w2YFPled_s41_t16.TIF', 'HAC-Cit-KRABdox_w2YFPled_s41_t17.TIF', 'HAC-Cit-KRABdox_w2YFPled_s41_t18.TIF']


#### Step 1: define a function to load data

#### Step 2: define a function to pre-process the data
 - filter: smooth image/remove noise
 - thresholding: generate mask
 - morphological: refine mask

#### Step 3: extract information from data
 - count cell number
 - measure protein expression
 - ......

#### Step 4: batch processing: count cell numbers of all images

#### 4.1.2 Strategy one: the vertical approach

 A straightforward procedure that applies all processing steps to each image one at a time, from start to end.

Plot the number of cells over all the images

**The advantages of using predefined functions for each processing step:** 
 - Concise flow: clear main flow for reading and processing; 
 - Time saving: easily reused for other projects; 
 - Reproducible: minimal input error and avoiding inconsistent operations. 

These are very important when processing large dataset with multiple steps, especially with many lines of codes.

### 4.1.3 Strategy two: the horizontal approach
When you face a problem and want to debug functions, vertical approach may not be the best choice. 
Instead of applying all processing steps to each image one at a time, you can use the horizontal approach that applies one processing step to all images and then move to the next step. In this way, each processing step will serve as a seperate unit for testing and debugging.

So now rather than completing your pipline "vertically", from start to finish, top to bottom, you are completing your pipeline "horizontally", applying each processing step across all the images.

**Exercise 1:** Implement horizontal approach: count cell numbers from all the images

In [2]:
# variable initialization
images = []
processed = []
results = []

# 1. load all the images


# 2. pre-process all the loaded images

    
# 3. extract information from all the pre-processed images


# 4. print your results


In fact, when you've made your code modular with little functions, Python tries to support this approach with _functional programming tools_. We'll talk about one of the especially useful ones:

- `map(function, iterable)`. Applies `function` to each element in `iterable` (i.e. anything you can loop through) and gives you a new list with the results.

In [3]:
images = []
processed = []
results = []



Now this looks like a pretty concise way to process a pipeline! But is it the best way?

### 4.1.4 Memory cost for batch processing


One thing to consider is that at the end of the program, `files`, `images`, `processed`, and `results` are all full arrays available to you in program memmory. This could be useful for debugging (e.g., check the intermediate results after preprocessing). It may also be useful when you need to aggregate across intermediate results.

But the downside of having all the intermediate information is that it may use a lot of memory, even out of memory in your computer.

**Exercise 3:** Not all approaches above are as memory intensive as the others. Which approach is the least offensive?

In our lesson on loading an image we discussed how much memory an image can use up. If you are processing 1000 16-bit, 1024x1024 pixel images each with 3 channels, how much memory does the first line above use?

In [4]:
num_bytes = 1000 * 2 * 1024**2 * 3
print("{} Bytes".format(num_bytes))
print("{} GB".format(num_bytes / 1024.**3))

6291456000 Bytes
5.859375 GB


That's probably not going to go smoothly if you have a multi-stage pipeline. You could delete each stage as you go, keeping in memory only the last and current stage using `del` stagements, e.g. `del images`, but this can introduce bugs when you need some parameters later. For very large numbers of images, this strategy suffers.

In [5]:
#processed = map(preprocessing,images)
#del images

#list(images)

#### A good way to design a pipeline is to draw out the pipes. 
Map out the dependencies between pieces of information so you can see where to break things into functions and figure out how much data you actually need to keep in memory. 

## 4.2 Tracking objects / pipeline design

Now let's import a video

**Step 1**: Pre-process and segment cells

In [6]:
def preprocessing(image):
    # filtering
    filtered = median_filter(image, size=2) 
    # thresholding
    otsu_thresh = filters.threshold_otsu(filtered) 
    masked = filtered > otsu_thresh #smaller threshold than otsu      
    # morphology
    morph = sm.binary_erosion(masked,sm.disk(1)) 
    return morph

**Step 2:** Quantify cell properties

**Step 3:** Define a function to track cells: The easiest way to accomplish this  task is to connect every segmented cell in a frame to the nearest cell in the subsequent frame. The assumption is that cells are moving slowly with respect to the chosen frame-rate and are not densely packed.

**Step 4:** Run batch processing to tack cell position in time-series images

**Exercise 2:** Plot the trajectory of another cell #5