# Object tracking in Python
**Stephen Cross (Wolfson Bioimaging Facility, University of Bristol, 2019).  Email: stephen.cross@bristol.ac.uk**

In this session we will build up a workflow to track pre-detected objects through multiple frames of a timeseries.  

The images we will be using are of fluorescently-labelled nuclei and are provided by the [Cell Tracking Challenge](http://celltrackingchallenge.net/).  For more information on the images, see the paper "Recruitment of Oct4 Protein to UV-Damaged Chromatin in Embryonic Stem Cells", Eva Bártová ,Gabriela Šustáčková,Lenka Stixová,Stanislav Kozubek,Soňa Legartová,Veronika Foltánková (2011) PLOS ONE.

The session is broken down into eight exercises, each of which will introduce a new step to the workflow.  Typically an exercise will require creation of a new Python function to achieve a specific task (e.g. calculating the cost of linking two points).  The format of each function is laid out in the comments above that function.  It'll be up to you to replace the comment "### CODE GOES HERE ###" with the relevant code.

At the end of each exercise is a pre-prepared block of code ("### TEST SPACE ###") that can be used to test the function is working correctly.  The code in this box doesn't need to be altered to work.

## Exercise 0 - Getting started
Before we start, we need to make sure this Jupyter Notebook has access to all the functions we're going to use later on.  

For the sake of keeping the output on the notebook neat, we'll also set Numpy and Pandas to only display numbers to 2 decimal places (they're still working with higher precision values).  

Finally, we'll also create a variable to represent "infinity".  This will be used later on during the object linking steps.

### Aim
- Import any functions we will use later on
- Set Numpy and Pandas to display numbers to 2 decimal places
- Create an "infinity" variable, which we will use during tracking
- If not already completed, download and extract the images (this will all be done automatically)

### Notes
- This exercise doesn't require addition of any code; all you need do is run the code blocks to check all the libraries import correctly. 

In [None]:
# Importing the libraries into this Notebook
import math
import sys
import util

import numpy as np
import pandas as pd

from scipy.optimize import linear_sum_assignment

In [None]:
# Setting parameters for Jupyter output
np.set_printoptions(precision=2,threshold=sys.maxsize)
pd.options.display.float_format = "{:.2f}".format

In [None]:
# Creating a global variable for "infinity".  This will be needed later on when calculating assignments.
inff = 1000000000

In [None]:
# Getting the data (courtesy of Chas Nelson!)
import urllib.request
import io
import os
import zipfile

from tqdm import tqdm_notebook, tnrange

url = 'http://data.celltrackingchallenge.net/training-datasets/Fluo-N2DH-GOWT1.zip'

if os.path.isfile ("./assets/images/Fluo-N2DH-GOWT1/01/t001.tif"):
    print("Images already downloaded")
else:
    with urllib.request.urlopen(url) as response:
        print("Downloading...")
        length = int(response.getheader('content-length'))
        chunk = max(4096, length//9999)
    
        buffer = io.BytesIO()
        size = 0
        for b in tnrange(length//chunk + 1):
            block = response.read(chunk)
            if not block:
                print("Finished reading after {0}% of file.".format(size/length))
            buffer.write(block)
            size = size + len(block)
        print("Finished reading file.")
    
        print("Unzipping... ",end="")
        zf = zipfile.ZipFile(buffer)
        os.makedirs('./assets/images/',exist_ok=True)
        zf.extractall(path='./assets/images')
        print("Complete")

## Exercise 1 - Loading coordinates and visualising
Our key aim this afternoon is to learn about tracking, so we don't want to spend half our time detecting objects to be tracked.  As such, you'll find a pre-prepared CSV file with object coordinates at *../assets/UntrackedCoordinates.csv*.  This file has 6 columns: Point ID, x-centroid, y-centroid, timepoint, 2D area and track ID (currently set to 0).

We could also sink a couple of hours into creating a script to draw our object coordinates and tracks, but let's not.  For now, I've created a separate script which will load these files for us - this is in the "util.py" file we just imported.  This script also has functions to display the spots on top of the images, so we can check how our tracking is doing.

### Aims
- Load object coordinates from *../assets/UntrackedCoordinates.csv* into a Pandas dataframe
- Load the timeseries images corresponding to the coordinates into a 3D Numpy array
- Display the first few lines of the coordinate dataframe
- Visualise the coordinates as an overlay on the timeseries images

### Notes
- This exercise also doesn't require any code to be written.  It's more about checking the relevant example files can be loaded and the utility functions run as expected.

In [None]:
%%html
<style>
.output_wrapper button.btn.btn-default,
.output_wrapper .ui-dialog-titlebar {
  display: none;
}
</style>

In [None]:
%matplotlib widget

# Loading image stack
path = "./assets/images/Fluo-N2DH-GOWT1/01/"
images = util.load_images(path);

# Loading coordinates
path = "./assets/UntrackedCoordinates.csv"
coords = util.load_coordinates(path);

# Displaying the first 10 lines of the dataframe
print(coords.shape)
print(coords[450:480])
print("")

# Adding track renders
util.show_overlay(images,coords,False)

## Exercise 2 - Extracting coordinates for a specific timepoint
We now have the full list of coordinates, but for tracking we're often going to want to access just those from a specific frame.  The first function we'll create will take the full list of coordinates and return a list of rows corresponding to only those points in the specified frame.

### Aims
- Create a function to get a list of row indices corresponding to coordinates in a specific frame
- Arguments: The coordinates dataframe and the frame number as arguments
- Return: A list of row indices locating the relevant points in the coordinates dataframe

### Notes
- This is the first exercise where we need to add in some code.  This exercise also follows the standard format that all remaining exercises will use.  In the next box you will need to replace the comment "### CODE GOES HERE ###" with the code for that function.  Above the function you'll find a description of the function format (i.e. the inputs and outputs).  Once the code is ready, you can test it with the next box ("### TEST SPACE ###").  The test space should display the row indices for the first couple of frames.  
- If you wish to further verify these indices are correct, you could open the coordinates csv file in a spreadsheet viewer (e.g. Excel).
- You'll likely want to use Panda's "index" function.  This returns the indices for the rows matching the provided condition.  For example, to get the row indices for all coordinates in frames after frame 7, we could use the following:
```Python
    coord_rows_over_f7 = coords.index[coords.FRAME > 7]
```

In [None]:
# Function to get the row indices for coordinates present in a specific frame.
#
# Args:
#     coords: Pandas dataframe containing all coordinates from all frames.
#     frame: Frame number for which we're getting coordinates.
#
# Returns:
#     A list of indices corresponding to the coordinates for the specified frame
#

def get_current_coords(coords, frame):
    ### CODE GOES HERE ###

In [None]:
### TEST SPACE ### 

# Loading coordinates
path = "./assets/UntrackedCoordinates.csv"
coords = util.load_coordinates(path);

# Testing on coordinates from frame 0
frame = 0
rows_0 = get_current_coords(coords,frame)
print("Row indices for points in frame 0:\n%a\n" % rows_0)

# Testing on coordinates from frame 1
frame = 1
rows_1 = get_current_coords(coords,frame)
print("Row indices for points in frame 1:\n%a\n" % rows_1)

## Exercise 3 - Assign tracks IDs to coordinates in first frame
Eventually, all points will be assigned track IDs.  As we work through all the frames, points will be linked to those in a previous frame and inherit their track IDs.  This way, track ID gets propagated through the timeseries.  To kick things off, we need to assign all points in the first frame unique IDs.  The function we create here will identify any points in a specific frame (for now, frame 0) that don't have assigned track IDs and assign them the smallest, currently unused track ID.  By keeping the function general like this we can reuse it later on when find any points in any frame that didn't get linked back to an existing track.

### Aims
- Assign unique track ID numbers to the rows in the coordinates dataframe ("coords") that match the provided rows
- Arguments: The coordinates dataframe and the row indices to assign track IDs for
- Return: Nothing!  This function will update the input dataframe

### Notes
- This function will update the coordinates dataframe that is passed to the function, so there are no returned values.
- With Pandas you can get the maximum value in a column using the function "max()".  For example, the following will give the maximum x-value in the whole dataframe:

```Python
    max_x_value = coords.X.max()
```

- One way to do this is to loop over all specified rows, checking if the current track ID is "0".  If it is, give it the next available track ID.

In [None]:
# Function to assign unique track ID numbers to any unassigned coordinates at the specified row
# indices.  Unassigned coordinates are identified by a track ID of 0.
#
# Args:
#     coords: Pandas dataframe containing all coordinates from all frames.
#     point_rows: list of row indices for tracks to have track IDs assigned.  All, some or 
#                 none of these may need track IDs assigning.
#
# Returns:
#     This function updates the provided coordinate dataframe, so does not return anything.
#

def assign_new_IDs(coords, point_rows):
    ### CODE GOES HERE ###        

In [None]:
### TEST SPACE ###

# Loading coordinates
path = "./assets/UntrackedCoordinates.csv"
coords = util.load_coordinates(path);

# Getting all points in the first frame
frame = 0
rows_0 = get_current_coords(coords,frame)

# Assigning unique track IDs to all these points
assign_new_IDs(coords,rows_0)

# Displaying the first 30 rows of our coordinate dataframe
print(coords.head(30))

## Exercise 4a - Getting all tracks
When assigning track links for each frame, we need to know what tracks are available to link to.  Here, we want to identify the rows corresponding to the most recent instance of each track.

To start with, we're going to consider any track that has existed as being "available".  As such, we just want to get the most recent row of the coordinates dataframe for each track.  In the second half of this exercise (4b) we will introduce a limit to how far back we will search for the most recent instance of each track.

### Aims
- Get a list containing the row indices of the most recent coordinate for each track
- Arguments: The coordinates dataframe
- Return: A list of row indices for the most recent instance of each track

### Notes
- To check our code works it's useful to have some coordinates that have already been tracked.  Therefore, for this exercise, the "coords" dataframe we load has been tracked up to frame 49
- We will need to find the unique track IDs in our coordinates dataframe.  It's possible to get all the unique values in a list using Numpy's "unique" function.  For example, to get the unique values in the list "my_list" we would call:
```Python
    unique_values = np.unique(my_list)
```
- You can get the last element of a list using *-1:* as the index.  For example, the following will create a list containing three numbers, then print the value of the final row (i.e. "21"):

```Python
    demo_array = np.array([42,63,21])
    print("Final element = ",a[-1:])
```

In [None]:
# Function to find the row indices corresponding to the most recent instance of each track.
#
# Args:
#     coords: Pandas dataframe containing all coordinates from all frames.
#
# Returns:
#     A list containing the row indices for the most recent instance of each track.
#

def get_all_tracks(coords):    
    ### CODE GOES HERE ###

In [None]:
### TEST SPACE ### 

# Loading coordinates
path = "./assets/PartiallyTrackedCoordinates.csv"
coords = util.load_coordinates(path);

# Getting row indices for all available tracks
track_rows = get_all_tracks(coords)

# Displaying the points we have identified
print("Identified %i available tracks" % len(track_rows))
print("")
print(coords.loc[track_rows])

## Exercise 4b - Getting all RECENT  tracks
In the first half of this exercise we identified the most recent instance of each track.  However, we may also choose to only allow links to tracks identified within a specific number of frames.  For example, we may not want to allow a point in frame 42 to link back to a track last seen in frame 9.

### Aims
- Repeat the code from Exercise 4a, but only permit links to tracks between the provided values of *start_frame* and *end_frame*
- Arguments: The coordinates dataframe, the start and end frames, between which we will get available tracks
- Return: A list of row indices for the most recent instance of each track

### Notes
- As before, we'll load the partially-tracked coordinates.

In [None]:
# Function to find the row indices corresponding to the most recent instance of each track.
# This will only return row indices for tracks present within a specified frame range.
#
# Args:
#     coords: Pandas dataframe containing all coordinates from all frames
#     start_frame: First frame of range for which tracks are considered "available" for linking to
#     end_frame: Final frame of range for which tracks are considered "available" for linking to.  This
#                will typically be the frame immediately prior to the "current" frame.
#
# Returns:
#     A list containing the row indices for the most recent instance of each available track.
#

def get_available_tracks(coords, start_frame, end_frame):
    ### CODE GOES HERE ###

In [None]:
### TEST SPACE ### 

# Loading coordinates
path = "./assets/TrackedCoordinates.csv"
coords = util.load_coordinates(path);

# Getting row indices for all available tracks
start_frame = 44
end_frame = 49
track_rows = get_available_tracks(coords,start_frame,end_frame)

# Displaying the points we have identified
print("Identified %i available tracks" % len(track_rows))
print("")
print(coords.loc[track_rows])

## Exercise 5 - Calculating cost for linking two points
In a couple of steps we will end up with two lists: one for points in the current frame; the other for the most recent point in all available tracks.  Links between the points in the two lists will be assigned based on the cost associated with making that link.  Here, we want to create a function that will calculate the cost associated with two given points.  To start with, we will just calculate the cost as the distance between the two points, but we could also add costs associated with other metrics.  For example, we may want to penalise links which would see the size or intensity of the object change too much.

### Aims
- Calculate the cost for linking two provided coordinates
- Arguments: The coordinates dataframe, the row index for last instance of the track, row index for current point, a distance threshold
- Return: A single, floating point value corresponding to the cost for linking these two points

### Notes
- For this we'll load in the full set of coordinates, but only calculate the cost for the first points in frame 1 and frame 2.
- In this case, the "cost" is simply the straight-line distance between the two points.
- To get the distance between two points we can use the following equation, where x1,y1 and x2,y2 are the xy coordinates of the two points:

```Python
    dx = x1-x2
    dy = y1-y2
    dist = math.sqrt(dx*dx + dy*dy)
```

In [None]:
# Function to calculate the cost of linking two points.  This function simply calculates the cost as 
# the distance between the two points.
#
# Args:
#     coords: Pandas dataframe containing all coordinates from all frames
#     track_row: Row index for the most recent coordinate in a track
#     point_row: Row index for a coordinate in the current frame
#     thresh: Maximum permitted distance between two points.  If points are separated by more than this
#             the cost will be assigned a very large value to ensure a very low probability of the link
#             being assigned by the Munkres algorithm.
#
# Returns:
#     A floating point (decimal) value corresponding to the cost of linking the two specified points.
#

def calculate_cost(coords,track_row,point_row,thresh):
    ### CODE GOES HERE ###

In [None]:
### TEST SPACE ### 

# Loading coordinates
path = "./assets/UntrackedCoordinates.csv"
coords = util.load_coordinates(path);

# Let's get the indices for the coordinates in the first frame
rows_0 = get_current_coords(coords,0)

# For this cost calculation we'll just use the first coordinate in this frame
row_0 = rows_0[0]

# Let's take a look at that coordinate
print("Coordinate 1:\n%a\n" % coords.loc[row_0])

# Now, let's do the same for the second frame
rows_1 = get_current_coords(coords,1)
row_1 = rows_1[0]
print("Coordinate 2:\n%a\n" % coords.loc[row_1])

# Calculating the cost of the two points with the threshold set high
thresh = 10;
cost = calculate_cost(coords,row_0,row_1,thresh)
print("Cost = %f (thresh = %f)" % (cost,thresh))

# Now, we'll do the same, but with a lower threshold
thresh = 0.5;
cost = calculate_cost(coords,row_0,row_1,thresh)
print("Cost = %f (thresh = %f)" % (cost,thresh))

## Exercise 6 - Calculate cost matrix
Now we can calculate the cost for a single pair of points we need to do it for all point pairs.  We will create a function which takes the points in the current frame and the available track points, then generates a 2D cost matrix.  The cost matrix will have a column for each point in the current frame and a row for each available track.  The value of each element will therefore be the cost of linking the point and track corresponding to that column and row.  We also want to limit the distance that links can be made over; therefore, any point-track pairs separations that are greater than a specific distance will be set to infinity.  This doesn't prevent them being linked (if there are no better options, the assignment algorithm will still suggest that as a link), but it makes it less likely.

Once we have the cost matrix, we can use SciPy's linear_sum_assignment to calculate the optimal assignments.  This step has been included as the final step in the test space.  We will look into the format of the assignment output in more detail as part of exercise 7.

### Aims
- Given two lists containing row indices, the first for the most recent instance of all available tracks and the second for all points in the current frame, calculate a cost matrix.
- Arguments: The coordinates dataframe, the row indices for tracks, row indices for current points, a distance threshold
- Return: An array of floating point values corresponding to the costs for linking all points

### Notes
- The cost matrix will need to be a 2D Numpy array with a row for every available track and a column for every point in the current frame.
- One way to do this is to use nested loops.  "Nesting loops" simply means to put one loop inside another.  The following example has two nested loops, which print their iteration numbers:

```Python
    for i in range(0,3):
        for j in range(7,9):
            print(i,",",j)
``` 
        This will print the following: "0,7", "0,8", "1,7", "1,8", "2,7", "2,8"
    
- When looping over a structure you can use the *enumerate* function get the current iteration index as well as the current value:

```Python
    demo_array = np.array([42,63,21])
    for i,val in enumerate(demo_array):
        print("Iter=%i,value=%i" % (i,val))
```
        This will print "Iter=0,value=42", "Iter=1,value=63", "Iter=2,value=21"

- So we can easily see how this is working, we'll load a special set of coordinates.  This set only has 3 points in the first frame and 4 points in the second frame.

In [None]:
# Function to calculate the cost of linking all points specified by two lists of row indices.  The
# first list of indices corresponds to the most recent coordinates in each available track.  The 
# second list of indices corresponds to the coordinates in the current frame.  Costs are added to a 
# 2D Numpy array.
#
# Args:
#     coords: Pandas dataframe containing all coordinates from all frames
#     track_rows: Row indices for the most recent coordinate in all available tracks
#     point_rows: Row indices for all coordinates in the current frame
#     thresh: Maximum permitted distance between two points.  If points are separated by more than this
#             the cost will be assigned a very large value to ensure a very low probability of the link
#             being assigned by the Munkres algorithm.
#
# Returns:
#     A 2D Numpy array containing all costs.  Each row of the array corresponds to a track and each column
#     to a current coordinate.  The intersection of each row and column is the associated cost.
#

def calculate_cost_matrix(coords,track_rows,point_rows,thresh):
    ### CODE GOES HERE ###

In [None]:
### TEST SPACE ### 

# Loading coordinates
path = "./assets/TestCoordinatesForCostMatrix.csv"
coords = util.load_coordinates(path);

# Let's get the indices for the coordinates in the first (available tracks) and second (current points) frames
track_rows = get_current_coords(coords,0)
point_rows = get_current_coords(coords,1)

# Calculating the cost matrix
thresh = 20
costs = calculate_cost_matrix(coords,track_rows,point_rows,thresh)

# Displaying our costs
print("Cost matrix:\n%a\n" % costs)

# Using SciPy's linear_sum_assignment solver to get the optimal assignments
assignments = linear_sum_assignment(costs)
print("Assignments:\n%a\n%a\n" % (assignments[0],assignments[1]))

## Exercise 7 - Assigning tracks
As we saw at the end of the last exercise (in the test space), the job of calculating the assignments is done using the Munkres (aka. Kuhn-Munkres or Hungarian) algorithm.  Rather than create our own implementation of this, we can use SciPy's linear_sum_assignment function.  This algorithm simply takes the cost matrix we just created and outputs a Nx2 array of assignments, where N is the number of assignments.  For each row of the assignment result the first column value corresponds to the track and the second column to the linked coordinate.  

As mentioned previously, the Munkres algorithm will still assign links that we set to "infinity", so before we copy over any track IDs we want to double check the cost is not infinity.

### Aims
- Using the calculated track-point assignments, apply the relevant track IDs to each point.
- Arguments: The coordinates dataframe, the row indices for the tracks and points to be (potentially) linked, the cost matrix for these tracks and points.
- Return: Nothing!  This function will update the input dataframe

### Notes
- If we were to iterate over the *assignments* array we would receive the first column, then second.  Instead, we want to get the rows one at a time.  We can do this using the *zip* function.
- For this example we'll go back to using the partially-tracked example coordinates.  This dataset has tracks assigned up to frame 49.
- The test space for this exercise uses many of the functions we've created so far: 
    - Getting available tracks
    - Getting coordinates from the current frame
    - Calculating the cost matrix
    - Calculating assignments
    - Inheriting track IDs from assigned tracks
    - Creating new track IDs for unlinked points
- We'll only permit linking to tracks last seen within 3 frames of the current frame.

In [None]:
# Function to assign links between two sets of coordinates based on provided costs.
#
# Args:
#     coords: Pandas dataframe containing all coordinates from all frames
#     track_rows: Row indices for the most recent coordinate in all available tracks
#     point_rows: Row indices for all coordinates in the current frame
#     costs: 2D Numpy array containing all costs.  Each row of the array corresponds to a track and each 
#            column to a current coordinate.  The intersection of each row and column is the associated cost.
#
# Returns:
#     This function updates the provided coordinate dataframe, so does not return anything.
#

def assign_IDs(coords, track_rows, point_rows, costs):
    ### CODE GOES HERE ###

In [None]:
### TEST SPACE ### 

# Loading coordinates
path = "./assets/PartiallyTrackedCoordinates.csv"
coords = util.load_coordinates(path);

# Setting some parameters
frame = 50
frame_thresh = 2
linking_thresh = 5

# Getting available tracks at frame 11.  We'll only allow links back 5 frames
start_frame = frame - frame_thresh
end_frame = frame - 1
track_rows = get_available_tracks(coords,start_frame,end_frame)

# Getting row indices for points in frame 11
point_rows = get_current_coords(coords,frame)

# Calculating cost matrix
costs = calculate_cost_matrix(coords,track_rows,point_rows,linking_thresh)

# Calculating assignments using SciPy's linear_sum_assignment
assign_IDs(coords, track_rows, point_rows, costs)

# Displaying the coordinates for the current frame (with new assigned track IDs)
print("Coordinates after linking:\n%a\n" % coords.loc[point_rows])

# Assigning new track IDs to points that weren't linked
assign_new_IDs(coords,point_rows)
print("Coordinates after new track creation:\n%a\n" % coords.loc[point_rows])

## Exercise 8 - Putting it all together
We now have all the components necessary to construct a full tracking workflow.  For this, we put most of the components in a single for-loop, which will iterate over each frame.  At the end of this we will re-render the overlay to see if our tracking has worked correctly.  Unlike previous exercises, you don't need to create new functions here - you have everything you need.  The aim is to load the full coordinate set, then iterate over each frame, tracking the coordinates.

### Aims
- Take the various functions you've created over the past few exercises and put them all together, so we can track over all frames.
- At the end of this section you want to end up with the *TRACK_ID* column of the *coords* array completed (i.e. all rows have a relevant ID).

### Notes
- Unlike previous exercises, there are no extra functions to create here.  As such, you'll be adding code direct into the test space.
- You may want to look at the test space for exercise 7 as an example of tracking between two frames.
- There's a final box of code after the test space which will run the pre-prepared plotting tool.  If the tracking has worked, you should see tracked lines over the images.

In [None]:
### TEST SPACE ### 

# First, a bit of housekeeping, so we can display the overlaid tracks later on
%matplotlib widget

# Setting parameters
np.set_printoptions(precision=3,threshold=sys.maxsize)

linking_thresh = 5
frame_thresh = 3

# Loading image stack
path = "./assets/images/Fluo-N2DH-GOWT1/01/"
images = util.load_images(path);

# Loading coordinates
path = "./assets/UntrackedCoordinates.csv"
coords = util.load_coordinates(path);

### CODE GOES HERE ###

In [None]:
# Adding track renders
util.show_overlay(images,coords,True)