This notebook will demonstrate how to use Cloudknot to parallelize a tracking method.  Example Cloudknot functions are provided in the knotlet module, but the user must build his/her own functions for this step to work properly.  

In [1]:
import os
import diff_classifier.imagej as ij
import boto3
import os.path as op
import diff_classifier.aws as aws
import cloudknot as ck
import diff_classifier.knotlets as kn
import numpy as np

import numpy.ma as ma
import pandas as pd

import diff_classifier.utils as ut
import diff_classifier.msd as msd
import diff_classifier.features as ft

# Experiment Initialization

First, I define the nomenclature I use to name my files, as well as specify exceptions (files that weren't generated or are missing and will be skipped in the analysis).  In this case, I was analyzing data collected in tissue slices.  Videos are named according to the pup number, the slice number, the hemisphere, and the video number.

In [2]:
to_track = []
result_futures = {}
start_knot = 12 #Must be unique number for every run on Cloudknot.

#slices = ["1", "2", "3", "4", "5"] #Number of slices per pup
folder = '07_12_18_varying_PEG_excess' #Folder in AWS S3 containing files to be analyzed
vids = 20

for num in slices:
    to_track.append('100_1xs_XY{}'.format(num))

In [None]:
to_track

# Define Cloudknot Function

The function defined below is sent to each individual machine the user calls upon.  A single video is sent to each machine for analysis, and the resulting outputs are uploaded to S3.  This case uses files that are only temporarily stored in a private bucket.  

The following function is broken down into four separate sections performing different tasks of the analysis:

* **parameter prediction**: A regression tool is used to predict the quality tracking parameter used by Trackmate based off a training dataset of images whose qualities were assessed manually beforehand.  If analyzing a large number of samples, the user should build a similar training dataset.

* **splitting section**: Splits videos to be analyzed into smaller chunks to make analysis feasible.

* **tracking section**: Tracks the videos using a Trackmate script.

* **MSDs and features calculations**: Calculates MSDs and relevant features and outputs associated files and images.

In [3]:
def split(prefix, folder):

    #Splitting section
    ###############################################################################################
    remote_folder = folder
    local_folder = os.getcwd()
    ires = 512
    frames = 651
    filename = '{}.tif'.format(prefix)
    remote_name = remote_folder+'/'+filename
    local_name = local_folder+'/'+filename

    msd_file = 'msd_{}.csv'.format(prefix)
    ft_file = 'features_{}.csv'.format(prefix)

    s3 = boto3.client('s3')

    names = []
    for i in range(0, 4):
        for j in range(0, 4):
            names.append('{}_{}_{}.tif'.format(prefix, i, j))

    try:
        for name in names:
            aws.download_s3(remote_folder+'/'+name, name, bucket_name='hpontes.data')
    except:
        aws.download_s3(remote_name, local_name, bucket_name='hpontes.data')
        names = ij.partition_im(local_name)

        names = []
        for i in range(0, 4):
            for j in range(0, 4):
                names.append('{}_{}_{}.tif'.format(prefix, i, j))

        for name in names:
            aws.upload_s3(name, remote_folder+'/'+name, bucket_name='hpontes.data')
            os.remove(name)
            print("Done with splitting.  Should output file of name {}".format(remote_folder+'/'+name))

        os.remove(filename)

In [11]:
def assemble_msds(prefix, folder):
    
    remote_folder = folder
    local_folder = os.getcwd()
    ires = 512
    frames = 651
    filename = '{}.tif'.format(prefix)
    remote_name = remote_folder+'/'+filename
    local_name = local_folder+'/'+filename

    msd_file = 'msd_{}.csv'.format(prefix)
    ft_file = 'features_{}.csv'.format(prefix)

    s3 = boto3.client('s3')

    names = []
    for i in range(0, 4):
        for j in range(0, 4):
            names.append('{}_{}_{}.tif'.format(prefix, i, j))
    #MSD and features section
    #################################################################################################
    files_to_big = False
    size_limit = 10

    counter = 0
    for name in names:
        row = int(name.split('.')[0].split('_')[3])
        col = int(name.split('.')[0].split('_')[4])

        filename = "Traj_{}_{}_{}.csv".format(prefix, row, col)
        local_name = local_folder+'/'+filename

        if counter == 0:
            to_add = ut.csv_to_pd(local_name)
            to_add['X'] = to_add['X'] + ires*col
            to_add['Y'] = ires - to_add['Y'] + ires*(3-row)
            merged = msd.all_msds2(to_add, frames=frames)
        else:

            if merged.shape[0] > 0:
                to_add = ut.csv_to_pd(local_name)
                to_add['X'] = to_add['X'] + ires*col
                to_add['Y'] = ires - to_add['Y'] + ires*(3-row)
                to_add['Track_ID'] = to_add['Track_ID'] + max(merged['Track_ID']) + 1
            else:
                to_add = ut.csv_to_pd(local_name)
                to_add['X'] = to_add['X'] + ires*col
                to_add['Y'] = ires - to_add['Y'] + ires*(3-row)
                to_add['Track_ID'] = to_add['Track_ID']

            merged = merged.append(msd.all_msds2(to_add, frames=frames))
            print('Done calculating MSDs for row {} and col {}'.format(row, col))
        counter = counter + 1


    for name in names:
        outfile = 'Traj_' + name.split('.')[0] + '.csv'
        os.remove(outfile)

    merged.to_csv(msd_file)
    aws.upload_s3(msd_file, remote_folder+'/'+msd_file, bucket_name='hpontes.data')
    merged_ft = ft.calculate_features(merged)
    merged_ft.to_csv(ft_file)

    aws.upload_s3(ft_file, remote_folder+'/'+ft_file, bucket_name='hpontes.data')

    os.remove(ft_file)
    os.remove(msd_file)

In [5]:
def tracking(subprefix):
    
    folder = '06_15_18_gel_validation'
    
    import os
    import os.path as op
    import numpy as np
    import numpy.ma as ma
    import pandas as pd
    import boto3
    
    import diff_classifier.aws as aws
    import diff_classifier.utils as ut
    import diff_classifier.msd as msd
    import diff_classifier.features as ft
    import diff_classifier.imagej as ij
    
    remote_folder = folder
    local_folder = os.getcwd()
    ires = 512
    frames = 651
    filename = '{}.tif'.format(subprefix)
    remote_name = remote_folder+'/'+filename
    local_name = local_folder+'/'+filename

    s3 = boto3.client('s3')

    #Tracking section
    ################################################################################################

    outfile = 'Traj_' + subprefix + '.csv'
    local_im = op.join(local_folder, '{}.tif'.format(subprefix))

    row = int(subprefix.split('_')[3])
    col = int(subprefix.split('_')[4])

    try:
        aws.download_s3(remote_folder+'/'+outfile, outfile, bucket_name='hpontes.data')
    except:
        aws.download_s3('{}/{}'.format(remote_folder, '{}.tif'.format(subprefix)), local_im, bucket_name='hpontes.data')
        test_intensity = ij.mean_intensity(local_im)
        if test_intensity > 500:
            quality = 450
        else:
            quality = 200

        if row==3:
            y = 485
        else:
            y = 511

        ij.track(local_im, outfile, template=None, fiji_bin=None, radius=3.5, threshold=0.5,
                 do_median_filtering=False, quality=quality, x=511, y=y, ylo=1, median_intensity=300.0, snr=0.0,
                 linking_max_distance=4.0, gap_closing_max_distance=7.0, max_frame_gap=2,
                 track_displacement=20.0)

        aws.upload_s3(outfile, remote_folder+'/'+outfile, bucket_name='hpontes.data')
    print("Done with tracking.  Should output file of name {}".format(remote_folder+'/'+outfile))

In [6]:
def download_split_track_msds(prefix):
    """
    1. Checks to see if features file exists.
    2. If not, checks to see if image partitioning has occured.
    3. If yes, checks to see if tracking has occured.
    4. Regardless, tracks, calculates MSDs and features.
    """
    
    import os
    import os.path as op
    import numpy as np
    import numpy.ma as ma
    import pandas as pd
    import boto3
    
    import diff_classifier.aws as aws
    import diff_classifier.utils as ut
    import diff_classifier.msd as msd
    import diff_classifier.features as ft
    import diff_classifier.imagej as ij
    
    folder = '06_15_18_gel_validation'

    #Splitting section
    ###############################################################################################
    remote_folder = '06_15_18_gel_validation'
    local_folder = os.getcwd()
    ires = 512
    frames = 651
    filename = '{}.tif'.format(prefix)
    remote_name = remote_folder+'/'+filename
    local_name = local_folder+'/'+filename

    msd_file = 'msd_{}.csv'.format(prefix)
    ft_file = 'features_{}.csv'.format(prefix)

    s3 = boto3.client('s3')

    names = []
    for i in range(0, 4):
        for j in range(0, 4):
            names.append('{}_{}_{}.tif'.format(prefix, i, j))

    try:
        obj = s3.head_object(Bucket='hpontes.data', Key=remote_folder+'/'+ft_file)
    except:

        try:
            for name in names:
                aws.download_s3(remote_folder+'/'+name, name, bucket_name='hpontes.data')
        except:
            aws.download_s3(remote_name, local_name, bucket_name='hpontes.data')
            names = ij.partition_im(local_name)
            
            names = []
            for i in range(0, 4):
                for j in range(0, 4):
                    names.append('{}_{}_{}.tif'.format(prefix, i, j))

            for name in names:
                aws.upload_s3(name, remote_folder+'/'+name, bucket_name='hpontes.data')
                print("Done with splitting.  Should output file of name {}".format(remote_folder+'/'+name))

            os.remove(filename)
        #Tracking section
        ################################################################################################
        for name in names:
            outfile = 'Traj_' + name.split('.')[0] + '.csv'
            local_im = op.join(local_folder, name)

            row = int(name.split('.')[0].split('_')[3])
            col = int(name.split('.')[0].split('_')[4])

            try:
                aws.download_s3(remote_folder+'/'+outfile, outfile, bucket_name='hpontes.data')
            except:
                test_intensity = ij.mean_intensity(local_im)
                if test_intensity > 500:
                    quality = 450
                else:
                    quality = 200
                
                if row==3:
                    y = 485
                else:
                    y = 511

                ij.track(local_im, outfile, template=None, fiji_bin=None, radius=3.5, threshold=0.5,
                         do_median_filtering=False, quality=quality, x=511, y=y, ylo=1, median_intensity=300.0, snr=0.0,
                         linking_max_distance=4.0, gap_closing_max_distance=7.0, max_frame_gap=2,
                         track_displacement=20.0)

                aws.upload_s3(outfile, remote_folder+'/'+outfile, bucket_name='hpontes.data')
            print("Done with tracking.  Should output file of name {}".format(remote_folder+'/'+outfile))


        #MSD and features section
        #################################################################################################
        files_to_big = False
        size_limit = 10

        for name in names:
            outfile = 'Traj_' + name.split('.')[0] + '.csv'
            local_im = name
            file_size_MB = op.getsize(local_im)/1000000
            if file_size_MB > size_limit:
                file_to_big = True

        if files_to_big:
            print('One or more of the {} trajectory files exceeds {}MB in size.  Will not continue with MSD calculations.'.format(
                  prefix, size_limit))
        else:
            counter = 0
            for name in names:
                row = int(name.split('.')[0].split('_')[3])
                col = int(name.split('.')[0].split('_')[4])

                filename = "Traj_{}_{}_{}.csv".format(prefix, row, col)
                local_name = local_folder+'/'+filename

                if counter == 0:
                    to_add = ut.csv_to_pd(local_name)
                    to_add['X'] = to_add['X'] + ires*col
                    to_add['Y'] = ires - to_add['Y'] + ires*(3-row)
                    merged = msd.all_msds2(to_add, frames=frames)
                else:

                    if merged.shape[0] > 0:
                        to_add = ut.csv_to_pd(local_name)
                        to_add['X'] = to_add['X'] + ires*col
                        to_add['Y'] = ires - to_add['Y'] + ires*(3-row)
                        to_add['Track_ID'] = to_add['Track_ID'] + max(merged['Track_ID']) + 1
                    else:
                        to_add = ut.csv_to_pd(local_name)
                        to_add['X'] = to_add['X'] + ires*col
                        to_add['Y'] = ires - to_add['Y'] + ires*(3-row)
                        to_add['Track_ID'] = to_add['Track_ID']

                    merged = merged.append(msd.all_msds2(to_add, frames=frames))
                    print('Done calculating MSDs for row {} and col {}'.format(row, col))
                counter = counter + 1

            merged.to_csv(msd_file)
            aws.upload_s3(msd_file, remote_folder+'/'+msd_file, bucket_name='hpontes.data')
            merged_ft = ft.calculate_features(merged)
            merged_ft.to_csv(ft_file)

            aws.upload_s3(ft_file, remote_folder+'/'+ft_file, bucket_name='hpontes.data')

        

In [None]:
for prefix in to_track:
    split(prefix, folder)

# Build Docker Image

Cloudknot requires a Docker image to load on each machine that is used.  This image has all the required dependencies for the code to run.  The Docker image created is available as 'arokem/python3-fiji:0.3'.  It essentially just includes a Fiji install in the correct location, and points to the correct Github installs.

Note: Use "sudo docker system prune -a" to clear existing Dockers before creating a new Docker image.

In [7]:
github_installs=('https://github.com/ccurtis7/diff_classifier.git')
my_image = ck.DockerImage(func=tracking, base_image='arokem/python3-fiji:0.3', github_installs=github_installs)

docker_file = open(my_image.docker_path)
docker_string = docker_file.read()
docker_file.close()

req = open(op.join(op.split(my_image.docker_path)[0], 'requirements.txt'))
req_string = req.read()
req.close()

new_req = req_string[0:req_string.find('\n')-4]+'5.28'+ req_string[req_string.find('\n'):]
req_overwrite = open(op.join(op.split(my_image.docker_path)[0], 'requirements.txt'), 'w')
req_overwrite.write(new_req)
req_overwrite.close()

In [None]:
#Test Docker Image
my_image.build("0.1", image_name="test_image")

# Starting analysis with Cloudknot

This is the actual location where the commands are sent to AWS to start machines and begin the analysis.  The meat of is in the function "Knot."  The user specifies a few essentials:

* **name**: The user-defined name of the knot of machines to be started. Used to identify jobs in AWS.
* **docker_image**: The Docker image used to initialize each machine.
* **memory**: desired memory of each machine to be used.
* **resource_type**: in order to get the cheapest machines, I set this to SPOT so we can bid on machines.
* **bid_percentage**: in order to ensure I get a machine in each case, I set to 100%.  You can lower this.
* **image_id**:
* **pars_policies**: I give each machine access to the required S3 bucket here.

In [8]:
knot = {}
results_futures = {}

for prefix in to_track:
    
    names = []
    for i in range(0, 4):
        for j in range(0, 4):
            names.append('{}_{}_{}'.format(prefix, i, j))

    knot[prefix] = ck.Knot(name='download_and_track_{}_b{}'.format(prefix, start_knot),
           docker_image = my_image,
           memory = 16000,
           resource_type = "SPOT",
           bid_percentage = 100,
           image_id = 'ami-068722f4f6d903247',
           pars_policies=('AmazonS3FullAccess',))
    result_futures[prefix] = knot[prefix].map(names)
    print('Successfully started knot for {}'.format(prefix))

Successfully started knot for COOH_t1_XY1
Successfully started knot for COOH_t1_XY2
Successfully started knot for COOH_t1_XY3
Successfully started knot for COOH_t1_XY4
Successfully started knot for COOH_t1_XY5


To completely shut down all resources started after the analysis, it is good practice to clobber them using the clobber function.  The user can do this manually in the AWS Batch interface as well.

In [None]:
for prefix in to_track:
    knot[prefix].clobber()
    print('Successfully clobbered resources for {}'.format(prefix))

Successfully clobbered resources for COOH_t1_XY1
Successfully clobbered resources for COOH_t1_XY2
Successfully clobbered resources for COOH_t1_XY3
Successfully clobbered resources for COOH_t1_XY4


In [10]:
for prefix in to_track:
    assemble_msds(prefix, folder)
    print('Successfully output msds for {}'.format(prefix))

Done calculating MSDs for row 0 and col 1
Done calculating MSDs for row 0 and col 2
Done calculating MSDs for row 0 and col 3
Done calculating MSDs for row 1 and col 0
Done calculating MSDs for row 1 and col 1
Done calculating MSDs for row 1 and col 2
Done calculating MSDs for row 1 and col 3
Done calculating MSDs for row 2 and col 0
Done calculating MSDs for row 2 and col 1
Done calculating MSDs for row 2 and col 2
Done calculating MSDs for row 2 and col 3
Done calculating MSDs for row 3 and col 0
Done calculating MSDs for row 3 and col 1
Done calculating MSDs for row 3 and col 2
Done calculating MSDs for row 3 and col 3
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.


  ar = width/height
  ratio = (df['MSDs'][n1]/df['MSDs'][n2]) - (df['Frame'][n1]/df['Frame'][n2])


Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.




Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters n

OSError: [Errno 2] No such file or directory: 'COOH_t1_XY3_0_0.tif'