This notebook will demonstrate how to use Cloudknot to parallelize a tracking method.  Example Cloudknot functions are provided in the knotlet module, but the user must build his/her own functions for this step to work properly.  

In [1]:
import os
import diff_classifier.imagej as ij
import boto3
import os.path as op
import diff_classifier.aws as aws
import cloudknot as ck
import diff_classifier.knotlets as kn
import numpy as np

import numpy.ma as ma
import pandas as pd

import diff_classifier.utils as ut
import diff_classifier.msd as msd
import diff_classifier.features as ft
import diff_classifier.heatmaps as hm

Before running the notebook, run "sudo docker system prune -a" in order to clear docker images.
Also change variable start_knot to a unique number.

# Experiment Initialization

First, I define the nomenclature I use to name my files, as well as specify exceptions (files that weren't generated or are missing and will be skipped in the analysis).  In this case, I was analyzing data collected in tissue slices.  Videos are named according to the pup number, the slice number, the hemisphere, and the video number.

In [6]:
to_track = []
result_futures = {}
start_knot = 2 #Must be unique number for every run on Cloudknot.

#slices = ["1", "2", "3", "4", "5"] #Number of slices per pup
folder = '08_03_18_varying_PEG_excess' #Folder in AWS S3 containing files to be analyzed
bucket = 'evanepst.data'
vids = 20
#xs = ['pt25', 'pt50', 'pt75', '1', '2', '3', '4']
xs = ['PSCOOH', 'pt25xs']
for x in xs:
    for num in range(1, vids+1):
        to_track.append('100_{}_XY{}'.format(x, '%02d' % num))

In [7]:
to_track

['100_PSCOOH_XY01',
 '100_PSCOOH_XY02',
 '100_PSCOOH_XY03',
 '100_PSCOOH_XY04',
 '100_PSCOOH_XY05',
 '100_PSCOOH_XY06',
 '100_PSCOOH_XY07',
 '100_PSCOOH_XY08',
 '100_PSCOOH_XY09',
 '100_PSCOOH_XY10',
 '100_PSCOOH_XY11',
 '100_PSCOOH_XY12',
 '100_PSCOOH_XY13',
 '100_PSCOOH_XY14',
 '100_PSCOOH_XY15',
 '100_PSCOOH_XY16',
 '100_PSCOOH_XY17',
 '100_PSCOOH_XY18',
 '100_PSCOOH_XY19',
 '100_PSCOOH_XY20',
 '100_pt25xs_XY01',
 '100_pt25xs_XY02',
 '100_pt25xs_XY03',
 '100_pt25xs_XY04',
 '100_pt25xs_XY05',
 '100_pt25xs_XY06',
 '100_pt25xs_XY07',
 '100_pt25xs_XY08',
 '100_pt25xs_XY09',
 '100_pt25xs_XY10',
 '100_pt25xs_XY11',
 '100_pt25xs_XY12',
 '100_pt25xs_XY13',
 '100_pt25xs_XY14',
 '100_pt25xs_XY15',
 '100_pt25xs_XY16',
 '100_pt25xs_XY17',
 '100_pt25xs_XY18',
 '100_pt25xs_XY19',
 '100_pt25xs_XY20']

# Predictor

Make sure to run this in a Python 3 notebook, and switch back to Python 2 when submitting final job to Cloudknot.

In [None]:
import os
import diff_classifier.imagej as ij
import boto3
import os.path as op
import diff_classifier.aws as aws
import diff_classifier.knotlets as kn
import numpy as np
from sklearn.externals import joblib

remote_folder = folder
tnum=30 #number of training datasets

pref = []
for num in to_track:                    
    for row in range(0, 4):
        for col in range(0, 4):
            pref.append("{}_{}_{}".format(num, row, col))

y = np.array([38.3, 44.6, 39.2, 28.9, 27.8, 46, 51.4, 34.94, 50, 49.56, 45, 46.26, 36, 46, 45, 48, 45.6, 
              41.5, 48.3, 50, 39, 57.5, 36, 48, 57, 47, 42, 44, 39.5, 40.8])

# Creates regression object based of training dataset composed of input images and manually
# calculated quality cutoffs from tracking with GUI interface.
regress = ij.regress_sys(folder, pref, y, tnum, have_output=True, bucket_name = bucket)
#Read up on how regress_sys works before running.

In [None]:
#Pickle object
filename = 'regress4.obj'
#filehandler = open(filename, 'w')
#pickle.dump(regress, filehandler)

with open(filename,'wb') as fp:
    joblib.dump(regress,fp)

import boto3
s3 = boto3.client('s3')

aws.upload_s3(filename, folder+'/'+filename, bucket_name = bucket)

# Define Cloudknot Function

The function defined below is sent to each individual machine the user calls upon.  A single video is sent to each machine for analysis, and the resulting outputs are uploaded to S3.  This case uses files that are only temporarily stored in a private bucket.  

The following function is broken down into four separate sections performing different tasks of the analysis:

* **parameter prediction**: A regression tool is used to predict the quality tracking parameter used by Trackmate based off a training dataset of images whose qualities were assessed manually beforehand.  If analyzing a large number of samples, the user should build a similar training dataset.

* **splitting section**: Splits videos to be analyzed into smaller chunks to make analysis feasible.

* **tracking section**: Tracks the videos using a Trackmate script.

* **MSDs and features calculations**: Calculates MSDs and relevant features and outputs associated files and images.

**Updated to make:**

Update the bucket name in all functions.
Update row and column locations in string in assemble_msds
Update the remote folder in relevant sections.
Update tracking parameters

In [8]:
def split(prefix, folder):

    #Splitting section
    ###############################################################################################
    bucket='evanepst.data'
    remote_folder = folder
    local_folder = os.getcwd()
    ires = 512
    frames = 651
    filename = '{}.tif'.format(prefix)
    remote_name = remote_folder+'/'+filename
    local_name = local_folder+'/'+filename

    msd_file = 'msd_{}.csv'.format(prefix)
    ft_file = 'features_{}.csv'.format(prefix)

    s3 = boto3.client('s3')

    names = []
    for i in range(0, 4):
        for j in range(0, 4):
            names.append('{}_{}_{}.tif'.format(prefix, i, j))

#     try:
#         for name in names:
#             aws.download_s3(remote_folder+'/'+name, name, bucket_name=bucket)
#     except:
    aws.download_s3(remote_name, local_name, bucket_name=bucket)
    names = ij.partition_im(local_name)

    names = []
    for i in range(0, 4):
        for j in range(0, 4):
            names.append('{}_{}_{}.tif'.format(prefix, i, j))

    for name in names:
        aws.upload_s3(name, remote_folder+'/'+name, bucket_name=bucket)
        os.remove(name)
        print("Done with splitting.  Should output file of name {}".format(remote_folder+'/'+name))

    os.remove(filename)

In [9]:
def assemble_msds(prefix, folder):
    
    bucket = 'evanepst.data'
    remote_folder = folder
    #local_folder = os.getcwd()
    ires = 512
    frames = 651
    filename = '{}.tif'.format(prefix)
    remote_name = remote_folder+'/'+filename
    local_name = filename

    msd_file = 'msd_{}.csv'.format(prefix)
    ft_file = 'features_{}.csv'.format(prefix)

    s3 = boto3.client('s3')

    names = []
    for i in range(0, 4):
        for j in range(0, 4):
            names.append('{}_{}_{}.tif'.format(prefix, i, j))
    #MSD and features section
    #################################################################################################
    files_to_big = False
    size_limit = 10

    counter = 0
    for name in names:
        row = int(name.split('.')[0].split('_')[3])
        col = int(name.split('.')[0].split('_')[4])

        filename = "Traj_{}_{}_{}.csv".format(prefix, row, col)
        aws.download_s3(remote_folder+'/'+filename, filename, bucket_name=bucket)
        local_name = filename

        if counter == 0:
            to_add = ut.csv_to_pd(local_name)
            to_add['X'] = to_add['X'] + ires*col
            to_add['Y'] = ires - to_add['Y'] + ires*(3-row)
            merged = msd.all_msds2(to_add, frames=frames)
        else:

            if merged.shape[0] > 0:
                to_add = ut.csv_to_pd(local_name)
                to_add['X'] = to_add['X'] + ires*col
                to_add['Y'] = ires - to_add['Y'] + ires*(3-row)
                to_add['Track_ID'] = to_add['Track_ID'] + max(merged['Track_ID']) + 1
            else:
                to_add = ut.csv_to_pd(local_name)
                to_add['X'] = to_add['X'] + ires*col
                to_add['Y'] = ires - to_add['Y'] + ires*(3-row)
                to_add['Track_ID'] = to_add['Track_ID']

            merged = merged.append(msd.all_msds2(to_add, frames=frames))
            print('Done calculating MSDs for row {} and col {}'.format(row, col))
        counter = counter + 1


    for name in names:
        outfile = 'Traj_' + name.split('.')[0] + '.csv'
        os.remove(outfile)

    merged.to_csv(msd_file)
    aws.upload_s3(msd_file, remote_folder+'/'+msd_file, bucket_name=bucket)
    merged_ft = ft.calculate_features(merged)
    merged_ft.to_csv(ft_file)

    aws.upload_s3(ft_file, remote_folder+'/'+ft_file, bucket_name=bucket)

    os.remove(ft_file)
    os.remove(msd_file)

In [10]:
def tracking(subprefix):
    
    folder = '08_03_18_varying_PEG_excess'
    bucket = 'evanepst.data'
    
    import os
    import os.path as op
    import numpy as np
    import numpy.ma as ma
    import pandas as pd
    import boto3
    
    import diff_classifier.aws as aws
    import diff_classifier.utils as ut
    import diff_classifier.msd as msd
    import diff_classifier.features as ft
    import diff_classifier.imagej as ij
    from sklearn.externals import joblib
    
    remote_folder = folder
    local_folder = os.getcwd()
    ires = 512
    frames = 651
    filename = '{}.tif'.format(subprefix)
    remote_name = remote_folder+'/'+filename
    local_name = local_folder+'/'+filename
    
    filename = 'regress4.obj'
    aws.download_s3(remote_folder+'/'+filename, filename, bucket_name=bucket)
    with open(filename, 'rb') as fp:
        regress = joblib.load(fp)

    s3 = boto3.client('s3')

    #Tracking section
    ################################################################################################

    outfile = 'Traj_' + subprefix + '.csv'
    local_im = op.join(local_folder, '{}.tif'.format(subprefix))

    row = int(subprefix.split('_')[3])
    col = int(subprefix.split('_')[4])

    try:
        aws.download_s3(remote_folder+'/'+outfile, outfile, bucket_name=bucket)
    except:
        aws.download_s3('{}/{}'.format(remote_folder, '{}.tif'.format(subprefix)), local_im, bucket_name=bucket)        
        quality = ij.regress_tracking_params(regress, subprefix, regmethod='PassiveAggressiveRegressor')

        if row==3:
            y = 485
        else:
            y = 511

        ij.track(local_im, outfile, template=None, fiji_bin=None, radius=3.0, threshold=0.0,
                 do_median_filtering=False, quality=quality, x=511, y=y, ylo=1, median_intensity=300.0, snr=0.0,
                 linking_max_distance=6.0, gap_closing_max_distance=10.0, max_frame_gap=3,
                 track_displacement=20.0)

        aws.upload_s3(outfile, remote_folder+'/'+outfile, bucket_name=bucket)
    print("Done with tracking.  Should output file of name {}".format(remote_folder+'/'+outfile))

In [None]:
for prefix in to_track[19:]:
    split(prefix, folder)

In [None]:
split('100nm_10k_PEG_XY20',folder)

In [None]:
to_track[19:]

# Build Docker Image

Cloudknot requires a Docker image to load on each machine that is used.  This image has all the required dependencies for the code to run.  The Docker image created is available as 'arokem/python3-fiji:0.3'.  It essentially just includes a Fiji install in the correct location, and points to the correct Github installs.

Note: Use "sudo docker system prune -a" to clear existing Dockers before creating a new Docker image.

In [11]:
github_installs=('https://github.com/ccurtis7/diff_classifier.git')
my_image = ck.DockerImage(func=tracking, base_image='arokem/python3-fiji:0.3', github_installs=github_installs)

docker_file = open(my_image.docker_path)
docker_string = docker_file.read()
docker_file.close()

req = open(op.join(op.split(my_image.docker_path)[0], 'requirements.txt'))
req_string = req.read()
req.close()

new_req = req_string[0:req_string.find('\n')-4]+'5.28'+ req_string[req_string.find('\n'):]
req_overwrite = open(op.join(op.split(my_image.docker_path)[0], 'requirements.txt'), 'w')
req_overwrite.write(new_req)
req_overwrite.close()

# Starting analysis with Cloudknot

This is the actual location where the commands are sent to AWS to start machines and begin the analysis.  The meat of is in the function "Knot."  The user specifies a few essentials:

* **name**: The user-defined name of the knot of machines to be started. Used to identify jobs in AWS.
* **docker_image**: The Docker image used to initialize each machine.
* **memory**: desired memory of each machine to be used.
* **resource_type**: in order to get the cheapest machines, I set this to SPOT so we can bid on machines.
* **bid_percentage**: in order to ensure I get a machine in each case, I set to 100%.  You can lower this.
* **image_id**:
* **pars_policies**: I give each machine access to the required S3 bucket here.

Make sure that the image_id has enough space for the job. Currently, this notebook is optimized to require no extra memory,
so the default should work.

In [12]:
names = []
for prefix in to_track:    
    for i in range(0, 4):
        for j in range(0, 4):
            names.append('{}_{}_{}'.format(prefix, i, j))

knot = ck.Knot(name='download_and_track_{}_e{}'.format('Evan', start_knot),
               docker_image = my_image,
               memory = 16000,
               resource_type = "SPOT",
               bid_percentage = 100,
               image_id = 'ami-07afca4c05ab97109', #May need to change this line
               pars_policies=('AmazonS3FullAccess',))
result_futures = knot.map(names)
#print('Successfully started knot for {}'.format(prefix))

To completely shut down all resources started after the analysis, it is good practice to clobber them using the clobber function.  The user can do this manually in the AWS Batch interface as well.

In [None]:
knot.clobber()

In [None]:
to_track

In [None]:
for prefix in to_track:
    assemble_msds(prefix, folder)
    print('Successfully output msds for {}'.format(prefix))

Done calculating MSDs for row 0 and col 1
Done calculating MSDs for row 0 and col 2
Done calculating MSDs for row 0 and col 3
Done calculating MSDs for row 1 and col 0
Done calculating MSDs for row 1 and col 1
Done calculating MSDs for row 1 and col 2
Done calculating MSDs for row 1 and col 3
Done calculating MSDs for row 2 and col 0
Done calculating MSDs for row 2 and col 1
Done calculating MSDs for row 2 and col 2
Done calculating MSDs for row 2 and col 3
Done calculating MSDs for row 3 and col 0
Done calculating MSDs for row 3 and col 1
Done calculating MSDs for row 3 and col 2
Done calculating MSDs for row 3 and col 3


  ar = width/height
  ratio = (df['MSDs'][n1]/df['MSDs'][n2]) - (df['Frame'][n1]/df['Frame'][n2])


Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.




Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.
Optimal parameters not found. Print NaN instead.


# Visualization tools

In [None]:
prefix = '100nm_5k_PEG_XY01'
msd = 'msd_{}.csv'.format(prefix)
aws.download_s3('{}/{}'.format(folder, msd), msd, bucket_name=bucket)
geomean, geoSEM = hm.plot_individual_msds(prefix, x_range=10, y_range=10, remote_folder=folder, bucket=bucket)
os.remove(msd)

In [None]:
np.exp(geomean)

In [None]:
#prefix = '100nm_5k_PEG_XY01'
for prefix in to_track:
    msd = 'msd_{}.csv'.format(prefix)
    aws.download_s3('{}/{}'.format(folder, msd), msd, bucket_name=bucket)
    geomean, geoSEM = hm.plot_individual_msds(prefix, x_range=10, y_range=10, remote_folder=folder, bucket=bucket)
    os.remove(msd)