Extracting bbbc021 data. <br>
See also: 
- Adrian's analysis using handcrafted features: https://github.com/microscopium/microscopium-scripts/blob/master/bbbc021_analysis.ipynb<br>
- Original description of bbbc021 data: https://www.broadinstitute.org/bbbc/BBBC021/

In [18]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import csv
import collections as coll
import re
import os
import sys
import time
import math
import pandas as pd

#import toolz as tz
#from microscopium.screens import image_xpress
from skimage import io, img_as_float
from sklearn.utils.extmath import cartesian as skcartesian

os.chdir("/Users/don/Documents/hcs")
sys.path.append("/Users/don/Documents/PyModules")

import skynet.bbbc021io as xio
import skynet.patch_extraction as pex
import skynet.utils

In [2]:
#Load a db of plate-well, compound, concentration and moa
#Note that this database includes NaNs

labels_db = pd.read_csv("/Users/don/Documents/hcs/label_db2.csv",
                       usecols=[1,2,3,4])
#Don't load the first column; it's just a col of indices
#If the compound-concentration of some plate-well didn't have an moa,
#it won't be in this database

# Sample search
result = xio.search_labels('BBBC021-40111-B03', labels_db)
result

<font size = 5>Extracting Image data</font><br><br>
The main folder, 'BBBC021', has a bunch of subfolders, e.g. 'Week3_xxxx'.<br>
The platenumber is found in the subfolder name; that's the 'xxxx'.<br>
Each subfolder has a bunch of images; the well number is in the image name. <br><br>
Desired output:
 - A list of patches, each row = 1 patch and its labels: [array:(20 x 20 x 3), plate-num, cc-label, moa-label]
 - Let's not unravel the patch yet.

In [11]:
#Declare some params:
csv_path = '/Users/don/Documents/BBBC021/BBBC021_parsed_metadata.csv'
#treatments = xio.get_labels_from_csv(csv_path, verbose=True)

path = '/Users/don/Documents/BBBC021'

# This is a list of all subfolder names
main_list = xio.get_main_list(path)

#Declare params
n_patches = 40
patch_len = 20
#Declare training/testing partition ratio. 
#Manually ensure that the division is whole.
p = 0.6 
#proportion of patches to be used for training
#e.g.0.6 * 40 = 24

In [5]:
#Takes about 17 mins
t0 = time.time()
main_data_list = []

idx = 1
img_counter = 0
for sf in main_list:
    t_i = time.time()
    subpath = path + '/' + sf
    print("%s of %s subfolders..." % (idx, len(main_list))
          , end="")
    sf_data_list_temp = xio.get_subfolder_patch_data(subpath, 
                                             n_patches, 
                                             patch_len, 
                                             labels_db, 
                                             verbose=1)
    print("%s images retrieved. " % len(sf_data_list_temp), end="")
    img_counter = img_counter + len(sf_data_list_temp)
    main_data_list = main_data_list + sf_data_list_temp
    idx +=1
    print("Runtime = %.2fs" % (time.time() - t_i))

dt = time.time() - t0
m, s = divmod(dt, 60)
h, m = divmod(m, 60)
print("All done in %d:%02d:%02d" % (h, m, s))
print("%s images used" % img_counter)
#Maybe remove all subfolder-level comments in get_subfolder_patch_data?

1 of 55 subfolders...60 images found in subfolder
20 images retrieved. Runtime = 18.70s

2 of 55 subfolders...60 images found in subfolder
20 images retrieved. Runtime = 17.97s

3 of 55 subfolders...60 images found in subfolder
20 images retrieved. Runtime = 17.51s

4 of 55 subfolders...60 images found in subfolder
25 images retrieved. Runtime = 17.66s

5 of 55 subfolders...60 images found in subfolder
25 images retrieved. Runtime = 17.97s

6 of 55 subfolders...60 images found in subfolder
25 images retrieved. Runtime = 17.90s

7 of 55 subfolders...60 images found in subfolder
18 images retrieved. Runtime = 18.25s

8 of 55 subfolders...60 images found in subfolder
18 images retrieved. Runtime = 18.23s

9 of 55 subfolders...60 images found in subfolder
18 images retrieved. Runtime = 18.09s

10 of 55 subfolders...60 images found in subfolder
16 images retrieved. Runtime = 16.52s

11 of 55 subfolders...60 images found in subfolder
16 images retrieved. Runtime = 16.60s

12 of 55 subfolders

In [6]:
data_list = np.array(main_data_list)

In [2]:
#data_list = np.load('bbbc021_data_6July.npy')

Our data is currently in the format of an np array, shape(n_images, 5)
- mydata[0]: all n_patches patches extracted from each image. Shape (n_patches, patchlen, patchlen, 3)
- mydata[1]: plate-well coords
- mydata[2]: compound
- mydata[3]: concentration
- mydata[4]: moa

Now let's partition into training and testing data, according to a ratio specified above. Desired output: a list of length 4:
- list[0]: training patches
- list[1]: moa label of training patches
- list[2]: testing patches
- list[3]: moa label of testing patches

In [17]:
x_train = []
label_train = []
x_test = []
label_test = []

n_trg = int(p*n_patches)
n_test = int(n_patches - n_trg)

for i in range(len(data_list)): #for each image
    patches_trg = list(data_list[i][0][:n_trg])
    labels1 = [data_list[i][4]]*(n_trg)
    patches_test = list(data_list[i][0][n_trg:])
    labels2 = [data_list[i][4]]*(n_test)
    
    x_train = x_train + patches_trg
    x_test = x_test + patches_test
    label_train = label_train + labels1
    label_test = label_test +labels2
    
x_train = np.array(x_train)
x_test = np.array(x_test)
label_train = np.array(label_train)
label_test = np.array(label_test)

print("Summary Stats")
print("-"*13)
print("%s patches * %s images = %s patches altogether" % 
      (n_patches, len(data_list), n_patches*len(data_list)))
print("%s patches (%s patches per image) allocate to training data" % 
     (n_trg*len(data_list), n_trg))
print("%s patches (%s patches per image) allocate to testing data" % 
     (n_test*len(data_list), n_test))
print(x_train.shape, label_train.shape)
print(x_test.shape, label_test.shape)

Summary Stats
-------------
40 patches * 962 images = 38480 patches altogether
23088 patches (24 patches per image) allocate to training data
15392 patches (16 patches per image) allocate to testing data
(23088, 20, 20, 3) (23088,)
(15392, 20, 20, 3) (15392,)


In [22]:
np.save("x_train", x_train)
np.save("label_train", label_train)
np.save("x_test", x_test)
np.save("label_test", label_test)