# Image preprocessing

----

This notebook can be used to preprocess the illumination corrected raw images obtained from [IDR0033](https://idr.openmicroscopy.org/webclient/?show=screen-1751) for the further analysis.

We will use available metadata to filter images that were manually identified by the authors of the [publication](https://elifesciences.org/articles/24060#s4) corresponding to the published imaging data as being outliers or that did not pass their quality control steps. For more information concerning the applied workflows to identify those, please refer to the publication.

---

## 0. Environmental setup

In [69]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tifffile

from collections import Counter
from tqdm import tqdm_notebook
from shutil import copyfile


sys.path.append("..")

from src.utils.basic.io import get_file_list

# Load automatic code formatter (automatically executed when running a chunk)
%load_ext nb_black

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

In [30]:
def rename_image_filenames(
    metadata,
    orig_col="Image_FileName_OrigHoechst",
    illum_col="Image_FileName_IllumHoechst",
    posfix="illum_corrected",
):
    orig_image_file_names = list(metadata[orig_col])
    illum_corrected_image_file_names = []
    for orig_image_file_name in orig_image_file_names:
        idx = orig_image_file_name.index(".")
        illum_corrected_image_file_name = (
            orig_image_file_name[:idx] + posfix + orig_image_file_name[idx:]
        )
        illum_corrected_image_file_names.append(illum_corrected_image_file_name)
    metadata[illum_col] = illum_corrected_image_file_names
    return metadata

<IPython.core.display.Javascript object>

In [31]:
def filter_out_qc_flagged_items(
    metadata,
    blurry_col="Image_Metadata_QCFlag_isBlurry",
    saturated_col="Image_Metadata_QCFlag_isSaturated",
):
    filtered_metadata = metadata.loc[
        (metadata[blurry_col] == 0) & (metadata[saturated_col] == 0)
    ]
    return filtered_metadata

<IPython.core.display.Javascript object>

In [32]:
def remove_outlier_items(
    metadata,
    outlier_plates=None,
    outlier_plate_wells=None,
    outlier_wells=None,
    plate_col="Image_Metadata_Plate",
    well_col="Image_Metadata_Well",
):
    metadata_orm = metadata.copy()
    if outlier_plates is not None:
        for outlier_plate in outlier_plates:
            metadata_orm = metadata_orm.loc[metadata_orm[plate_col] != outlier_plate]
    if outlier_plate_wells is not None:
        for outlier_plate_well in outlier_plate_wells:
            metadata_orm = metadata_orm.loc[
                (metadata_orm[plate_col] != outlier_plate_well[0])
                | (metadata_orm[well_col] != outlier_plate_well[1])
            ]
    if outlier_wells is not None:
        for outlier_well in outlier_wells:
            metadata_orm = metadata_orm.loc[metadata_orm[well_col] != outlier_well]
    return metadata_orm

<IPython.core.display.Javascript object>

In [54]:
def save_filtered_images(
    metadata,
    input_dir,
    output_dir,
    plate_col="Image_Metadata_Plate",
    illum_file_col="Image_FileName_IllumHoechst",
):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    n = len(metadata)
    for i in tqdm_notebook(range(n), desc := "Copying images"):
        plate = metadata.iloc[i, list(metadata.columns).index(plate_col)]
        filename = metadata.iloc[i, list(metadata.columns).index(illum_file_col)]

        plate_output_dir = os.path.join(output_dir, str(plate))
        plate_input_dir = os.path.join(input_dir, str(plate))
        if not os.path.exists(plate_output_dir):
            os.makedirs(plate_output_dir)

        plate_output_file = os.path.join(plate_output_dir, filename)
        plate_input_file = os.path.join(plate_input_dir, filename)

        copyfile(plate_input_file, plate_output_file)
    print("Images (n={}) copied to {}.".format(len(metadata), output_dir))

<IPython.core.display.Javascript object>

In [75]:
def create_sample_datasets(input_dir, n=100, random_seed=1234):
    np.random.seed(random_seed)
    plate_input_dirs = [f.path for f in os.scandir(input_dir) if f.is_dir()]
    sample_datasets = {}
    for plate_input_dir in plate_input_dirs:
        plate = os.path.split(plate_input_dir)[1]
        plate_files = np.random.choice(np.array(get_file_list(plate_input_dir)), n)
        plate_images = []
        for plate_file in plate_files:
            plate_images.append(tifffile.imread(plate_file))
        sample_datasets[plate] = plate_images
    return sample_datasets

<IPython.core.display.Javascript object>

---

## 1. Read in data

First, we will read in the metadata information that provide information about e.g. which images correspond to which plate, well combination and which gene was targeted. The respective file was exported from a database for which the corresponding sql script has been published alongside with the imaging data. Please refer to the github repo of the original publication for more information of how to set up the database. To derive the metadata file load in in the following, we provide a custom sql-file (`extract_metadata.sql`).

In [44]:
metadata = pd.read_csv("../data/images/metadata/metadata_image_data.csv")
metadata.head()

Unnamed: 0,Image_Metadata_Plate,Image_Metadata_Well,Image_FileName_OrigHoechst,Image_Count_Nuclei,Image_Metadata_GeneID,Image_Metadata_GeneSymbol,Image_Metadata_IsLandmark,Image_Metadata_AlleleDesc,Image_Metadata_ExpressionVector,Image_Metadata_FlaggedForToxicity,...,Image_Metadata_IntendedOrfMismatch,Image_Metadata_OpenOrClosed,Image_Metadata_RNAiVirusPlateName,Image_Metadata_Site,Image_Metadata_TimePoint_Hours,Image_Metadata_Type,Image_Metadata_Virus_Vol_ul,Image_Metadata_ASSAY_WELL_ROLE,Image_Metadata_QCFlag_isBlurry,Image_Metadata_QCFlag_isSaturated
0,41744,k21,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...,60,1977.0,EIF4E,0.0,WT.2,pLX304,,...,,open,ORA11.12.13.18A,7,72H,ORF OE,1,Treated,0,0
1,41744,i13,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...,69,22943.0,DKK1,0.0,WT,pLX304,,...,,open,ORA11.12.13.18A,4,72H,ORF OE,1,Treated,0,0
2,41744,j16,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...,48,22926.0,ATF6,1.0,WT.1,pLX304,,...,,open,ORA11.12.13.18A,9,72H,ORF OE,1,Treated,0,0
3,41744,m07,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...,57,5045.0,FURIN,0.0,WT.2,pLX304,,...,,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0
4,41744,i04,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...,51,5599.0,MAPK8,0.0,WT.2,pLX304,,...,,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0


<IPython.core.display.Javascript object>

As suggested in the original publication we will be working with the image that were corrected for different illumination conditions that is also available.

We will now adapt the `Image_FileName_OrigHoechst` entries in the metadata dataframe by the posfix `_illum_corrected` to ensure that the column represents the actual filenames of `.tif` images that we will be working with.

In [45]:
posfix = "_illum_corrected"
orig_col = "Image_FileName_OrigHoechst"
illum_col = "Image_FileName_IllumHoechst"

metadata = rename_image_filenames(
    metadata, orig_col=orig_col, illum_col=illum_col, posfix=posfix
)
metadata.head()

Unnamed: 0,Image_Metadata_Plate,Image_Metadata_Well,Image_FileName_OrigHoechst,Image_Count_Nuclei,Image_Metadata_GeneID,Image_Metadata_GeneSymbol,Image_Metadata_IsLandmark,Image_Metadata_AlleleDesc,Image_Metadata_ExpressionVector,Image_Metadata_FlaggedForToxicity,...,Image_Metadata_OpenOrClosed,Image_Metadata_RNAiVirusPlateName,Image_Metadata_Site,Image_Metadata_TimePoint_Hours,Image_Metadata_Type,Image_Metadata_Virus_Vol_ul,Image_Metadata_ASSAY_WELL_ROLE,Image_Metadata_QCFlag_isBlurry,Image_Metadata_QCFlag_isSaturated,Image_FileName_IllumHoechst
0,41744,k21,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...,60,1977.0,EIF4E,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,7,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...
1,41744,i13,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...,69,22943.0,DKK1,0.0,WT,pLX304,,...,open,ORA11.12.13.18A,4,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...
2,41744,j16,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...,48,22926.0,ATF6,1.0,WT.1,pLX304,,...,open,ORA11.12.13.18A,9,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...
3,41744,m07,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...,57,5045.0,FURIN,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...
4,41744,i04,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...,51,5599.0,MAPK8,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...


<IPython.core.display.Javascript object>

---

## 2. Data filtering

### 2a. Filter out blurry or saturated images

Next, we will filter out images that were identified to be blurry or saturated and thus not passing the standards for the image quality that we take over from the authors of the original publication. The respective information are also available in the metadata.

In [46]:
blurry_col = "Image_Metadata_QCFlag_isBlurry"
saturated_col = "Image_Metadata_QCFlag_isSaturated"

filtered_metadata = filter_out_qc_flagged_items(
    metadata, blurry_col=blurry_col, saturated_col=saturated_col
)
filtered_metadata.head()

Unnamed: 0,Image_Metadata_Plate,Image_Metadata_Well,Image_FileName_OrigHoechst,Image_Count_Nuclei,Image_Metadata_GeneID,Image_Metadata_GeneSymbol,Image_Metadata_IsLandmark,Image_Metadata_AlleleDesc,Image_Metadata_ExpressionVector,Image_Metadata_FlaggedForToxicity,...,Image_Metadata_OpenOrClosed,Image_Metadata_RNAiVirusPlateName,Image_Metadata_Site,Image_Metadata_TimePoint_Hours,Image_Metadata_Type,Image_Metadata_Virus_Vol_ul,Image_Metadata_ASSAY_WELL_ROLE,Image_Metadata_QCFlag_isBlurry,Image_Metadata_QCFlag_isSaturated,Image_FileName_IllumHoechst
0,41744,k21,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...,60,1977.0,EIF4E,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,7,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...
1,41744,i13,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...,69,22943.0,DKK1,0.0,WT,pLX304,,...,open,ORA11.12.13.18A,4,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...
2,41744,j16,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...,48,22926.0,ATF6,1.0,WT.1,pLX304,,...,open,ORA11.12.13.18A,9,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...
3,41744,m07,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...,57,5045.0,FURIN,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...
4,41744,i04,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...,51,5599.0,MAPK8,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...


<IPython.core.display.Javascript object>

In [47]:
print(
    "Images filtered out for not passing the quality standards: {}.".format(
        len(metadata) - len(filtered_metadata)
    )
)

Images filtered out for not passing the quality standards: 251.


<IPython.core.display.Javascript object>

As seen above 251 images were identified to be either blurry or saturated and thus should be excluded for the downstream analysis.

---

### 2b. Filter out outlier images (manual selection)

In addition to the images that were flagged for not passing the quality control steps the authors further excluded 2 additional plate-well combinations and one complete plate during their analyses as they identified those by visual inspection as outliers.

Those are the following:
* Plate 41749 (all wells)
* Plate 41754 (well B01)
* Plate 41757 (well E17)

Unfortunately, no description is given which criteria was used to identify these outliers. When briefly looking at the data in the [IDR webclient](https://idr.openmicroscopy.org/webclient/?show=screen-1751) we do not see any remarkable abnormalities.

Nonetheless, we derive the subset of the dataset where we filter out the corresponding items of these plate-well combinations to follow the preprocessing steps of the original publication. If we will use this subset or the larger set that includes those combinations in our final study is yet to be determined.

In [48]:
outlier_plates = [41749]
outlier_plate_wells = [[41754, "b01"], [41757, "e17"]]
outlier_wells = []
plate_col = "Image_Metadata_Plate"
well_col = "Image_Metadata_Well"


filtered_metadata_orm = remove_outlier_items(
    metadata,
    outlier_plates=outlier_plates,
    outlier_plate_wells=outlier_plate_wells,
    outlier_wells=outlier_wells,
    plate_col=plate_col,
    well_col=well_col,
)
filtered_metadata_orm.head()

Unnamed: 0,Image_Metadata_Plate,Image_Metadata_Well,Image_FileName_OrigHoechst,Image_Count_Nuclei,Image_Metadata_GeneID,Image_Metadata_GeneSymbol,Image_Metadata_IsLandmark,Image_Metadata_AlleleDesc,Image_Metadata_ExpressionVector,Image_Metadata_FlaggedForToxicity,...,Image_Metadata_OpenOrClosed,Image_Metadata_RNAiVirusPlateName,Image_Metadata_Site,Image_Metadata_TimePoint_Hours,Image_Metadata_Type,Image_Metadata_Virus_Vol_ul,Image_Metadata_ASSAY_WELL_ROLE,Image_Metadata_QCFlag_isBlurry,Image_Metadata_QCFlag_isSaturated,Image_FileName_IllumHoechst
0,41744,k21,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...,60,1977.0,EIF4E,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,7,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_k21_s7_w10efe...
1,41744,i13,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...,69,22943.0,DKK1,0.0,WT,pLX304,,...,open,ORA11.12.13.18A,4,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i13_s4_w13be2...
2,41744,j16,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...,48,22926.0,ATF6,1.0,WT.1,pLX304,,...,open,ORA11.12.13.18A,9,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_j16_s9_w1b03e...
3,41744,m07,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...,57,5045.0,FURIN,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_m07_s5_w1226d...
4,41744,i04,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...,51,5599.0,MAPK8,0.0,WT.2,pLX304,,...,open,ORA11.12.13.18A,5,72H,ORF OE,1,Treated,0,0,taoe005-u2os-72h-cp-a-au00044859_i04_s5_w1f731...


<IPython.core.display.Javascript object>

After the filtering we are left with 1,918 unique plate-well combinations for each 9 fields of view are available leading to a total of 17,262 images. Surprisingly, those are 36 images (4 plate-well) more than what is described in the publication to be the final result of the preprocessing of the images. 

While our segmentation pipeline will differ from the ones the authors used to segment the nuclei, we can get also a first feeling of the dimension of the single-nuclei imaging dataset that we will be working with using the available metadata.

In [49]:
np.sum(list(filtered_metadata_orm["Image_Count_Nuclei"])), len(
    np.unique(filtered_metadata_orm["Image_Metadata_GeneSymbol"])
)

(1278881, 194)

<IPython.core.display.Javascript object>

The authors obtained roughly 1,28 million nuclei corresponding to ORF overexpression of 193 genes respectively the control condition. The total number of nuclei for each condition differs quite a bit, which is also due to the fact that we have more plate-well combinations for some genes than for others.

In [50]:
Counter(filtered_metadata_orm["Image_Metadata_GeneSymbol"])

Counter({'EIF4E': 90,
         'DKK1': 45,
         'ATF6': 135,
         'FURIN': 90,
         'MAPK8': 90,
         'CARD11': 135,
         'ATG16L1': 45,
         'TSC1': 90,
         'PAK1': 90,
         'XBP1': 180,
         'PRKAA1': 90,
         'MAP3K9': 45,
         'IKBKE': 90,
         'TGFBR1': 180,
         'RIPK1': 45,
         'EMPTY': 1557,
         'PSENEN': 45,
         'BRAF': 135,
         'DVL1': 90,
         'PER1': 90,
         'EGLN1': 135,
         'MOS': 90,
         'BMPR1B': 180,
         'Luciferase': 360,
         'CCND1': 90,
         'DVL3': 45,
         'TBK1': 90,
         'PIK3R1': 90,
         'PRKACA': 90,
         'NOTCH1': 135,
         'DEPTOR': 90,
         'TRAF6': 90,
         'PRKCZ': 180,
         'PRKACB': 135,
         'PRKACG': 135,
         'DIABLO': 45,
         'CHUK': 90,
         'LRPPRC': 45,
         'SLIRP': 90,
         'MKNK1': 45,
         'RAF1': 135,
         'GLI1': 45,
         'CYLD': 90,
         'JAK2': 90,
         'PKI

<IPython.core.display.Javascript object>

---

## 3. Save filtered image data

Previously we filtered the metadata to exclude images not passing the quality control standards established by Rohban et al. (2017). We will now use this subset and store the Hoechst(DNA) stained images individually. This is because a) we can drastically reduce the size of the overall dataset in terms of storage requirements and going forward we are only interested in this staining.

In [55]:
output_dir = "../data/images/filtered"
input_dir = "../data/images/data"
plate_col = "Image_Metadata_Plate"
illum_file_col = "Image_FileName_IllumHoechst"

# save_filtered_images(
#     filtered_metadata_orm,
#     input_dir=input_dir,
#     output_dir=output_dir,
#     plate_col=plate_col,
#     illum_file_col=illum_file_col,
# )

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for i in tqdm_notebook(range(n), desc := "Copying images"):


Copying images:   0%|          | 0/17262 [00:00<?, ?it/s]

Images (n=17262) copied to ../data/images/filtered.


<IPython.core.display.Javascript object>

---

## 4. Nuclei segmentation

After the previously described filtering steps we will now turn to the segmentation of the invidiual nuclei as we will be working on single nuclei images. This has two advantages a) we get rid of the toxicity signal that would be a confounder for our downstream analyses and b) it further decreases the dimensionality of the data set yielding to lower computational costs for our analyses.

### 4.1. Load filtered imaging data

As a first step we will load n=100 randomly sampled images from each plate of our filtered image data set.
We will use these images to tune the segmentation pipeline.

In [76]:
n = 100
input_dir = "../data/images/filtered"

image_sample_datasets = create_sample_datasets(
    input_dir=input_dir, n=n, random_seed=1234
)

<IPython.core.display.Javascript object>

In [77]:
i = 0

{'41756': [array([[ 560,  537,  576, ...,  551,  533,  511],
         [ 522,  539,  516, ...,  580,  602,  509],
         [ 508,  528,  586, ...,  554,  570,  537],
         ...,
         [ 633,  556,  601, ..., 2521, 2761, 2929],
         [ 617,  572,  600, ..., 2653, 2718, 2829],
         [ 594,  586,  622, ..., 2540, 2609, 2674]], dtype=uint16),
  array([[ 508,  532,  523, ..., 2682, 2666, 2830],
         [ 499,  551,  514, ..., 2469, 2635, 2439],
         [ 517,  560,  519, ..., 2294, 2223, 1776],
         ...,
         [ 760,  973, 1135, ...,  534,  492,  528],
         [ 912, 1115, 1303, ...,  516,  580,  562],
         [1038, 1268, 1519, ...,  508,  537,  596]], dtype=uint16),
  array([[501, 493, 511, ..., 553, 515, 495],
         [520, 535, 510, ..., 527, 529, 545],
         [501, 505, 526, ..., 515, 505, 499],
         ...,
         [537, 514, 521, ..., 657, 700, 859],
         [507, 558, 495, ..., 609, 646, 756],
         [544, 542, 530, ..., 591, 648, 695]], dtype=uint16),
 

<IPython.core.display.Javascript object>