# 1 - The QCB project

The goal of the quantitative cell biology (QCB) project is to find patterns between physical properties of our cells. Further analysis of these patterns, which may be conducted in collaboration with the modelling team, may lead to new rules that govern cell shape and potentially cell function. Two distinct approaches can be employed in this project _feature search_ and _image inspection_.

## 1.2 - Feature search

We use tools from data mining to rank pairs of high-correlated non-obvious features that may lead to the discovery of new scaling laws between cell components, or clustering of cells. This approach should be conducted in collaboration with the modeling team as tools and results from both teams can be complementary.

In this approach we want to use the pre-extracted features available in the database. Please, refer to this [repository](http://stash.corp.alleninstitute.org/projects/AICSAD/repos/aics-feature/browse?at=refs%2Fheads%2FfeatureCalc) for details of feature extraction.

## 1.3 - Image inspection

We look at our data and we describe what we see and what features may capture that.

In this approach is valid to perform simple measurements using ImageJ, for example, to test whether a particular measurement captures a particular aspect observed by eye. For instance, looking at control vs. treated images of Golgi, one may observed that treated cells display fragmented Golgi. In that case, one could use the `Image → Adjust → Auto Threshold` and `Analyze → Analyze Particles` to count the average number of fragments in each condition. If the difference in number of fragments is verified, this valuable information will guide the team responsile for feature extraction to develop Golgi-specific features that will be further available in the database.


# 2 - Feature extraction

Feature extraction module is called aicsfeature and can be found in this [repository](http://stash.corp.alleninstitute.org/projects/AICSAD/repos/aics-feature/browse?at=refs%2Fheads%2FfeatureCalc). The module divides feature extraction in four submodules depending on the component to be analyzed, as explained in the sections below.

It is important to notice that users working on the QCB project do not need to worry about the feature extraction pipeline. However, it is good to keep the pipeline in mind when presenting results and suggesting new features to be included in the database.

To access the submodules for feature extraction in Python use

In [1]:
from aicsfeature.extractor import mem, dna, fov, structure

## 2.1 - MEM - Cell membrane submodule


This submodule is used to extract features of cell membrane images. The features currently available are listed below.

In [2]:
mem.GetListOfFeatures().tolist()

['mem_volume',
 'mem_surface_area',
 'mem_1st_axis_x',
 'mem_1st_axis_y',
 'mem_1st_axis_z',
 'mem_2nd_axis_x',
 'mem_2nd_axis_y',
 'mem_2nd_axis_z',
 'mem_3rd_axis_x',
 'mem_3rd_axis_y',
 'mem_3rd_axis_z',
 'mem_1st_axis_length',
 'mem_2nd_axis_length',
 'mem_3rd_axis_length',
 'mem_1st_eigenvalue',
 'mem_2nd_eigenvalue',
 'mem_3rd_eigenvalue',
 'mem_meridional_eccentricity',
 'mem_equator_eccentricity',
 'mem_sphericity',
 'mem_lowest_z',
 'mem_highest_z',
 'mem_x_centroid',
 'mem_y_centroid',
 'mem_z_centroid',
 'mem_intensity_mean',
 'mem_intensity_median',
 'mem_intensity_sum',
 'mem_intensity_mode',
 'mem_intensity_max',
 'mem_intensity_std',
 'mem_intensity_entropy',
 'mem_haralick_ang2nd_moment',
 'mem_haralick_contrast',
 'mem_haralick_corr',
 'mem_haralick_variance',
 'mem_haralick_inv_diff_moment',
 'mem_haralick_sum_avg',
 'mem_haralick_sum_var',
 'mem_haralick_sum_entropy',
 'mem_haralick_entropy',
 'mem_haralick_diff_var',
 'mem_haralick_diff_entropy',
 'mem_haralick_info

## 2.2 - DNA - Nucleus submodule


This submodule is used to extract features of nucleus images. The features currently available are listed below.

In [3]:
dna.GetListOfFeatures().tolist()

['dna_volume',
 'dna_surface_area',
 'dna_1st_axis_x',
 'dna_1st_axis_y',
 'dna_1st_axis_z',
 'dna_2nd_axis_x',
 'dna_2nd_axis_y',
 'dna_2nd_axis_z',
 'dna_3rd_axis_x',
 'dna_3rd_axis_y',
 'dna_3rd_axis_z',
 'dna_1st_axis_length',
 'dna_2nd_axis_length',
 'dna_3rd_axis_length',
 'dna_1st_eigenvalue',
 'dna_2nd_eigenvalue',
 'dna_3rd_eigenvalue',
 'dna_meridional_eccentricity',
 'dna_equator_eccentricity',
 'dna_sphericity',
 'dna_lowest_z',
 'dna_highest_z',
 'dna_x_centroid',
 'dna_y_centroid',
 'dna_z_centroid',
 'dna_intensity_mean',
 'dna_intensity_median',
 'dna_intensity_sum',
 'dna_intensity_mode',
 'dna_intensity_max',
 'dna_intensity_std',
 'dna_intensity_entropy',
 'dna_haralick_ang2nd_moment',
 'dna_haralick_contrast',
 'dna_haralick_corr',
 'dna_haralick_variance',
 'dna_haralick_inv_diff_moment',
 'dna_haralick_sum_avg',
 'dna_haralick_sum_var',
 'dna_haralick_sum_entropy',
 'dna_haralick_entropy',
 'dna_haralick_diff_var',
 'dna_haralick_diff_entropy',
 'dna_haralick_info

## 2.3 - FOV - Field of view submodule


This submodule is used to extract features of field of view images. The features currently available are listed below.

In [4]:
fov.GetListOfFeatures().tolist()

['fov_lowest_z',
 'fov_highest_z',
 'fov_x_centroid',
 'fov_y_centroid',
 'fov_z_centroid']

## 2.4 - Structure - Structure submodule


This submodule is used to extract features of structure images. The features currently available are listed below.

In [5]:
structure.GetListOfFeatures().tolist()

  cropped = ar[slices]


['str_1st_axis_length_mean',
 'str_1st_axis_x_mean',
 'str_1st_axis_y_mean',
 'str_1st_axis_z_mean',
 'str_1st_eigenvalue_mean',
 'str_2nd_axis_length_mean',
 'str_2nd_axis_x_mean',
 'str_2nd_axis_y_mean',
 'str_2nd_axis_z_mean',
 'str_2nd_eigenvalue_mean',
 'str_3rd_axis_length_mean',
 'str_3rd_axis_x_mean',
 'str_3rd_axis_y_mean',
 'str_3rd_axis_z_mean',
 'str_3rd_eigenvalue_mean',
 'str_equator_eccentricity_mean',
 'str_highest_z_mean',
 'str_lowest_z_mean',
 'str_meridional_eccentricity_mean',
 'str_skeleton_edge_vol_mean',
 'str_skeleton_prop_deg0_mean',
 'str_skeleton_prop_deg1_mean',
 'str_skeleton_prop_deg3_mean',
 'str_skeleton_prop_deg4p_mean',
 'str_skeleton_vol_mean',
 'str_sphericity_mean',
 'str_surface_area_mean',
 'str_volume_mean',
 'str_x_centroid_mean',
 'str_y_centroid_mean',
 'str_z_centroid_mean',
 'str_1st_axis_length_std',
 'str_1st_axis_x_std',
 'str_1st_axis_y_std',
 'str_1st_axis_z_std',
 'str_1st_eigenvalue_std',
 'str_2nd_axis_length_std',
 'str_2nd_axis_x_

# 3 - Pipeline for feature extraction

Feature extraction pipeline, which includes the implementation and extraction of new features for all the cells in the database, is done by Matheus and Jianxu.

## Pipeline

1 - The file `gen_meta_ds.py` is used to first save a local version of the current meta dataset (not necessary).

2 - Files `gen_dna_ds.py`, `gen_fov_ds.py`, `gen_mem_ds.py` and `gen_str_ds.py` (yet to be implemented) are used to load the meta dataset from database and apply the corresponding submodule for feature extraction.

3 - The Jupyter Notebook `upload_ds.ipynb` is used to upload the new datasets of features to the database.


# 4 - Current available datasets

To load the datasets of features from the database, first connect to the prod database with

````
import datasetdatabase as dsdb
prod = dsdb.DatasetDatabase(config='/allen/aics/assay-dev/Analysis/QCB_database/prod_config.json')
````

Next query for the dataset you want.

## 4.1 QCB_MEM_feature

To load this dataset use

`prod.get_dataset(name='QCB_MEM_feature')`

## 4.2 QCB_DNA_feature

To load this dataset use

`prod.get_dataset(name='QCB_DNA_feature')`

## 4.3 QCB_FOV_feature

To load this dataset use

`prod.get_dataset(name='QCB_FOV_feature')`

## 4.4 QCB_ST6GAL_feature

To load this dataset use

`prod.get_dataset(name='QCB_ST6GAL_feature')`

## 4.5 QCB_TOM20_feature

To load this dataset use

`prod.get_dataset(name='QCB_TOM20_feature')`

## 4.6 QCB_DRUG_MEM_feature

To load this dataset use

`prod.get_dataset(name='QCB_DRUG_MEM_feature')`

## 4.7 QCB_DRUG_DNA_feature

To load this dataset use

`prod.get_dataset(name='QCB_DRUG_DNA_feature')`

## 4.8 QCB_DRUG_ST6GAL_feature

To load this dataset use

`prod.get_dataset(name='QCB_DRUG_ST6GAL_feature')`

## 4.9 QCB_DRUG_TUBA1B_feature

To load this dataset use

`prod.get_dataset(name='QCB_DRUG_TUBA1B_feature')`

## 4.10 QCB_DRUG_SEC61B_feature

To load this dataset use

`prod.get_dataset(name='QCB_DRUG_SEC61B_feature')`



# 5 - Checking datasets

In [2]:
import datasetdatabase as dsdb
prod = dsdb.DatasetDatabase(config='/allen/aics/assay-dev/Analysis/QCB_database/prod_config.json')

In [3]:
prod.preview(name='QCB_MEM_feature')

{'info': {'id': 41, 'name': 'QCB_MEM_feature', 'description': 'Cell membrane features extracted from QCB data', 'introspector': 'datasetdatabase.introspect.dataframe.DataFrameIntrospector', 'created': datetime.datetime(2018, 10, 3, 18, 22, 53, 389741)},
 'shape': (10116, 46),
 'keys': ['mem_volume',
  'mem_surface_area',
  'mem_1st_axis_x',
  'mem_1st_axis_y',
  'mem_1st_axis_z',
  'mem_2nd_axis_x',
  'mem_2nd_axis_y',
  'mem_2nd_axis_z',
  'mem_3rd_axis_x',
  'mem_3rd_axis_y',
  'mem_3rd_axis_z',
  'mem_1st_axis_length',
  'mem_2nd_axis_length',
  'mem_3rd_axis_length',
  'mem_1st_eigenvalue',
  'mem_2nd_eigenvalue',
  'mem_3rd_eigenvalue',
  'mem_meridional_eccentricity',
  'mem_equator_eccentricity',
  'mem_sphericity',
  'mem_lowest_z',
  'mem_highest_z',
  'mem_x_centroid',
  'mem_y_centroid',
  'mem_z_centroid',
  'mem_intensity_mean',
  'mem_intensity_median',
  'mem_intensity_sum',
  'mem_intensity_mode',
  'mem_intensity_max',
  'mem_intensity_std',
  'mem_intensity_entropy',


In [4]:
prod.preview(name='QCB_DNA_feature')

{'info': {'id': 42, 'name': 'QCB_DNA_feature', 'description': 'Nucleus features extracted from QCB data', 'introspector': 'datasetdatabase.introspect.dataframe.DataFrameIntrospector', 'created': datetime.datetime(2018, 10, 3, 18, 37, 32, 838891)},
 'shape': (10116, 46),
 'keys': ['dna_volume',
  'dna_surface_area',
  'dna_1st_axis_x',
  'dna_1st_axis_y',
  'dna_1st_axis_z',
  'dna_2nd_axis_x',
  'dna_2nd_axis_y',
  'dna_2nd_axis_z',
  'dna_3rd_axis_x',
  'dna_3rd_axis_y',
  'dna_3rd_axis_z',
  'dna_1st_axis_length',
  'dna_2nd_axis_length',
  'dna_3rd_axis_length',
  'dna_1st_eigenvalue',
  'dna_2nd_eigenvalue',
  'dna_3rd_eigenvalue',
  'dna_meridional_eccentricity',
  'dna_equator_eccentricity',
  'dna_sphericity',
  'dna_lowest_z',
  'dna_highest_z',
  'dna_x_centroid',
  'dna_y_centroid',
  'dna_z_centroid',
  'dna_intensity_mean',
  'dna_intensity_median',
  'dna_intensity_sum',
  'dna_intensity_mode',
  'dna_intensity_max',
  'dna_intensity_std',
  'dna_intensity_entropy',
  'dna

In [2]:
prod.preview(name='QCB_FOV_feature')

{'info': {'id': 43, 'name': 'QCB_FOV_feature', 'description': 'Field of view features extracted from QCB data', 'introspector': 'datasetdatabase.introspect.dataframe.DataFrameIntrospector', 'created': datetime.datetime(2018, 10, 3, 18, 52, 35, 873941)},
 'shape': (3107, 6),
 'keys': ['fov_lowest_z',
  'fov_highest_z',
  'fov_x_centroid',
  'fov_y_centroid',
  'fov_z_centroid',
  'czi_filename'],
 'annotations': []}

In [5]:
prod.preview(name='QCB_ST6GAL_feature')

{'info': {'id': 48, 'name': 'QCB_ST6GAL_feature', 'description': 'Features extracted from QCB images of ST6GAL-tagged Golgi.', 'introspector': 'datasetdatabase.introspect.dataframe.DataFrameIntrospector', 'created': datetime.datetime(2018, 10, 4, 17, 15, 4, 554944)},
 'shape': (694, 102),
 'keys': ['cell_id',
  'str_1st_axis_length_mean',
  'str_1st_axis_x_mean',
  'str_1st_axis_y_mean',
  'str_1st_axis_z_mean',
  'str_1st_eigenvalue_mean',
  'str_2nd_axis_length_mean',
  'str_2nd_axis_x_mean',
  'str_2nd_axis_y_mean',
  'str_2nd_axis_z_mean',
  'str_2nd_eigenvalue_mean',
  'str_3rd_axis_length_mean',
  'str_3rd_axis_x_mean',
  'str_3rd_axis_y_mean',
  'str_3rd_axis_z_mean',
  'str_3rd_eigenvalue_mean',
  'str_equator_eccentricity_mean',
  'str_highest_z_mean',
  'str_lowest_z_mean',
  'str_meridional_eccentricity_mean',
  'str_sphericity_mean',
  'str_surface_area_mean',
  'str_volume_mean',
  'str_x_centroid_mean',
  'str_y_centroid_mean',
  'str_z_centroid_mean',
  'str_1st_axis_len

In [7]:
prod.preview(name='QCB_DRUG_ST6GAL_feature')

{'info': {'id': 55, 'name': 'QCB_DRUG_ST6GAL_feature', 'description': 'Features extracted from QCB images of ST6GAL-tagged Golgi.', 'introspector': 'datasetdatabase.introspect.dataframe.DataFrameIntrospector', 'created': datetime.datetime(2018, 10, 5, 2, 20, 59, 303831)},
 'shape': (231, 102),
 'keys': ['cell_id',
  'str_1st_axis_length_mean',
  'str_1st_axis_x_mean',
  'str_1st_axis_y_mean',
  'str_1st_axis_z_mean',
  'str_1st_eigenvalue_mean',
  'str_2nd_axis_length_mean',
  'str_2nd_axis_x_mean',
  'str_2nd_axis_y_mean',
  'str_2nd_axis_z_mean',
  'str_2nd_eigenvalue_mean',
  'str_3rd_axis_length_mean',
  'str_3rd_axis_x_mean',
  'str_3rd_axis_y_mean',
  'str_3rd_axis_z_mean',
  'str_3rd_eigenvalue_mean',
  'str_equator_eccentricity_mean',
  'str_highest_z_mean',
  'str_lowest_z_mean',
  'str_meridional_eccentricity_mean',
  'str_sphericity_mean',
  'str_surface_area_mean',
  'str_volume_mean',
  'str_x_centroid_mean',
  'str_y_centroid_mean',
  'str_z_centroid_mean',
  'str_1st_axi