Chapter 5: Training ML models with Serotiny
Suggested: Alex, Gui
- Quick explanation of serotiny’s yaml-based task formulation
- Show how one can start a simple training based on 2D images to classify, e.g. edge vs. non-edge cells
- Show how one can load and apply the trained model
- Show how to bring in a pretrained model (2D RESNET)
- Show that we can use the latent space from 3D images (which has been precomputed and stored)



In [1]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nbvv
from upath import UPath as Path
from aicsimageio import AICSImage
from ome_zarr.reader import Reader
from ome_zarr.io import parse_url
import logging
logging.getLogger("bfio").setLevel(logging.ERROR)
logging.getLogger("aicsimageio").setLevel(logging.ERROR)


# Should these functions not be 

def read_ome_zarr(path, level=0, image_name="default"):
    path = str(path if image_name is None else Path(path) / image_name)
    reader = Reader(parse_url(path))

    node = next(iter(reader()))
    pps = node.metadata["coordinateTransformations"][0][0]["scale"][-3:]
   
    return AICSImage(
        node.data[level].compute(),
        channel_names=node.metadata["name"],
        physical_pixel_sizes=pps
    )

def rescale_image(img_data, channels):
    img_data = img_data.squeeze().astype(np.float32)
    
    for ix, channel in enumerate(channels):
        if "_seg" not in channel:
            img_data[ix] -= 1
            
            img_data[ix] = np.where(
                img_data[ix] >= 0,
                img_data[ix] / img_data.max(),
                -1
            )
    return img_data.astype(np.float16)

## Load the manifest and explore dimensions

In [2]:
cells_df = pd.read_parquet("s3://variance-dataset/processed/manifest.parquet")
print(f'Number of cells: {len(cells_df)}')
print(f'Number of columns: {len(cells_df.columns)}')

Number of cells: 215081
Number of columns: 1242


## Make a simple data of edge vs. non-edge cells

In [3]:
from serotiny.transforms.dataframe.transforms import split_dataframe
Path('./serotiny_data/').mkdir(parents=True, exist_ok=True)

n = 1000 #number of cells per class
# Sample cells for each class
edge_label = cells_df["edge_flag"].unique()
index = pd.Series([])
for s, struct in enumerate(edge_label):
    index = index.append(
        cells_df[cells_df["edge_flag"] == s]
        .sample(n=n)
        .index.to_series()
    )
cells_edgeVSnoedge = cells_df.loc[index]
# Add the train, test and validate split
cells_edgeVSnoedge = split_dataframe(dataframe=cells_edgeVSnoedge,train_frac=0.7,val_frac=0.2,return_splits=False)
#
cells_edgeVSnoedge.to_csv('./serotiny_data/cells_edgeVSnoedge.csv') 
print(f'Number of cells: {len(cells_edgeVSnoedge)}')
print(f'Number of columns: {len(cells_edgeVSnoedge.columns)}')



  index = pd.Series([])
  index = index.append(
  index = index.append(


Number of cells: 2000
Number of columns: 1243


https://allencell.github.io/serotiny/getting_started.html
Using the cookiecutter to create a serotiny project

In [11]:
!pip install cookiecutter | grep -v 'already satisfied' #avoid warnings

In [None]:
# !cookiecutter https://github.com/AllenCellModeling/serotiny-project-cookiecutter ran this in the terminal

In [12]:
!pip install -e ch5_attempt1/ | grep -v 'already satisfied' #avoid warnings

Obtaining file:///home/aicsuser/cytodata-hackathon-base/notebooks/ch5_attempt1
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Checking if build backend supports build_editable: started
  Checking if build backend supports build_editable: finished with status 'done'
  Getting requirements to build editable: started
  Getting requirements to build editable: finished with status 'done'
  Preparing editable metadata (pyproject.toml): started
  Preparing editable metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: ch5-attempt1
  Building editable for ch5-attempt1 (pyproject.toml): started
  Building editable for ch5-attempt1 (pyproject.toml): finished with status 'done'
  Created wheel for ch5-attempt1: filename=ch5_attempt1-0.0.0-py3-none-any.whl size=1405 sha256=4b7629a0c33122ea34b31f5b4d531bbde871f5fc00946050363d2bbf8634b237
  Stored in directory: /tmp/pip-ephem-wheel-cache-0wa_uwbc/wheels

### Show image info using serotiny CLI

In [4]:
!serotiny image info s3://variance-dataset/max_projection_z/408295.ome.tiff

Attempted file (variance-dataset/max_projection_z/408295.ome.tiff) load with reader: aicsimageio.readers.bfio_reader.OmeTiledTiffReader failed with error: No module named 'bfio'
  d = to_dict(os.fspath(xml), parser=parser, validate=validate)
Image shape:  (7, 245, 381)
Channel names:  ['bf', 'dna', 'membrane', 'structure', 'dna_segmentation', 'membrane_segmentation', 'struct_segmentation_roof']


In [50]:
!cd ch5_attempt1/;serotiny train model=class2d_model data=class2d_data



[2022-09-03 18:06:22,519][torch.distributed.nn.jit.instantiator][INFO] - Created a temporary directory at /tmp/tmpotupp3bp
[2022-09-03 18:06:22,519][torch.distributed.nn.jit.instantiator][INFO] - Writing /tmp/tmpotupp3bp/_remote_module_non_scriptable.py
[2022-09-03 18:06:22,650][pytorch_lightning.utilities.seed][INFO] - Global seed set to 42
[2022-09-03 18:06:22,651][serotiny.ml_ops.ml_ops][INFO] - Instantiating datamodule
[2022-09-03 18:06:23,845][serotiny.ml_ops.ml_ops][INFO] - Instantiating trainer
[2022-09-03 18:06:24,102][pytorch_lightning.utilities.rank_zero][INFO] - GPU available: True (cuda), used: False
[2022-09-03 18:06:24,102][pytorch_lightning.utilities.rank_zero][INFO] - TPU available: False, using: 0 TPU cores
[2022-09-03 18:06:24,102][pytorch_lightning.utilities.rank_zero][INFO] - IPU available: False, using: 0 IPUs
[2022-09-03 18:06:24,102][pytorch_lightning.utilities.rank_zero][INFO] - HPU available: False, using: 0 HPUs
  rank_zero_warn(
[2022-09-03 18:06:24,458][sero

In [11]:
import yaml

In [7]:
from hydra.utils import instantiate

In [13]:
data = instantiate(yaml.full_load('''_target_: serotiny.datamodules.ManifestDatamodule

path: serotiny_data/cells_edgeVSnoedge.csv

batch_size: 64
num_workers: 1
loaders:
  id:
    _target_: serotiny.io.dataframe.loaders.LoadColumn
    column: CellId
    dtype: int
  image:
    _target_: serotiny.io.dataframe.loaders.LoadImage
    column: max_projection_z
    select_channels: ['membrane']
  class:
    _target_: serotiny.io.dataframe.loaders.LoadColumn
    column: edge_flag
    dtype: int

split_column: "split"'''))

In [17]:
model = instantiate(yaml.full_load(''' 
_target_: ch5_attempt1.ch5_attempt1.model.Classifier
x_label: image
y_label: class
network:
  _target_: torch.nn.Sequential
loss:
  _target_: torch.nn.CrossEntropyLoss'''))

In [16]:
from ch5_attempt1.ch5_attempt1.model import Classifier