<a href="https://colab.research.google.com/github/Martin09/DeepSEM/blob/master/segmentation-NMs/2_nm_seg_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2 - Model definition, training and saving
In this notebook we will:
1. Load and prepare our dataset.
2. Create and configure a pre-trained model from the detectron2 model zoo. 
3. Train our custom model on our dataset of labelled SEM images. 
4. Perform inference to test our model.
5. Save the model to a file for later use.

First, let's check that we are running a GPU instance of Colab:

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime -> "Change runtime type" menu to enable GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

If you see a prompt above to "Change runtime type" then you are not running a GPU instance. Follow the instructions above to enable the GPU.

## Install detectron2
We will be using Facebook's [detectron2](https://github.com/facebookresearch/detectron2) library to train our model. First we need to install it and its dependencies.

In [None]:
# install dependencies: (use cu101 because colab has CUDA 10.1)
!pip install -U torch==1.5 torchvision==0.6 -f https://download.pytorch.org/whl/cu101/torch_stable.html 
!pip install cython pyyaml==5.1
!pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
import torch, torchvision
print(torch.__version__, torch.cuda.is_available())
!gcc --version

In [None]:
# install detectron2:
!pip install detectron2==0.1.2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/index.html

In [None]:
# Setup detectron2 logger
import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()

# Import some common libraries
import numpy as np
import os, cv2, random, json, datetime, time, urllib, pycocotools
from glob import glob
from google.colab.patches import cv2_imshow
from PIL import Image
from pathlib import Path
from tqdm import tqdm

# Import some common detectron2 utilities
from detectron2 import model_zoo
from detectron2.engine import DefaultTrainer, DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
from detectron2.structures import BoxMode

## Introduction
In this section, we will train an existing detectron2 model on our labelled SEM image dataset that we prepared in [Notebook 1](https://colab.research.google.com/github/Martin09/DeepSEM/blob/master/segmentation-NWs/1_nw_seg_image_prep.ipynb) and labelled with [Labelbox](https://labelbox.com). We will use a custom labelled nanowire detection dataset which can be downloaded [here](https://github.com/Martin09/DeepSEM/raw/master/segmentation-NWs/datasets/WJ_NWs_D1-17-02-17-C_processed.zip). This dataset has been labelled with three classes: 
*   Droplet
*   Nanowire
*   Parasitic

From this, we will train a custom image segmentation model from an existing model pre-trained on the COCO dataset, available in detectron2's model zoo.

Note: the COCO dataset does not have any of these categories by default so we will have to perform transfer learning to get the model to detect them.

## 2.1 - Preparing our dataset

In [None]:
# Define the classes in our dataset
class_dict = {'slit': '1',
              'nanomembrane': '2',
              'parasitic': '3',
              'bottom_nucleus': '4',
              'side_nucleus': '5',
              'nanowire': '6',
              'overgrowth': '7'}

# Define some paths/constants that will be useful later
root = Path('./DeepSEM/segmentation-NMs/')
dataset_dir = root.joinpath('datasets')
output_dir = root.joinpath('output')
models_dir = root.joinpath('trained_models')

github_url = 'https://github.com/Martin09/' + str(root).replace('/','/trunk/')

imgs_zip = dataset_dir.joinpath('Nick_NMs_50kmag_png.zip')
imgs_dir = dataset_dir.joinpath(imgs_zip.stem)

labelbox_export = dataset_dir.joinpath('export-2020-06-16T09_29_47.338Z.json')

test_dir = imgs_dir.joinpath('test')
train_dir = imgs_dir.joinpath('train')

dataset_root_name = 'nm_masks'
train_name = dataset_root_name + '_train'
test_name = dataset_root_name + '_test'

In [None]:
# # Optional: Save everything to your own GoogleDrive
# from google.colab import drive
# drive.mount('/content/gdrive/')
# %cd "/content/gdrive/My Drive/path/to/save/location"

# Clone just the relevant folder from the DeepSEM repo
!rm -rf $root
!apt install subversion
!svn checkout $github_url $root

# # Alternative: Clone whole DeepSEM repository
# !rm -rf DeepSEM  # Remove folder if it already exists
# !git clone https://github.com/Martin09/DeepSEM

To make things easier, I have already prepared a labelled dataset and put it in a handy .zip file. Next, let's unzip this dataset to use it for training.

In [None]:
# Remove existing images directory and unzip dataset
!rm -rf $imgs_dir
!unzip -o $imgs_zip -d $imgs_dir

Alternatively, you can download and build up the dataset from a LabelBox JSON export file from scratch. The script below will build up the same file structure using a LabelBox JSON export file. Downloading takes about 30s/image, depending on size and number of segmentation masks.

Note: Not sure why but the JSON export from Labelbox seems to have some duplicates (20 images online, 23 images in JSON file) so I had to devise a crude check according to the "External ID" (aka PNG file name) to get rid of them. Might be worth investigating this later.


In [None]:
# from urllib.error import HTTPError

# # Remove existing images directory
# !rm -rf $imgs_dir

# # Load json data
# with open(labelbox_export) as f:
#     data = json.load(f)

# # Empty list to store image names, keep only unique data
# image_names = []
# unique_data = []

# # Loop over all the images, keep only unique image names
# for img in data:
#     # Check if unique
#     if img['External ID'] not in image_names:
#       unique_data.append(img)
#       image_names.append(img['External ID'])
#     else:
#       print(f"'{img['External ID']} is a duplicate'")

# data = unique_data

# # Perform test/train split by random choice
# n_images = len(data)
# i_test = np.random.choice(list(range(n_images)), int(n_images*0.1)+1, replace=False)

# # Define a small helper function to download images from Labelbox
# def get_image(img_url):
#   counter = 0
#   while(True):
#     try:
#       counter += 1
#       time.sleep(0.2)  # To not overload the LabelBox servers with requests
#       image = Image.open(urllib.request.urlopen(img_url))
#       return image
#     except HTTPError:
#       if counter > 10:
#         print('Image download failed 10 times in a row!')
#         raise

# # Loop over all the images (one image per row)
# for i, img in enumerate(tqdm(data, unit='image')):
#     image_name = img['External ID']

#     # Output image into test and train directories randomly
#     if i in i_test:
#         example_dest_dir = imgs_dir.joinpath('test/' + image_name.split(".")[0])
#     else:
#         example_dest_dir = imgs_dir.joinpath('train/' + image_name.split(".")[0])

#     # If there are labels for this image, create a folder for it and save the image
#     if img['Label']['objects'] and len(img['Label']['objects']) > 0:
#         example_dest_dir.joinpath('images').mkdir(parents=True, exist_ok=True)
#         image_url = img['Labeled Data']
#         image = get_image(image_url)
#         image.save(example_dest_dir.joinpath('images/' + image_name))
    
#     # For each label, download the mask PNG file from labelbox and save it to relevant path
#     for label in img['Label']['objects']:
#         mask_class = class_dict[label['value']]
#         mask_name = label['featureId']
#         mask_url = label['instanceURI']

#         mask_dest_dir = example_dest_dir.joinpath('masks/' + mask_class)
#         mask_dest_dir.mkdir(parents=True, exist_ok=True)
#         mask = get_image(mask_url)
#         mask.save(mask_dest_dir.joinpath(mask_name +'.png'), bit=1)

In [None]:
# # Optional: download the dataset
# from google.colab import files
# %cd $imgs_dir
# !rm -rf ../dataset.zip
# !zip -r ../dataset.zip *
# time.sleep(10) # Give it a bit of time to save the file
# files.download("../dataset.zip")
# %cd /content

###Calculate mean pixel intensity of dataset



In [None]:
# Calcualte dataset mean pixel intensity
images = list(train_dir.glob('*/images/*.png'))

min_pixel_intensity = np.infty
max_pixel_intensity = -np.infty
mean_pixel_intensity = 0
for img in images:
    im = cv2.imread(str(img),cv2.IMREAD_GRAYSCALE)
    min_pixel_intensity = min([np.min(im), min_pixel_intensity])
    max_pixel_intensity = max([np.max(im), max_pixel_intensity])
    mean_pixel_intensity += np.mean(im)
mean_pixel_intensity /= len(images)

print('The min pixel intesity is: {:.2f}'.format(min_pixel_intensity))
print('The max pixel intesity is: {:.2f}'.format(max_pixel_intensity))
print('The mean pixel intesity is: {:.2f}'.format(mean_pixel_intensity))
mpi = float(mean_pixel_intensity)

Register the nanowire dataset to detectron2, following the [detectron2 custom dataset tutorial](https://detectron2.readthedocs.io/tutorials/datasets.html).
Here, the dataset is in its custom format, therefore we write a function to parse it and prepare it into detectron2's standard format. See the tutorial for more details.


In [None]:
# if your dataset is in COCO format, this cell can be replaced by the following three lines:
# from detectron2.data.datasets import register_coco_instances
# register_coco_instances("my_dataset_train", {}, "json_annotation_train.json", "path/to/image/dir")
# register_coco_instances("my_dataset_test", {}, "json_annotation_test.json", "path/to/image/dir")

def get_bbox(msk):
    rows = np.any(msk, axis=1)
    cols = np.any(msk, axis=0)
    ymin, ymax = np.where(rows)[0][[0, -1]]
    xmin, xmax = np.where(cols)[0][[0, -1]]
    return xmin, ymin, xmax, ymax

def get_nw_mask_dicts_from_labelbox(folder):
    dataset_dicts = []
    for idx, img_file in enumerate(folder.glob('*/images/*.png')):    
        record = {}
        record["image_id"] = idx
        record["file_name"] = str(img_file)
        
        height, width = cv2.imread(str(img_file)).shape[:2]
        record["height"] = height
        record["width"] = width    

        objs = []

        for mask_file in img_file.parents[1].glob('masks/*/*.png'):
            mask = cv2.imread(str(mask_file), cv2.IMREAD_GRAYSCALE)
            mask = (mask/mask.max()).astype(np.uint8)  # Convert to integer mask
            try:
                obj = {
                    "bbox": list(get_bbox(mask)),
                    "bbox_mode": BoxMode.XYXY_ABS,
                    "segmentation": pycocotools.mask.encode(np.asarray(mask, order="F")),
                    "category_id": int(mask_file.parent.stem)-1,
                }
            except IndexError: # Can happen if we have an emtpy mask and try to find its bbox, for example
                continue
            objs.append(obj)
        record["annotations"] = objs
        dataset_dicts.append(record)
    return dataset_dicts        

Now we register the train and test datasets in detectron2:

In [None]:
from detectron2.data import DatasetCatalog, MetadataCatalog
DatasetCatalog.clear()

for d in [train_dir, test_dir]:
    DatasetCatalog.register(dataset_root_name + '_' + d.stem, lambda d=d: get_nw_mask_dicts_from_labelbox(d))
    MetadataCatalog.get(dataset_root_name + '_' + d.stem).set(thing_classes=list(class_dict))
nanowire_metadata = MetadataCatalog.get(train_name)

To verify the data loading is correct, let's visualize the annotations of randomly selected samples in the training set:



In [None]:
dataset_dicts = get_nw_mask_dicts_from_labelbox(train_dir)
for d in random.sample(dataset_dicts, 3):
    img = cv2.imread(d["file_name"])

    visualizer = Visualizer(img[:, :, ::-1], metadata=nanowire_metadata, scale=1.0)
    vis = visualizer.draw_dataset_dict(d)
    cv2_imshow(vis.get_image()[:, :, ::-1])

## 2.2 - Model definition

Now, we will load a COCO-pretrained RR50-FPN Mask R-CNN model from the [detectron2 model zoo](https://github.com/facebookresearch/detectron2/blob/master/MODEL_ZOO.md). We will then modify its configuration settings in order to adapt it for transfer learning on our SEM dataset.

In [None]:
cfg = get_cfg()
cfg.OUTPUT_DIR = str(output_dir.joinpath(datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')))  # Timestamp output folder
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.INPUT.MASK_FORMAT='bitmask'
cfg.DATASETS.TRAIN = (train_name,)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 8
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")  # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 8
cfg.SOLVER.BASE_LR = 0.05  # learning rate
cfg.SOLVER.MAX_ITER = 100  # 100 here is for proof of concept, will want to increase this to 10-20k for final model
cfg.SOLVER.CHECKPOINT_PERIOD = 5000  # Save a checkpoint after every this number of iterations

# Learning rate warmup and decay
cfg.SOLVER.WARMUP_FACTOR = 1/1000.  # Learning starts at BASE_LR * WU_FACTOR
cfg.SOLVER.WARMUP_ITERS = 1000  # Number of iterations for warm-up phase
cfg.SOLVER.WARMUP_METHOD = "linear"
cfg.SOLVER.GAMMA = 0.5
# cfg.SOLVER.STEPS = (100,200,300,)  # List of iteration numbers at which to decrease learning rate by factor GAMMA.
cfg.SOLVER.STEPS = tuple(range(0,cfg.SOLVER.MAX_ITER,2000)) # Decrease LR every 1000 steps

# Don't scale the input images
cfg.INPUT.MIN_SIZE_TRAIN = (0,)  # Keep these data types or might run into issues during inference when loading config file
cfg.INPUT.MAX_SIZE_TRAIN = 99999  # Keep these data types or might run into issues during inference when loading config file

cfg.MODEL.PIXEL_MEAN = [mpi, mpi, mpi] 
cfg.MODEL.PIXEL_STD = [1.0, 1.0, 1.0]
cfg.MODEL.ROI_HEADS.IOU_THRESHOLDS = [0.5] # Intersection over union threshold
cfg.MODEL.ROI_HEADS.NUM_CLASSES = len(class_dict) # We have three classification classes 

We can start up tensorboard to monitor training progress in realtime.

In [None]:
# Look at training curves in tensorboard:
%load_ext tensorboard
# %reload_ext tensorboard
%tensorboard --logdir $output_dir

## 2.3 - Training
It takes ~2 minutes to train 100 iterations on Colab Pro's Tesla P100 GPUs and a bit longer on Colab's free Tesla K80 GPUs.

In [None]:
# Start training
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()

Hopefully, during training you should start to see the `total_loss` decreasing over time as the model learns. Initially, the learning rate (`lr`) will increase during the warm-up stage for 1000 iterations. After which it will be halved every 1000 iterations until the end of training. These settings can all be changed in the config (`cgf.XX = X`) definitions above. As a proof of concept, I have set `cfg.SOLVER.MAX_ITER=100` but you should increase this to 1000 or even >10000 to achieve the best performance.

## 2.4 - Inference with a trained model
Now, let's run inference with the trained model on the test dataset. First, let's create a predictor using the model we just trained:



In [None]:
cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5   # Set the testing threshold for this model
cfg.MODEL.ROI_HEADS.NMS_THRESH_TEST = 0.2     # Non-max supression threshold
cfg.DATASETS.TEST = (test_name, )
cfg.TEST.DETECTIONS_PER_IMAGE = 200
cfg.INPUT.MIN_SIZE_TEST = 0
cfg.INPUT.MAX_SIZE_TEST = 99999
predictor = DefaultPredictor(cfg)

Then, we randomly a few test samples to visualize the prediction results.

In [None]:
from detectron2.utils.visualizer import ColorMode
dataset_dicts = get_nw_mask_dicts_from_labelbox(test_dir)
for d in random.sample(dataset_dicts, 2):    
    im = cv2.imread(d["file_name"])
    outputs = predictor(im)
    v = Visualizer(im[:, :, ::-1],
                   metadata=nanowire_metadata, 
                   scale=1.5, 
                  #  instance_mode=ColorMode.IMAGE_BW   # remove the colors of unsegmented pixels
    )
    v = v.draw_instance_predictions(outputs["instances"].to("cpu"))
    cv2_imshow(v.get_image()[:, :, ::-1])

We can see that already, even with relatively few training iterations, thanks to transfer learning on a pre-trained network, the model already achieves quite good performance! As you train for longer, the predictions will get even more accurate. At >10k training iterations the model should approach human-level performance.

In [None]:
# # Check the outputs of the neural network manually:
# outputs["instances"].pred_boxes
# outputs["instances"].scores
# outputs["instances"].pred_classes

We can also evaluate its performance using AP metric implemented in COCO API.

In [None]:
# ## Not working yet
# from detectron2.evaluation import COCOEvaluator, inference_on_dataset
# from detectron2.data import build_detection_test_loader
# evaluator = COCOEvaluator(train_name, cfg, False)#, output_dir=output_dir)
# test_loader = build_detection_test_loader(cfg, train_name
# inference_on_dataset(trainer.model, test_loader, evaluator)
# # another equivalent way is to use trainer.test

## 2.5 - Saving the trained model
Now let's save our final model for safe keeping to use it in the next notebook.

In [None]:
weights_source = cfg.OUTPUT_DIR + '/model_final.pth'
model_name = 'nm_seg_it'+str(trainer.iter)+'_lossX.XXX.yaml'
model_dest = models_dir.joinpath(model_name)

# Move weights file to "trained_models" folder and update the config file accordingly
weights_dest = model_dest.parent.joinpath(model_dest.stem + '.pth')
!cp $weights_source $weights_dest
cfg.MODEL.WEIGHTS = str(weights_dest)

# Save the config file alongside the weights file
confi_dest = model_dest
with open(confi_dest, "w") as text_file:
    text_file.write(cfg.dump())

## A final note on overfitting
In this example we are only monitoring the training loss as a function of iterations. This has the disadvantage that we do not know if our model is over-fitting to our training set. In this simple application we don't worry about overfitting too much, especially as the model seems to perform well on the test set after training. However, to achieve the best possible performance, we should create a small third dataset called a validation set with which we periodically estimate the performance of the model during training. Training should then be stopped when validation loss stops decreasing to prevent overfitting on the training set.

Next, in [Notebook 3](https://colab.research.google.com/github/Martin09/DeepSEM/blob/master/segmentation-NMs/3_nm_seg_inference.ipynb), let's see how we can apply our trained model on raw SEM images and extract some important growth information from them!