[Open in Google Colab](https://colab.research.google.com/github/MagicShoebox/vt-cs4664-tiny-towns-scorer/blob/main/training.ipynb)

# Model Training

Tiny Towns Scorer\
CS 4664: Data-Centric Computing Capstone

### Authors
Alex Owens, Daniel Schoenbach, Payton Klemens

### Acknowledgements

Portions of this project were adapted from tutorials and examples available
on the [OpenCV](https://opencv.org/) and [Keras](https://keras.io/) websites.

In particular, significant portions of code from the tutorials [Feature Detection and Description](https://docs.opencv.org/4.6.0/db/d27/tutorial_py_table_of_contents_feature2d.html) and [Train an Object Detection Model on Pascal VOC 2007 using KerasCV](https://keras.io/guides/keras_cv/retina_net_overview/) were copied entirely or used as the basis for several cells in this notebook.

# Dependencies

This notebook uses some additional libraries not installed on Colab by default:
- `tensorflow-io` extends TensorFlow with additional features.\
We use it to handle EXIF metadata in images.

- `keras-cv` extends Keras with computer vision-focused tools.\
We use the RetinaNet model, bounding box utilities, and more.

- `luketils` is used in, and written by the author of, the tutorial referenced above.\
We use some of its visualization functions.

In [None]:
!pip install tensorflow-io==0.26.0 keras-cv luketils

Import the major packages we'll be using. Other imports will be done as needed.

In [None]:
import tensorflow as tf
import tensorflow_io as tfio
from tensorflow import keras
import keras_cv
import matplotlib.pyplot as plt

# Data

## Setup

Global parameters used in pipeline construction and training.

`IMAGE_SIZE` - The square image size for the neural network portion of the model. Images are resized to `IMAGE_SIZE`x`IMAGE_SIZE` without respect for their original aspect ratio. Note that once the network has made predictions, other parts of the model will use the original image.

`BATCH_SIZE` - Number of images to process at once.

`EPOCHS` - Maximum number of epochs to train.

In [None]:
IMAGE_SIZE = 512
BATCH_SIZE = 16
EPOCHS = 200

The collected data has been made available on [Zenodo](https://zenodo.org/record/7429657#.Y5d_np7MKUk).

The dataset is over 1 GB. To avoid re-downloading it each time, the notebook saves it to Google Drive.

In [None]:
# If you do not want to connect the notebook to your Google Drive,
# simply uncomment this line and comment the three below
# PROJECT_FOLDER = '/content'

# Comment to prevent connecting to Google Drive
from google.colab import drive
drive.mount('/content/drive')
PROJECT_FOLDER = '/content/drive/My Drive/tiny-towns-scorer'

from os import path
IMAGES_FOLDER = path.join(PROJECT_FOLDER, 'images')
ANNOTATIONS_FOLDER = path.join(PROJECT_FOLDER, 'annotations')
MODEL_FOLDER = path.join(PROJECT_FOLDER, 'model', 'model')
CHECKPOINT_PATH = "checkpoint/" # Note: local to runtime environment

In [None]:
!mkdir -p "{PROJECT_FOLDER}"
!test ! -d "{IMAGES_FOLDER}" && wget -O "images.tar.gz" "https://zenodo.org/record/7429657/files/images.tar.gz?download=1" && tar -xzvf images.tar.gz -C "{PROJECT_FOLDER}"

We'll also want the list of classes, simply hardcoded here:

In [None]:
class_ids = [
  'brick',
  'chapel',
  'cottage',
  'farm',
  'tavern',
  'theater',
  'wheat',
  'wood',
  'board',
  'factory',
  'stone',
  'well',
  'glass',
]
class_mapping = dict(zip(range(len(class_ids)), class_ids))
print(class_mapping)

## Parsing

Images taken with smartphones often have EXIF Orientation Metadata. CVAT reads this, so the annotations were made using images' correct orientation. However, TensorFlow ignores it by default, so we'll need a function to correct orientation when loading images.

In [None]:
# Apply EXIF Orientation to image tensor
# Adapted from:
# https://medium.com/@ageitgey/the-dumb-reason-your-fancy-computer-vision-app-isnt-working-exif-orientation-73166c7d39da
def fix_orientation(img, orientation):
  if orientation == 1:
    # Normal image - nothing to do!
    pass
  elif orientation == 2:
    # Mirrored left to right
    img = tf.image.flip_left_right(img)
  elif orientation == 3:
    # Rotated 180 degrees
    img = tf.image.rot90(img, 2)
  elif orientation == 4:
    # Mirrored top to bottom
    img = tf.image.flip_up_down(img)
  elif orientation == 5:
    # Mirrored along top-left diagonal
    img = tf.image.rot90(img, -1)
    img = tf.image.flip_left_right(img)
  elif orientation == 6:
    # Rotated 90 degrees
    img = tf.image.rot90(img, -1)
  elif orientation == 7:
    # Mirrored along top-right diagonal
    img = tf.image.rot90(img, 1)
    img = tf.image.flip_left_right(img)
  elif orientation == 8:
    # Rotated 270 degrees
    img = tf.image.rot90(img, 1)
  return img

The dataset is stored in two parts: raw image files and annotations. The annotations are stored in Tensorflow's TFRecords format. When loading the dataset, we will parse each record, load its associated image, and reformat its annotations for use with KerasCV. While loading the image, we also resize it and correct its orientation using the function above.

In [None]:
from keras_cv import bounding_box

# This was inspired by:
# https://github.com/keras-team/keras-cv/blob/v0.3.4/keras_cv/datasets/pascal_voc/load.py
# This function returns a function that takes a dataset.
# The intended usage is dataset.apply( parse_cvat_tfrecords(...) )
def parse_cvat_tfrecords(bounding_box_format, img_size=None):

  # https://opencv.github.io/cvat/docs/manual/advanced/formats/format-tfrecord/
  # Switched VarLenFeature to RaggedFeature
  image_feature_description = {
      'image/filename': tf.io.FixedLenFeature([], tf.string),
      'image/source_id': tf.io.FixedLenFeature([], tf.string),
      'image/height': tf.io.FixedLenFeature([], tf.int64),
      'image/width': tf.io.FixedLenFeature([], tf.int64),
      # Object boxes and classes.
      'image/object/bbox/xmin': tf.io.RaggedFeature(tf.float32),
      'image/object/bbox/xmax': tf.io.RaggedFeature(tf.float32),
      'image/object/bbox/ymin': tf.io.RaggedFeature(tf.float32),
      'image/object/bbox/ymax': tf.io.RaggedFeature(tf.float32),
      'image/object/class/label': tf.io.RaggedFeature(tf.int64),
      'image/object/class/text': tf.io.RaggedFeature(tf.string),
  }

  # TODO: Use keras-cv resizing layer that respects bounding boxes
  # See: https://github.com/keras-team/keras-cv/blob/master/keras_cv/datasets/pascal_voc/load.py
  if img_size is not None:
    resizing = keras.layers.Resizing(
        height=img_size[0], width=img_size[1], crop_to_aspect_ratio=False
    )

  # Construct function to parse individual record
  def parse_record(example_proto):
    features = tf.io.parse_example(example_proto, image_feature_description)
    filename = tf.strings.join([IMAGES_FOLDER, path.sep, features['image/filename']])
    image_raw = tf.io.read_file(filename)
    # Not normalizing here due to bug in luketils plot_bounding_box_gallery
    # https://github.com/LukeWood/luketils/issues/13
    image = tf.io.decode_image(image_raw, channels=3) # / 255
    image = tf.ensure_shape(image, [None,None,3])
    image = tf.cond(tf.image.is_jpeg(image_raw),
                   lambda: fix_orientation(image, tfio.experimental.image.decode_jpeg_exif(image_raw)),
                   lambda: image)
    if img_size is not None:
      image = resizing(image)
    bounding_boxes = tf.ragged.stack(
        [features['image/object/bbox/ymin'],
        features['image/object/bbox/xmin'],
        features['image/object/bbox/ymax'],
        features['image/object/bbox/xmax'],
        tf.cast(features['image/object/class/label'] - 1, tf.float32)],
        axis=1
        )
    bounding_boxes = bounding_box.convert_format(
        bounding_boxes,
        images=image,
        source='rel_yxyx',
        target=bounding_box_format
    )
    return {'images': image, 'bounding_boxes': bounding_boxes}
  
  # Construct function that applies parse_record to every record in dataset
  def apply(dataset):
    return dataset.map(parse_record)

  # Return that function
  return apply

## Pipelines

We are now ready to construct the data pipelines. We begin by creating TensorFlow datasets for the two categories of annotation records: images from the played game and images of individual pieces taken afterward. 

In [None]:
# Prepare the game state annotation records
game_state_records = tf.data.TFRecordDataset([
          path.join(ANNOTATIONS_FOLDER, 'top_down_alex.tfrecord'),
          path.join(ANNOTATIONS_FOLDER, 'top_down_daniel.tfrecord'),
          path.join(ANNOTATIONS_FOLDER, 'top_down_payton.tfrecord'),
          path.join(ANNOTATIONS_FOLDER, 'frontal_alex.tfrecord'),
          path.join(ANNOTATIONS_FOLDER, 'frontal_daniel.tfrecord'),
          path.join(ANNOTATIONS_FOLDER, 'frontal_payton.tfrecord'),
          path.join(ANNOTATIONS_FOLDER, 'side_angle_alex.tfrecord'),
          path.join(ANNOTATIONS_FOLDER, 'side_angle_daniel.tfrecord'),
          path.join(ANNOTATIONS_FOLDER, 'side_angle_payton.tfrecord'),
])

# Prepare the individual piece annotation records
piece_records = tf.data.TFRecordDataset([
          path.join(ANNOTATIONS_FOLDER, 'pieces_buildings.tfrecord'),
          path.join(ANNOTATIONS_FOLDER, 'pieces_resources.tfrecord'),
])

The training dataset is constructed by combining the images of individual pieces and 80% of the game state photos. The other 20% of the game state images are used for validation.

In [None]:
import random

# Load and shuffle all the game state records so we can make train/val split
game_state_records = list(game_state_records)
random.shuffle(game_state_records)

# Split game state records 80% Train / 20% Validation
all_records = tf.data.Dataset.from_tensor_slices(game_state_records)
train_records = tf.data.Dataset.from_tensor_slices(game_state_records[:(len(game_state_records)*4 + 4)//5])
val_records = tf.data.Dataset.from_tensor_slices(game_state_records[(len(game_state_records)*4 + 4)//5:])

# Add piece records to game state records allocated for training
train_records = train_records.concatenate(piece_records)

# Shuffle the complete training split
train_records = list(train_records)
random.shuffle(train_records)
train_records = tf.data.Dataset.from_tensor_slices(train_records)

Lastly, use the functions defined in [Parsing](#parsing) to convert the datasets into a more useful format. Since the dataset is relatively small, we can cache it in memory for performance. We also shuffle and batch the the datasets.

In [None]:
# Convert annotation records to batches of {'images': img, 'bounding_boxes': boxes}
def parse_cache_shuffle_batch(ds):
  ds = ds.apply(parse_cvat_tfrecords('xywh', (IMAGE_SIZE, IMAGE_SIZE)))
  ds = ds.cache()
  ds = ds.shuffle(8 * BATCH_SIZE, reshuffle_each_iteration=True)
  ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(batch_size=BATCH_SIZE))
  return ds

# Train and validation datasets ready to go
all_ds = parse_cache_shuffle_batch(all_records)
train_ds = parse_cache_shuffle_batch(train_records)
val_ds = parse_cache_shuffle_batch(val_records)

## Augmentation

Use data augmentation to artificially create more training data. To preserve the annotation bounding boxes, we only perform simple flips and brightness- or color-based distortions.

In [None]:
random_flip = keras_cv.layers.RandomFlip(mode="horizontal", bounding_box_format="xywh")
rand_augment = keras_cv.layers.RandAugment(
    value_range=(0, 255),
    augmentations_per_image=2,
    # we disable geometric augmentations for object detection tasks
    geometric=False,
)

def augment(inputs):
    # In future KerasCV releases, RandAugment will support
    # bounding box detection
    inputs["images"] = rand_augment(inputs["images"])
    inputs = random_flip(inputs)
    return inputs

augmented_ds = train_ds.map(augment, num_parallel_calls=tf.data.AUTOTUNE)

Lastly, we'll reformat the pipelines from dictionaries to tuples in preparation for model training.

In [None]:
def dict_to_tuple(inputs):
    return inputs["images"], inputs["bounding_boxes"]

all_ds = all_ds.map(dict_to_tuple, num_parallel_calls=tf.data.AUTOTUNE)
train_ds = train_ds.map(dict_to_tuple, num_parallel_calls=tf.data.AUTOTUNE)
augmented_ds = augmented_ds.map(dict_to_tuple, num_parallel_calls=tf.data.AUTOTUNE)
val_ds = val_ds.map(dict_to_tuple, num_parallel_calls=tf.data.AUTOTUNE)

all_ds = all_ds.prefetch(tf.data.AUTOTUNE)
train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
augmented_ds = augmented_ds.prefetch(tf.data.AUTOTUNE)
val_ds = val_ds.prefetch(tf.data.AUTOTUNE)

# Visualization

Define a function to help us visualize datasets.

In [None]:
from luketils import visualization

def visualize_dataset(dataset, bounding_box_format):
    images, boxes = next(iter(dataset))
    visualization.plot_bounding_box_gallery(
        images,
        value_range=(0, 255),
        bounding_box_format=bounding_box_format,
        y_true=boxes,
        scale=4,
        rows=3,
        cols=3,
        show=True,
        thickness=4,
        font_scale=1,
        class_mapping=class_mapping,
    )

Visualize a sample of the original training dataset.

In [None]:
visualize_dataset(train_ds, bounding_box_format="xywh")

And a sample of the augmented dataset we'll use for training.

In [None]:
visualize_dataset(augmented_ds, bounding_box_format="xywh")

# Model Creation

Construct a RetinaNet neural network with ResNet50 backbone, pretrained on weights learned from ImageNet. The backbone weights are frozen and not trained.

In [None]:
model = keras_cv.models.RetinaNet(
    # number of classes to be used in box classification
    classes=len(class_ids),
    # For more info on supported bounding box formats, visit
    # https://keras.io/api/keras_cv/bounding_box/
    bounding_box_format="xywh",
    # KerasCV offers a set of pre-configured backbones
    backbone="resnet50",
    # Each backbone comes with multiple pre-trained weights
    # These weights match the weights available in the `keras_cv.model` class.
    backbone_weights="imagenet",
    # include_rescaling tells the model whether your input images are in the default
    # pixel range (0, 255) or if you have already rescaled your inputs to the range
    # (0, 1).  In our case, we feed our model images with inputs in the range (0, 255).
    include_rescaling=True,
    # Typically, you'll want to set this to False when training a real model.
    # evaluate_train_time_metrics=True makes `train_step()` incompatible with TPU,
    # and also causes a massive performance hit.  It can, however be useful to produce
    # train time metrics when debugging your model training pipeline.
    evaluate_train_time_metrics=False,
)
# Fine-tuning a RetinaNet is as simple as setting backbone.trainable = False
model.backbone.trainable = False

Before we compile the model, we define evaluation metrics. We use the COCO Metrics provided by KerasCV. Since we want COCO Metrics for all classes, as well as each individual class, we define a couple helper functions first.

In [None]:
def coco_map_metric(name, class_ids):
  return keras_cv.metrics.COCOMeanAveragePrecision(
            class_ids=class_ids,
            bounding_box_format="xywh",
            name=name)
def coco_recall_metric(name, class_ids):
  return keras_cv.metrics.COCORecall(
            class_ids=class_ids,
            bounding_box_format="xywh",
            max_detections=100,
            name=name)

Equip the model with RetinaNet's "focal loss" function and prepare it for training.

In [None]:
model.compile(
    classification_loss=keras_cv.losses.FocalLoss(from_logits=True, reduction="none"),
    box_loss=keras_cv.losses.SmoothL1Loss(l1_cutoff=1.0, reduction="none"),
    optimizer=tf.optimizers.SGD(global_clipnorm=10.0),
    metrics=[
        coco_map_metric("Total Mean Average Precision", range(len(class_ids))),
        coco_recall_metric("Total Recall", range(len(class_ids))),
        *(coco_map_metric(f'{cid} Mean Average Precision', [idx]) for idx, cid in enumerate(class_ids)),
        *(coco_recall_metric(f'{cid} Recall', [idx]) for idx, cid in enumerate(class_ids))
    ]
)

# Model Training

Train the model, with some callbacks to adjust the learning rate as needed and to allow for early stopping. Weights will only be saved **locally** after training is complete.

**We recommend using a GPU hardware accelerator.**

**Running this cell may take an hour or more.**

In [None]:
model.fit(
  augmented_ds,
  validation_data=val_ds.take(20),
  epochs=EPOCHS,
  callbacks=[
    keras.callbacks.TensorBoard(log_dir="logs"),
    keras.callbacks.ReduceLROnPlateau(patience=5),
    keras.callbacks.EarlyStopping(patience=10),
    keras.callbacks.ModelCheckpoint(CHECKPOINT_PATH, save_weights_only=True),
  ],
)
model.save_weights(CHECKPOINT_PATH)

# Model Evaluation

Now we can evaluate our COCO Metrics on the trained model. We evaluate both against the validation dataset, as well as the dataset of all game state photos. (Individual piece photos are only used for training.)

In [None]:
model.load_weights(CHECKPOINT_PATH)
for name, ds in [('All dataset', all_ds.take(100)),('Validation dataset', val_ds.take(100))]:
  metrics = model.evaluate(ds, return_dict=True)
  print(f'{name} metrics:')
  print(metrics)

# Saving the Model

If the `MODEL_FOLDER` set earlier is on a Google Drive, save the weights there for use in the other notebooks.

In [None]:
model.save_weights(MODEL_FOLDER)

If you want to do "fire and forget" training, you can uncomment the last line, then use Run All to disconnect the runtime at the end of the notebook.

In [None]:
from google.colab import runtime

# Make sure we finished saving
drive.flush_and_unmount()

# Uncomment to have Run All terminate the session when done
# runtime.unassign()