Computer Vision Fine Tuning

Fine tune a computer vision model to solve your task locally, on HPC, in a container, or in the cloud!

Core Concepts

Computer Vision Tasks

Classification: Classify what the image is. The simplest task.
Object Detection: Detect and classify objects in images, using bounding boxes.
Segmentation: Detect and classify objects in images, using segmentation masks.
Panoptic Segmentation: Detect and classify everything in an image, using segmentation masks for the whole image. The hardest class!

Picking Your Model

There are three leading models for computer vision at the moment:

Residual Neural Networks (ResNet)
You Only Look Once (YOLO, v5)
Vision Transformers (ViT)

ResNets and YOLOs are both based on convolutional neural networks. ViT models apply transformer model architecture.

Currently, ResNets and YOLOs are the easiest to work with and the most broadly supported. They also can be fine tuned or even trained on surprisingly small datasets. ViT models require enormous sums of data for fine tuning, which makes them slightly less attractive in research use cases where datasets are often on the smaller side.

In this repository, I will first focus on demonstrating workflows for using ResNet and YOLO architectures. YOLO models often excel in terms of inference / prediction speed, which is very attractive when considering possible production or deployment scenarios where computational resources may be at a premium or speed is a requirement.

Data Preperation

Before you get started, you need to have images and annotions.

Some general advice:

When collecting a dataset of images, work to make sure that your training data contains the same sort of variability that you would see in a deployment scenario. You want to make sure that your model is exposed to a diverse set of images that show your objects in a variety of different positions and sizes in the frame, on a wide variety of backgrounds.
Try to have balanced classes. 1000 images per class is a great starting point.
Include blank or "null" images, so that your model can get better at understanding what the background is.
Objects which take up a larger amount of the image are easier than objects that smaller.

Tools for making annotations:

makesense: Pros: free, can keep your data private, built in COCO object detection. Cons: Less robust ecosystem, limited scalability.
roboflow: Pros: industry standard, well integrated with other platforms, scalable. Cons: you have to upload your data and it won't be private.
CVAT

Model Parameters

Epochs: Anywhere from 100 to 300 epochs is a good starting point. Generally it takes a decent number of epochs for a model to start to perform anywhere near well.

Environments

Technical Requirements

While you can use CPUs for deep learning, you really should only use devices with CUDA capable GPUs.

For a local device, a good build would be:

OS: Windows 11+WSL2 or Ubuntu (e.g. Lambda Stack)
GPU: RTX 3080 or better
RAM: 64gb or more
SSD: biggest NVME you can afford
CPU: Fast Intel i9 or AMD threadripper

The main priority is to have a very powerful GPU with a lot of CUDA cores and a large amount of fast vRAM. Your GPU's speed and size will determine the speed and complexity of the models that you can train, and will also control the batch size that you can work with.

You also will want to have a decent amount of fast RAM. RAM is very important for high throughput I/O operations, such as when your model is referencing numerous images and annotations. Many models also allow you to cache your images and annotations on RAM to accelerate training/fine tuning.

A large and fast SSD (e.g. NVME) can be a huge benefit, allowing faster I/O processes and supporting a high speed paging file if you do not have sufficient RAM. Caching on a fast SSD can also be a decent alternative if your RAM is not sufficient for caching your data.

For CPUs, something with 4+ cores is usually good, plus spec for sufficient gates to support PCIE GPU + PCIE NVME drive.

Locally

You will need a device with a CUDA GPU.

Current testing has been on a desktop with:

OS: Windows 10
GPU: RTX 2070 Super 8gb vRAM
RAM: 16gb DDR5
CPU: Ryzen 7 5800x 3.8Ghz 8-core
SSD: 1tb NVME

This desktop is approximately 2x faster than a GCP Tesla k80 VM.

On MacOS/Linux

conda create -n finetune python=3.9

conda activate finetune

pip install -r requirements.txt

On Windows

conda create -n finetune python=3.9

conda activate finetune

pip install -r requirements.txt

Docker

Recommended OS: Ubuntu/Debian (native or via WSL2). This assumes you are using a device with CUDA.

Documentation: Nvidia Docker and YOLOv5.

Update your NVIDIA drivers.
Set up Docker

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

Set up Nvidia Container Toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Install nvidia-docker2

suda apt-get update

sudo apt-get install -y nvidia-docker2

Restart docker

sudo systemctl restart docker

Test if the installation worked

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Set up YOLO

sudo docker pull ultralytics/yolov5:latest

Run the container, giving it access to your GPU(s).

You can also mount local files with -v "$(pwd)"/datasets:/usr/src/datasets.

sudo docker run --ipc=host -it --gpus all ultralytics/yolov5:latest

Once the container is up and running and you have your data mounted, you can then use all the YOLOv5 functions as normal.

HPC

TODO

Google Colab

Google Colab provides 12 hours a day of GPU node compute per day on the free tier; this represents one of the easiest ways to get access to low cost compute.

There are two major drawbacks of Google Colab:

You have a maximum of 12 hours on the free tier.
You must have your computer open and running.

Data Logging

Weights and Biases Logging

Use weights and biases for data logging and model pipeline version control.

Grid

TODO: Show how to set up Grid project for this.

Examples

Camera Traps

Object Detection using YOLO

Context

Camera traps are used to capture images of wildlife, with many applications in conservation biology and ecology. The New South Wales National Parks Service has been running a long term wildlife monitoring program called Wildcount. For this program they collected millions of images over several years from sites all over New South Wales, and recorded the species of animals found in these images.

The original data format is a series of directories structured like:

year/set_of_10_sites/site/subsite

Each of the final subsite folders contains all of the images captured for that location.

Google Colab

Notebook on colab.

Weed-AI Integration

TODO: Google Colab showing how to pull a dataset from weed-ai and fine tune segmentation.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
artemis_hpc		artemis_hpc
datasets		datasets
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
weedai_demo.ipynb		weedai_demo.ipynb
yolo_fine_tune.ipynb		yolo_fine_tune.ipynb

License

Sydney-Informatics-Hub/computer-vision-fine-tuning

Folders and files

Latest commit

History

Repository files navigation

Computer Vision Fine Tuning

Table of Contents

Core Concepts

Computer Vision Tasks

Picking Your Model

Data Preperation

Model Parameters

Environments

Technical Requirements

Locally

Docker

HPC

Google Colab

Data Logging

Weights and Biases Logging

Grid

Examples

Camera Traps

Context

Google Colab

Weed-AI Integration

About

Topics

Resources

License

Stars

Watchers

Forks

Languages