# Training & Validation notebook

This notebook walks through our training and validation scripts.

### Setting up the environment
Create a new virtual environment using `venv` or if already created, move on to the next part (skip this if running on Google colab).

In [None]:
# !python3 -m venv venv
# !source venv/bin/activate

### Cloning the repo
Clone the repo and install the required packages defined in `requirements.txt`.

In [1]:
%%capture
!git clone https://github.com/arnavrneo/torchFlow.git
%cd torchFlow
!pip install -r requirements.txt

Run the following shell script to (skip this if the data is already arranged):
- download the `yolo` and `onnx` model checkpoints to the `models` directory;
- download the tiled datasets in their respective directories.

In [2]:
%%capture
!./get-data.sh

In [3]:
!python config/config.py # for setting up the dataset directories

## Training

For training, we have trained the model by tiling the dataset into the following sizes:
- 256 x 256
- 512 x 512
- 1280 x 1280

and recursively training the model on next tiled dataset and then finally training at 3200 x 2600 size on the original dataset.

The base model used: `yolov8l.pt`.

For replicating training process, keep changing:
- the model sizes acc. to the tile size (keep the params as it is) in the `train-config.yaml` file.
![epochs_sizes.png](assets/epochs_sizes.png)


- and the dataset directory path in the `dataset.yaml`, i.e.

![dataset-yaml.png](assets/dataset-yaml.png)

- and the model checkpoints.

The arguments:
- `-m`: model path

### 256 x 256 Training

In [6]:
!python train.py -m yolov8l.pt

       1/40      1.71G   0.008658      1.198      1.698          1        256: 100% 365/365 [00:34<00:00, 10.63it/s]

                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 87/87 [00:05<00:00, 15.75it/s]

                   all        347        392     0.0804      0.378     0.0517      0.021



      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

       2/40      1.63G    0.01017      1.143      1.984          1        256: 100% 365/365 [00:30<00:00, 11.89it/s]

                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 87/87 [00:04<00:00, 20.25it/s]

                   all        347        392     0.0027     0.0893    0.00151   0.000357



      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

       3/40      1.61G   0.009945      1.092      1.955          1        256: 100% 365/365 [00:30<00:00, 12.10it/s]

                 Class     Images  Instances      

### 512 x 512 Training

In [7]:
!python train.py -m /content/torchFlow/runs/detect/train/weights/best.pt

[34m[1mval: [0mScanning /content/torchFlow/dataset-512/labels/test... 224 images, 0 backgrounds, 0 corrupt: 100% 224/224 [00:00<00:00, 402.08it/s]

[34m[1mval: [0mNew cache created: /content/torchFlow/dataset-512/labels/test.cache

Plotting labels to /content/torchFlow/runs/detect/train2/labels.jpg... 

[34m[1moptimizer:[0m AdamW(lr=0.002, momentum=0.9) with parameter groups 97 weight(decay=0.0), 104 weight(decay=0.0005), 103 bias(decay=0.0)

Image sizes 512 train, 512 val

Using 2 dataloader workers

Logging results to [1m/content/torchFlow/runs/detect/train2[0m

Starting training for 30 epochs...



      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

       1/30      2.12G   0.007438     0.8013      1.538          2        512: 100% 230/230 [00:27<00:00,  8.33it/s]

                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% 56/56 [00:04<00:00, 13.31it/s]

                   all        224        294      0.681 

### 1280 x 1280 Training

In [10]:
!python train.py -m /content/torchFlow/runs/detect/train2/weights/best.pt

Transferred 595/595 items from pretrained weights

[34m[1mTensorBoard: [0mStart with 'tensorboard --logdir /content/torchFlow/runs/detect/train3', view at http://localhost:6006/

[34m[1mAMP: [0mrunning Automatic Mixed Precision (AMP) checks with YOLOv8n...

[34m[1mAMP: [0mchecks passed ✅


[34m[1mtrain: [0mScanning /content/torchFlow/dataset-1280/labels/train.cache... 239 images, 0 backgrounds, 0 corrupt: 100% 239/239 [00:00<?, ?it/s]

[34m[1malbumentations: [0mBlur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8)), RandomRotate90(p=0.75)

[34m[1mval: [0mScanning /content/torchFlow/dataset-1280/labels/test.cache... 125 images, 0 backgrounds, 0 corrupt: 100% 125/125 [00:00<?, ?it/s]

Plotting labels to /content/torchFlow/runs/detect/train3/labels.jpg... 

[34m[1moptimizer:[0m AdamW(lr=0.002, momentum=0.9) with parameter groups 97 weight(decay=0.0), 104 weight(decay=0.0005), 10

### Training on 3232 x 2432

In [5]:
import os
os.environ['WANDB_DISABLED'] = 'true'

In [27]:
!python train.py -m /kaggle/working/torchFlow/models/torchFlow-ckpt.pt

Running on: cuda
New https://pypi.org/project/ultralytics/8.0.128 available 😃 Update with 'pip install -U ultralytics'
Ultralytics YOLOv8.0.121 🚀 Python-3.10.10 torch-2.0.0 CUDA:0 (Tesla P100-PCIE-16GB, 16281MiB)
[34m[1myolo/engine/trainer: [0mtask=detect, mode=train, model=None, data=config/dataset.yaml, epochs=60, patience=50, batch=1, imgsz=3200, save=True, save_period=-1, cache=False, device=0, workers=4, project=None, name=None, exist_ok=False, pretrained=True, optimizer=AdamW, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=0, resume=False, amp=True, fraction=1.0, profile=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=True, save_hybrid=False, conf=0.15, iou=0.3, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, vid_stride=1, line_width=None, visualize=False, augment=False, agnostic_nms

## Validation

Validation will be done automatically during the training step, but for manual validation, we provide the `val.py` script.
- We run the validation at `3232` size. The values can be changed in the `val-config.yaml`.

The arguments:
- `-m`: model path

In [24]:
!python val.py -m /kaggle/working/torchFlow/models/torchFlow-ckpt.pt

Running on: cuda
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
Ultralytics YOLOv8.0.121 🚀 Python-3.10.10 torch-2.0.0 CUDA:0 (Tesla P100-PCIE-16GB, 16281MiB)
Model summary (fused): 268 layers, 43607379 parameters, 0 gradients
[34m[1mval: [0mScanning /kaggle/working/torchFlow/dataset/labels/test.cache... 30 images, [0m
                 Class     Images  Instances      Box(P          R      mAP50  m
                   all         30        292      0.703      0.696      0.737      0.438
Speed: 3.6ms preprocess, 451.6ms inference, 0.0ms loss, 2.3ms postprocess per image
Saving /kaggle/working/torchFlow/runs/detect/val/predictions.json...
Results saved t