# RetinaNet for Object Detection

This notebook documents the training done on `RetinaNet` for datasets sourced by cerv.AI team. The said datasets are as follows:
1. https://universe.roboflow.com/cervitester-colposcopy-yihp4/colposcopy
2. https://universe.roboflow.com/cervitester-colposcopy-yihp4/acetic_acid
3. https://universe.roboflow.com/madhura/merged-acetic-acid/dataset/3

## Additional Notes
1. Due to the balanced nature of the datasets, this training provides **no data augmentations**. It is only when the training with the other datasets that augmentations will be applied.
    - However, RetinaNet is excellent for imbalanced datasets due to the **Focal Loss** function.

2. The datasets are all from **IARC Cervical Image Cancer Bank**

## References 
1. [Fine-Tuning an Object Detection Model](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html)
2. [Focal Loss](https://paperswithcode.com/method/focal-loss)
3. [RetinaNet (Theory)](https://paperswithcode.com/method/retinanet)
4. [RetinaNet (Pytorch implementation)](https://pytorch.org/vision/main/models/generated/torchvision.models.detection.retinanet_resnet50_fpn_v2.html)
5. [Paper on Focal Loss](https://arxiv.org/abs/1708.02002)
6. [Blog 1 on RetinaNet](https://medium.com/@14prakash/the-intuition-behind-retinanet-eb636755607d)
7. [Blog 2 on RetinaNet](https://blog.zenggyu.com/en/post/2018-12-05/retinanet-explained-and-demystified/)
8. [Blog 3 on RetinaNet](https://towardsdatascience.com/review-retinanet-focal-loss-object-detection-38fba6afabe4)
9. [Blog 4 on RetinaNet](https://analyticsindiamag.com/what-is-retinanet-ssd-focal-loss/)
10. [Blog 5 on RetinaNet](https://towardsdatascience.com/object-detection-on-aerial-imagery-using-retinanet-626130ba2203)


## Dataset Class and DataLoading

The dataset class is based on the typical COCO format. Instead of storing the image on dedicated arrays, which results in large space complexity, the dataset instead accesses the image on the `load_image` method through the root folder, set name (either as train or test), and the image IDs.

The dataloader stores the methods for each images. For each iteration of the pipline, the dataloader appleis the necessary methods to each dataset itself. You can refer to the training pipeline to see where this is done.

In [None]:
import os
import math
import json
import collections
from PIL import Image
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import torch
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
from torch.optim.lr_scheduler import LambdaLR

import numpy as np
from tqdm import tqdm
from pycocotools.coco import COCO
import skimage

#* Ignore warnings
import warnings
warnings.filterwarnings('ignore')

#* Dataset class
class CocoDataset(Dataset):
    """Coco dataset."""

    def __init__(self, root_dir, set_name='train', transform=None):
        """
        Args:
            root_dir (string): COCO directory.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.root_dir = root_dir
        self.set_name = set_name
        self.transform = transform

        self.coco      = COCO(os.path.join(self.root_dir, self.set_name,'_annotations.coco.json'))
        self.image_ids = self.coco.getImgIds()

        self.load_classes()

    def load_classes(self):
        # load class names (name -> label)
        categories = self.coco.loadCats(self.coco.getCatIds())
        categories.sort(key=lambda x: x['id'])

        self.classes             = {}
        self.coco_labels         = {}
        self.coco_labels_inverse = {}
        for c in categories:
            self.coco_labels[len(self.classes)] = c['id']
            self.coco_labels_inverse[c['id']] = len(self.classes)
            self.classes[c['name']] = len(self.classes)

        # also load the reverse (label -> name)
        self.labels = {}
        for key, value in self.classes.items():
            self.labels[value] = key

    def __len__(self):
        return len(self.image_ids)

    def __getitem__(self, idx):

        img = self.load_image(idx)
        annot = self.load_annotations(idx)
        sample = {'img': img, 'annot': annot}
        if self.transform:
            sample = self.transform(sample)

        return sample

    def load_image(self, image_index):
        image_info = self.coco.loadImgs(self.image_ids[image_index])[0]
        path       = os.path.join(self.root_dir,self.set_name, image_info['file_name'])
        img = skimage.io.imread(path)

        if len(img.shape) == 2:
            img = skimage.color.gray2rgb(img)

        return img.astype(np.float32)/255.0

    def load_annotations(self, image_index):
        # get ground truth annotations
        annotations_ids = self.coco.getAnnIds(imgIds=self.image_ids[image_index], iscrowd=False)
        annotations     = np.zeros((0, 5))

        # Catch function in case an image has no annotations
        if len(annotations_ids) == 0:
            return annotations

        # parse annotations
        coco_annotations = self.coco.loadAnns(annotations_ids)
        for idx, a in enumerate(coco_annotations):

            # some annotations have basically no width / height, skip them
            if a['bbox'][2] < 1 or a['bbox'][3] < 1:
                continue

            annotation        = np.zeros((1, 5))
            annotation[0, :4] = a['bbox']
            annotation[0, 4]  = self.coco_label_to_label(a['category_id'])
            annotations       = np.append(annotations, annotation, axis=0)

        # transform from [x, y, w, h] to [x1, y1, x2, y2]
        annotations[:, 2] = annotations[:, 0] + annotations[:, 2]
        annotations[:, 3] = annotations[:, 1] + annotations[:, 3]

        return annotations

    def coco_label_to_label(self, coco_label):
        return self.coco_labels_inverse[coco_label]


    def label_to_coco_label(self, label):
        return self.coco_labels[label]

    def image_aspect_ratio(self, image_index):
        image = self.coco.loadImgs(self.image_ids[image_index])[0]
        return float(image['width']) / float(image['height'])



## Loading of Data to DataLoader

The `DataLoader` contains a `collater` and `sampler`. 

- The `collater` merges a list of samples to form a mini-batch of Tensor(s), which is useful for batched loading.
    - In simpler (but not exact) terms, it returns the images, annotations, and additional padding if defined into tensors
    - These can be therefore, easily learned by the model and parallelized using batched nodes
- The `sampler` defines the strategy to draw samples from the dataset. If specified, shuffle must not be specified.
    - In some cases, ussing `shuffle` is simpler, but a dedicated `AspectRatioBasedSampler` function is contained
    - It simply uses the `random` library's shuffle function. In reality, the simple `shuffle=True` also works.

- Use the `transforms` library of torch to apply your own transformations in `data_transform`

In [None]:
import random
from torch.utils.data.sampler import Sampler

device = "cuda"
root_dirs = ['datasets/merged']

#* Custom collater for the dataloader
def collater(data):

    imgs = [s['img'] for s in data]
    annots = [s['annot'] for s in data]
    scales = [s['scale'] for s in data]
        
    widths = [int(s.shape[0]) for s in imgs]
    heights = [int(s.shape[1]) for s in imgs]
    batch_size = len(imgs)

    max_width = np.array(widths).max()
    max_height = np.array(heights).max()

    padded_imgs = torch.zeros(batch_size, max_width, max_height, 3)

    for i in range(batch_size):
        img = imgs[i]
        padded_imgs[i, :int(img.shape[0]), :int(img.shape[1]), :] = img

    max_num_annots = max(annot.shape[0] for annot in annots)
    
    if max_num_annots > 0:

        annot_padded = torch.ones((len(annots), max_num_annots, 5)) * -1

        if max_num_annots > 0:
            for idx, annot in enumerate(annots):
                #print(annot.shape)
                if annot.shape[0] > 0:
                    annot_padded[idx, :annot.shape[0], :] = annot
    else:
        annot_padded = torch.ones((len(annots), 1, 5)) * -1


    padded_imgs = padded_imgs.permute(0, 3, 1, 2)

    return {'img': padded_imgs, 'annot': annot_padded, 'scale': scales}

#* sampling method
class AspectRatioBasedSampler(Sampler):

    def __init__(self, data_source, batch_size, drop_last):
        self.data_source = data_source
        self.batch_size = batch_size
        self.drop_last = drop_last
        self.groups = self.group_images()

    def __iter__(self):
        random.shuffle(self.groups)
        for group in self.groups:
            yield group

    def __len__(self):
        if self.drop_last:
            return len(self.data_source) // self.batch_size
        else:
            return (len(self.data_source) + self.batch_size - 1) // self.batch_size

    def group_images(self):
        # determine the order of the images
        order = list(range(len(self.data_source)))
        order.sort(key=lambda x: self.data_source.image_aspect_ratio(x))

        # divide into groups, one group = one batch
        return [[order[x % len(order)] for x in range(i, i + self.batch_size)] for i in range(0, len(order), self.batch_size)]

#* Data Transformation (You can define your own here)
data_transform = {
    "train": transforms.Compose([transforms.RandomResizedCrop(224),
                                    transforms.RandomHorizontalFlip(),
                                    transforms.ToTensor(),
                                    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])]),
    "test": transforms.Compose([transforms.Resize(256),
                                transforms.CenterCrop(224),
                                transforms.ToTensor(),
                                transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])}

dataset_train = CocoDataset(root_dir="/datasets",set_name="train",transform=data_transform['train']) 
sampler = AspectRatioBasedSampler(dataset_train, batch_size=2, drop_last=False)
dataloader_train = DataLoader(dataset_train, num_workers=3, collate_fn=collater, batch_sampler=sampler)

Using 8 dataloader workers every process


## Loss Function and Evalutation Metrics

## Focal Loss
Formally, it is given by the following:
$$FL_{p_t} = -(1-p)^{\gamma}\log(p_t)$$
where $\gamma$ is a tunable parameter. It basically adds a loss factor $-(1-p)^{\gamma}$ to the standard cross entropy criterion. 
- Setting $\gamma > 0$ reduces the relative loss for well-classified examples ($p_t > 5), emphasizing the hard, misclassified examples.

## Evaluation Metrics
### Intersection over Union (IoU)
The evaluation metric defined in the [docs](https://pytorch.org/ignite/generated/ignite.metrics.IoU.html). Here however, it is hardcoded so as to make it more appropriate to the task at hand.

## Some Classes and Functions
- `gather` : Gathers values along an axis specified by *dim*.


In [36]:
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
# from torchvision.models.detection import retinanet_resnet50_fpn_v2, RetinaNet_ResNet50_FPN_V2_Weights
# from torchvision.models import ResNet50_Weights
from torchinfo import summary
import torch.utils.model_zoo as model_zoo
from pycocotools.cocoeval import COCOeval

#* Intersection over Union
def calc_iou(a, b):
    """
    Defines the Intersection over Union (IoU) metric which utilizes the confusion matrix.
    """

    area = (b[:, 2] - b[:, 0]) * (b[:, 3] - b[:, 1])

    iw = torch.min(torch.unsqueeze(a[:, 2], dim=1), b[:, 2]) - torch.max(torch.unsqueeze(a[:, 0], 1), b[:, 0])
    ih = torch.min(torch.unsqueeze(a[:, 3], dim=1), b[:, 3]) - torch.max(torch.unsqueeze(a[:, 1], 1), b[:, 1])

    iw = torch.clamp(iw, min=0)
    ih = torch.clamp(ih, min=0)

    ua = torch.unsqueeze((a[:, 2] - a[:, 0]) * (a[:, 3] - a[:, 1]), dim=1) + area - iw * ih

    ua = torch.clamp(ua, min=1e-8)

    intersection = iw * ih

    IoU = intersection / ua

    return IoU


#* Define the Focal Loss function
class FocalLoss(nn.Module):
    """
    Defines the Focal Loss criterion. See the documentation in the notebook.
    """
    def __init__(self, gamma=2, alpha = 0.25, size_average = True): #For default values: https://pytorch.org/vision/main/generated/torchvision.ops.sigmoid_focal_loss.html
        super(FocalLoss, self).__init__()
        self.gamma = gamma
        self.alpha = alpha
        if isinstance(alpha,(float,int)): self.alpha = torch.Tensor([alpha,1-alpha]) # store the p and 1-p
        if isinstance(alpha,list): self.alpha = torch.Tensor(alpha)
        self.size_average = size_average
    
    def forward(self,input,target):
        if input.dim() > 2:
            # Applying transmutations to data
            input = input.view(input.size(0), input.size(1),-1) # N,C,H,W => N,C,H*W
            input = input.transpose(1,2) #N,H*W,C
            input = input.contiguous().view(-1,input.size(2)) # N,H*W, C => N*H*W,C
        target = target.view(-1,1)
        
        # Logits calculation for probability
        logpt = F.log_softmax(input)
        logpt = logpt.gather(1,target)
        logpt = logpt.view(-1) # Convert to row vec
        pt = logpt.data.exp()

        # if alpha is present
        if self.alpha is not None:
            if self.alpha.type() != input.data.type():
                self.alpha = self.alpha.type_as(input.data) #ensures the alpha is the same data type as input (Tensor)
            at = self.alpha.gather()
            logpt = logpt * at

        loss = -1 * (1-pt)**self.gamma * logpt
        if self.size_average: return loss.mean()
        else: return loss.sum()
        
def evaluate_coco(dataset, model, threshold=0.05):
    
    model.eval()
    
    with torch.no_grad():

        # start collecting results
        results = []
        image_ids = []

        for index in range(len(dataset)):
            image, target, scale = dataset[index]

            # run network
            if torch.cuda.is_available():
                scores, labels, boxes = model(image.cuda().float().unsqueeze(0))
            else:
                scores, labels, boxes = model(image.float().unsqueeze(0))
            scores = scores.cpu()
            labels = labels.cpu()
            boxes  = boxes.cpu()

            # correct boxes for image scale
            boxes /= scale

            if boxes.shape[0] > 0:
                # change to (x, y, w, h) (MS COCO standard)
                boxes[:, 2] -= boxes[:, 0]
                boxes[:, 3] -= boxes[:, 1]

                # compute predicted labels and scores
                #for box, score, label in zip(boxes[0], scores[0], labels[0]):
                for box_id in range(boxes.shape[0]):
                    score = float(scores[box_id])
                    label = int(labels[box_id])
                    box = boxes[box_id, :]

                    # scores are sorted, so we can break
                    if score < threshold:
                        break

                    # append detection for each positively labeled class
                    image_result = {
                        'image_id'    : dataset.image_ids[index],
                        'category_id' : dataset.label_to_coco_label(label),
                        'score'       : float(score),
                        'bbox'        : box.tolist(),
                    }

                    # append detection to results
                    results.append(image_result)

            # append image to list of processed images
            image_ids.append(dataset.image_ids[index])

            # print progress
            print('{}/{}'.format(index, len(dataset)), end='\r')

        if not len(results):
            return

        # write output
        json.dump(results, open('{}_bbox_results.json'.format(dataset.set_name), 'w'), indent=4)

        # load results in COCO evaluation tool
        coco_true = dataset.coco
        coco_pred = coco_true.loadRes('{}_bbox_results.json'.format(dataset.set_name))

        # run COCO evaluation
        coco_eval = COCOeval(coco_true, coco_pred, 'bbox')
        coco_eval.params.imgIds = image_ids
        coco_eval.evaluate()
        coco_eval.accumulate()
        coco_eval.summarize()

        model.train()
        # Optional: extract mAP@0.5 (index 1) and mAP@0.5:0.95 (index 0)
        map_50 = coco_eval.stats[1]
        map_5095 = coco_eval.stats[0]
        print(f"mAP@0.5: {map_50:.4f}, mAP@0.5:0.95: {map_5095:.4f}")
        return

## RetinaNet-RS

See the documentation [here](https://paperswithcode.com/method/retinanet-rs). It also has a medium article [here](https://freedium.cfd/https://medium.com/@evertongomede/retinanet-advancing-object-detection-in-computer-vision-719ceb744308).

- Class imbalance has impede training due to overfitting on dominant classes. This is due to traditional loss functions treating all classes equally.

- **Focal loss** addresses this issue by dynamically down-weighing the contribution of well-classified examples while emphasizing the importance of hard-to-classify examples. 
- This is achieved by introducing a **modulating factor** that reduces the loss for well-classified examples and increases the loss for misclassified examples
- For the classification head, a separate model is used as a backbone. To use more efficient resources, we will be utilizing the smaller model **ResNet50** with 50 neural layers.

## Some Important Model Notes (see article [here](https://medium.com/@14prakash/the-intuition-behind-retinanet-eb636755607d))
- **Anchor boxes** are used to generate the region proposals. Previously, selective search and edge boxes weere used. However, this was impossible to generate with standard convolutions that are utilized by most CNNs.

- Each representative region (in the article, it is a 50x50 pixel) is fed to a **regression head** and a **classification head**. Usually used is the ResNet models.
- However, the feature map created after multiple subsampling loses a lot of semantic information at low level, thus unable to detet small objects in images. To solve this, the model uses **Feature Pyramid Networks**

### [Feature Pyramid Networks](https://arxiv.org/abs/1612.03144)
- Though convnets are robust to variance in scale, all the top entries in ImageNet or COCO have used multi-scale testing on featurized image pyramids.

- We have to take images at different sizes say 256 x 256, 300 x 300, 500 x 500 and 800 x 800 etc, calculate feature maps for each of this image and then apply non-maxima supression over all these detected positive anchor boxes. This is a very costly operation and inference times gets high.

- The authors of this paper observed that deep convnet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape. For example, take a Resnet architecture and instead of just using the final feature map as shown in RPN network, take feature maps before every pooling (subsampling) layer. 

- Perform the same operations as for **Region Proposal Network** RPN on each of these feature maps and finally combine them using non-maxima supression. This is the crude way of building the feature pyramid networks. 

- But there are large semantic gaps caused by different depths. The high resolution maps (earlier layers) have low-level features that harm their representational capacity for object detection. To achieve this goal, the authors relayed on a architecture that combines low-resolution, semantically strong features with high-resolution, semantically strong features via top-down pathway and lateral connection.

In [37]:
from torchvision.ops import nms
from retinanet.utils import BasicBlock, Bottleneck, BBoxTransform, ClipBoxes
from retinanet.anchors import Anchors

#* Store the urls of all ResNet models to easily swap to other models
model_urls = {
    'resnet18': 'https://download.pytorch.org/models/resnet18-5c106cde.pth',
    'resnet34': 'https://download.pytorch.org/models/resnet34-333f7ec4.pth',
    'resnet50': 'https://download.pytorch.org/models/resnet50-19c8e357.pth',
    'resnet101': 'https://download.pytorch.org/models/resnet101-5d3b4d8f.pth',
    'resnet152': 'https://download.pytorch.org/models/resnet152-b121ed2d.pth',
}

#* Define the RetinaNet Model architecture: Regression Head, Pyramid Networks, Classification Backbone, etc.
class PyramidFeatures(nn.Module):
    def __init__(self, C3_size, C4_size, C5_size, feature_size=256):
        super(PyramidFeatures, self).__init__()

        # upsample C5 to get P5 from the FPN paper
        self.P5_1 = nn.Conv2d(C5_size, feature_size, kernel_size=1, stride=1, padding=0)
        self.P5_upsampled = nn.Upsample(scale_factor=2, mode='nearest')
        self.P5_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)

        # add P5 elementwise to C4
        self.P4_1 = nn.Conv2d(C4_size, feature_size, kernel_size=1, stride=1, padding=0)
        self.P4_upsampled = nn.Upsample(scale_factor=2, mode='nearest')
        self.P4_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)

        # add P4 elementwise to C3
        self.P3_1 = nn.Conv2d(C3_size, feature_size, kernel_size=1, stride=1, padding=0)
        self.P3_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=1, padding=1)

        # "P6 is obtained via a 3x3 stride-2 conv on C5"
        self.P6 = nn.Conv2d(C5_size, feature_size, kernel_size=3, stride=2, padding=1)

        # "P7 is computed by applying ReLU followed by a 3x3 stride-2 conv on P6"
        self.P7_1 = nn.ReLU()
        self.P7_2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, stride=2, padding=1)

    def forward(self, inputs):
        C3, C4, C5 = inputs

        P5_x = self.P5_1(C5)
        P5_upsampled_x = self.P5_upsampled(P5_x)
        P5_x = self.P5_2(P5_x)

        P4_x = self.P4_1(C4)
        P4_x = P5_upsampled_x + P4_x
        P4_upsampled_x = self.P4_upsampled(P4_x)
        P4_x = self.P4_2(P4_x)

        P3_x = self.P3_1(C3)
        P3_x = P3_x + P4_upsampled_x
        P3_x = self.P3_2(P3_x)

        P6_x = self.P6(C5)

        P7_x = self.P7_1(P6_x)
        P7_x = self.P7_2(P7_x)

        return [P3_x, P4_x, P5_x, P6_x, P7_x]


class RegressionModel(nn.Module):
    def __init__(self, num_features_in, num_anchors=9, feature_size=256):
        super(RegressionModel, self).__init__()

        self.conv1 = nn.Conv2d(num_features_in, feature_size, kernel_size=3, padding=1)
        self.act1 = nn.ReLU()

        self.conv2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
        self.act2 = nn.ReLU()

        self.conv3 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
        self.act3 = nn.ReLU()

        self.conv4 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
        self.act4 = nn.ReLU()

        self.output = nn.Conv2d(feature_size, num_anchors * 4, kernel_size=3, padding=1)

    def forward(self, x):
        out = self.conv1(x)
        out = self.act1(out)

        out = self.conv2(out)
        out = self.act2(out)

        out = self.conv3(out)
        out = self.act3(out)

        out = self.conv4(out)
        out = self.act4(out)

        out = self.output(out)

        # out is Batch_size x Channels x Width x Height, with C = 4*num_anchors
        out = out.permute(0, 2, 3, 1)
        # After self.output, the shape is [B, C, W, H].
        # The model permutes it to [B, W, H, C] so that:
        # For every pixel location on the feature map (W x H), it has all anchor predictions grouped together.
        # Easier to reshape into a flat list of bounding box predictions.

        return out.contiguous().view(out.shape[0], -1, 4)
        
        #After permute: shape = [B, W, H, num_anchors * 4]
        #.view(B, -1, 4) collapses spatial dimensions and anchors into a single list of box predictions:
        #Output shape: [B,W×H×num_anchors,4]
        #Output shape: [B,W×H×num_anchors,4]

        #So you're getting a 3D tensor where:
        #- B is batch size.
        #- Second dim is the total number of anchor boxes.
        #- Last dim is the 4 regression values per anchor.

class ClassificationModel(nn.Module):
    def __init__(self, num_features_in, num_anchors=9, num_classes=80, prior=0.01, feature_size=256):
        super(ClassificationModel, self).__init__()

        self.num_classes = num_classes
        self.num_anchors = num_anchors

        self.conv1 = nn.Conv2d(num_features_in, feature_size, kernel_size=3, padding=1)
        self.act1 = nn.ReLU()

        self.conv2 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
        self.act2 = nn.ReLU()

        self.conv3 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
        self.act3 = nn.ReLU()

        self.conv4 = nn.Conv2d(feature_size, feature_size, kernel_size=3, padding=1)
        self.act4 = nn.ReLU()

        self.output = nn.Conv2d(feature_size, num_anchors * num_classes, kernel_size=3, padding=1)
        self.output_act = nn.Sigmoid()

    def forward(self, x):
        out = self.conv1(x)
        out = self.act1(out)

        out = self.conv2(out)
        out = self.act2(out)

        out = self.conv3(out)
        out = self.act3(out)

        out = self.conv4(out)
        out = self.act4(out)

        out = self.output(out)
        out = self.output_act(out)

        # out is B x C x W x H, with C = n_classes + n_anchors
        out1 = out.permute(0, 2, 3, 1)

        batch_size, width, height, channels = out1.shape

        out2 = out1.view(batch_size, width, height, self.num_anchors, self.num_classes)

        return out2.contiguous().view(x.shape[0], -1, self.num_classes)


class ResNet(nn.Module):

    def __init__(self, num_classes, block, layers):
        self.inplanes = 64
        super(ResNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        if block == BasicBlock:
            fpn_sizes = [self.layer2[layers[1] - 1].conv2.out_channels, self.layer3[layers[2] - 1].conv2.out_channels,
                         self.layer4[layers[3] - 1].conv2.out_channels]
        elif block == Bottleneck:
            fpn_sizes = [self.layer2[layers[1] - 1].conv3.out_channels, self.layer3[layers[2] - 1].conv3.out_channels,
                         self.layer4[layers[3] - 1].conv3.out_channels]
        else:
            raise ValueError(f"Block type {block} not understood")

        self.fpn = PyramidFeatures(fpn_sizes[0], fpn_sizes[1], fpn_sizes[2])

        self.regressionModel = RegressionModel(256)
        self.classificationModel = ClassificationModel(256, num_classes=num_classes)

        self.anchors = Anchors()

        self.regressBoxes = BBoxTransform()

        self.clipBoxes = ClipBoxes()

        self.focalLoss = FocalLoss()

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

        prior = 0.01

        self.classificationModel.output.weight.data.fill_(0)
        self.classificationModel.output.bias.data.fill_(-math.log((1.0 - prior) / prior))

        self.regressionModel.output.weight.data.fill_(0)
        self.regressionModel.output.bias.data.fill_(0)

        self.freeze_bn()

    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.inplanes, planes * block.expansion,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * block.expansion),
            )

        layers = [block(self.inplanes, planes, stride, downsample)]
        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)

    def freeze_bn(self):
        '''Freeze BatchNorm layers.'''
        for layer in self.modules():
            if isinstance(layer, nn.BatchNorm2d):
                layer.eval()

    def forward(self, inputs):

        if self.training:
            img_batch, annotations = inputs
        else:
            img_batch = inputs

        x = self.conv1(img_batch)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x1 = self.layer1(x)
        x2 = self.layer2(x1)
        x3 = self.layer3(x2)
        x4 = self.layer4(x3)

        features = self.fpn([x2, x3, x4])

        regression = torch.cat([self.regressionModel(feature) for feature in features], dim=1)

        classification = torch.cat([self.classificationModel(feature) for feature in features], dim=1)

        anchors = self.anchors(img_batch)

        if self.training:
            return self.focalLoss(classification, regression, anchors, annotations)
        else:
            transformed_anchors = self.regressBoxes(anchors, regression)
            transformed_anchors = self.clipBoxes(transformed_anchors, img_batch)

            finalResult = [[], [], []]

            finalScores = torch.Tensor([])
            finalAnchorBoxesIndexes = torch.Tensor([]).long()
            finalAnchorBoxesCoordinates = torch.Tensor([])

            if torch.cuda.is_available():
                finalScores = finalScores.cuda()
                finalAnchorBoxesIndexes = finalAnchorBoxesIndexes.cuda()
                finalAnchorBoxesCoordinates = finalAnchorBoxesCoordinates.cuda()

            for i in range(classification.shape[2]):
                scores = torch.squeeze(classification[:, :, i])
                scores_over_thresh = (scores > 0.05)
                if scores_over_thresh.sum() == 0:
                    # no boxes to NMS, just continue
                    continue

                scores = scores[scores_over_thresh]
                anchorBoxes = torch.squeeze(transformed_anchors)
                anchorBoxes = anchorBoxes[scores_over_thresh]
                anchors_nms_idx = nms(anchorBoxes, scores, 0.5)

                finalResult[0].extend(scores[anchors_nms_idx])
                finalResult[1].extend(torch.tensor([i] * anchors_nms_idx.shape[0]))
                finalResult[2].extend(anchorBoxes[anchors_nms_idx])

                finalScores = torch.cat((finalScores, scores[anchors_nms_idx]))
                finalAnchorBoxesIndexesValue = torch.tensor([i] * anchors_nms_idx.shape[0])
                if torch.cuda.is_available():
                    finalAnchorBoxesIndexesValue = finalAnchorBoxesIndexesValue.cuda()

                finalAnchorBoxesIndexes = torch.cat((finalAnchorBoxesIndexes, finalAnchorBoxesIndexesValue))
                finalAnchorBoxesCoordinates = torch.cat((finalAnchorBoxesCoordinates, anchorBoxes[anchors_nms_idx]))

            return [finalScores, finalAnchorBoxesIndexes, finalAnchorBoxesCoordinates]


def resnet18(num_classes, pretrained=False, **kwargs):
    """Constructs a ResNet-18 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(num_classes, BasicBlock, [2, 2, 2, 2], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet18'], model_dir='.'), strict=False)
    return model


def resnet34(num_classes, pretrained=False, **kwargs):
    """Constructs a ResNet-34 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(num_classes, BasicBlock, [3, 4, 6, 3], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet34'], model_dir='.'), strict=False)
    return model


def resnet50(num_classes, pretrained=False, **kwargs):
    """Constructs a ResNet-50 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(num_classes, Bottleneck, [3, 4, 6, 3], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet50'], model_dir='.'), strict=False)
    return model


def resnet101(num_classes, pretrained=False, **kwargs):
    """Constructs a ResNet-101 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(num_classes, Bottleneck, [3, 4, 23, 3], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet101'], model_dir='.'), strict=False)
    return model


def resnet152(num_classes, pretrained=False, **kwargs):
    """Constructs a ResNet-152 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(num_classes, Bottleneck, [3, 8, 36, 3], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet152'], model_dir='.'), strict=False)
    return model

#* Map the model depths to their respective model variants
resnet_fn_map = {
    18: resnet18,
    34: resnet34,
    50: resnet50,
    101: resnet101,
    152: resnet152
}

## Training Pipeline

In [None]:
depth = 18
if depth not in resnet_fn_map:
        raise ValueError(f"Unsupported ResNet depth: {depth}. Choose from {list(resnet_fn_map.keys())}.")
retinanet = resnet_fn_map[depth](num_classes=2,pretrained=True)

#* Optimizers and schedulers
optimizer = torch.optim.Adam(retinanet.parameters(),lr=1e-5)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, verbose=True)
loss = FocalLoss()
loss_hist = collections.deque(maxlen=500)

#TODO: Create pipeline for hyperparametertuning of model and optimizers

# Use CUDA GPU
use_gpu = True
if use_gpu:
    if torch.cuda.is_available():
        retinanet = retinanet.cuda()
if torch.cuda.is_available():
    retinanet = torch.nn.DataParallel(retinanet).cuda()
else:
    retinanet = torch.nn.DataParallel(retinanet)

#* Training loop
num_epochs = 1
for epoch in range(num_epochs):
    retinanet.train()
    retinanet.module.freeze_bn()

    epoch_loss = []

    for iter_num, data in enumerate(train_loader):
        try:
            optimizer.zero_grad()
            if torch.cuda.is_available():
                    classification_loss, regression_loss = retinanet([data['img'].cuda().float(), data['annot']])
            else:
                classification_loss, regression_loss = retinanet([data['img'].float(), data['annot']])
                
            classification_loss = classification_loss.mean()
            regression_loss = regression_loss.mean()

            loss = classification_loss + regression_loss

            if bool(loss == 0):
                continue

            loss.backward()

            torch.nn.utils.clip_grad_norm_(retinanet.parameters(), 0.1)

            optimizer.step()

            loss_hist.append(float(loss))

            epoch_loss.append(float(loss))

            print(
                'Epoch: {} | Iteration: {} | Classification loss: {:1.5f} | Regression loss: {:1.5f} | Running loss: {:1.5f}'.format(
                    epoch, iter_num, float(classification_loss), float(regression_loss), np.mean(loss_hist)))

            del classification_loss
            del regression_loss
        except Exception as e:
                print(e)
                continue
    print('Evaluating dataset')
    evaluate_coco(test_dataset, retinanet)
    scheduler.step(np.mean(epoch_loss))
    torch.save(retinanet.module, 'weights/{}_retinanet_{}.pt'.format(retinanet, num_epochs))
retinanet.eval()
torch.save(retinanet,f'model_resnet{depth}.pt')

list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be integers or slices, not str
list indices must be

AttributeError: 'COCODataset' object has no attribute 'image_ids'

In [None]:
def test_coco(dataset, model, threshold=0.05):
    
    model.eval()
    
    with torch.no_grad():
        for index in range(len(dataset)):
            data = dataset[index]
            print(type(data), len(data))
            
test_coco(test_dataset,retinanet)

<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tuple'> 2
<class 'tu