# 🐠 Reef - DETR - Detection Transformer - Train

## DETR Baseline model for the [Great Barrier Reef Competition](https://www.kaggle.com/c/tensorflow-great-barrier-reef)

![](https://storage.googleapis.com/kaggle-competitions/kaggle/31703/logos/header.png)

## An adaption of [End to End Object Detection with Transformers:DETR](https://www.kaggle.com/tanulsingh077/end-to-end-object-detection-with-transformers-detr) to the [Great Barrier Reef Competition](https://www.kaggle.com/c/tensorflow-great-barrier-reef)

I made various adaptations to it in order to work, based on the following code and documentation:
* This awesome fork [End to End Object Detection with Transformers:DETR](https://www.kaggle.com/prokaj/end-to-end-object-detection-with-transformers-detr) by [prvi](https://www.kaggle.com/prokaj), correctly formatting the input, which is not coco and not pascal_voc, but something else.
* Albumentation code for bbox normalize and denormalize functions: [here](https://github.com/albumentations-team/albumentations/blob/master/albumentations/augmentations/bbox_utils.py#L88)
* [DETR's hands on Colab Notebook](https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/detr_attention.ipynb): Shows how to load a model from hub, generate predictions, then visualize the attention of the model (similar to the figures of the paper)
* [Standalone Colab Notebook](https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/detr_demo.ipynb): In this notebook, we demonstrate how to implement a simplified version of DETR from the grounds up in 50 lines of Python, then visualize the predictions. It is a good starting point if you want to gain better understanding the architecture and poke around before diving in the codebase.
* [Panoptic Colab Notebook](https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/DETR_panoptic.ipynb): Demonstrates how to use DETR for panoptic segmentation and plot the predictions.
* [Hugging Face DETR Documentation](https://huggingface.co/docs/transformers/model_doc/detr)

The main changes to the original notebook I forked are:
* Data format changed from `[x_min, y_min, w, h]` to `[x_center, y_center, w, h]`
* Resnet-like normalization instead of `[0...1]`


## This is the training notebook. You can find the inference one here: [🐠 Reef - DETR - Detection Transformer - Infer](https://www.kaggle.com/julian3833/reef-detr-detection-transformer-infer).



# Please, _DO_ upvote if you find this useful!!


&nbsp;
&nbsp;
&nbsp;

---


# About DETR (Detection Transformer)

Attention is all you need,paper for Transformers,changed the state of NLP and has achieved great hieghts. Though mainly developed for NLP , the latest research around it focuses on how to leverage it across different verticals of deep learning. Transformer acrhitecture is very very powerful, and is something which is very close to my part,this is the reason I am motivated to explore anything that uses transformers , be it google's recently released Tabnet or OpenAI's ImageGPT .

Detection Transformer leverages the transformer network(both encoder and the decoder) for Detecting Objects in Images . Facebook's researchers argue that for object detection one part of the image should be in contact with the other part of the image for greater result especially with ocluded objects and partially visible objects, and what's better than to use transformer for it.

**The main motive behind DETR is effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode prior knowledge about the task and makes the process complex and computationally expensive**

The main ingredients of the new framework, called DEtection TRansformer or DETR, <font color='green'>are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture.</font>

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/05/Screenshot-from-2020-05-27-17-48-38.png)

---


In [None]:
from IPython.display import IFrame, YouTubeVideo
YouTubeVideo('T35ba_VXkMY',width=600, height=400)

# References:
* The video [above](https://www.youtube.com/watch?v=T35ba_VXkMY) in youtube
* [Other Video](https://www.youtube.com/watch?v=LfUsGv-ESbc)
* The original notebook: [End to End Object Detection with Transformers:DETR](https://www.kaggle.com/tanulsingh077/end-to-end-object-detection-with-transformers-detr)
* [Paper](https://scontent.flko3-1.fna.fbcdn.net/v/t39.8562-6/101177000_245125840263462_1160672288488554496_n.pdf?_nc_cat=104&_nc_sid=ae5e01&_nc_ohc=KwU3i7_izOgAX9bxMVv&_nc_ht=scontent.flko3-1.fna&oh=64dad6ce7a7b4807bb3941690beaee69&oe=5F1E8347) is the link to the paper
* [Github repo](https://github.com/facebookresearch/detr)
* [Blogpost](https://ai.facebook.com/blog/end-to-end-object-detection-with-transformers/)

Ok, enough chit chat, show me the code!!

&nbsp;
&nbsp;
&nbsp;

# Clone github repo of detr

We will not use it for the inference notebook, but we need it for training. The model is a pytorch base model that relies on the torch hub for the weights.

In [None]:
!git clone https://github.com/facebookresearch/detr.git   

* Now if you have seen the video , you know that DETR uses a special loss called Bipartite Matching loss where it assigns one ground truth bbox to a predicted box using a matcher , thus when fine tuning we need the matcher (hungarian matcher as used in paper) and also the fucntion SetCriterion which gives Bipartite matching loss for backpropogation. This is the reason for forking the github repo

* So I did not know that we can add the path to environment variables using sys , hence I was changine directories , but now I have made changes so I do not have to change directories and import detr easily. A big Thanks to @prvi for his help

In [None]:
import os
import time
import random
import numpy as np 
import pandas as pd 

from tqdm import tqdm


#Torch
import torch
import torch.nn as nn
from torch.utils.data import Dataset,DataLoader


#CV
import cv2

################# DETR FUCNTIONS FOR LOSS######################## 
import sys
sys.path.append('./detr/')

from detr.models.matcher import HungarianMatcher
from detr.models.detr import SetCriterion
#################################################################

#Albumenatations
import albumentations as A
import matplotlib.pyplot as plt
from albumentations.pytorch.transforms import ToTensorV2


# Fix randomness

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    
seed_everything(seed=42)

# Configuration

Basic configuration for this model

In [None]:
NUM_CLASSES = 2
NUM_QUERIES = 18
NULL_CLASS_COEF = 0.1
BATCH_SIZE = 8
LR = 1e-5
EPOCHS = 20

WIDTH = 1280
HEIGHT = 720

BASE_DIR = "../input/tensorflow-great-barrier-reef/train_images/"

# Train-validation split

We are splitting using subsequences. I have tried other strategies and this is the one that works the best for now. The dataset has just 3 videos, each of them split into sequences, but in total there are only 20 sequences. A **subsequences**, as we defined them, are parts of a sequences where objects are continually present or are continually not present. 

&nbsp;

Let's see an **example**. Consider the sequence `A` with the following frames:
* `1-20` - No annotations present
* `21-30` - Annotations present
* `31-60` - No annotations
* `61-80` - Annotations present

In this case, we say that the sequence `A` has `4` subsequences (`1-20`, `21-30`, `31-60`, `61-80`).


See: [🐠 Reef - CV strategy: subsequences!](https://www.kaggle.com/julian3833/reef-cv-strategy-subsequences) for more details about this

In [None]:
df = pd.read_csv("../input/reef-cv-strategy-subsequences-dataframes/train-validation-split/train-0.1.csv")

# Turn annotations from strings into lists of dictionaries
df['annotations'] = df['annotations'].apply(eval)

# Create the image path for the row
df['image_path'] = "video_" + df['video_id'].astype(str) + "/" + df['video_frame'].astype(str) + ".jpg"

df.head()

In [None]:
def clean_annotations(annotations):
    new_annotations = []
    for a in annotations:
        x_max = a['x'] + a['width']
        y_max = a['y'] + a['height']
        if x_max > WIDTH or y_max > HEIGHT:
            print("Found broken annotation:", a, x_max, y_max)
        else:
            new_annotations.append(a)
    return new_annotations

In [None]:
# Drop annotation exceeding the image frame
# We could clip them also. Smarter and possibly better?
df['annotations'] = df['annotations'].apply(clean_annotations)

In [None]:
# Drop images with no annotations. The background works as negative samples anyway
df = df[df.annotations.str.len() > 0 ].reset_index(drop=True)

In [None]:
# Train-validation split
df_train, df_val = df[df['is_train']], df[~df['is_train']]

# Augmentations


In [None]:
import torchvision.transforms as T

In [None]:
def get_train_transform():
    return A.Compose([
        #A.Flip(0.5),
        A.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ToTensorV2(p=1.0),
        
    ])#, bbox_params={'format': 'coco', 'label_fields': ['labels']})


def get_valid_transform():
    return A.Compose([
        A.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ToTensorV2(p=1.0),
        
    ])#, bbox_params={'format': 'coco', 'label_fields': ['labels']})

# Creating Dataset

* I hope you have the video by now , DETR accepts data in coco format which is (x,y,w,h)(for those who do not know there are two formats coco and pascal(smin,ymin,xmax,ymax) which are widely used) . So now we need to prepare data in that format

In [None]:
class ReefDataset(Dataset):

    def __init__(self, df, transforms):
        self.df = df
        self.transforms = transforms

    def get_boxes(self, row):
        """Returns the bboxes for a given row as a 3D matrix with format [x_center, y_center, w, h]"""
        boxes = pd.DataFrame(row['annotations'], columns=['x', 'y', 'width', 'height'])
        boxes['x'] = boxes['x'] + (boxes['width'] / 2)
        boxes['y'] = boxes['y'] + (boxes['height'] / 2)
        
        return boxes.astype(float).values
    
    def get_image(self, row):
        """Gets the image for a given row"""
        
        image = cv2.imread(f'{BASE_DIR}/{row["image_path"]}', cv2.IMREAD_COLOR)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
        #image /= 255.0
        return image
    
    
    
    def plot_img(self, i):
        img, _, _ = self[i]
        plt.imshow(img.permute(1, 2, 0))
        plt.show()
    
    def __getitem__(self, i):

        row = self.df.iloc[i]
        image = self.get_image(row)
        boxes = self.get_boxes(row)
        
        n_boxes = boxes.shape[0]
        
        # Calculate the area
        #Area of bb
        #area = boxes[:,2] * boxes[:,3]
        
        
        target = {
            'boxes': torch.as_tensor(boxes, dtype=torch.float32),
            #'area': torch.as_tensor(area, dtype=torch.float32),
            
             'image_id': torch.tensor([i]),
            
            # There is only one class
            'labels': torch.zeros((n_boxes,), dtype=torch.int64),
        }

        image_id = self.df.iloc[i]['image_path']
        
        
        sample = {
            'image': image,
            'bboxes': target['boxes'],
            'labels': target['labels']
        }
        image = self.transforms(image=sample['image'])['image']
        #image = sample['image']

        #import pdb; pdb.set_trace()
        #target['boxes'] = torch.stack(tuple(map(torch.tensor, zip(*sample['bboxes'])))).permute(1, 0)
        
        target['boxes'] = A.augmentations.bbox_utils.normalize_bboxes(target['boxes'],rows=HEIGHT,cols=WIDTH)
        target['boxes'] = torch.as_tensor(target['boxes'], dtype=torch.float32)
        
        return image, target, image_id

    def __len__(self):
        return len(self.df)


# Create Datasets and DataLoaders

In [None]:
def collate_fn(batch):
    return tuple(zip(*batch))

ds_train = ReefDataset(df_train, get_train_transform())
ds_val = ReefDataset(df_val, get_valid_transform())

dl_train = DataLoader(ds_train, batch_size=BATCH_SIZE,
                      shuffle=True, num_workers=4, collate_fn=collate_fn)


dl_val = DataLoader(ds_val, batch_size=BATCH_SIZE, shuffle=False,
                    num_workers=4, collate_fn=collate_fn)

In [None]:
ds_train.plot_img(3)

# Model

* Initial DETR model is trained on coco dataset , which has 91 classes + 1 background class , hence we need to modify it to take our own number of classes
* Also DETR model takes in 100 queries ie, it outputs total of 100 bboxes for every image, we can very well change that too

In [None]:
class DETRModel(nn.Module):
    def __init__(self, num_classes, num_queries):
        super(DETRModel,self).__init__()
        self.num_classes = num_classes
        self.num_queries = num_queries
        
        self.model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True)
        self.in_features = self.model.class_embed.in_features
        
        self.model.class_embed = nn.Linear(in_features=self.in_features,out_features=self.num_classes)
        self.model.num_queries = self.num_queries
        
    def forward(self,images):
        return self.model(images)

# Utils
* AverageMeter - class for averaging loss,metric,etc over epochs

In [None]:
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

# Matcher and Bipartite Matching Loss

Now we make use of the unique loss that the model uses and for that we need to define the matcher. DETR calcuates three individual losses :
* Classification Loss for labels(its weight can be set by loss_ce)
* Bbox Loss (its weight can be set by loss_bbox)
* Loss for Background class

In [None]:
'''
code taken from github repo detr , 'code present in engine.py'
'''

matcher = HungarianMatcher()

weight_dict = weight_dict = {'loss_ce': 1, 'loss_bbox': 1 , 'loss_giou': 1}

losses = ['labels', 'boxes', 'cardinality']

# Training Function

Training of DETR is unique and different from FasteRRcnn and EfficientDET, as we train the criterion as well , the training function can be viewed here : https://github.com/facebookresearch/detr/blob/master/engine.py

In [None]:
def train_fn(data_loader, model, criterion, optimizer, device, scheduler, epoch):
    model.train()
    criterion.train()
    
    summary_loss = AverageMeter()
    
    tk0 = tqdm(data_loader, total=len(data_loader))
    
    for step, (images, targets, image_ids) in enumerate(tk0):
        
        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        

        output = model(images)
        loss_dict = criterion(output, targets)
        weight_dict = criterion.weight_dict
        
        losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)
        
        optimizer.zero_grad()

        losses.backward()
        optimizer.step()
        if scheduler is not None:
            scheduler.step()
        
        summary_loss.update(losses.item(),BATCH_SIZE)
        tk0.set_postfix(loss=summary_loss.avg)
        
    return summary_loss

# Eval Function

In [None]:
def eval_fn(data_loader, model,criterion, device):
    model.eval()
    criterion.eval()
    summary_loss = AverageMeter()
    
    with torch.no_grad():
        
        tk0 = tqdm(data_loader, total=len(data_loader))
        for step, (images, targets, image_ids) in enumerate(tk0):
            
            images = list(image.to(device) for image in images)
            targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

            output = model(images)

            loss_dict = criterion(output, targets)
            weight_dict = criterion.weight_dict
        
            losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)
            
            summary_loss.update(losses.item(),BATCH_SIZE)
            tk0.set_postfix(loss=summary_loss.avg)
    
    return summary_loss

# Training/validation loop

In [None]:
device = torch.device('cuda')
model = DETRModel(num_classes=NUM_CLASSES,num_queries=NUM_QUERIES)
model = model.to(device)
criterion = SetCriterion(NUM_CLASSES-1, matcher, weight_dict, eos_coef = NULL_CLASS_COEF, losses=losses)
criterion = criterion.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=LR)

best_loss = 10**5
for epoch in range(EPOCHS):
    time_start = time.time()
    train_loss = train_fn(dl_train, model,criterion, optimizer,device,scheduler=None,epoch=epoch)
    valid_loss = eval_fn(dl_val, model,criterion, device)

    elapsed = time.time() - time_start
    chk_name = f'pytorch_model_e{epoch}.bin'
    torch.save(model.state_dict(), chk_name)
    print(f"[Epoch {epoch+1:2d} / {EPOCHS:2d}] Train loss: {train_loss.avg:.3f}. Val loss: {valid_loss.avg:.3f} --> {chk_name}  [{elapsed/60:.0f} mins]")   

    if valid_loss.avg < best_loss:
        best_loss = valid_loss.avg
        print(f'Best model found in epoch {epoch+1}........Saving Model')
        torch.save(model.state_dict(), 'pytorch_model.bin')

# Save the torch hub cache path 

Save the torch hub path to the working directory so it's stored as the output of the notebook and we can use everything without Internet in the submission notebook.

In [None]:
cp -R /root/.cache/torch/hub/ torch_hub/

# Sample

* I know we might be naive to visualize the model ouput just after one epoch but lets do that and see what are the results like

In [None]:
def view_sample(df_sample, model, device):
    '''
    Code taken from Peter's Kernel 
    https://www.kaggle.com/pestipeti/pytorch-starter-fasterrcnn-train
    '''
    ds_val = ReefDataset(df_sample, get_valid_transform())
    dl_val = DataLoader(ds_val, batch_size=BATCH_SIZE, shuffle=False,
                        num_workers=4, collate_fn=collate_fn)
    
    images, targets, image_ids = next(iter(dl_val))
    
    images = list(img.to(device) for img in images)
    targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
    
    # [x_center, y_center, width, height] # All scaled
    boxes = targets[0]['boxes'].cpu().numpy()
    # De-scaled
    boxes = np.array([np.array(box).astype(np.int32) for box in A.augmentations.bbox_utils.denormalize_bboxes(boxes, HEIGHT, WIDTH)])
    #boxes['x'] = boxes['x'] + (boxes['width'] / 2)
    #boxes['y'] = boxes['y'] + (boxes['height'] / 2)

    
    #[x_min, y_min, width, height]
    boxes[:, 0] = boxes[:, 0] - (boxes[:, 2] / 2) # x_center --> x_min
    boxes[:, 1] = boxes[:, 1] - (boxes[:, 3] / 2) # y_center --> y_min
    
    sample = images[0].permute(1,2,0).cpu().numpy()
    
    model.eval()
    model.to(device)
    cpu_device = torch.device("cpu")
    
    with torch.no_grad():
        outputs = model(images)
        
    outputs = [{k: v.to(cpu_device) for k, v in outputs.items()}]
    
    fig, ax = plt.subplots(1, 1, figsize=(16, 8))
    for box in boxes:
        x, y, w, h = box
        cv2.rectangle(sample, (x, y), (x+w, y+h), (0, 220, 0), 3)
        
        

    oboxes = outputs[0]['pred_boxes'][0].detach().cpu().numpy()
    oboxes = np.array([np.array(box).astype(np.int32) for box in A.augmentations.bbox_utils.denormalize_bboxes(oboxes, HEIGHT, WIDTH)])
    
    # [x_min, y_min, width, height]
    oboxes[:, 0] = oboxes[:, 0] - (oboxes[:, 2] / 2) # x_center --> x_min
    oboxes[:, 1] = oboxes[:, 1] - (oboxes[:, 3] / 2) # y_center --> y_min
    
    prob   = outputs[0]['pred_logits'][0].softmax(1).detach().cpu().numpy()[:,0]
    #print(f"Probabilities: {prob}")
    scored_boxes = list(zip(oboxes, prob))
    sorted_boxes = list(sorted(scored_boxes, key=lambda y: -y[1]))
    print([score for _, score in sorted_boxes][:10])
    for i, (box, p) in enumerate(sorted_boxes):
        x, y, w, h = box
        
        if p > 0.5:
            cv2.rectangle(sample, (x, y), (x+w, y+h), (220, 0, 0), 3)
        
        if i > 18:
            break

            
    ax.set_axis_off()
    ax.imshow(sample)
    
    
model = DETRModel(num_classes=NUM_CLASSES, num_queries=NUM_QUERIES)
model.load_state_dict(torch.load("./pytorch_model.bin"))
view_sample(df_sample=df_val[df_val['n_annotations'] > 3].iloc[[30]], model=model,device=torch.device('cuda'))

# Please, _DO_ upvote if you find it useful or interesting!!