# **영수증 글자 검출 Baseline code Tutorial**


> 영수증 글자 검출 대회에 오신 여러분 환영합니다! 🎉
>
> 아래 Tutorial에서는 Baseline code가 어떻게 구성되어 있는지 살펴보겠습니다.

# Contents

1.   Config 구성
2.   Model
3.   Lightning Module
4.   Train
5.   Test, Predict
6.   Tips

---
⚠️ 주의사항
```
Tutorial 파일은 Baseline code의 이해를 돕기 위한 예제로,
Google Colab 환경을 기준으로 작성되어 있습니다.
VSCode를 이용한 로컬 환경에서 실행할 경우 결로 수정 등이 필요할 수 있습니다.
```
---


# 0. Baseline code 다운로드 및 Dependency 설치

Colab 환경에서 Baseline code를 실행하기 위한 설정입니다.

In [None]:
#@markdown ### Colab 노트북 Dependency 설치하기
#@markdown - Colab 노트북을 사용하는데 필요한 Packages를 설치합니다.
#@markdown ---

from IPython.display import clear_output
import ipywidgets as widgets

def inf(msg, style, wdth): inf = widgets.Button(description=msg, disabled=True, button_style=style, layout=widgets.Layout(min_width=wdth));display(inf)
C_default = "\033[0;39m"
C_yellow = "\033[1;93m"

print(C_yellow + "Install..." + C_default)
# START

!pip install gdown==v4.6.3 pathlib==1.0.1

# END CODE
clear_output()
inf('\u2714 Done', 'success', '50px')

Button(button_style='success', description='✔ Done', disabled=True, layout=Layout(min_width='50px'), style=But…

In [None]:
#@markdown ### Baseline code 다운로드 (Google Drive)

#@markdown 미리 준비된 다음 파일을 설정한 경로에 다운로드 받습니다.
#@markdown - Baseline code
#@markdown - Test dataset
#@markdown - Model checkpoint
#@markdown ---

print(C_yellow + "Download..." + C_default)
# START

from pathlib import Path

Baseline_URL = "https://aistages-api-public-prod.s3.amazonaws.com/app/Competitions/000293/data/code.tar.gz" #@param {type:"string"}
Library_URL = "https://drive.google.com/drive/folders/1TYvjiTivRJcIrLytshcEaaooLie9s4pU?usp=sharing" #@param {type:"string"}
Dataset_URL = "https://drive.google.com/drive/folders/1FBEafD3Zua86kb5TodVUgAk5SgRuIUJ4?usp=sharing" #@param {type:"string"}
Download_Path = Path('/content')

Baseline_Path = Download_Path / "baseline_code"
Library_Path = Baseline_Path / "lib"
Dataset_Path = Download_Path / "dataset"

!mkdir -p {Download_Path}
!curl -o {Download_Path}/code.tar.gz {Baseline_URL}
!gdown {Library_URL} -O {Library_Path} --folder
!gdown {Dataset_URL} -O {Dataset_Path} --folder

!mkdir -p /data
!ln -s /content/dataset /data/datasets

print(C_yellow + "Install..." + C_default)
!tar xvfz {Download_Path}/code.tar.gz
!pip install -r {Library_Path}/requirements.txt
!cp -rf {Library_Path}/vgg16.py {Baseline_Path}/ocr/models/encoder/

%cd {Baseline_Path}

# END CODE
clear_output()
inf('\u2714 Done', 'success', '50px')

Button(button_style='success', description='✔ Done', disabled=True, layout=Layout(min_width='50px'), style=But…

In [None]:
#@markdown ### GPU 사용 설정

#@markdown - Colab 환경에서 유효한 Device를 확인합니다.
#@markdown ---

import torch

def to_device(data, device):
    if isinstance(data, (list, tuple)):
        return [to_device(x, device) for x in data]
    elif isinstance(data, dict):
        return {k: to_device(v, device) for k, v in data.items()}
    elif isinstance(data, torch.Tensor):
        return data.to(device)
    else:
        return data

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f'Use [{device}]')

Use [cuda]


# 1. Config 구성

Config 구성은 Hydra 기반으로 구성되어 있으며 아래와 같은 Folder structure를 가집니다.

```
└─── configs
    ├── preset
    │   ├── example.yaml
    │   ├── base.yaml
    │   ├── datasets
    │   │   └── db.yaml
    │   ├── lightning_modules
    │   │   └── base.yaml
    │   ├── metrics
    │   │   └── cleval.yaml
    │   └── models
    │       ├── decoder
    │       │   └── unet.yaml
    │       ├── encoder
    │       │   └── timm_backbone.yaml
    │       ├── head
    │       │   └── db_head.yaml
    │       ├── loss
    │       │   └── db_loss.yaml
    │       ├── postprocess
    │       │   └── base.yaml
    │       └── model_example.yaml
    ├── train.yaml
    ├── test.yaml
    └── predict.yaml
 ```

## 1.1 Dataset config

- configs/preset/datasets/db.yaml

> Dataset config에 대해서 알아보자.

```yaml
# @package _global_

dataset_base_path: "/data/datasets/"   # Change your path

datasets:
  train_dataset:
    _target_: ${dataset_path}.OCRDataset
    image_path: ${dataset_base_path}images/train
    annotation_path: ${dataset_base_path}jsons/train.json
    transform: ${transforms.train_transform}
  val_dataset:
    _target_: ${dataset_path}.OCRDataset
    image_path: ${dataset_base_path}images/val
    annotation_path: ${dataset_base_path}jsons/val.json
    transform: ${transforms.val_transform}
  test_dataset:
    _target_: ${dataset_path}.OCRDataset
    image_path: ${dataset_base_path}images/val
    annotation_path: ${dataset_base_path}jsons/val.json
    transform: ${transforms.test_transform}
  predict_dataset:
    _target_: ${dataset_path}.OCRDataset
    image_path: ${dataset_base_path}images/test
    annotation_path: null
    transform: ${transforms.test_transform}

transforms:
  train_transform:
    _target_: ${dataset_path}.DBTransforms
    transforms:
      - _target_: albumentations.LongestMaxSize
        max_size: 640
        p: 1.0
      - _target_: albumentations.PadIfNeeded
        min_width: 640
        min_height: 640
        border_mode: 0
        p: 1.0
      - _target_: albumentations.HorizontalFlip
        p: 0.5
      - _target_: albumentations.Normalize
        mean: [0.485, 0.456, 0.406]
        std: [0.229, 0.224, 0.225]
    keypoint_params:
      _target_: albumentations.KeypointParams
      format: 'xy'
      remove_invisible: True
```

In [None]:
import omegaconf
from pathlib import Path
from hydra.utils import instantiate
import sys
from hydra import compose, initialize

sys.path.append(str(Baseline_Path))

config_path = 'configs/'

with initialize(version_base=None, config_path=str(config_path), job_name="test_app"):
    cfg = compose(config_name="train", overrides=["preset=example"])

train_dataset = instantiate(cfg.datasets.train_dataset)

data = train_dataset[0]
data.keys()

odict_keys(['image', 'image_filename', 'shape', 'polygons', 'inverse_matrix'])

```yaml
# @package _global_

dataloaders:
  train_dataloader:
    batch_size: 4
    shuffle: True
    num_workers: 2
  val_dataloader:
    batch_size: 4
    shuffle: False
    num_workers: 2
  test_dataloader:
    batch_size: 4
    shuffle: False
    num_workers: 2
  predict_dataloader:
    batch_size: 1
    shuffle: False
    num_workers: 2

collate_fn:
  _target_: ${dataset_path}.DBCollateFN
  shrink_ratio: 0.4
  thresh_min: 0.3
  thresh_max: 0.7
```

In [None]:
from torch.utils.data import DataLoader

collate_fn = instantiate(cfg.collate_fn)
collate_fn.inference_mode = False
data_loader = DataLoader(train_dataset, collate_fn=collate_fn, **cfg.dataloaders.train_dataloader)
for i, data_loaded in enumerate(data_loader):
    if i == 1:
        break
    print(data_loaded.keys())
    print(f"image size: {data_loaded['images'].shape}")

data_loaded = to_device(data_loaded, device)



odict_keys(['images', 'image_filename', 'inverse_matrix', 'polygons', 'prob_maps', 'thresh_maps'])
image size: torch.Size([4, 3, 640, 640])


## 1.2 Encoder config

- configs/preset/models/encoder/timm_backbone.yaml

Encoder - Decoder - Head 구조 중 Encoder config에 대해서 알아보자.

```yaml
# @package _global_

models:
  encoder:
    _target_: ${encoder_path}.TimmBackbone
    model_name: 'resnet18'
    select_features: [1, 2, 3, 4]            # Output layer
    pretrained: true
```

In [None]:
encoder = instantiate(cfg.models.encoder).to(device)
encoder_features = encoder(data_loaded["images"])
for encoder_feature in encoder_features:
    print(encoder_feature.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


torch.Size([4, 64, 160, 160])
torch.Size([4, 128, 80, 80])
torch.Size([4, 256, 40, 40])
torch.Size([4, 512, 20, 20])


## 1.3 Decoder config

- configs/preset/models/decoder/unet.yaml

Encoder - Decoder - Head 구조 중 Decoder config에 대해서 알아보자.

```yaml
# @package _global_

models:
  decoder:
    _target_: ${decoder_path}.UNet
    in_channels: [64, 128, 256, 512]  # Input layer channel
    strides: [4, 8, 16, 32]           # Input layer scale
    inner_channels: 256               # Hidden layer channel
    output_channels: 64               # output layer channel
    bias: False
```

In [None]:
decoder = instantiate(cfg.models.decoder).to(device)
decoder_features = decoder(encoder_features)
for decoder_feature in decoder_features:
    print(decoder_feature.shape)

torch.Size([4, 64, 160, 160])
torch.Size([4, 64, 160, 160])
torch.Size([4, 64, 160, 160])
torch.Size([4, 64, 160, 160])


## 1.4 Head config

- configs/preset/models/head/db_head.yaml

Encoder - Decoder - Head 구조 중 Head config에 대해서 알아보자.

```yaml
# @package _global_

# https://arxiv.org/pdf/1911.08947.pdf 참조

models:
  head:
    _target_: ${head_path}.DBHead
    in_channels: 256                 # Input layer channel
    upscale: 4                       # Output layer scale factor
    k: 50                            # The amplifying factor
    bias: False                      # Use bias or not in LayerNorm
    smooth: False                    # Use smooth or not in Upsample
    postprocess:
      thresh: 0.3                    # Binarization threshold
      box_thresh: 0.7                # Detection Box threshold
      max_candidates: 300            # Limit the number of detection boxes
      use_polygon: False             # Detection Box Type (QUAD or POLY)
```

In [None]:
head = instantiate(cfg.models.head).to(device)
with torch.no_grad():
  head_output = head(decoder_features)
  for k, v in head_output.items():
      print(f'{k}: {v.shape}')

prob_maps: torch.Size([4, 1, 640, 640])
thresh_maps: torch.Size([4, 1, 640, 640])
binary_maps: torch.Size([4, 1, 640, 640])


## 1.5 Loss config

- configs/preset/models/loss/db_loss.yaml

Loss config에 대해서 알아보자.

```yaml
# @package _global_

# https://arxiv.org/pdf/1911.08947.pdf 참조

models:
  loss:
    _target_: ${loss_path}.DBLoss
    negative_ratio: 3.0
    eps: 1e-6
    prob_map_loss_weight: 5.0
    thresh_map_loss_weight: 10.0
    binary_map_loss_weight: 1.0
```

In [None]:
loss_fn = instantiate(cfg.models.loss).to(device)
loss, loss_dict = loss_fn(head_output, **data_loaded)
print(loss)
for k, v in loss_dict.items():
    print(f'{k}: {v}')

tensor(30.8433, device='cuda:0')
loss_prob: 5.339261531829834
loss_thresh: 0.322316437959671
loss_binary: 0.9238760471343994


## 1.6 Post Process

Model output을 후처리하면 어떠한 output이 나오는지 알아보자.

In [None]:
boxes_batch, boxes_score = head.get_polygons_from_maps(data_loaded, head_output)

In [None]:
boxes_batch

[[[[840, 1264], [866, 1264], [866, 1280], [840, 1280]],
  [[1102, 1264], [1114, 1264], [1114, 1274], [1102, 1274]],
  [[978, 1260], [988, 1270], [979, 1279], [969, 1269]],
  [[969, 1261], [979, 1271], [970, 1280], [960, 1270]],
  [[952, 1254], [972, 1264], [964, 1280], [944, 1270]],
  [[946, 1258], [956, 1268], [947, 1277], [937, 1267]],
  [[920, 1260], [931, 1271], [922, 1280], [911, 1269]],
  [[904, 1270], [914, 1260], [922, 1268], [912, 1278]],
  [[904, 1260], [915, 1271], [906, 1280], [895, 1269]],
  [[895, 1260], [907, 1268], [897, 1282], [885, 1274]],
  [[888, 1260], [898, 1270], [889, 1279], [879, 1269]],
  [[880, 1260], [890, 1270], [881, 1279], [871, 1269]],
  [[858, 1262], [882, 1262], [882, 1278], [858, 1278]],
  [[794, 1262], [818, 1262], [818, 1280], [794, 1280]],
  [[792, 1260], [803, 1271], [794, 1280], [783, 1269]],
  [[600, 1260], [611, 1271], [602, 1280], [591, 1269]],
  [[592, 1260], [603, 1271], [594, 1280], [583, 1269]],
  [[356, 1264], [366, 1264], [366, 1276], [3

In [None]:
boxes_score

[[0.45524825815111397,
  0.6496775825893565,
  0.5841146866839967,
  0.4793402031799288,
  0.5766175248995856,
  0.6665217394107266,
  0.4715702997577451,
  0.5163349043448558,
  0.4439139825259417,
  0.47516152241761345,
  0.5143326961386169,
  0.5864490897302845,
  0.5763864548336844,
  0.49151017881787856,
  0.5251596476092263,
  0.45600231226953014,
  0.46975141108288715,
  0.5906513050547801,
  0.523324312770927],
 [],
 [],
 [0.7143274578265846,
  0.6290705460545724,
  0.47780686336541706,
  0.41646984848656843,
  0.40017052990447993,
  0.5309884444706969,
  0.5799044613329236,
  0.5308339078910649,
  0.4260425598886286,
  0.4338377337550628,
  0.4682258974734892,
  0.45017589119242984,
  0.48015375061968374,
  0.4544194877299015,
  0.4410321581177414,
  0.4702468290552497,
  0.4168512431298785,
  0.46145283357291195,
  0.48142035766039043,
  0.46864132285350935,
  0.46602760212146677,
  0.4363943788292818,
  0.41702349438673997,
  0.44712691594112486,
  0.513532528722135,
  0.449

## 1.7 Model config

- configs/preset/models/model_example.yaml

Encoder - Decoder - Head - Loss를 합친 Model config에 대해서 알아보자.

```yaml
# @package _global_

defaults:
  - /preset/models/decoder/unet
  - /preset/models/encoder/timm_backbone
  - /preset/models/head/db_head
  - /preset/models/loss/db_loss
  - _self_

models:
  optimizer:
    _target_: torch.optim.Adam
    lr: 0.001
    weight_decay: 0.0001
  scheduler:
    _target_: torch.optim.lr_scheduler.StepLR
    step_size: 100
    gamma: 0.1
```

## 1.8 Example config

- configs/preset/example.yaml

Dataset, Model, Base setting config를 합친 Example config

```yaml
# @package _global_

defaults:
  - base
  - /preset/datasets/db
  - /preset/models/model_example
  - /preset/lightning_modules/base
  - _self_
```

## 1.9 Train config

- configs/train.yaml

Train 관련 config에 대해서 알아보자.

```yaml
defaults:
  - _self_
  - preset:
  - override hydra/hydra_logging: disabled
  - override hydra/job_logging: disabled

seed: 42
exp_name: "ocr_training"
project_name: "OCRProject"

wandb: False
exp_version: "v1.0"

resume: null  # "checkpoints/sehwan_20240118_144938/epoch=5-step=4908.ckpt"

trainer:
  max_epochs: 10
  num_sanity_val_steps: 1
  log_every_n_steps: 50
  check_val_every_n_epoch: 1
  deterministic: True
```

## 1.10 Test config

- configs/test.yaml

Test 관련 config에 대해서 알아보자.

```yaml
defaults:
  - _self_
  - preset:
  - override hydra/hydra_logging: disabled
  - override hydra/job_logging: disabled

seed: 42
exp_name: "ocr_training"
project_name: "OCRProject"

wandb: False
exp_version: "v1.0"

checkpoint_path: "checkpoints/sehwan_20240118_144938/epoch=5-step=4908.ckpt"
```

## 1.11 Predict config

- configs/predict.yaml

Predict 관련 config에 대해서 알아보자.

```yaml
defaults:
  - _self_
  - preset:
  - override hydra/hydra_logging: disabled
  - override hydra/job_logging: disabled

seed: 42
exp_name: "ocr_training"

checkpoint_path: "checkpoints/sehwan_20240118_144938/epoch=5-step=4908.ckpt"
minified_json: False
```

# 2. Model

Model의 구현체는 어떻게 생겼는지 알아보자.

1.   Encoder
2.   Decoder
3.   Architecture



## 2.1 Encoder

In [None]:
import torch.nn as nn
import timm


class TimmBackbone(nn.Module):
    def __init__(self, model_name='resnet18', select_features=[1, 2, 3, 4], pretrained=True):
        super(TimmBackbone, self).__init__()
        # Timm Backbone 모델을 자유롭게 사용
        self.model = timm.create_model(model_name, pretrained=pretrained, features_only=True)
        # Decoder에 연결하려는 Feature를 선택
        self.select_features = select_features

    def forward(self, x):
        features = self.model(x)
        return [features[i] for i in self.select_features]

## 2.2 Decoder

In [None]:
from itertools import accumulate
import torch.nn as nn


class UNet(nn.Module):
    def __init__(self,
                 in_channels=[64, 128, 256, 512],
                 strides=[4, 8, 16, 32],
                 inner_channels=256,
                 output_channels=64,
                 bias=False):
        super(UNet, self).__init__()

        assert len(strides) == len(in_channels), "Mismatch in 'strides' and 'in_channels' lengths."

        # Parameters에 따라 UNet 구조를 동적으로 생성
        # Decoder size 계산
        upscale_factors = [strides[idx] // strides[idx - 1] for idx in range(1, len(strides))]
        outscale_factors = list(accumulate(upscale_factors, lambda x, y: x * y))

        self.upsamples = nn.ModuleList()
        for upscale in upscale_factors:
            self.upsamples.append(nn.Upsample(scale_factor=upscale, mode='nearest'))

        self.inners = nn.ModuleList()
        for in_channel in in_channels:
            self.inners.append(nn.Conv2d(in_channel, inner_channels, kernel_size=1, bias=bias))

        self.outers = nn.ModuleList()
        for outscale in reversed(outscale_factors):
            outer = nn.Sequential(nn.Conv2d(inner_channels, output_channels,
                                            kernel_size=3, padding=1, bias=bias),
                                  nn.Upsample(scale_factor=outscale, mode='nearest'))
            self.outers.append(outer)
        self.outers.append(nn.Conv2d(inner_channels, output_channels, kernel_size=3,
                                     padding=1, bias=bias))

        self.upsamples.apply(self.weights_init)
        self.inners.apply(self.weights_init)
        self.outers.apply(self.weights_init)

    def weights_init(self, m):
        classname = m.__class__.__name__
        if classname.find('Conv') != -1:
            nn.init.kaiming_normal_(m.weight.data)
        elif classname.find('BatchNorm') != -1:
            m.weight.data.fill_(1.)
            m.bias.data.fill_(1e-4)

    def forward(self, features):
        in_features = [inner(feat) for feat, inner in zip(features, self.inners)]

        up_features = []
        up = in_features[-1]
        for i in range(len(in_features) - 1, 0, -1):
            up = self.upsamples[i - 1](up) + in_features[i - 1]
            up_features.append(up)

        out_features = [self.outers[0](in_features[-1])]
        out_features += [outer(feat) for feat, outer in zip(up_features, self.outers[1:])]

        return out_features

## 2.3 Architecture

In [None]:
import torch.nn as nn
from hydra.utils import instantiate
from ocr.models.encoder import get_encoder_by_cfg
from ocr.models.decoder import get_decoder_by_cfg
from ocr.models.head import get_head_by_cfg
from ocr.models.loss import get_loss_by_cfg


class OCRModel(nn.Module):
    def __init__(self, cfg):
        super(OCRModel, self).__init__()
        self.cfg = cfg

        # 각 모듈 instantiate
        self.encoder = get_encoder_by_cfg(cfg.encoder)
        self.decoder = get_decoder_by_cfg(cfg.decoder)
        self.head = get_head_by_cfg(cfg.head)
        self.loss = get_loss_by_cfg(cfg.loss)

    def forward(self, images, return_loss=True, **kwargs):
        encoded_features = self.encoder(images)
        decoded_features = self.decoder(encoded_features)
        pred = self.head(decoded_features, return_loss)

        # Loss 계산
        if return_loss:
            loss, loss_dict = self.loss(pred, **kwargs)
            pred.update(loss=loss, loss_dict=loss_dict)

        return pred

    def get_optimizers(self):
        optimizer_config = self.cfg.optimizer
        optimizer = instantiate(optimizer_config, params=self.parameters())

        if 'scheduler' in self.cfg:
            scheduler_config = self.cfg.scheduler
            scheduler = instantiate(scheduler_config, optimizer=optimizer)
            return [optimizer], [scheduler]
        return optimizer

    def get_polygons_from_maps(self, gt, pred):
        return self.head.get_polygons_from_maps(gt, pred)

# 3. Lightning Module

Lightning의 Module에 대해서 알아보자.

## 3.1 Lightning Trainer

In [None]:
import numpy as np
import json
from datetime import datetime
from pathlib import Path
import lightning.pytorch as pl
from tqdm import tqdm
from collections import defaultdict
from collections import OrderedDict
from torch.utils.data import DataLoader
from hydra.utils import instantiate
from ocr.metrics import CLEvalMetric


class OCRPLModule(pl.LightningModule):
    def __init__(self, model, dataset, config):
        super(OCRPLModule, self).__init__()
        self.model = model
        self.dataset = dataset
        self.metric = CLEvalMetric()
        self.config = config

        self.validation_step_outputs = OrderedDict()
        self.test_step_outputs = OrderedDict()
        self.predict_step_outputs = OrderedDict()

    def forward(self, x):
        return self.model(return_loss=False, **x)

    def training_step(self, batch, batch_idx):
        pred = self.model(**batch)
        self.log('train/loss', pred['loss'], batch_size=len(batch))
        for key, value in pred['loss_dict'].items():
            self.log(f'train/{key}', value, batch_size=len(batch))
        return pred

    def validation_step(self, batch, batch_idx):
        pred = self.model(**batch)
        self.log('val/loss', pred['loss'], batch_size=len(batch))
        for key, value in pred['loss_dict'].items():
            self.log(f'val/{key}', value, batch_size=len(batch))

        boxes_batch, _ = self.model.get_polygons_from_maps(batch, pred)
        for idx, boxes in enumerate(boxes_batch):
            self.validation_step_outputs[batch['image_filename'][idx]] = boxes
        return pred

    def on_validation_epoch_end(self):
        cleval_metrics = defaultdict(list)

        for gt_filename, gt_words in tqdm(self.dataset['val'].anns.items(), desc="Evaluation"):
            if gt_filename not in self.validation_step_outputs:
                # TODO: Check if this is on_sanity?
                cleval_metrics['recall'].append(np.array(0., dtype=np.float32))
                cleval_metrics['precision'].append(np.array(0., dtype=np.float32))
                cleval_metrics['hmean'].append(np.array(0., dtype=np.float32))
                continue

            pred = self.validation_step_outputs[gt_filename]
            det_quads = [[point for coord in polygons for point in coord]
                         for polygons in pred]
            gt_quads = [item.squeeze().reshape(-1) for item in gt_words]

            self.metric(det_quads, gt_quads)
            cleval = self.metric.compute()
            cleval_metrics['recall'].append(cleval['det_r'].cpu().numpy())
            cleval_metrics['precision'].append(cleval['det_p'].cpu().numpy())
            cleval_metrics['hmean'].append(cleval['det_h'].cpu().numpy())
            self.metric.reset()

        recall = np.mean(cleval_metrics['recall'])
        precision = np.mean(cleval_metrics['precision'])
        hmean = np.mean(cleval_metrics['hmean'])

        self.log('val/recall', recall, on_epoch=True, prog_bar=True)
        self.log('val/precision', precision, on_epoch=True, prog_bar=True)
        self.log('val/hmean', hmean, on_epoch=True, prog_bar=True)

        self.validation_step_outputs.clear()

    def test_step(self, batch):
        pred = self.model(return_loss=False, **batch)

        boxes_batch, _ = self.model.get_polygons_from_maps(batch, pred)
        for idx, boxes in enumerate(boxes_batch):
            self.test_step_outputs[batch['image_filename'][idx]] = boxes
        return pred

    def on_test_epoch_end(self):
        cleval_metrics = defaultdict(list)

        for gt_filename, gt_words in tqdm(self.dataset['test'].anns.items(), desc="Evaluation"):
            pred = self.test_step_outputs[gt_filename]
            det_quads = [[point for coord in polygons for point in coord]
                         for polygons in pred]
            gt_quads = [item.squeeze().reshape(-1) for item in gt_words]

            self.metric(det_quads, gt_quads)
            cleval = self.metric.compute()
            cleval_metrics['recall'].append(cleval['det_r'].cpu().numpy())
            cleval_metrics['precision'].append(cleval['det_p'].cpu().numpy())
            cleval_metrics['hmean'].append(cleval['det_h'].cpu().numpy())
            self.metric.reset()

        recall = np.mean(cleval_metrics['recall'])
        precision = np.mean(cleval_metrics['precision'])
        hmean = np.mean(cleval_metrics['hmean'])

        self.log('test/recall', recall, on_epoch=True, prog_bar=True)
        self.log('test/precision', precision, on_epoch=True, prog_bar=True)
        self.log('test/hmean', hmean, on_epoch=True, prog_bar=True)

        self.test_step_outputs.clear()

    def predict_step(self, batch):
        pred = self.model(return_loss=False, **batch)
        boxes_batch, _ = self.model.get_polygons_from_maps(batch, pred)

        for idx, boxes in enumerate(boxes_batch):
            self.predict_step_outputs[batch['image_filename'][idx]] = boxes
        return pred

    def on_predict_epoch_end(self):
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        submission_file = Path(f"{self.config.submission_dir}") / f"{timestamp}.json"
        submission_file.parent.mkdir(parents=True, exist_ok=True)

        submission = OrderedDict(images=OrderedDict())
        for filename, pred_boxes in self.predict_step_outputs.items():
            # Separate box
            boxes = OrderedDict()
            for idx, box in enumerate(pred_boxes):
                boxes[f'{idx + 1:04}'] = OrderedDict(points=box)

            # Append box
            submission['images'][filename] = OrderedDict(words=boxes)

        # Export submission
        with submission_file.open("w") as fp:
            if self.config.minified_json:
                json.dump(submission, fp, indent=None, separators=(',', ':'))
            else:
                json.dump(submission, fp, indent=4)

        self.predict_step_outputs.clear()

    def configure_optimizers(self):
        return self.model.get_optimizers()

## 3.2 Lightning Data Module

In [None]:
class OCRDataPLModule(pl.LightningDataModule):
    def __init__(self, dataset, config):
        super(OCRDataPLModule, self).__init__()
        self.dataset = dataset
        self.config = config
        self.collate_fn = instantiate(self.config.collate_fn)

    def train_dataloader(self):
        train_loader_config = self.config.dataloaders.train_dataloader
        self.collate_fn.inference_mode = False
        return DataLoader(self.dataset['train'], collate_fn=self.collate_fn, **train_loader_config)

    def val_dataloader(self):
        val_loader_config = self.config.dataloaders.val_dataloader
        self.collate_fn.inference_mode = False
        return DataLoader(self.dataset['val'], collate_fn=self.collate_fn, **val_loader_config)

    def test_dataloader(self):
        test_loader_config = self.config.dataloaders.test_dataloader
        self.collate_fn.inference_mode = False
        return DataLoader(self.dataset['test'], collate_fn=self.collate_fn, **test_loader_config)

    def predict_dataloader(self):
        predict_loader_config = self.config.dataloaders.predict_dataloader
        self.collate_fn.inference_mode = True
        return DataLoader(self.dataset['predict'], collate_fn=self.collate_fn,
                          **predict_loader_config)

# 4. Train

명령어를 통해 어떻게 학습을 시킬 수 있는지, 학습 실행 코드는 어떻게 이루어져 있는지 알아보자

In [None]:
!python runners/train.py preset=example

2024-02-27 09:09:14.631496: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-27 09:09:14.631547: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-27 09:09:14.632930: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /content/baseline_code/outputs/ocr_training/checkpoints exists and is not empty.
Sanity Checking DataLoader 0: 100% 1/1 [00:01<00:00,  1.38s/it]
Evaluation:   0% 0/4 [00:00<?, ?it/s][A
Evaluation:  25% 1/4 [00:00<00:01,  2.01it/s][A
Evaluation: 100%

```yaml
defaults:
  - _self_
  - preset: example
  - override hydra/hydra_logging: disabled
  - override hydra/job_logging: disabled

seed: 42
exp_name: "ocr_training"
project_name: "OCRProject"

wandb: False
exp_version: "v1.0"

resume: null  # "checkpoints/sehwan_20240118_144938/epoch=5-step=4908.ckpt"

trainer:
  max_epochs: 10
  num_sanity_val_steps: 1
  log_every_n_steps: 50
  check_val_every_n_epoch: 1
  deterministic: True
```

```python
@hydra.main(config_path=CONFIG_DIR, config_name='train', version_base='1.2')
def train(config):
    model_module, data_module = get_pl_modules_by_cfg(config)

    trainer = pl.Trainer(
        **config.trainer,
        logger=logger,
        callbacks=callbacks
    )

    trainer.fit(
        model_module,
        data_module,
        ckpt_path=config.get("resume", None),
    )
    trainer.test(
        model_module,
        data_module,
    )


if __name__ == "__main__":
    train()
```

# 5. Test, Predict

명령어를 통해서 Test, Predict는 어떻게 하는지, Test, Predict 실행 코드는 어떻게 구성되어 있는지 알아보자.

In [None]:
!python runners/test.py preset=example "checkpoint_path='lib/dbnet-baseline.ckpt'"

2024-02-27 09:10:08.461757: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-27 09:10:08.461833: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-27 09:10:08.463237: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Testing DataLoader 0: 100% 1/1 [00:00<00:00,  2.16it/s]
Evaluation:   0% 0/4 [00:00<?, ?it/s][A
Evaluation:  25% 1/4 [00:01<00:03,  1.14s/it][A
Evaluation:  50% 2/4 [00:01<00:01,  1.47it/s][A
Evaluation:  75% 3/4 [00:02<00:00,  1.62it/s][A
Evaluation: 100% 4/4 [00:02<00:00,  1.65it/s]
Testing DataLoader 0: 100% 1/1 [00:02<00:00,  2.91s/it]
┏━━━━━━━━━━━━━━━━━━

```python
@hydra.main(config_path=CONFIG_DIR, config_name='test', version_base='1.2')
def test(config):
    model_module, data_module = get_pl_modules_by_cfg(config)

    trainer = pl.Trainer(
        logger=logger,
    )

    trainer.test(
        model_module,
        data_module,
        ckpt_path=config.get("checkpoint_path", None),
    )


if __name__ == "__main__":
    test()
```

In [None]:
!python runners/predict.py preset=example "checkpoint_path='lib/dbnet-baseline.ckpt'"

Predicting DataLoader 0: 100% 4/4 [00:00<00:00,  6.38it/s]


```python
@hydra.main(config_path=CONFIG_DIR, config_name='predict', version_base='1.2')
def predict(config):
    model_module, data_module = get_pl_modules_by_cfg(config)

    trainer = pl.Trainer()

    pred = trainer.predict(model_module,
                           data_module,
                           ckpt_path=config.get("checkpoint_path"),
                           )


if __name__ == "__main__":
    predict()
```

In [None]:
#@markdown ### 메모리 해제

#@markdown - 미사용 자원을 반환하고 메모리를 확보합니다.
#@markdown ---

import gc

try:
  if torch.cuda.is_available():
    torch.cuda.empty_cache()

  del encoder
  del decoder
  del head
  del head_output
  del loss_fn

  gc.collect()

  print("Clear!")

except:
  pass

Clear!


# 6. Tips

1.   Logging
2.   Encoder 교체
3.   Decoder 교체
4.   Head 교체
5.   Dataset augmentation
6.   후처리 고도화
7.   Customize model



## 6.1 Logging

```bash
!python runners/train.py preset=example exp_name=sehwan wandb=True

# https://wandb.ai/sehwan-joo_up/OCRProject?workspace=user-sehwan-joo_up
```

```bash
!python runners/train.py preset=example exp_name=sehwan wandb=False
!tensorboard --log_dirs={logging_path} --port={port_number}
```

## 6.2 Encoder 교체

In [None]:
import torch.nn as nn
import timm


class TimmBackbone(nn.Module):
    def __init__(self, model_name='resnet18', select_features=[1, 2, 3, 4], pretrained=True):
        super(TimmBackbone, self).__init__()
        # Timm Backbone 모델을 자유롭게 사용
        self.model = timm.create_model(model_name, pretrained=pretrained, features_only=True)
        # Decoder에 연결하려는 Feature를 선택
        self.select_features = select_features

    def forward(self, x):
        features = self.model(x)
        return [features[i] for i in self.select_features]

```yaml
# @package _global_

models:
  encoder:
    _target_: ${encoder_path}.TimmBackbone
    model_name: 'convnext_pico.d1_in1k'   #resnet18
    select_features: [1, 2, 3, 4]            # Output layer
    pretrained: true
```

In [None]:
cfg.models.encoder.model_name = 'convnext_pico.d1_in1k'

encoder = instantiate(cfg.models.encoder).to(device)
encoder_features = encoder(data_loaded["images"])
for encoder_feature in encoder_features:
    print(encoder_feature.shape)

IndexError: list index out of range

In [None]:
for feature in encoder.model(data_loaded["images"]):
    print(feature.shape)

torch.Size([4, 64, 160, 160])
torch.Size([4, 128, 80, 80])
torch.Size([4, 256, 40, 40])
torch.Size([4, 512, 20, 20])


In [None]:
cfg.models.encoder.model_name = 'convnext_pico.d1_in1k'
cfg.models.encoder.select_features = [0, 1, 2, 3]

encoder = instantiate(cfg.models.encoder).to(device)
encoder_features = encoder(data_loaded["images"])
for encoder_feature in encoder_features:
    print(encoder_feature.shape)

torch.Size([4, 64, 160, 160])
torch.Size([4, 128, 80, 80])
torch.Size([4, 256, 40, 40])
torch.Size([4, 512, 20, 20])


## 6.3 Decoder 교체

```yaml
# @package _global_

models:
  decoder:
    _target_: ${decoder_path}.UNet
    in_channels: [64, 128, 256, 512]  # [64, 128, 256, 512]
    strides: [4, 8, 16, 32]           # [4, 8, 16, 32]
    inner_channels: 256               # Hidden layer channel
    output_channels: 64               # output layer channel
    bias: False
```

In [None]:
cfg.models.decoder.in_channels = [64, 128, 256, 512]
cfg.models.decoder.strides = [4, 8, 16, 32]
decoder = instantiate(cfg.models.decoder).to(device)
decoder_features = decoder(encoder_features)
for decoder_feature in decoder_features:
    print(decoder_feature.shape)

torch.Size([4, 64, 160, 160])
torch.Size([4, 64, 160, 160])
torch.Size([4, 64, 160, 160])
torch.Size([4, 64, 160, 160])


In [None]:
cfg.models.decoder.output_channels = 256
decoder = instantiate(cfg.models.decoder).to(device)
decoder_features = decoder(encoder_features)
for decoder_feature in decoder_features:
    print(decoder_feature.shape)

torch.Size([4, 256, 160, 160])
torch.Size([4, 256, 160, 160])
torch.Size([4, 256, 160, 160])
torch.Size([4, 256, 160, 160])


## 6.4 Head 교체

```yaml
# @package _global_

# https://arxiv.org/pdf/1911.08947.pdf 참조

models:
  head:
    _target_: ${head_path}.DBHead
    in_channels: 1024                # 256
    upscale: 4                       # 4
    k: 50                            # The amplifying factor
    bias: False                      # Use bias or not in LayerNorm
    smooth: False                    # Use smooth or not in Upsample
    postprocess:
      thresh: 0.3                    # Binarization threshold
      box_thresh: 0.7                # Detection Box threshold
      max_candidates: 300            # Limit the number of detection boxes
      use_polygon: False             # Detection Box Type (QUAD or POLY)
```

In [None]:
cfg.models.head.in_channels = 256 * 4
cfg.models.head.upscale = 4

head = instantiate(cfg.models.head).to(device)
with torch.no_grad():
  head_output = head(decoder_features)
  for k, v in head_output.items():
      print(f'{k}: {v.shape}')

prob_maps: torch.Size([4, 1, 640, 640])
thresh_maps: torch.Size([4, 1, 640, 640])
binary_maps: torch.Size([4, 1, 640, 640])


In [None]:
#@markdown ### 메모리 해제

#@markdown - 미사용 자원을 반환하고 메모리를 확보합니다.
#@markdown ---

import gc

try:
  if torch.cuda.is_available():
    torch.cuda.empty_cache()

  del encoder
  del encoder_features
  del decoder
  del decoder_features
  del head
  del head_output

  gc.collect()

  print("Clear!")

except:
  pass

Clear!


## 6.5 Dataset augmentation

```yaml
transforms:
  train_transform:
    _target_: ${dataset_path}.DBTransforms
    transforms:
      - _target_: albumentations.LongestMaxSize
        max_size: 640
        p: 1.0
      - _target_: albumentations.PadIfNeeded
        min_width: 640
        min_height: 640
        border_mode: 0
        p: 1.0
      - _target_: albumentations.HorizontalFlip
        p: 0.5
      - _target_: albumentations.Normalize
        mean: [0.485, 0.456, 0.406]
        std: [0.229, 0.224, 0.225]
      - _target_: albumentations.ShiftScaleRotate    # 추가된 augmentation
        shift_limit: 0.5
        scale_limit: (0.5, 1.0)
    keypoint_params:
      _target_: albumentations.KeypointParams
      format: 'xy'
      remove_invisible: True
```

In [None]:
cfg.transforms.train_transform.transforms.append(
    omegaconf.OmegaConf.create(
        dict(
            _target_='albumentations.ShiftScaleRotate',
            shift_limit=0.5,
            scale_limit=(0.5, 1.0)
        )
    )
)
train_dataset = instantiate(cfg.datasets.train_dataset)

data = train_dataset[0]
data.keys()

odict_keys(['image', 'image_filename', 'shape', 'polygons', 'inverse_matrix'])

## 6.6 후처리 고도화

```yaml
# @package _global_

# https://arxiv.org/pdf/1911.08947.pdf 참조

models:
  head:
    _target_: ${head_path}.DBHead
    in_channels: 256                 # Input layer channel
    upscale: 4                       # Output layer scale factor
    k: 50                            # The amplifying factor
    bias: False                      # Use bias or not in LayerNorm
    smooth: False                    # Use smooth or not in Upsample
    postprocess:
      thresh: 0.2                    # 0.3
      box_thresh: 0.3                # 0.4
      max_candidates: 1000           # 300
      use_polygon: True              # False
```

In [None]:
cfg.models.head.postprocess.thresh = 0.2
cfg.models.head.postprocess.box_thresh = 0.3
cfg.models.head.postprocess.max_candidates = 1000
cfg.models.head.postprocess.use_polygon = True

## 6.7 Customize model

Encoder를 예시로 Custom

아래의 예시 클래스를 ocr/models/encoder/ 폴더 아래에 생성

In [None]:
import torch

def conv_layer(chann_in, chann_out, k_size, p_size):
    layer = torch.nn. Sequential(
        torch.nn.Conv2d(chann_in, chann_out, kernel_size=k_size, padding=p_size),
        torch.nn.BatchNorm2d(chann_out),
        torch.nn.ReLU()
    )
    return layer

def vgg_conv_block(in_list, out_list, k_list, p_list, pooling_k, pooling_s):
    layers = [ conv_layer(in_list[i], out_list[i], k_list[i], p_list[i]) for i in range(len(in_list)) ]
    layers += [ torch.nn.MaxPool2d(kernel_size = pooling_k, stride = pooling_s) ]
    return torch.nn.Sequential(*layers)

def vgg_fc_layer(size_in, size_out) :
    layer = torch.nn.Sequential(
        torch.nn.Linear(size_in, size_out),
        torch.nn.BatchNormld(size_out),
        torch.nn.ReLU()
    )
    return layer

class VGG16(torch.nn.Module):
    def __init__(self):
        super(VGG16, self).__init__()

        # Conv blocks (BatchNorm + ReLU activation added in each block)
        self.layer1 = vgg_conv_block([3,64], [64,64], [3,3], [1,1], 2, 2)
        self.layer2 = vgg_conv_block([64,128], [128,128], [3,3], [1,1], 2, 2)
        self.layer3 = vgg_conv_block([128,256,256], [256,256,256], [3,3,3], [1,1,1], 2, 2)
        self.layer4 = vgg_conv_block([256,512,512], [512,512,512], [3,3,3], [1,1,1], 2, 2)
        self.layer5 = vgg_conv_block([512,512,512], [512,512,512], [3,3,3], [1,1,1], 2, 2)

    def forward(self, x):
        features = [self.layer1(x)]
        features.append(self.layer2(features[-1]))
        features.append(self.layer3(features[-1]))
        features.append(self.layer4(features[-1]))
        features.append(self.layer5(features[-1]))

        return features

In [None]:
vgg_encoder = VGG16().to(device)
vgg_output = vgg_encoder(data_loaded["images"])
for v in vgg_output:
    print (v.shape)

torch.Size([4, 64, 320, 320])
torch.Size([4, 128, 160, 160])
torch.Size([4, 256, 80, 80])
torch.Size([4, 512, 40, 40])
torch.Size([4, 512, 20, 20])


In [None]:
#@markdown ### 메모리 해제

#@markdown - 미사용 자원을 반환하고 메모리를 확보합니다.
#@markdown ---

import gc
from numba import cuda

try:
  if torch.cuda.is_available():
    torch.cuda.empty_cache()

  del vgg_encoder
  del vgg_output

  gc.collect()

  print("Clear!")

except:
  pass

Clear!


In [None]:
cfg.models.encoder._target_ = cfg.encoder_path + '.vgg16.VGG16'
del cfg.models.encoder.model_name
del cfg.models.encoder.select_features
del cfg.models.encoder.pretrained

with torch.no_grad():
  encoder = instantiate(cfg.models.encoder).to(device)
  encoder_features = encoder(data_loaded["images"])
  for encoder_feature in encoder_features:
      print(f'encoder feature shape: {encoder_feature.shape}')

  cfg.models.decoder.in_channels = [64, 128, 256, 512, 512]
  cfg.models.decoder.strides = [2, 4, 8, 16, 32]
  decoder = instantiate(cfg.models.decoder).to(device)
  decoder_features = decoder(encoder_features)
  for decoder_feature in decoder_features:
      print(f'decoder feature shape: {decoder_feature.shape}')

encoder feature shape: torch.Size([4, 64, 320, 320])
encoder feature shape: torch.Size([4, 128, 160, 160])
encoder feature shape: torch.Size([4, 256, 80, 80])
encoder feature shape: torch.Size([4, 512, 40, 40])
encoder feature shape: torch.Size([4, 512, 20, 20])
decoder feature shape: torch.Size([4, 256, 320, 320])
decoder feature shape: torch.Size([4, 256, 320, 320])
decoder feature shape: torch.Size([4, 256, 320, 320])
decoder feature shape: torch.Size([4, 256, 320, 320])
decoder feature shape: torch.Size([4, 256, 320, 320])
