
![](https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F518134%2F68421364ae2731375c0f59fd1749c845%2Fpexels-ivan-samkov-4989186.jpg?generation=1611197793386796&alt=media)
<div style="text-align:center;"><cite>Image from <a href="https://www.pexels.com/ja-jp/photo/4989186/">https://www.pexels.com/ja-jp/photo/4989186/</a></cite></div>

<br/>

# VinBigData 2-class classifier complete pipeline

This competition is object detaction task to find a class and location of thoracic abnormalities from chest x-ray image (radiographs).

However, it is mentioned that training 2 class classifier to understand which is the normal image is important to get high score.

本竞赛的目的是通过胸部x线片发现胸部异常的类别和位置。
但是，要想获得高分，训练2个分类器了解哪一个是正常图像是很重要的。
 - Kernel: [VinBigData 🌟2 Class Filter🌟](https://www.kaggle.com/awsaf49/vinbigdata-2-class-filter)
 - Discussion: [[LB0.155] baseline solution](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/208837)

Here, I will introduce complete **EDA, Training (with 5-fold cross validation) and Prediction pipeline** for training 2-class classifier.

在这里，我将介绍完整的EDA、训练(使用5折交叉验证)和训练2类分类器的预测管道。

You can learn the usage of following tools to accelerate deep learning tasks in computer vision!
 - [pytorch](https://github.com/pytorch/pytorch): Deep learning framework, it's popular among researchers for its flexible usage. no need to explain detail!
 - [albumentations](https://github.com/albumentations-team/albumentations): Image augmentation library, developed by famous kagglers!
 - [timm](https://github.com/rwightman/pytorch-image-models): pytorch-image-models, it provides a lot of popular SoTA CNN models with pretrained weights.
 - [pytorch ignite](https://github.com/pytorch/ignite): Traning/Evaluation abstraction framework on top of pytorch.
 - [pytorch pfn extras](https://github.com/pfnet/pytorch-pfn-extras): It is used to add more feature-rich functionality on Ignite.
 在这里，我将介绍完整的EDA，训练（具有5倍交叉验证）和预测管道，用于训练2类分类器。

您可以学习以下工具的用法，以加速计算机视觉中的深度学习任务！

pytorch：深度学习框架，由于其灵活的用法而在研究人员中很受欢迎。 无需解释细节！

albumentations：图像增强库，由著名的kagglers开发！

timm：pytorch-image-models，它提供了许多流行的具有预训练权重的SoTA CNN模型。

pytorch ignite：在pytorch之上的Traning / Evaluation抽象框架。

pytorch pfn extras：用于在Ignite上添加更多功能丰富的功能。

# Table of Contents

** [Dataset preparation](#dataset)** <br/>
** [Installation](#installation)** <br/>
** [EDA: distribution between normal & abnormal class](#eda)** <br/>
** [Image visualizaion & augmentation with albumentations](#aug)** <br/>
** [Defining CNN models](#model)** <br/>
** [Training utils](#trainutil)** <br/>
** [Training scripts](#trainscript)** <br/>
** [Prediction on validation & test dataset](#prediction)** <br/>
** [Next step](#nextstep)** <br/>

<a id="dataset"></a>
# Dataset preparation

Preprocessing x-ray image format (dicom) into normal png image format is already done by @xhlulu in the below discussion:
 - [Multiple preprocessed datasets: 256/512/1024px, PNG and JPG, modified and original ratio](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/207955).

Here I will just use the dataset [VinBigData Chest X-ray Resized PNG (256x256)](https://www.kaggle.com/xhlulu/vinbigdata-chest-xray-resized-png-256x256) to skip the preprocessing and focus on modeling part. Please upvote the dataset as well!

数据准备

@xhlulu在下面的讨论中已经将x射线图像格式（dicom）预处理为普通png图像格式：

多个预处理数据集：256/512 / 1024px，PNG和JPG，修改后的比例和原始比例。

在这里，我将仅使用数据集VinBigData胸部X射线调整大小的PNG（256x256）来跳过预处理并专注于建模部分。

In [None]:
import gc
import os
from pathlib import Path
import random
import sys

from tqdm.notebook import tqdm
import numpy as np
import pandas as pd
import scipy as sp


import matplotlib.pyplot as plt
import seaborn as sns

from IPython.core.display import display, HTML

# --- plotly ---
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
import plotly.io as pio
pio.templates.default = "plotly_dark"

# --- models ---
from sklearn import preprocessing
from sklearn.model_selection import KFold
import lightgbm as lgb
import xgboost as xgb
import catboost as cb
import torch

# --- setup ---
pd.set_option('max_columns', 50)


<a id="installation"></a>
# Installation

detectron2 is not pre-installed in this kaggle docker, so let's install it. 
We can follow [installation instruction](https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md), we need to know CUDA and pytorch version to install correct `detectron2`.

detectron2没有预先安装在这个kaggle docker中，所以让我们来安装它。我们可以按照安装说明，我们需要知道CUDA和pytorch版本来安装正确的detectron2。

In [None]:
!pip install detectron2 -f \
  https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.7/index.html
!pip install pytorch-pfn-extras timm

This `Flags` class summarizes all the configuratoin available during the training.

As I will show later, you can change various hyperparameters to experiment improving your models!

这个Flags类涵盖了训练期间所有可用的配置。
正如我将在后面展示的，您可以更改各种超参数来试验改进您的模型！

In [None]:
from typing import Any
import yaml

def save_yaml(filepath: str, content: Any, width: int = 120):
    with open(filepath, "w") as f:
        yaml.dump(content, f, width=width)

In [None]:
from dataclasses import dataclass, field
from typing import Dict, Any, Tuple, Union, List


@dataclass
class Flags:
    # General
    debug: bool = True
    outdir: str = "results/det"
    device: str = "cuda:0"

    # Data config
    imgdir_name: str = "vinbigdata-chest-xray-resized-png-256x256"
    seed: int = 111
    target_fold: int = 0  # 0~4
    # Model config
    model_name: str = "resnet18"
    # Training config
    epoch: int = 20
    batchsize: int = 8
    valid_batchsize: int = 16
    num_workers: int = 4
    snapshot_freq: int = 5
    ema_decay: float = 0.999  # negative value is to inactivate ema.
    scheduler_type: str = ""
    scheduler_kwargs: Dict[str, Any] = field(default_factory=lambda: {})
    scheduler_trigger: List[Union[int, str]] = field(default_factory=lambda: [1, "iteration"])

    def update(self, param_dict: Dict) -> "Flags":
        # Overwrite by `param_dict`
        for key, value in param_dict.items():
            if not hasattr(self, key):
                raise ValueError(f"[ERROR] Unexpected key for flag = {key}")
            setattr(self, key, value)
        return self


In [None]:
flags_dict = {
    "debug": False,  # Change to True for fast debug run!
    "outdir": "results/tmp_debug",
    # Data
    "imgdir_name": "vinbigdata-chest-xray-resized-png-256x256",
    # Model
    "model_name": "resnet18",
    # Training
    "num_workers": 4,
    "epoch": 15,
    "batchsize": 8,
    "scheduler_type": "CosineAnnealingWarmRestarts",
    "scheduler_kwargs": {"T_0": 28125},  # 15000 * 15 epoch // (batchsize=8)
    "scheduler_trigger": [1, "iteration"]
}

In [None]:
import dataclasses

# args = parse()
print("torch", torch.__version__)
flags = Flags().update(flags_dict)
print("flags", flags)
debug = flags.debug
outdir = Path(flags.outdir)
os.makedirs(str(outdir), exist_ok=True)
flags_dict = dataclasses.asdict(flags)
save_yaml(str(outdir / "flags.yaml"), flags_dict)

# --- Read data ---
inputdir = Path("/kaggle/input")
datadir = inputdir / "vinbigdata-chest-xray-abnormalities-detection"
imgdir = inputdir / flags.imgdir_name

# Read in the data CSV files
train = pd.read_csv(datadir / "train.csv")
# sample_submission = pd.read_csv(datadir / 'sample_submission.csv')

<a id="eda"></a>
# EDA: distribution between normal & abnormal class

At first, let's check how many normal class exist in the training data.
It is classified as "class_name = No finding" and "class_id = 14".

However you need to be careful that 3 radiologists annotated for each image, so you can find 3 annotations as you can see below.

正态和异常类之间的分布首先，我们检查一下训练数据中存在多少正态类。它被分类为“class_name=no finding”和“class_id = 14”。然而，你需要注意的是，3名放射科医生为每一张图像做了注释，所以你可以找到3个注释，如下所示。

In [None]:
train.query("image_id == '50a418190bc3fb1ef1633bf9678929b3'")

So the question arises, is there an image that the 3 radiologists' opinions differ?

Let's check number of "No finding" annotations for each image, if the opinions are in complete agreement the number of "No finding" annotations should be **0 -> Abnormal(all radiologists does not think this is normal)" or "1 -> Normal(all radiologists think this is normal)"**.

那么问题来了，这三位放射学家是否有不同的看法?
让我们检查每个图像的“无发现”注释的数量，如果意见完全一致，“无发现”注释的数量应该是“0 ->异常(所有放射科医生认为这是正常的)”或“1 ->正常(所有放射科医生认为这是正常的)”。

In [None]:
is_normal_df = train.groupby("image_id")["class_id"].agg(lambda s: (s == 14).sum()).reset_index().rename({"class_id": "num_normal_annotations"}, axis=1)
is_normal_df.head()

We could confirm that **always 3 radiologists opinions match** for normal - abnormal diagnosis.

[Note] I noticed that it does not apply for the other classes. i.e., 3 radiologists opinions sometimes do not match for the other class of thoracic abnormalities.

我们可以确认，对于正常-异常诊断，总是有3位放射科医生的意见匹配。

[注意]我注意到它不适用于其他类的病。 即，3位放射科医生的意见有时与另一类胸腔异常不符。
也就是说，对于有病没病，三个医生的意见是一致的。
但是具体是什么病，病区是哪块，意见不一致。

In [None]:
# 每张图片中“找不到”注释的数量
num_normal_anno_counts = is_normal_df["num_normal_annotations"].value_counts()
num_normal_anno_counts.plot(kind="bar")
plt.title("The number of 'No finding' annotations in each image")

In [None]:
num_normal_anno_counts_df = num_normal_anno_counts.reset_index()
num_normal_anno_counts_df["name"] = num_normal_anno_counts_df["index"].map({0: "Abnormal", 3: "Normal"})
num_normal_anno_counts_df

So almost 70% of the data is actually "Normal" X-ray images.

Only 30% of the images need thoracic abnormality location detection.

因此，几乎70%的数据实际上是“正常的”x射线图像。只有30%的图像需要胸部异常定位。

In [None]:
px.pie(num_normal_anno_counts_df, values="num_normal_annotations", names="name", title="Normal/Abnormal ratio")

<a id="aug"></a>
# Image visualizaion & augmentation with albumentations

When you train CNN models, image augmentation is important to avoid model to overfit.<br/>
I'll show examples to use Albumentations to run image augmentation very easily.<br/>
At first, I will define pytorch Dataset class for this competition, which can be also used later in the training.

当你训练CNN模型时，图像增强对于避免模型过拟合是很重要的。
我将展示使用Albumentations非常容易地运行图像增强的示例。
   首先，我将为这个竞赛定义pytorch Dataset类，它也可以在稍后的训练中使用。

In [None]:
import pickle
from pathlib import Path
from typing import Optional

import cv2
import numpy as np
import pandas as pd
from detectron2.structures import BoxMode
from tqdm import tqdm


def get_vinbigdata_dicts(
    imgdir: Path,
    train_df: pd.DataFrame,
    train_data_type: str = "original",
    use_cache: bool = True,
    debug: bool = True,
    target_indices: Optional[np.ndarray] = None,
):
    debug_str = f"_debug{int(debug)}"
    train_data_type_str = f"_{train_data_type}"
    cache_path = Path(".") / f"dataset_dicts_cache{train_data_type_str}{debug_str}.pkl"
    if not use_cache or not cache_path.exists():
        print("Creating data...")
        train_meta = pd.read_csv(imgdir / "train_meta.csv")
        if debug:
            train_meta = train_meta.iloc[:500]  # For debug....

        # Load 1 image to get image size.
        image_id = train_meta.loc[0, "image_id"]
        image_path = str(imgdir / "train" / f"{image_id}.png")
        image = cv2.imread(image_path)
        resized_height, resized_width, ch = image.shape
        print(f"image shape: {image.shape}")

        dataset_dicts = []
        for index, train_meta_row in tqdm(train_meta.iterrows(), total=len(train_meta)):
            record = {}

            image_id, height, width = train_meta_row.values
            filename = str(imgdir / "train" / f"{image_id}.png")
            record["file_name"] = filename
            record["image_id"] = image_id
            record["height"] = resized_height
            record["width"] = resized_width
            objs = []
            for index2, row in train_df.query("image_id == @image_id").iterrows():
                # print(row)
                # print(row["class_name"])
                # class_name = row["class_name"]
                class_id = row["class_id"]
                if class_id == 14:
                    # It is "No finding"
                    # This annotator does not find anything, skip.
                    pass
                else:
                    # bbox_original = [int(row["x_min"]), int(row["y_min"]), int(row["x_max"]), int(row["y_max"])]
                    h_ratio = resized_height / height
                    w_ratio = resized_width / width
                    bbox_resized = [
                        int(row["x_min"]) * w_ratio,
                        int(row["y_min"]) * h_ratio,
                        int(row["x_max"]) * w_ratio,
                        int(row["y_max"]) * h_ratio,
                    ]
                    obj = {
                        "bbox": bbox_resized,
                        "bbox_mode": BoxMode.XYXY_ABS,
                        "category_id": class_id,
                    }
                    objs.append(obj)
            record["annotations"] = objs
            dataset_dicts.append(record)
        with open(cache_path, mode="wb") as f:
            pickle.dump(dataset_dicts, f)

    print(f"Load from cache {cache_path}")
    with open(cache_path, mode="rb") as f:
        dataset_dicts = pickle.load(f)
    if target_indices is not None:
        dataset_dicts = [dataset_dicts[i] for i in target_indices]
    return dataset_dicts


def get_vinbigdata_dicts_test(
    imgdir: Path, test_meta: pd.DataFrame, use_cache: bool = True, debug: bool = True,
):
    debug_str = f"_debug{int(debug)}"
    cache_path = Path(".") / f"dataset_dicts_cache_test{debug_str}.pkl"
    if not use_cache or not cache_path.exists():
        print("Creating data...")
        # test_meta = pd.read_csv(imgdir / "test_meta.csv")
        if debug:
            test_meta = test_meta.iloc[:500]  # For debug....

        # Load 1 image to get image size.
        image_id = test_meta.loc[0, "image_id"]
        image_path = str(imgdir / "test" / f"{image_id}.png")
        image = cv2.imread(image_path)
        resized_height, resized_width, ch = image.shape
        print(f"image shape: {image.shape}")

        dataset_dicts = []
        for index, test_meta_row in tqdm(test_meta.iterrows(), total=len(test_meta)):
            record = {}

            image_id, height, width = test_meta_row.values
            filename = str(imgdir / "test" / f"{image_id}.png")
            record["file_name"] = filename
            # record["image_id"] = index
            record["image_id"] = image_id
            record["height"] = resized_height
            record["width"] = resized_width
            # objs = []
            # record["annotations"] = objs
            dataset_dicts.append(record)
        with open(cache_path, mode="wb") as f:
            pickle.dump(dataset_dicts, f)

    print(f"Load from cache {cache_path}")
    with open(cache_path, mode="rb") as f:
        dataset_dicts = pickle.load(f)
    return dataset_dicts


In [None]:
"""
Referenced `chainer.dataset.DatasetMixin` to work with pytorch Dataset.
"""
import numpy
import six
import torch
from torch.utils.data.dataset import Dataset


class DatasetMixin(Dataset):

    def __init__(self, transform=None):
        self.transform = transform

    def __getitem__(self, index):
        """Returns an example or a sequence of examples."""
        if torch.is_tensor(index):
            index = index.tolist()
        if isinstance(index, slice):
            current, stop, step = index.indices(len(self))
            return [self.get_example_wrapper(i) for i in
                    six.moves.range(current, stop, step)]
        elif isinstance(index, list) or isinstance(index, numpy.ndarray):
            return [self.get_example_wrapper(i) for i in index]
        else:
            return self.get_example_wrapper(index)

    def __len__(self):
        """Returns the number of data points."""
        raise NotImplementedError

    def get_example_wrapper(self, i):
        """Wrapper of `get_example`, to apply `transform` if necessary"""
        example = self.get_example(i)
        if self.transform:
            example = self.transform(example)
        return example

    def get_example(self, i):
        """Returns the i-th example.

        Implementations should override it. It should raise :class:`IndexError`
        if the index is invalid.

        Args:
            i (int): The index of the example.

        Returns:
            The i-th example.

        """
        raise NotImplementedError


In [None]:
import cv2
import numpy as np


class VinbigdataTwoClassDataset(DatasetMixin):
    def __init__(self, dataset_dicts, image_transform=None, transform=None, train: bool = True):
        super(VinbigdataTwoClassDataset, self).__init__(transform=transform)
        self.dataset_dicts = dataset_dicts
        self.image_transform = image_transform
        self.train = train

    def get_example(self, i):
        d = self.dataset_dicts[i]
        filename = d["file_name"]

        img = cv2.imread(filename)
        if self.image_transform:
            img = self.image_transform(img)
        img = np.transpose(img, (2, 0, 1)).astype(np.float32)
        if self.train:
            label = int(len(d["annotations"]) > 0)  # 0 normal, 1 abnormal
            return img, label
        else:
            # Only return img
            return img,

    def __len__(self):
        return len(self.dataset_dicts)


Now creating the dataset is just easy as following:

In [None]:
dataset_dicts = get_vinbigdata_dicts(imgdir, train, debug=debug)
dataset = VinbigdataTwoClassDataset(dataset_dicts)

You can access each image and its label (0=Normal, 1=Abnormal) by just access `dataset` with index.

您可以访问每个图像及其标签(0=正常，1=不正常)仅通过访问具有索引的数据集。

In [None]:
index = 0
img, label = dataset[index]
plt.imshow(img.transpose((1, 2, 0)) / 255.)
plt.title(f"{index}-th image: label {label}")

To run augmentation on this image, I will define `Transform` class which is applied each time the data is accessed.

You can refer [albumentations](https://github.com/albumentations-team/albumentations) page, that various kinds of augmentation is already implemented and can be used very easily!

为了在图像上运行增强，我将定义Transform类，每次访问数据时应用它。
你可以参考albumentations页面，各种扩展已经实现，可以很容易地使用!

In [None]:
import albumentations as A


class Transform:
    def __init__(
        self, hflip_prob: float = 0.5, ssr_prob: float = 0.5, random_bc_prob: float = 0.5
    ):
        self.transform = A.Compose(
            [
                A.HorizontalFlip(p=hflip_prob),
                A.ShiftScaleRotate(
                    shift_limit=0.0625, scale_limit=0.1, rotate_limit=10, p=ssr_prob
                ),
                A.RandomBrightnessContrast(p=random_bc_prob),
            ]
        )

    def __call__(self, image):
        image = self.transform(image=image)["image"]
        return image


To use augmentation, you can just define dataset with the `Transform` function.

In [None]:
aug_dataset = VinbigdataTwoClassDataset(dataset_dicts, image_transform=Transform())

Let's visualize, looks good. <br/>
You can see each image looks different (rotated, brightness is different etc...) even if it is generated from the same image :)

让我们可视化一下，看起来不错。

您可以看到每个图像看起来都不同（旋转，亮度不同等），即使它是从同一图像生成的也是如此：）

In [None]:
index = 0

n_images = 4

fig, axes = plt.subplots(1, n_images, figsize=(16, 5))
for i in range(n_images):
    # Each time the data is accessed, the result is different due to random augmentation!
    img, label = aug_dataset[index]
    ax = axes[i]
    ax.imshow(img.transpose((1, 2, 0)) / 255.)
    ax.set_title(f"{index}-th image: label {label}")
plt.show()

<a id="model"></a>
# Defining CNN models

Recently, several libraries of CNN-collection are available on public.

I will use `timm` this time. You don't need to impelment deep CNN models by yourself, you can just re-use latest research results without hustle.<br/>
You can focus on more about looking data and try experiment now.

定义CNN模型

最近，CNN集合的几个库公开可用。

这次我将使用timm。 您无需自己推动深层的CNN模型，您可以轻松使用最新的研究结果。

您可以集中精力查看数据并立即尝试进行实验。

In [None]:
import timm


def build_predictor(model_name: str):
    return timm.create_model(model_name, pretrained=True, num_classes=2, in_chans=3)

In [None]:
import torch


def accuracy(y: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
    """Computes multi-class classification accuracy"""
    assert y.shape[:-1] == t.shape, f"y {y.shape}, t {t.shape} is inconsistent."
    pred_label = torch.max(y.detach(), dim=-1)[1]
    count = t.nelement()
    correct = (pred_label == t).sum().float()
    acc = correct / count
    return acc

In [None]:
import torch
import torch.nn.functional as F
from torch import nn
import pytorch_pfn_extras as ppe


class Classifier(nn.Module):
    """two class classfication"""

    def __init__(self, predictor, lossfun=F.cross_entropy):
        super().__init__()
        self.predictor = predictor
        self.lossfun = lossfun
        self.prefix = ""

    def forward(self, image, targets):
        outputs = self.predictor(image)
        loss = self.lossfun(outputs, targets)
        metrics = {
            f"{self.prefix}loss": loss.item(),
            f"{self.prefix}acc": accuracy(outputs, targets).item()
        }
        ppe.reporting.report(metrics, self)
        return loss, metrics

    def predict(self, data_loader):
        pred = self.predict_proba(data_loader)
        label = torch.argmax(pred, dim=1)
        return label

    def predict_proba(self, data_loader):
        device: torch.device = next(self.parameters()).device
        y_list = []
        self.eval()
        with torch.no_grad():
            for batch in data_loader:
                if isinstance(batch, (tuple, list)):
                    # Assumes first argument is "image"
                    batch = batch[0].to(device)
                else:
                    batch = batch.to(device)
                y = self.predictor(batch)
                y = torch.softmax(y, dim=-1)
                y_list.append(y)
        pred = torch.cat(y_list)
        return pred

What kind of models are supported in the `timm` library?

timm库支持哪些模型？

In [None]:
supported_models = timm.list_models()
print(f"{len(supported_models)} models are supported in timm.")
print(supported_models)

Wow more than 300 models are supported!<br/>
It of course includes **resnet** related models, **efficientnet**, etc.<br/>
You may wonder which model should be used?<br/>
I will go with `resnet18` as a baseline at first, and try using more deeper/latest models in the experiment.

哇，支持300多种型号！

当然，它包括与Resnet相关的模型，efficiencynet等。

您可能想知道应该使用哪种模型？

首先，我将以resnet18为基准，并尝试在实验中使用更深入/最新的模型。

<a id="trainutil"></a>
# Training utils

Here are training util methods. You can just copy these to use in other projects.

这里是训练util方法。您可以将这些复制到其他项目中使用。

In [None]:
"""
From https://github.com/pfnet-research/kaggle-lyft-motion-prediction-4th-place-solution
"""
from logging import getLogger

from torch import nn


class EMA(object):
    """Exponential moving average of model parameters.

    Ref
     - https://github.com/tensorflow/addons/blob/v0.10.0/tensorflow_addons/optimizers/moving_average.py#L26-L103
     - https://anmoljoshi.com/Pytorch-Dicussions/

    Args:
        model (nn.Module): Model with parameters whose EMA will be kept.
        decay (float): Decay rate for exponential moving average.
        strict (bool): Apply strict check for `assign` & `resume`.
        use_dynamic_decay (bool): Dynamically change decay rate. If `True`, small decay rate is
            used at the beginning of training to move moving average faster.
    """  # NOQA

    def __init__(
        self,
        model: nn.Module,
        decay: float,
        strict: bool = True,
        use_dynamic_decay: bool = True,
    ):
        self.decay = decay
        self.model = model
        self.strict = strict
        self.use_dynamic_decay = use_dynamic_decay
        self.logger = getLogger(__name__)
        self.n_step = 0

        self.shadow = {}
        self.original = {}

        # Flag to manage which parameter is assigned.
        # When `False`, original model's parameter is used.
        # When `True` (`assign` method is called), `shadow` parameter (ema param) is used.
        self._assigned = False

        # Register model parameters
        for name, param in model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()

    def step(self):
        self.n_step += 1
        if self.use_dynamic_decay:
            _n_step = float(self.n_step)
            decay = min(self.decay, (1.0 + _n_step) / (10.0 + _n_step))
        else:
            decay = self.decay

        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                new_average = (1.0 - decay) * param.data + decay * self.shadow[name]
                self.shadow[name] = new_average.clone()

    # alias
    __call__ = step

    def assign(self):
        """Assign exponential moving average of parameter values to the respective parameters."""
        if self._assigned:
            if self.strict:
                raise ValueError("[ERROR] `assign` is called again before `resume`.")
            else:
                self.logger.warning(
                    "`assign` is called again before `resume`."
                    "shadow parameter is already assigned, skip."
                )
                return

        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                self.original[name] = param.data.clone()
                param.data = self.shadow[name]
        self._assigned = True

    def resume(self):
        """Restore original parameters to a model.

        That is, put back the values that were in each parameter at the last call to `assign`.
        """
        if not self._assigned:
            if self.strict:
                raise ValueError("[ERROR] `resume` is called before `assign`.")
            else:
                self.logger.warning("`resume` is called before `assign`, skip.")
                return

        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                param.data = self.original[name]
        self._assigned = False


In [None]:
"""
From https://github.com/pfnet-research/kaggle-lyft-motion-prediction-4th-place-solution
"""
from typing import Mapping, Any

from torch import optim

from pytorch_pfn_extras.training.extension import Extension, PRIORITY_READER
from pytorch_pfn_extras.training.manager import ExtensionsManager


class LRScheduler(Extension):
    """A thin wrapper to resume the lr_scheduler"""

    trigger = 1, 'iteration'
    priority = PRIORITY_READER
    name = None

    def __init__(self, optimizer: optim.Optimizer, scheduler_type: str, scheduler_kwargs: Mapping[str, Any]) -> None:
        super().__init__()
        self.scheduler = getattr(optim.lr_scheduler, scheduler_type)(optimizer, **scheduler_kwargs)

    def __call__(self, manager: ExtensionsManager) -> None:
        self.scheduler.step()

    def state_dict(self) -> None:
        return self.scheduler.state_dict()

    def load_state_dict(self, to_load) -> None:
        self.scheduler.load_state_dict(to_load)


In [None]:
from ignite.engine import Engine


def create_trainer(model, optimizer, device) -> Engine:
    model.to(device)

    def update_fn(engine, batch):
        model.train()
        optimizer.zero_grad()
        loss, metrics = model(*[elem.to(device) for elem in batch])
        loss.backward()
        optimizer.step()
        return metrics
    trainer = Engine(update_fn)
    return trainer


<a id="trainscript"></a>
# Training scripts

In [None]:
import dataclasses
import os
import sys
from pathlib import Path

import numpy as np
import pandas as pd
import pytorch_pfn_extras.training.extensions as E
import torch
from ignite.engine import Events
from pytorch_pfn_extras.training import IgniteExtensionsManager
from sklearn.model_selection import StratifiedKFold
from torch import nn, optim
from torch.utils.data.dataloader import DataLoader

## Preparing data by 5-fold cross validation

When we have few data, running stable evaluation is very important. 
We can use cross validation to reduce validation error standard deviation.

Here, I will use **[`StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)** to keep the balance between normal/abnormal ratio same for the train & validation dataset.

According to [this discussion](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/208837#1139712), using multi label stratified kfold https://github.com/trent-b/iterative-stratification may be more stable.

通过5倍交叉验证准备数据

当我们的数据很少时，进行稳定的评估非常重要。 我们可以使用交叉验证来减少验证错误的标准偏差。

在这里，我将使用StratifiedKFold来保持训练和验证数据集的正常/异常比率之间的平衡相同。

根据此讨论，使用多标签分层kfold https://github.com/trent-b/iterative-stratification可能更稳定。

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=flags.seed)
# skf.get_n_splits(None, None)
y = np.array([int(len(d["annotations"]) > 0) for d in dataset_dicts])
split_inds = list(skf.split(dataset_dicts, y))
train_inds, valid_inds = split_inds[flags.target_fold]  # Choose which fold to train, 0th fold selected this time.
train_dataset = VinbigdataTwoClassDataset(
    [dataset_dicts[i] for i in train_inds], image_transform=Transform()
)
valid_dataset = VinbigdataTwoClassDataset([dataset_dicts[i] for i in valid_inds])

## Write training code

pytorch-ignite & pytorch-pfn-extras are used here.

 - [pytorch/ignite](https://github.com/pytorch/ignite): It provides abstraction for writing training loop.
 - [pfnet/pytorch-pfn-extras](https://github.com/pfnet/pytorch-pfn-extras): It provides several "extensions" useful for training. Useful for **logging, printing, evaluating, saving the model, scheduling the learning rate** during training.
 
**[Note] Why training abstraction library is used?**

You may feel understanding training abstraction code below is a bit unintuitive compared to writing "raw" training loop.<br/>
The advantage of abstracting the code is that we can re-use implemented handler class for other training, other competition.<br/>
You don't need to write code for saving models, logging training loss/metric, show progressbar etc.
These are done by provided util classes in `pytorch-pfn-extras` library!

You may refer my other kernel in previous competition too:
 - [Bengali: SEResNeXt training with pytorch](https://www.kaggle.com/corochann/bengali-seresnext-training-with-pytorch)
 - [Lyft: Training with multi-mode confidence](https://www.kaggle.com/corochann/lyft-training-with-multi-mode-confidence)

编写培训代码

此处使用pytorch-ignite和pytorch-pfn-extras。

pytorch / ignite：为编写训练循环提供抽象。

pfnet / pytorch-pfn-extras：它提供了一些对培训有用的“扩展”。 对于记录，打印，评估，保存模型，安排训练期间的学习率很有用。

[注意]为什么要使用训练抽象库？

与编写“原始”训练循环相比，您可能会觉得理解下面的训练抽象代码有点不直观。

抽象代码的优点是我们可以将实现的处理程序类重新用于其他培训和其他比赛。

您无需编写代码来保存模型，记录训练损失/指标，显示进度条等。这些操作由pytorch-pfn-extras库中提供的util类完成！



您也可以在以前的比赛中引用我的其他内核：

孟加拉语：使用pytorch进行SEResNeXt培训

Lyft：充满信心地进行训练

In [None]:
# 训练集装载器
train_loader = DataLoader(
    train_dataset,
    batch_size=flags.batchsize,
    num_workers=flags.num_workers,
    shuffle=True,
    pin_memory=True,
)
# 验证集装载器
valid_loader = DataLoader(
    valid_dataset,
    batch_size=flags.valid_batchsize,
    num_workers=flags.num_workers,
    shuffle=False,
    pin_memory=True,
)

device = torch.device(flags.device)

predictor = build_predictor(model_name=flags.model_name)
classifier = Classifier(predictor)
model = classifier
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Train setup
trainer = create_trainer(model, optimizer, device)

ema = EMA(predictor, decay=flags.ema_decay)

def eval_func(*batch):
    loss, metrics = model(*[elem.to(device) for elem in batch])
    # HACKING: report ema value with prefix.
    if flags.ema_decay > 0:
        classifier.prefix = "ema_"
        ema.assign()
        loss, metrics = model(*[elem.to(device) for elem in batch])
        ema.resume()
        classifier.prefix = ""

valid_evaluator = E.Evaluator(
    valid_loader, model, progress_bar=False, eval_func=eval_func, device=device
)

# log_trigger = (10 if debug else 1000, "iteration")
log_trigger = (1, "epoch")
log_report = E.LogReport(trigger=log_trigger)
extensions = [
    log_report,
    E.ProgressBarNotebook(update_interval=10 if debug else 100),  # Show progress bar during training
    E.PrintReportNotebook(),  # Show "log" on jupyter notebook  
    # E.ProgressBar(update_interval=10 if debug else 100),  # Show progress bar during training
    # E.PrintReport(),  # Print "log" to terminal
    E.FailOnNonNumber(),  # Stop training when nan is detected.
]
epoch = flags.epoch
models = {"main": model}
optimizers = {"main": optimizer}
manager = IgniteExtensionsManager(
    trainer, models, optimizers, epoch, extensions=extensions, out_dir=str(outdir),
)
# Run evaluation for valid dataset in each epoch.
manager.extend(valid_evaluator)

# Save predictor.pt every epoch
manager.extend(
    E.snapshot_object(predictor, "predictor.pt"), trigger=(flags.snapshot_freq, "epoch")
)
# Check & Save best validation predictor.pt every epoch
# manager.extend(E.snapshot_object(predictor, "best_predictor.pt"),
#                trigger=MinValueTrigger("validation/module/nll",
#                trigger=(flags.snapshot_freq, "iteration")))

# --- lr scheduler ---
if flags.scheduler_type != "":
    scheduler_type = flags.scheduler_type
    print(f"using {scheduler_type} scheduler with kwargs {flags.scheduler_kwargs}")
    manager.extend(
        LRScheduler(optimizer, scheduler_type, flags.scheduler_kwargs),
        trigger=flags.scheduler_trigger,
    )

manager.extend(E.observe_lr(optimizer=optimizer), trigger=log_trigger)

if flags.ema_decay > 0:
    # Exponential moving average
    manager.extend(lambda manager: ema(), trigger=(1, "iteration"))

    def save_ema_model(manager):
        ema.assign()
        torch.save(predictor.state_dict(), outdir / "predictor_ema.pt")
        ema.resume()

    manager.extend(save_ema_model, trigger=(flags.snapshot_freq, "epoch"))

_ = trainer.run(train_loader, max_epochs=epoch)

So what is happening in above training abstraction? Let's understand what each extension did.

**Extensions** - Each role:
 - **`ProgressBar` (`ProgressBarNotebook`)**: Shows training progress in formatted style.
 - **`LogReport`**: Logging metrics reported by `ppe.reporter.report` (see `LyftMultiRegressor` for reporting point) method and save to **log** file. It automatically collects reported value in each iteration and saves the "mean" of reported value for regular frequency (for example every 1 epoch).
 - **`PrintReport` (`PrintReportNotebook`)**: Prints the value which `LogReport` collected in formatted style.
 - **`Evaluator`**: Evaluate on validation dataset.
 - **`snapshot_object`**: Saves the object. Here the `model` is saved in regular interval `flags.snapshot_freq`. Even you quit training using Ctrl+C without finishing all the epoch, the intermediate trained model is saved and you can use it for inference.
 - **`LRScheduler`**: You can insert learning rate scheduling with this extension, together with the regular interval call specified by `trigger`. Here cosine annealing is applied (configured by Flags) by calling `scheduler.step()` every iteration.
 - **`observe_lr`**: `LogReport` will check optimizer's learning rate using this extension. So you can follow how the learning rate changed through the training.


Such many functionalities can be "added" easily using extensions!

那么，以上训练抽象发生了什么？ 让我们了解每个扩展的功能。

扩展-每个角色：

ProgressBar（ProgressBarNotebook）：以格式化的样式显示训练进度。

LogReport：记录由ppe.reporter.report报告的度量标准（有关报告点，请参见LyftMultiRegressor），并保存到日志文件中。 它会在每次迭代中自动收集报告值，并以常规频率（例如，每1个周期）保存报告值的“平均值”。

PrintReport（PrintReportNotebook）：以格式化的样式打印LogReport收集的值。

Evaluator：评估验证数据集。

snapshot_object：保存对象。 在这里，模型以规则的时间间隔flags.snapshot_freq保存。 即使您在没有完成所有纪元的情况下使用Ctrl + C退出了训练，中间训练的模型也会被保存，您可以将其用于推理。

LRScheduler：您可以插入带有此扩展名的学习率计划，以及触发器指定的常规间隔调用。 这里，通过每次迭代调用scheduler.step（）来应用余弦退火（由Flags配置）。

watch_lr：LogReport将使用此扩展名检查优化器的学习率。 因此，您可以通过培训了解学习率的变化。

使用扩展可以轻松地“添加”这么多的功能！Evaluator

Also **Exponential Moving Average of model weights** is calculated by `EMA` class during training, together with showing its validation loss. We can usually obtrain more stable models with EMA.

此外，EMA类在训练过程中计算模型权值的指数移动平均，并显示其有效性损失。我们通常可以用EMA得到更稳定的模型。

You can obtrain training history results really easily by just accessing `LogReport` class, which is useful for managing a lot of experiments during kaggle competitions.

通过访问LogReport类，您可以很容易地获得训练历史结果，这对于管理kaggle比赛期间的大量实验非常有用。

In [None]:
torch.save(predictor.state_dict(), outdir / "predictor_last.pt")
df = log_report.to_dataframe()
df.to_csv(outdir / "log.csv", index=False)
df

<a id="prediction"></a>
# Prediction on validation & test dataset

In [None]:
# --- Prediction ---
# 作预测
print("Training done! Start prediction...")
# valid data
# 对验证集数据做预测
valid_pred = classifier.predict_proba(valid_loader).cpu().numpy()
valid_pred_df = pd.DataFrame({
    "image_id": [dataset_dicts[i]["image_id"] for i in valid_inds],
    "class0": valid_pred[:, 0],
    "class1": valid_pred[:, 1]
})
valid_pred_df.to_csv(outdir/"valid_pred.csv", index=False)

# test data
# 读取测试数据
test_meta = pd.read_csv(inputdir / "vinbigdata-testmeta" / "test_meta.csv")
dataset_dicts_test = get_vinbigdata_dicts_test(imgdir, test_meta, debug=debug)
test_dataset = VinbigdataTwoClassDataset(dataset_dicts_test, train=False)
test_loader = DataLoader(
    test_dataset,
    batch_size=flags.valid_batchsize,
    num_workers=flags.num_workers,
    shuffle=False,
    pin_memory=True,
)

# 对测试集数据做预测
test_pred = classifier.predict_proba(test_loader).cpu().numpy()
test_pred_df = pd.DataFrame({
    "image_id": [d["image_id"] for d in dataset_dicts_test],
    "class0": test_pred[:, 0],
    "class1": test_pred[:, 1]
})
test_pred_df.to_csv(outdir/"test_pred.csv", index=False)

In [None]:
valid_loader.sampler

In [None]:
classifier.predict_proba(valid_loader.)

In [None]:
is_normal_df

In [None]:
eqw

In [None]:
# --- Test dataset prediction result ---
test_pred_df

In [None]:
binary_clf=test_pred_df[['image_id','class0']]

new_col = ['image_id', 'target']
binary_clf.columns = new_col
binary_clf
binary_clf.to_csv(outdir / "2-cls test pred.csv", index=False)

In [None]:
# 画出训练集和验证集的概率分布
sns.distplot(valid_pred_df["class0"].values, color='green', label='valid pred')
sns.distplot(test_pred_df["class0"].values, color='orange', label='test pred')
plt.title("Prediction results histogram")
plt.xlim([0., 1.])
plt.legend()

In [None]:
from sklearn.metrics import roc_curve, auc  ###计算roc和auc
from sklearn import cross_validation

eqw=pd.merge(is_normal_df,valid_pred_df,on='image_id')

true_class=eqw['num_normal_annotations'].tolist()
true_class=np.array([0 if value == 0 else 1 for value in true_class])

pred_class=np.array(eqw['class0'].tolist())

# Compute ROC curve and ROC area for each class
fpr,tpr,threshold = roc_curve(true_class, pred_class) ###计算真正率和假正率
roc_auc = auc(fpr,tpr) ###计算auc的值

plt.figure()
lw = 2
plt.figure(figsize=(10,10))
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假正率为横坐标，真正率为纵坐标做曲线
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

That's all!

<h3 style="color:red">If this kernel helps you, please upvote to keep me motivated 😁<br>Thanks!</h3>

<a id="nextstep"></a>
# Next step

I explained EDA - Training - Prediction pipeline for 2-class image classification in this kernel.<br/>
You can try changing training configurations by just changing `Flags` (`flags_dict`) configuration.

我解释了该内核中的EDA、2类图像分类的训练-预测管道。

您可以尝试仅通过更改Flags（flags_dict）配置来更改训练配置。



For example, you can change these paramters:

例如，您可以更改以下参数：

 - **Data**
   - `imgdir_name`: You can use different preprocessed image introduced in [Multiple preprocessed datasets: 256/512/1024px, PNG and JPG, modified and original ratio](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/207955) by @xhlulu.
 - **Model**
   - `model_name`: You can try various kinds of models `timm` library support, by just changing model_name.
 - **Training**
   - `epoch`, `batch_size`, `scheduler_type` etc: Try changing these hyperparamters, to see the difference!
   - Augmentation: Please modify `Transform` class to add your augmentation, it's easy to support more augmentations with `albumentations` library.


My basic strategy is as follows:
 - Check training loss/training accuracy: If it is almost same with validation loss/accuracy and it is not accurate enough, model's representation power may be not enough, or data augmentation is too strong. You can try more deeper models, decrease data augmentation or using more rich data (high-resolution image).
 - Check training loss/validation loss difference: If validation loss is very high compared to training loss, it is a sign of overfitting. Try using smaller models, increase data augmentation or apply regularization (dropout etc).
 
 我的基本策略如下：

检查训练损失/训练准确性：如果与验证损失/准确性几乎相同并且不够准确，则模型的表示能力可能不够，或者数据增强太强。 您可以尝试更深入的模型，减少数据扩充或使用更丰富的数据（高分辨率图像）。

检查训练损失/验证损失差异：如果验证损失与训练损失相比非常高，则表明过拟合。 尝试使用较小的模型，增加数据扩充或应用正则化（dropout等）。

# Next to read

[📸VinBigData detectron2 train](https://www.kaggle.com/corochann/vinbigdata-detectron2-train) kernel explains how to run object detection training, using `detectron2` library.

[📸VinBigData detectron2 prediction](https://www.kaggle.com/corochann/vinbigdata-detectron2-prediction) kernel explains how to use trained model for the prediction and submisssion for this competition.

这两个Notebook，解释了如何去训练，如何去预测