# 수정사항

**2021년 5월 15일 토요일 10:00**

`MyTrainer.model`은 `nn.DataParallel`인데 후처리에서 그냥 `nn.Module`인 것 처럼 접근하고 있는 문제 수정했습니다.  
`model.module.pre_classifier.register_forward_hook`으로 접근해야 하는데 `model.pre_classifier.register_forward_hook` 으로 접근하고 있었습니다.  

# 머릿말

※ 여러개의 `*.py` 또는 `*.ipynb` 파일들을 하나로 합치다보니 중간에 예기치못한 오류가 있을 수 있습니다. 댓글이나 이메일로 알려주시면 감사드립니다.  
마찬가지로 모든 과정을 한번에 실행하려면 매우 많은 메모리와 연산이 필요합니다. 소스코드 중간중간 데이터를 저장하고 가로선으로 나누어진 부분이 있습니다. 이 부분을 기점으로 메모리를 초기화하며 진행했습니다.

# 전체 요약

## 1. 전처리

**중복되는 데이터 제거**

* 똑같은 텍스트를 갖은 데이터가 여럿 있습니다. 이 데이터들을 제거하면 train과 inference 시간이 약 2/3로 개선되었습니다.
* 또한 텍스트는 똑같지만 level은 다른 데이터도 몇 있습니다. 중복되는 데이터 중에서 가장 많이 등장한 level만 남기고 전부 제거했습니다.

**필요없는 텍스트 제거**

* 날짜, 시간, PID, timestamp 등 필요없다고 생각되는 부분을 제거했습니다.

**Oversampling**

* level 2, 4, 6의 개수가 너무 적습니다. Train-validation을 나누고 나면 일부 특징은 해당 fold에서 아예 누락되어 학습을 못하기도 하기 때문에 level 2, 4, 6 인 데이터만 10배로 oversampling 해주었습니다. Oversampling할 때 특별히 augmenation을 적용하지는 않았고 그냥 복제를 했습니다.

**기타**

* token의 길이가 512를 초과하는 경우 앞의 512자리까지만 사용했습니다.
* 원본 데이터 파일은 `./data/ori` 에 저장됩니다.
* 데이터를 전처리해서 `pkl`파일으로 `./data/ver6` 에 저장했습니다.

## 2. 학습

* DistilBert를 finetune했고, FocalLoss를 썼습니다. 추가 데이터는 없습니다.
* 5 fold cross validation을 했습니다. 하지만 결과를 합치기 전에 public score 0.9207, 합친 후에 0.9208으로 크게 차이는 없었습니다.

## 3. 후처리

* 모든 train 데이터의 feature를 저장해두고 test feature와 euclidean distance를 계산합니다. 이 거리 값을 level 추론에 사용합니다.

## 4. 추론

* "기존에 나타난 log들과는 상이한 데이터가 level7지 않을까?" 하는 가정으로 접근했습니다.
* Level에 따라서 threshold를 각각 설정해주었고, threshold 또는 각종 조건을 넘어서면 level7, 아닐경우 fully connected layer의 출력과 distance를 종합해서 출력을 만들었습니다.

## Environments

* Ubuntu 18.04 LTS
* RTX3090, cuda-toolkit 11.2
* Checkpoint를 통해서 여러 장비들을 계속 옮겨가면서 작업했기 때문에 random seed 등의 문제로 완전한 reproduce는 어려울 수 있습니다.

## Requirements

* pytorch==1.7.1 # 이유는 모르겠지만 같은 weight를 써도 transformer 계열의 모델들은 1.7.1과 1.8.0 버전에서 전혀 다른 출력을 내더군요
* numpy
* pandas
* matplotlib
* pyaml
* easydict
* pytorch_transformers
* sklearn
* tqdm

In [None]:
import argparse
import logging
import math
import multiprocessing
import pickle
import random
import re
import sys
from collections import defaultdict
from datetime import datetime
from pathlib import Path
from pprint import pformat

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import yaml
from easydict import EasyDict
from pytorch_transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import StratifiedKFold
from torch.optim import Adam, AdamW
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm

# 중복되는 데이터 제거

## 중복되는 데이터 제거 - Train 데이터

In [None]:
df = pd.read_csv("data/ori/train.csv")

In [None]:
db = {}  # {텍스트: 레벨: 개수}
with tqdm(total=len(df), ncols=100, file=sys.stdout) as t:
    for i, (level, full_log) in enumerate(zip(df.level, df.full_log)):
        text = full_log
        if text not in db:
            db[text] = {}
        if level not in db[text]:
            db[text][level] = {"cnt": 0, "list": []}
        db[text][level]["cnt"] += 1
        db[text][level]["list"].append(i)

        t.update()

In [7]:
keys = list(db.keys())

In [8]:
len(keys), len(df)

(421079, 472972)

In [9]:
duples = {}
for i, key in enumerate(keys):
    if len(db[key]) != 1:
        K = tuple(sorted(list(db[key].keys())))
        if K not in duples:
            duples[K] = []
        duples[K].append(i)

In [10]:
duples.keys()

dict_keys([(0, 1), (0, 5), (3, 5), (0, 1, 5), (1, 5), (0, 3)])

In [13]:
outdb = {"level": [], "text": []}
with tqdm(total=len(keys), ncols=100, file=sys.stdout) as t:
    for key in keys:
        d = db[key]
        if len(list(d.keys())) > 1:
            # 가장 개수가 많은 level을 선택
            cnts = sorted([(d[k]["cnt"], k) for k in d.keys()], reverse=True)

            # 만약 개수가 같은게 있다면 가장 level이 작은 것을 선택
            max_cnt = cnts[0][0]
            same_ks = list(filter(lambda x: x[0] == max_cnt, cnts))
            level = same_ks[-1][1]
        else:
            level = list(d.keys())[0]

        outdb["level"].append(level)
        outdb["text"].append(key)
        t.update()

100%|███████████████████████████████████████████████████| 421079/421079 [00:00<00:00, 954321.56it/s]


In [14]:
temp = defaultdict(int)
for level in outdb["level"]:
    temp[level] += 1

In [15]:
temp  # 각 레벨별 데이터의 개수

defaultdict(int, {0: 280793, 1: 134195, 3: 4219, 5: 1842, 2: 12, 4: 10, 6: 8})

In [17]:
!mkdir data/ver6

In [19]:
# pickle로 저장한다
with open("data/ver6/train.pkl", "wb") as f:
    pickle.dump(outdb, f)

## 중복되는 데이터 제거 - Test 데이터

In [20]:
df = pd.read_csv("data/ori/test.csv")

In [25]:
full_logs = [full_log for full_log in df.full_log]

In [28]:
db = {}
with tqdm(total=len(df), ncols=100, file=sys.stdout) as t:
    for id, full_log in zip(df.id, df.full_log):
        # text = MyDatasetVer5.refine_data(full_log)
        text = full_log
        if text not in db:
            db[text] = []
        db[text].append(id)

        t.update()

100%|█████████████████████████████████████████████████| 1418916/1418916 [00:02<00:00, 609184.17it/s]


In [31]:
len(db.keys()), len(df)  # db: {full_log: "list of ids"}

(1095951, 1418916)

In [32]:
with open("data/ver6/test.pkl", "wb") as f:
    pickle.dump(db, f)

## 중복되는 데이터 제거 - Level7 validation 데이터

In [33]:
df = pd.read_csv("data/ori/validation_sample.csv")

In [34]:
full_logs = [full_log for full_log in df.full_log]

In [35]:
full_logs

['type=ANOM_PROMISCUOUS msg=audit(1600402733.466:4503): dev=enp2s0 prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295 type=SYSCALL msg=audit(1600402733.466:4503): arch=c000003e syscall=54 success=yes exit=0 a0=c a1=107 a2=1 a3=7f856aed1140 items=0 ppid=1 pid=12152 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="W#01-enp2s0" exe="/usr/sbin/suricata" subj=system_u:system_r:unconfined_service_t:s0 key=(null) type=PROCTITLE msg=audit(1600402733.466:4503): proctitle=2F7362696E2F7375726963617461002D63002F6574632F73757269636174612F73757269636174612E79616D6C002D2D70696466696C65002F7661722F72756E2F73757269636174612E706964002D6900656E70327330',
 'oscap: msg: "xccdf-result", scan-id: "0001600739632", content: "ssg-centos-7-ds.xml", title: "Prevent Log In to Accounts With Empty Password", id: "xccdf_org.ssgproject.content_rule_no_empty_passwords", result: "fail", severity: "high", description: "If an account is configured for pass

In [36]:
with open("data/ver6/valid-level7.pkl", "wb") as f:
    pickle.dump(full_logs, f)

---

# 학습

## Utility Functions

In [None]:
class AverageMeter(object):
    """
    AverageMeter, referenced to https://dacon.io/competitions/official/235626/codeshare/1684
    """

    def __init__(self):
        self.sum = 0
        self.cnt = 0
        self.avg = 0

    def update(self, val, n=1):
        if n > 0:
            self.sum += val * n
            self.cnt += n
            self.avg = self.sum / self.cnt

    def get(self):
        return self.avg

    def __call__(self):
        return self.avg

    
class CustomLogger:
    def __init__(self, filename=None, filemode="a", use_color=True):
        if filename is not None:
            self.empty = False
            filename = Path(filename)
            if filename.is_dir():
                timestr = self._get_timestr().replace(" ", "_").replace(":", "-")
                filename = filename / f"log_{timestr}.log"
            self.file = open(filename, filemode)
        else:
            self.empty = True

        self.use_color = use_color

    def _get_timestr(self):
        n = datetime.now()
        return f"{n.year:04d}-{n.month:02d}-{n.day:02d} {n.hour:02d}:{n.minute:02d}:{n.second:02d}"

    def _write(self, msg, level):
        timestr = self._get_timestr()
        out = f"[{timestr} {level}] {msg}"

        if self.use_color:
            if level == " INFO":
                print("\033[34m" + out + "\033[0m")
            elif level == " WARN":
                print("\033[35m" + out + "\033[0m")
            elif level == "ERROR":
                print("\033[31m" + out + "\033[0m")
            elif level == "FATAL":
                print("\033[43m\033[1m" + out + "\033[0m")
            else:
                print(out)
        else:
            print(out)

        if not self.empty:
            self.file.write(out + "\r\n")

    def debug(self, *msg):
        msg = " ".join(map(str, msg))
        self._write(msg, "DEBUG")

    def info(self, *msg):
        msg = " ".join(map(str, msg))
        self._write(msg, " INFO")

    def warn(self, *msg):
        msg = " ".join(map(str, msg))
        self._write(msg, " WARN")

    def error(self, *msg):
        msg = " ".join(map(str, msg))
        self._write(msg, "ERROR")

    def fatal(self, *msg):
        msg = " ".join(map(str, msg))
        self._write(msg, "FATAL")

    def flush(self):
        if not self.empty:
            self.file.flush()
    

def seed_everything(seed, deterministic=False):
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.backends.cudnn.deterministic = deterministic
        torch.backends.cudnn.benchmark = not deterministic

## Dataset Functions

In [None]:
def first_word(text, deli=" "):
    for i, t in enumerate(text):
        if t == deli:
            break
    return text[:i]


def remove_pattern(pattern, full_log):
    for s in re.finditer(pattern, full_log):
        a, b = s.span()
        full_log = (full_log[:a] + full_log[b:]).strip()
    return full_log


def filttt(x):
    if len(x) == 1:
        return False

    if re.fullmatch(r"[\.\d,\s-]+([ABTZ]|ms)?", x):
        return False

    if x.lower() in ("x64", "win32", "x86", "ko", "en", "kr", "us", "ko-kr", "en-us"):
        return False

    return True


def refine_data(full_log):
    t = first_word(full_log)
    if len(t) == 4 and t.isdigit() and t[:2] in ("19", "20", "21"):
        full_log = full_log[5:].strip()

    t = first_word(full_log)
    if len(t) == 3 and t in seasons:
        full_log = full_log[4:].strip()

        t = first_word(full_log)
        if t.isdigit():
            full_log = full_log[len(t) + 1 :].strip()

    # 00:00:00 형식의 시간 이면?
    if re.match(r"\d{2}:\d{2}:\d{2}", full_log):
        full_log = full_log[9:].strip()

    if full_log.startswith("localhost"):
        full_log = full_log[10:].strip()

    # @timestamp: "~~~~Z"
    full_log = remove_pattern(r'"@timestamp"\s?:\s?"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z?",?', full_log)
    # "pid": "4567"
    full_log = remove_pattern(r'"pid"\s?:\s?\d+,?', full_log)
    # [pid]
    full_log = remove_pattern(r"\[\d+\]", full_log)

    full_log = re.sub(r"\s+", " ", full_log)

    return full_log


class MyDatasetVer7(Dataset):
    """
    가장 결과가 좋았던 ver1을 따라간다
    """

    def __init__(self, tokenizer, texts, levels=None) -> None:
        super().__init__()
        self.tokenizer = tokenizer
        self.texts = texts
        self.levels = levels
        self.train = levels is not None

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        text = refine_data(str(text))
        otext = text  # backup original text before being tokenized

        text = self.tokenizer.encode(text, add_special_tokens=True)
        if len(text) < 512:  # padding
            text = text + [0] * (512 - len(text))
        elif len(text) > 512:  # cropping
            text = text[:512]
        text = torch.tensor(text, dtype=torch.long)

        if self.train:  # train
            level = self.levels[idx]
            level = torch.tensor(level, dtype=torch.long)
            return text, level, otext
        else:  # test
            return text, otext


class MyDatasetVer7Test(Dataset):
    def __init__(self, tokenizer, data) -> None:
        super().__init__()
        self.tokenizer = tokenizer
        self.data = data
        self.keys = list(self.data.keys())

    def __len__(self):
        return len(self.keys)

    def __getitem__(self, idx):
        text = self.keys[idx]
        ids = self.data[text]
        text = refine_data(text)
        otext = text  # backup original text before being tokenized

        text = self.tokenizer.encode(text, add_special_tokens=True)
        if len(text) < 512:  # padding
            text = text + [0] * (512 - len(text))
        elif len(text) > 512:  # cropping
            text = text[:512]
        text = torch.tensor(text, dtype=torch.long)

        return text, otext, ids


class DatasetGeneratorVer7:
    def __init__(
        self,
        data_dir,
        seed,
        fold,
        tokenizer,
        batch_size,
        num_workers,
        train_shuffle=True,
        oversampling=False,
        oversampling_scale=50,
    ):
        self.data_dir = Path(data_dir)
        self.seed = seed
        self.fold = fold
        self.tokenizer = tokenizer
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.train_shuffle = train_shuffle
        self.oversampling = oversampling
        self.oversampling_scale = oversampling_scale

        self.dl_kwargs = dict(batch_size=self.batch_size, num_workers=self.num_workers, pin_memory=True)

    def train_valid(self):
        # train dataset
        with open(self.data_dir / "train.pkl", "rb") as f:
            data = pickle.load(f)
        levels = np.array(data["level"], dtype=np.long)
        texts = np.array(data["text"], dtype=np.object)

        if self.oversampling:
            # oversampling -- level 2
            mask = levels == 2
            o2_len = len(levels[mask])
            o2_levels = np.array([2] * o2_len * self.oversampling_scale)
            o2_texts = np.concatenate([texts[mask] for _ in range(self.oversampling_scale)])

            # oversampling -- level 4
            mask = levels == 4
            o4_len = len(levels[mask])
            o4_levels = np.array([4] * o4_len * self.oversampling_scale)
            o4_texts = np.concatenate([texts[mask] for _ in range(self.oversampling_scale)])

            # oversampling -- level 6
            mask = levels == 6
            o6_len = len(levels[mask])
            o6_levels = np.array([6] * o6_len * self.oversampling_scale)
            o6_texts = np.concatenate([texts[mask] for _ in range(self.oversampling_scale)])

            levels = np.concatenate([levels, o2_levels, o4_levels, o6_levels])
            texts = np.concatenate([texts, o2_texts, o4_texts, o6_texts])

        # k-fold
        skf = StratifiedKFold(n_splits=5, shuffle=self.train_shuffle, random_state=self.seed)
        indices = list(skf.split(texts, levels))
        tidx, vidx = indices[self.fold - 1]
        tds = MyDatasetVer7(self.tokenizer, texts[tidx], levels[tidx])
        vds = MyDatasetVer7(self.tokenizer, texts[vidx], levels[vidx])

        tdl = DataLoader(tds, shuffle=self.train_shuffle, **self.dl_kwargs)
        vdl = DataLoader(vds, shuffle=False, **self.dl_kwargs)
        return tdl, vdl

    def train_only(self):
        with open(self.data_dir / "train.pkl", "rb") as f:
            data = pickle.load(f)
        levels = np.array(data["level"], dtype=np.long)
        texts = np.array(data["text"], dtype=np.object)

        ds = MyDatasetVer7(self.tokenizer, texts, levels)
        dl = DataLoader(ds, shuffle=False, **self.dl_kwargs)
        return dl

    def valid_lv7(self):
        # validation level 7 dataset
        with open(self.data_dir / "valid-level7.pkl", "rb") as f:
            texts = pickle.load(f)

        ds = MyDatasetVer7(self.tokenizer, texts)
        dl = DataLoader(ds, shuffle=False, **self.dl_kwargs)
        return dl

    def test(self):
        # test dataset
        with open(self.data_dir / "test.pkl", "rb") as f:
            data = pickle.load(f)

        ds = MyDatasetVer7Test(self.tokenizer, data)
        return ds

## Trainer Classes

In [None]:
# loss function
class FocalLoss(nn.Module):
    """
    https://dacon.io/competitions/official/235585/codeshare/1796
    """

    def __init__(self, gamma=0, eps=1e-7):
        super(FocalLoss, self).__init__()
        self.gamma = gamma
        # print(self.gamma)
        self.eps = eps
        self.ce = torch.nn.CrossEntropyLoss(reduction="none")

    def forward(self, input, target):
        logp = self.ce(input, target)
        p = torch.exp(-logp)
        loss = (1 - p) ** self.gamma * logp
        return loss.mean()

In [None]:
class MyTrainer:
    _tqdm_ = dict(ncols=100, leave=False, file=sys.stdout)

    def __init__(self, config, fold, checkpoint=None) -> None:
        self.C = config
        self.fold = fold

        # model
        self.tokenizer = DistilBertTokenizer.from_pretrained(self.C.model.name)
        self.model = DistilBertForSequenceClassification.from_pretrained(self.C.model.name, num_labels=7).cuda()
        self.model = nn.DataParallel(self.model)
        # loss
        self.criterion = FocalLoss(self.C.train.loss.params.gamma).cuda()
        # optimizer
        self.optimizer = AdamW(self.model.parameters(), lr=self.C.train.lr)
        # scheduler
        self.scheduler = ReduceLROnPlateau(self.optimizer, **self.C.train.scheduler.params)

        self.epoch = 1
        self.best_loss = math.inf
        self.best_acc = 0.0
        self.earlystop_cnt = 0
        self._freeze_step = 3

        if checkpoint is not None:
            if Path(checkpoint).exists():
                self.load(checkpoint)
            else:
                self.C.log.info("No checkpoint file", checkpoint)

        # dataset
        self.dsgen = DatasetGeneratorVer8(self.C, self.tokenizer, shuffle=True)
        self.tdl, self.vdl = self.dsgen.train_valid(self.fold)

    def _freeze_step1(self):
        self._freeze_step = 1
        self.model.module.requires_grad_(False)
        self.model.module.classifier.requires_grad_(True)

    def _freeze_step2(self):
        self._freeze_step = 2
        self.model.module.requires_grad_(True)
        self.model.module.classifier.requires_grad_(False)

    def _freeze_step3(self):
        self._freeze_step = 3
        self.model.module.requires_grad_(True)

    def save(self, path):
        torch.save(
            {
                "model": self.model.module.state_dict(),
                "optimizer": self.optimizer.state_dict(),
                "epoch": self.epoch,
                "best_loss": self.best_loss,
                "best_acc": self.best_acc,
                "earlystop_cnt": self.earlystop_cnt,
            },
            path,
        )

    def load(self, path):
        print("Load pretrained", path)
        ckpt = torch.load(path)
        self.model.module.load_state_dict(ckpt["model"])
        self.optimizer.load_state_dict(ckpt["optimizer"])
        self.epoch = ckpt["epoch"] + 1
        self.best_loss = ckpt["best_loss"]
        self.best_acc = ckpt["best_acc"]
        self.earlystop_cnt = ckpt["earlystop_cnt"]

    def train_loop(self):
        self.model.train()

        O = MyOutput()
        with tqdm(total=len(self.tdl.dataset), desc=f"Train {self.epoch:03d}", **self._tqdm_) as t:
            for text, tlevel, otext in self.tdl:
                text_ = text.cuda()
                tlevel_ = tlevel.cuda()
                plevel_ = self.model(text_)[0]
                loss = self.criterion(plevel_, tlevel_)

                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()

                pvlevel_ = plevel_.detach().argmax(dim=1)
                acc = (pvlevel_ == tlevel_).sum() / len(text) * 100
                O.loss.update(loss.item(), len(text))
                O.acc.update(acc.item(), len(text))
                O.plevels.append(pvlevel_.cpu())
                O.tlevels.append(tlevel)
                t.set_postfix_str(f"loss: {O.loss():.6f}, acc: {O.acc():.2f}", refresh=False)
                t.update(len(text))
        return O.freeze()

    @torch.no_grad()
    def valid_loop(self):
        self.model.eval()

        O = MyOutput()
        with tqdm(total=len(self.vdl.dataset), desc=f"Valid {self.epoch:03d}", **self._tqdm_) as t:
            for text, tlevel, otext in self.vdl:
                text_ = text.cuda()
                tlevel_ = tlevel.cuda()
                plevel_ = self.model(text_)[0]
                loss = self.criterion(plevel_, tlevel_)

                pvlevel_ = plevel_.detach().argmax(dim=1)
                acc = (pvlevel_ == tlevel_).sum() / len(text) * 100
                O.loss.update(loss.item(), len(text))
                O.acc.update(acc.item(), len(text))
                O.plevels.append(pvlevel_.cpu())
                O.tlevels.append(tlevel)
                t.set_postfix_str(f"loss: {O.loss():.6f}, acc: {O.acc():.2f}", refresh=False)
                t.update(len(text))
        return O.freeze()

    @torch.no_grad()
    def callback(self, to: MyOutput, vo: MyOutput):
        # f1 score
        tf1 = f1_score(to.tlevels, to.plevels, zero_division=1, average="macro")
        vf1 = f1_score(vo.tlevels, vo.plevels, zero_division=1, average="macro")
        trep = str(classification_report(to.tlevels, to.plevels, labels=[0, 1, 2, 3, 4, 5, 6], zero_division=1))
        vrep = str(classification_report(vo.tlevels, vo.plevels, labels=[0, 1, 2, 3, 4, 5, 6], zero_division=1))

        self.C.log.info(
            f"Epoch: {self.epoch:03d}/{self.C.train.max_epochs},",
            f"loss: {to.loss:.6f};{vo.loss:.6f},",
            f"acc {to.acc:.2f};{vo.acc:.2f}",
            f"f1 {tf1:.2f}:{vf1:.2f}",
        )
        self.C.log.info("Train Report\r\n" + trep)
        self.C.log.info("Validation Report\r\n" + vrep)
        self.C.log.flush()

        if isinstance(self.scheduler, ReduceLROnPlateau):
            self.scheduler.step(vo.loss)

        if self.best_loss - vo.loss > 1e-6 or vf1 - self.best_acc > 1e-6:
            if self.best_loss > vo.loss:
                self.best_loss = vo.loss
            else:
                self.best_acc = vf1

            self.earlystop_cnt = 0
            self.save(self.C.result_dir / f"{self.C.uid}_{self.fold}.pth")

            # TODO 결과 요약 이미지 출력
        else:
            self.earlystop_cnt += 1

    def fit(self):
        for self.epoch in range(self.epoch, self.C.train.max_epochs + 1):
            if self.C.train.finetune.do:
                if self.epoch <= self.C.train.finetune.step1_epochs:
                    if self._freeze_step != 1:
                        self.C.log.info("Finetune Step 1")
                        self._freeze_step1()
                elif self.epoch <= self.C.train.finetune.step2_epochs:
                    if self._freeze_step != 2:
                        self.C.log.info("Finetune Step 2")
                        self._freeze_step2()
                elif self.epoch > self.C.train.finetune.step2_epochs:
                    if self._freeze_step != 3:
                        self.C.log.info("Finetune Step 3")
                        self._freeze_step3()

            to = self.train_loop()
            vo = self.valid_loop()
            self.callback(to, vo)

## Load Configuration

`./config/distilbert-base-uncased-ver7.yaml` 파일의 내용입니다.

```yaml
model:
  name: distilbert-base-uncased
comment: null # 기타 추가할 메모
result_dir: results/distilbert-base-uncased-ver7 # 출력 파일들이 저장될 경로 (*.log 로그 파일과, *.pth 체크포인트가 저장됩니다.)

debug: false
seed: 20210425
ver: 7

train:
  SAM: false
  folds: 
    - 1
    - 2
    - 3  # 몇번 째 fold들을 학습할건지. 1~5 사이의 값.
    - 4  # 한 fold마다 GPU에 따라 15~25시간 정도 걸립니다.
    - 5
  checkpoints: 
    - null
    - null
    - null  # 학습을 checkpoint부터 다시 시작할 때 설정
    - null
    - null
  loss: 
    name: focal # ce, focal, arcface
    params:
      gamma: 2.0
      s: 45.0
      m: 0.1
      crit: focal
    
  optimizer:
    name: AdamW # Adam, AdamW
  
  finetune:
    do: true  # tail부분만 2epochs, body만 2epochs, 전체 8epochs
    step1_epochs: 2
    step2_epochs: 4
  max_epochs: 12
    
  lr: 0.00001
  scheduler:
    name: ReduceLROnPlateau
    params:
      factor: 0.5
      patience: 3
      verbose: true
  
dataset:
  dir: data/ver6  # 데이터셋 경로
  batch_size: 35  # Batch size 35에 약 22GB 정도의 GPU 메모리가 필요(finetune-step3 기준)
  num_workers: 8
  oversampling: true  # level 2, 4, 6에 대해서 10배로 데이터를 복제해줍니다.
  oversampling_scale: 10
```

In [None]:
def main():
    CONFIG_PATH = "config/distilbert-base-uncased-ver7.yaml"
    
    with open(CONFIG_PATH, "r") as f:
        C = EasyDict(yaml.load(f, yaml.FullLoader))

    for fold, checkpoint in zip(C.train.folds, C.train.checkpoints):
        with open(CONFIG_PATH, "r") as f:
            C = EasyDict(yaml.load(f, yaml.FullLoader))
            Path(C.result_dir).mkdir(parents=True, exist_ok=True)

            if C.dataset.num_workers < 0:
                C.dataset.num_workers = multiprocessing.cpu_count()
            C.uid = f"{C.model.name.split('/')[-1]}-{C.train.loss.name}"
            C.uid += f"-{C.train.optimizer.name}"
            C.uid += f"-lr{C.train.lr}"
            C.uid += f"-ver{C.ver}" if C.ver > 1 else ""
            C.uid += f"-os{C.dataset.oversampling_scale}" if C.dataset.oversampling else ""
            C.uid += "-sam" if C.train.SAM else ""
            C.uid += f"-{C.comment}" if C.comment is not None else ""
            print(C.uid)

            log = CustomLogger(Path(C.result_dir) / f"{C.uid}_{fold}.log", "a")
            log.info("\r\n" + pformat(C))
            log.flush()

            C.log = log
            C.result_dir = Path(C.result_dir)
            C.dataset.dir = Path(C.dataset.dir)
            seed_everything(C.seed, deterministic=False)

        C.log.info("Fold", fold, ", checkpoint", checkpoint)
        trainer = MyTrainer(C, fold, checkpoint)
        trainer.fit()

In [None]:
main()

**학습 로그**

```log
[2021-05-11 19:04:32  INFO] Epoch: 006/12, loss: 0.001347;0.001057, acc 99.90;99.91 f1 0.93:0.99
[2021-05-11 19:04:32  INFO] Train Report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    224634
           1       1.00      1.00      1.00    107356
           2       0.94      0.87      0.90       106
           3       1.00      1.00      1.00      3376
           4       0.82      0.98      0.89        88
           5       1.00      0.98      0.99      1473
           6       0.90      0.64      0.75        70

    accuracy                           1.00    337103
   macro avg       0.95      0.92      0.93    337103
weighted avg       1.00      1.00      1.00    337103

[2021-05-11 19:04:32  INFO] Validation Report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56159
           1       1.00      1.00      1.00     26839
           2       1.00      0.96      0.98        26
           3       1.00      0.99      1.00       843
           4       1.00      1.00      1.00        22
           5       1.00      0.99      0.99       369
           6       0.95      1.00      0.97        18

    accuracy                           1.00     84276
   macro avg       0.99      0.99      0.99     84276
weighted avg       1.00      1.00      1.00     84276

<기타 생략>

[2021-05-12 01:29:26  INFO] Epoch: 012/12, loss: 0.000648;0.001045, acc 99.92;99.90 f1 0.99:1.00
[2021-05-12 01:29:26  INFO] Train Report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    224634
           1       1.00      1.00      1.00    107356
           2       0.96      0.98      0.97       106
           3       1.00      1.00      1.00      3376
           4       1.00      1.00      1.00        88
           5       1.00      0.99      0.99      1473
           6       1.00      1.00      1.00        70

    accuracy                           1.00    337103
   macro avg       0.99      1.00      0.99    337103
weighted avg       1.00      1.00      1.00    337103

[2021-05-12 01:29:26  INFO] Validation Report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56159
           1       1.00      1.00      1.00     26839
           2       1.00      1.00      1.00        26
           3       1.00      1.00      1.00       843
           4       1.00      1.00      1.00        22
           5       0.99      0.99      0.99       369
           6       1.00      1.00      1.00        18

    accuracy                           1.00     84276
   macro avg       1.00      1.00      1.00     84276
weighted avg       1.00      1.00      1.00     84276
```

---

# 후처리

모든 Train 데이터에 대한 feature map을 구하고 저장합니다. 이 feature map의 모음을 여기서는 "deck"이라고 부르겠습니다.

이 부분은 학습과는 별개로 진행되었기 때문에 메모리를 초기화하고 config 읽기와 trainer 생성부터 다시 진행됩니다.  
이 부분은 fold 3에 대해서만 작성했습니다. Cross validation을 하려면 같은 방법으로 5개의 fold에 대해 총 5회 반복해야 합니다.

In [3]:
postfix = "distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3"
outdir = Path("results/distilbert-base-uncased-ver7")
fold = 3

In [4]:
with open("config/distilbert-base-uncased-ver7.yaml", "r") as f:
    C = EasyDict(yaml.load(f, yaml.FullLoader))
    C.result_dir = Path(C.result_dir)
    C.dataset.dir = Path(C.dataset.dir)
    seed_everything(C.seed, deterministic=False)

In [6]:
trainer = MyTrainer(C, fold, outdir / f"{postfix}.pth")

Load pretrained results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3.pth


In [7]:
model = trainer.model
model.eval()
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7fdcd9cafe10>

In [8]:
activation = []


def hook(model, input, output):
    activation.append(output.detach().cpu())

In [9]:
model.module.pre_classifier.register_forward_hook(hook)  # set feature hook function

<torch.utils.hooks.RemovableHandle at 0x7fdb7a876c50>

In [10]:
train_dl = trainer.dsgen.train_only()  # shuffle 없는 모든 train 데이터셋

### Make Train Deck

In [None]:
activation = []
deck = {"feat": [], "otext": [], "tlevel": [], "fclevel": []}
with tqdm(total=len(train_dl.dataset), ncols=100, file=sys.stdout) as t:
    for text, tlevel, otext in train_dl:
        fclevel = model(text.cuda(non_blocking=True))[0].argmax(dim=1).cpu()
        deck["fclevel"].append(fclevel)
        deck["tlevel"].append(tlevel)
        deck["otext"].extend(otext)
        t.update(text.size(0))

 76%|████████████████████████████████████████▉             | 319130/421079 [15:50<05:03, 336.34it/s]

In [None]:
deck["feat"] = torch.cat(activation)
deck["tlevel"] = torch.cat(deck["tlevel"])
deck["fclevel"] = torch.cat(deck["fclevel"])

In [None]:
deck["tlevel"].shape, deck["fclevel"].shape

In [None]:
deck["feat"].shape

In [None]:
torch.save(deck, outdir / f"{postfix}-deck1.pth")

### Valid Lv7에 대한 결과 분석

가장 거리가 가까운 4개의 거리, 인덱스, level을 출력해봅니다.

메모리가 20GB이상이 아니면 out of memory error 가 발생할 수 있습니다.  
메모리가 부족하면 테스트 데이터를 쪼개서 해보는 것도 좋을 수 있습니다.

In [15]:
ds_lv7 = trainer.dsgen.valid_lv7().dataset

In [16]:
text, otext = ds_lv7[0]
activation = []
print(model(text[None].cuda())[0].cpu())

tensor([[-4.8830,  4.1110, -5.7687, -4.4678, -7.3731, -4.0645, -6.7607]])


In [17]:
dists, indices = torch.norm(deck["feat"] - activation[0][None], p=None, dim=1).topk(4, largest=False)
dists, indices, deck["tlevel"][indices]

(tensor([[24.6197, 32.4965, 36.9424, 40.4677]]),
 tensor([[144, 177, 112, 195]]),
 tensor([[0, 0, 0, 0]]))

In [18]:
text, otext = ds_lv7[1]
activation = []
print(model(text[None].cuda())[0].cpu())

tensor([[-3.2189, -2.4237, -7.7600,  1.8829, -8.6021, -5.1994, -7.5476]])


In [19]:
dists, indices = torch.norm(deck["feat"] - activation[0][None], p=None, dim=1).topk(4, largest=False)
dists, indices, deck["tlevel"][indices]

(tensor([[38.2679, 40.8461, 46.9950, 47.6562]]),
 tensor([[536, 236, 272, 377]]),
 tensor([[0, 0, 0, 0]]))

In [20]:
text, otext = ds_lv7[2]
activation = []
print(model(text[None].cuda())[0].cpu())

tensor([[-0.8619, -2.2384, -1.8922, -2.0378, -3.1984, -2.5976, -2.0431]])


In [21]:
dists, indices = torch.norm(deck["feat"] - activation[0][None], p=None, dim=1).topk(4, largest=False)
dists, indices, deck["tlevel"][indices]

(tensor([[29.5083, 43.1404, 52.9118, 53.9867]]),
 tensor([[144, 103,  82, 757]]),
 tensor([[0, 0, 0, 0]]))

---

### Make Test Deck

In [3]:
postfix = "distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3"
outdir = Path("results/distilbert-base-uncased-ver7")
fold = 3

In [4]:
with open("config/distilbert-base-uncased-ver7.yaml", "r") as f:
    C = EasyDict(yaml.load(f, yaml.FullLoader))
    C.result_dir = Path(C.result_dir)
    C.dataset.dir = Path(C.dataset.dir)
    seed_everything(C.seed, deterministic=False)

In [6]:
trainer = MyTrainer(C, fold, outdir / f"{postfix}.pth")

Load pretrained results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3.pth


In [7]:
model = trainer.model
model.eval()
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7fdcd9cafe10>

In [8]:
activation = []


def hook(model, input, output):
    activation.append(output.detach().cpu())

In [9]:
model.module.pre_classifier.register_forward_hook(hook)  # set feature hook function

<torch.utils.hooks.RemovableHandle at 0x7fdb7a876c50>

In [22]:
ds_test = trainer.dsgen.test()

In [23]:
activation = []
deck = {"feat": [], "otext": [], "fclevel": [], "ids": []}
with tqdm(total=len(ds_test), ncols=100, file=sys.stdout) as t:
    for i in range(len(ds_test)):
        text, otext, ids = ds_test[i]
        fclevel = model(text[None].cuda(non_blocking=True))[0].argmax(dim=1).cpu()
        deck["fclevel"].append(fclevel)
        deck["otext"].append(otext)
        deck["ids"].append(ids)
        t.update()

100%|██████████████████████████████████████████████████| 1095951/1095951 [1:42:37<00:00, 177.99it/s]


In [None]:
deck["feat"] = torch.stack(activation)
deck["fclevel"] = torch.stack(deck["fclevel"])

In [38]:
deck["feat"].shape, deck["fclevel"].shape

(torch.Size([1095951, 1, 768]), torch.Size([1095951]))

In [39]:
deck["feat"] = deck["feat"][:, 0, :]
# deck["fclevel"] = deck["fclevel"][:, 0]

In [40]:
deck["feat"].shape, deck["fclevel"].shape

(torch.Size([1095951, 768]), torch.Size([1095951]))

In [41]:
torch.save(deck, outdir / f"{postfix}-deck2.pth")

---

## 모든 train feature에 대한 test feature의 거리 구하기

이 부분은 위와는 별개로 진행되었기 때문에 저장된 deck을 다시 로딩합니다. (한 번에 하려면 메모리가 부족합니다.)

In [3]:
deck1 = torch.load("results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3-deck1.pth")
deck2 = torch.load("results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3-deck2.pth")

In [4]:
deck1["feat"].shape, deck2["feat"].shape

(torch.Size([421079, 768]), torch.Size([1095951, 768]))

In [5]:
deck1["feat_"] = deck1["feat"].cuda()
deck2["feat_"] = deck2["feat"].cuda()

In [6]:
N = deck2["feat"].size(0)
distdeck = {"dist": [], "level": []}
with tqdm(total=N, ncols=100, file=sys.stdout) as t:
    for i in range(N):
        dist_ = torch.norm(deck1["feat_"] - deck2["feat_"][i, None], p=None, dim=1)
        dist_, indices_ = dist_.topk(4, largest=False)
        tlevels = deck1["tlevel"][indices_]

        distdeck["dist"].append(dist_.cpu())
        distdeck["level"].append(tlevels)

        t.update()

100%|██████████████████████████████████████████████████| 1095951/1095951 [2:05:02<00:00, 146.07it/s]


In [7]:
distdeck["dist"] = torch.stack(distdeck["dist"])
distdeck["level"] = torch.stack(distdeck["level"])

In [11]:
torch.save(
    distdeck, "results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3-distdeck.pth"
)

---

# 추론

## Submission 파일 만들기

이 부분은 위와는 별개로 진행되었기 때문에 저장된 deck을 다시 로딩합니다. (한 번에 하려면 메모리가 부족합니다.)

In [3]:
distdeck = torch.load(
    "results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3-distdeck.pth"
)

In [4]:
deck2 = torch.load("results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3-deck2.pth")

In [5]:
df = pd.read_csv("data/ori/test.csv")

In [6]:
distdeck["dist"].shape, distdeck["level"].shape

(torch.Size([1095951, 4]), torch.Size([1095951, 4]))

In [7]:
distdeck.keys(), deck2.keys()

(dict_keys(['dist', 'level']), dict_keys(['feat', 'otext', 'fclevel', 'ids']))

In [8]:
total_len = 1418916

In [9]:
def policy(dists, tlevels, fclevel):
    if fclevel in [6, 4, 2]:
        return fclevel.item()
    if (tlevels == 5).all():
        return 5 if dists[0] < 1.5 else 7
    if (tlevels == 3).all():
        return 3 if dists[0] < 1.5 else 7
    if dists[0] < 0.7:
        return fclevel.item()
    return 7

In [10]:
out_dists = [None for _ in range(total_len)]
out_levels = [None for _ in range(total_len)]
out_fclevels = [None for _ in range(total_len)]
N = distdeck["dist"].size(0)
with tqdm(total=N, ncols=100, file=sys.stdout) as t:
    for i in range(N):
        dists = distdeck["dist"][i]
        levels = distdeck["level"][i]
        fclevel = deck2["fclevel"][i]
        out_level = policy(dists, levels, fclevel)
        ids = deck2["ids"][i]
        for j in ids:
            out_levels[j - 1000000] = out_level
            out_dists[j - 1000000] = dists
            out_fclevels[j - 1000000] = fclevel
        t.update()

100%|██████████████████████████████████████████████████| 1095951/1095951 [01:20<00:00, 13610.90it/s]


In [11]:
out_levels = np.array(out_levels)

In [12]:
# 각 레벨별 개수
for i in range(8):
    cnt = (out_levels == i).sum()
    print(i, ":", cnt, f"{cnt / len(out_levels)*100:.2f}%")

0 : 1003955 70.76%
1 : 395007 27.84%
2 : 42 0.00%
3 : 12950 0.91%
4 : 34 0.00%
5 : 6334 0.45%
6 : 31 0.00%
7 : 563 0.04%


In [13]:
out_ids = list(range(1000000, 1000000 + len(out_levels)))

In [14]:
out_df = {"id": out_ids, "level": out_levels}

In [15]:
out_df = pd.DataFrame(out_df)

In [16]:
out_df

Unnamed: 0,id,level
0,1000000,0
1,1000001,0
2,1000002,1
3,1000003,0
4,1000004,1
...,...,...
1418911,2418911,0
1418912,2418912,0
1418913,2418913,1
1418914,2418914,0


In [17]:
# 최종 출력
out_df.to_csv(
    "results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3-out_ver2.csv", index=False
)

---

# Cross Validation

In [3]:
csvs = [
    pd.read_csv("results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_3-out_ver2.csv"),
    
    pd.read_csv("results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_1-out_ver2.csv"),
    pd.read_csv("results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_2-out_ver2.csv"),
    pd.read_csv("results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_4-out_ver2.csv"),
    pd.read_csv("results/distilbert-base-uncased-ver7/distilbert-base-uncased-focal-AdamW-lr1e-05-ver7-os10_5-out_ver2.csv"),
]

In [4]:
levels = np.stack([csv.level.to_numpy() for csv in csvs])

In [5]:
N = len(csvs[0])

In [6]:
out_levels = []
for i in range(N):
    kt = False
    ll = [0 for _ in range(8)]
    for j in range(len(csvs)):
        if levels[j, i] in [2, 4, 6]:
            out_levels.append(levels[j, i])
            kt = True
            break
        else:
            ll[levels[j, i]] += 1

    if not kt:
        if max(ll) == 1:
            out_levels.append(levels[0, i])
        elif ll[7] >= 2:
            out_levels.append(7)
        else:
            out_levels.append(max(range(8), key=lambda i: ll[i]))

In [7]:
df = pd.DataFrame({"id": csvs[0].id.to_list(), "level": out_levels})

In [8]:
df

Unnamed: 0,id,level
0,1000000,0
1,1000001,0
2,1000002,1
3,1000003,0
4,1000004,1
...,...,...
1418911,2418911,0
1418912,2418912,0
1418913,2418913,1
1418914,2418914,0


In [25]:
df.to_csv("results/ens-ver2.csv", index=False)