Most of the code in this model is copied from https://www.kaggle.com/code/yasufuminakama/fb3-deberta-v3-base-baseline-train and  </br>
https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently </br>

40 new essays were added from All Things Considered by G.K. Chesterton https://www.gutenberg.org/ebooks/11505 <br/>
and all the essays were assigned the score 5 on all target columns <br/>

Adding essays with low scores on all columns if future work.

The procedure to add essays are in the read_essays.ipynb file

The "Model" section contains the crux of the project.

**Directory Settings**

In [None]:
import os
OUTPUT_DIR = '/content/drive/MyDrive/Colab Notebooks/Erdos Fall 2022/' # change directory when running on local machine

In [None]:
!pip install transformers==4.21.2
!pip install tokenizers==0.12.1
!pip install sentencepiece

**Custom Configuration class** </br>
https://huggingface.co/docs/transformers/main_classes/configuration

In [None]:
# the variables used here will be used in the model later
class CFG:
    debug=False
    apex=True # GradScaler enabled. https://pytorch.org/docs/stable/notes/amp_examples.html
    print_freq=20
    num_workers= 4 #for multiprocessing data loading, change according to memory capacity
    model="microsoft/deberta-v3-base" #https://huggingface.co/docs/transformers/model_doc/deberta
    gradient_checkpointing=True  #https://pytorch.org/docs/stable/checkpoint.html
    scheduler='cosine' # ['linear', 'cosine']
    batch_scheduler=True
    num_cycles=0.5 #The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0
                    #following a half-cosine).
    num_warmup_steps=0 #https://huggingface.co/docs/transformers/main_classes/optimizer_schedules
    epochs=3 # increase this for better results
    encoder_lr=2e-5
    decoder_lr=2e-5
    min_lr=1e-6
    eps=1e-6
    betas=(0.9, 0.999) #coefficients used for computing running averages of gradient and its square
    batch_size=2
    max_len=512
    weight_decay=0.01
    gradient_accumulation_steps=1
    max_grad_norm=1000
    target_cols=['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions']
    seed=42
    n_fold=3
    trn_fold= range(n_fold)
    train=True

**Libraries**

In [None]:
import os
import gc
import re
import ast
import sys
import copy
import json
import time
import math
import string
import pickle
import random
import joblib
import itertools
import warnings
warnings.filterwarnings("ignore")

import scipy as sp
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from tqdm.auto import tqdm
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold, train_test_split

os.system('pip install iterative-stratification==0.1.7')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

import torch
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.optim import Adam, SGD, AdamW
from torch.utils.data import DataLoader, Dataset

import tokenizers
import transformers
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
%env TOKENIZERS_PARALLELISM=true

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

env: TOKENIZERS_PARALLELISM=true


**Utils**

In [None]:
def MCRMSE(y_trues, y_preds):
    scores = []
    idxes = y_trues.shape[1]
    for i in range(idxes):
        y_true = y_trues[:,i]
        y_pred = y_preds[:,i]
        score = mean_squared_error(y_true, y_pred, squared=False) # RMSE
        scores.append(score)
    mcrmse_score = np.mean(scores)
    return mcrmse_score, scores


def get_score(y_trues, y_preds):
    mcrmse_score, scores = MCRMSE(y_trues, y_preds)
    return mcrmse_score, scores


def get_logger(filename=OUTPUT_DIR+'train'):
    from logging import getLogger, INFO, StreamHandler, FileHandler, Formatter
    logger = getLogger(__name__)
    logger.setLevel(INFO)
    handler1 = StreamHandler()
    handler1.setFormatter(Formatter("%(message)s"))
    handler2 = FileHandler(filename=f"{filename}.log")
    handler2.setFormatter(Formatter("%(message)s"))
    logger.addHandler(handler1)
    logger.addHandler(handler2)
    return logger

LOGGER = get_logger()


def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=42)

**Data Loading**

Because of the class imbalance in the high scores, we added 55 essays from the above mentioned book.


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/train_new.csv')
train, test = train_test_split(df, test_size=0.2, shuffle = True, random_state = 440)
train=train.reset_index(drop=True)
test= test.reset_index(drop=True)

print(f"df.shape: {df.shape}")
display(df.head())

print(f"train.shape: {train.shape}") # this will be used for training 
display(train.head())

print(f"test.shape: {test.shape}") # this will be used for testing
display(test.head())


df.shape: (3945, 8)


Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5


train.shape: (3156, 8)


Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,4D23A7FA9408,Although some say students wouldn't benefit at...,2.5,2.5,3.0,2.5,2.5,3.5
1,57DFE525E50A,Ever since I was in elementary school I have h...,4.0,4.5,5.0,4.5,4.5,4.0
2,EC07B11B2CF3,a positive attitude is the key to success in l...,2.0,2.0,3.0,2.5,2.0,2.0
3,B5FFED1536FE,I disagree with Emerson because if they have a...,2.5,2.0,3.0,2.0,2.5,3.5
4,845693DC199E,"Dear principal, ''This is what i think about c...",4.0,3.5,4.0,3.0,4.0,4.0


test.shape: (789, 8)


Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0C28481DFBB8,"In my opinion, I know why Thomas Jefferson dec...",4.0,3.0,3.0,3.0,2.5,3.5
1,05E5653B781C,its always good to ask people how it feels or ...,3.0,3.5,3.0,3.0,3.5,3.0
2,F075F9AD2AA4,Do you think it is a good idea for the student...,2.5,2.5,2.5,2.0,2.0,2.0
3,ACA1A45EE438,I strongly disagree with extending the school ...,4.0,3.0,3.0,4.0,4.0,4.0
4,A0D47C0DD67F,Individuality is the idea of freedom of though...,3.0,3.0,3.0,2.0,2.0,2.0


**CV Split**

In [None]:
# ====================================================
# CV split
# ====================================================
Fold = MultilabelStratifiedKFold(n_splits=CFG.n_fold, shuffle=True, random_state=404)
for n, (train_index, val_index) in enumerate(Fold.split(train, train[CFG.target_cols])):
    train.loc[val_index, 'fold'] = int(n)
train['fold'] = train['fold'].astype(int)
display(train.groupby('fold').size())

fold
0    1052
1    1052
2    1052
dtype: int64

**Tokenizer**

In [None]:
tokenizer = AutoTokenizer.from_pretrained(CFG.model) # we used deberta tokenizer
tokenizer.save_pretrained(OUTPUT_DIR+'tokenizer/')
CFG.tokenizer = tokenizer # saved to configuration class

Downloading tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.35M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


**Max length**

In [None]:
lengths = []
for text in tqdm(df['full_text'].values, total=len(df)):
    length = len(tokenizer(text, add_special_tokens=True)['input_ids'])
    lengths.append(length)
CFG.max_len = max(lengths)
LOGGER.info(f"max_len: {CFG.max_len}")

  0%|          | 0/3945 [00:00<?, ?it/s]

max_len: 2713
INFO:__main__:max_len: 2713


In [None]:
# example
tokenizer("I love NLP. I hate NLP", add_special_tokens=True) # there is no padding to max_len here, so attention mask does not have zeros.

{'input_ids': [1, 273, 472, 40903, 260, 273, 3254, 40903, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

**Dataset**

In [None]:
def prepare_input(cfg, text): # see example below
    inputs = cfg.tokenizer.encode_plus(
        text, 
        return_tensors=None, 
        add_special_tokens=True, # the special tokens are CLS at the beginning of sentence and SEP at the end of sentence
        max_length=CFG.max_len,
        pad_to_max_length=True,
        truncation=True
    )
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
    return inputs


class TrainDataset(Dataset): #dataset class in pytorch https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
    def __init__(self, cfg, df):
        self.cfg = cfg
        self.texts = df['full_text'].values
        self.labels = df[cfg.target_cols].values

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        inputs = prepare_input(self.cfg, self.texts[item])
        label = torch.tensor(self.labels[item], dtype=torch.float)
        return inputs, label
    
# collate finds the max length in the batch, and truncates every row to its length. This is to speed up training
def collate(inputs):
    mask_len = int(inputs["attention_mask"].sum(axis=1).max())
    for k, v in inputs.items():
        inputs[k] = inputs[k][:,:mask_len]
    return inputs

In [None]:
#example
prepare_input(CFG, "I love NLP. I hate NLP.") #converted to tensor and padded zeros to max length

{'input_ids': tensor([  1, 273, 472,  ...,   0,   0,   0]), 'token_type_ids': tensor([0, 0, 0,  ..., 0, 0, 0]), 'attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0])}

**Model**

In [None]:
# see example below
class MeanPooling(nn.Module): #https://machinelearningmastery.com/pooling-layers-for-convolutional-neural-networks/
    def __init__(self):
        super(MeanPooling, self).__init__()
        
    def forward(self, last_hidden_state, attention_mask):
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
        sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)
        mean_embeddings = sum_embeddings / sum_mask
        return mean_embeddings
    

class CustomModel(nn.Module):
    def __init__(self, cfg, config_path=None, pretrained=False):
        super().__init__()
        self.cfg = cfg
        if config_path is None:
            self.config = AutoConfig.from_pretrained(cfg.model, output_hidden_states=True)
            self.config.hidden_dropout = 0.
            self.config.hidden_dropout_prob = 0.
            self.config.attention_dropout = 0.
            self.config.attention_probs_dropout_prob = 0.
            LOGGER.info(self.config)
        else:
            self.config = torch.load(config_path)
        if pretrained:
            self.model = AutoModel.from_pretrained(cfg.model, config=self.config)
        else:
            self.model = AutoModel(self.config)
        if self.cfg.gradient_checkpointing:
            self.model.gradient_checkpointing_enable()
        self.pool = MeanPooling()
        self.fc = nn.Linear(self.config.hidden_size, 6) # this is the final layer
        self._init_weights(self.fc)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        
    def feature(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_states = outputs[0]
        feature = self.pool(last_hidden_states, inputs['attention_mask']) # this is where the pooling happens
        return feature

    def forward(self, inputs):
        feature = self.feature(inputs)
        output = self.fc(feature) #final layer output
        return output

*Examples*

In [None]:
#sentence= "I love NLP." #we didn't add cls and sep here.
attention_mask1= torch.tensor([1,1,1,0,0]) # 1's for texts and special tokens. 0's for padding.
input_mask_expanded1 = attention_mask1.unsqueeze(-1).expand([5,4]).float() # matrix the size of hidden state
last_hidden_state1= torch.tensor([[2,4.5,5,1],[4,1,2,.2],[1,1,2,1],[0.1,2,6,1],[1,1,1,1]]) #sequence of hidden states at the output of the last layer
#https://huggingface.co/docs/transformers/main_classes/output
sum_embeddings1 = torch.sum(last_hidden_state1 * input_mask_expanded1, 1)
sum_mask1 = input_mask_expanded1.sum(1)
sum_mask2 = torch.clamp(sum_mask1, min=1e-9)
mean_embeddings1 = sum_embeddings1 / sum_mask2

display("attention_mask:",attention_mask1)
display("input_mask_expanded:", input_mask_expanded1 )
display("last_hidden_state:", last_hidden_state1)
display("sum_embeddings:", sum_embeddings1)
display("sum_mask:", sum_mask1)
display("sum_mask2:", sum_mask2)
display("mean_embeddings:", mean_embeddings1)

'attention_mask:'

tensor([1, 1, 1, 0, 0])

'input_mask_expanded:'

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

'last_hidden_state:'

tensor([[2.0000, 4.5000, 5.0000, 1.0000],
        [4.0000, 1.0000, 2.0000, 0.2000],
        [1.0000, 1.0000, 2.0000, 1.0000],
        [0.1000, 2.0000, 6.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000]])

'sum_embeddings:'

tensor([12.5000,  7.2000,  5.0000,  0.0000,  0.0000])

'sum_mask:'

tensor([4., 4., 4., 0., 0.])

'sum_mask2:'

tensor([4.0000e+00, 4.0000e+00, 4.0000e+00, 1.0000e-09, 1.0000e-09])

'mean_embeddings:'

tensor([3.1250, 1.8000, 1.2500, 0.0000, 0.0000])

**Loss**

In [None]:
class RMSELoss(nn.Module):
    def __init__(self, reduction='mean', eps=1e-9):
        super().__init__()
        self.mse = nn.MSELoss(reduction='none')
        self.reduction = reduction
        self.eps = eps

    def forward(self, y_pred, y_true):
        loss = torch.sqrt(self.mse(y_pred, y_true) + self.eps)
        if self.reduction == 'none':
            loss = loss
        elif self.reduction == 'sum':
            loss = loss.sum()
        elif self.reduction == 'mean':
            loss = loss.mean()
        return loss

**Helper Function**

In [None]:
class AverageMeter(object):
    #Computes and stores the average and current value
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (remain %s)' % (asMinutes(s), asMinutes(rs))


def train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device):
    model.train()
    scaler = torch.cuda.amp.GradScaler(enabled=CFG.apex)  #https://pytorch.org/docs/stable/amp.html
    losses = AverageMeter()
    start = end = time.time()
    global_step = 0
    for step, (inputs, labels) in enumerate(train_loader):
        inputs = collate(inputs)
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
        with torch.cuda.amp.autocast(enabled=CFG.apex):
            y_preds = model(inputs)
            loss = criterion(y_preds, labels)
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        scaler.scale(loss).backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), CFG.max_grad_norm)
        if (step + 1) % CFG.gradient_accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            global_step += 1
            if CFG.batch_scheduler:
                scheduler.step()
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(train_loader)-1):
            print('Epoch: [{0}][{1}/{2}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  'Grad: {grad_norm:.4f}  '
                  'LR: {lr:.8f}  '
                  .format(epoch+1, step, len(train_loader), 
                          remain=timeSince(start, float(step+1)/len(train_loader)),
                          loss=losses,
                          grad_norm=grad_norm,
                          lr=scheduler.get_lr()[0]))
      
    return losses.avg


def valid_fn(valid_loader, model, criterion, device):
    losses = AverageMeter()
    model.eval()
    preds = []
    start = end = time.time()
    for step, (inputs, labels) in enumerate(valid_loader):
        inputs = collate(inputs)
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
        with torch.no_grad():
            y_preds = model(inputs)
            loss = criterion(y_preds, labels)
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        preds.append(y_preds.to('cpu').numpy())
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(valid_loader)-1):
            print('EVAL: [{0}/{1}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  .format(step, len(valid_loader),
                          loss=losses,
                          remain=timeSince(start, float(step+1)/len(valid_loader))))
    predictions = np.concatenate(preds)
    return losses.avg, predictions

**Train Loop**

In [None]:
def train_loop(folds, fold):
    
    LOGGER.info(f"========== fold: {fold} training ==========")

    # loader

    train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
    valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)
    valid_labels = valid_folds[CFG.target_cols].values
    
    train_dataset = TrainDataset(CFG, train_folds)
    valid_dataset = TrainDataset(CFG, valid_folds)

    train_loader = DataLoader(train_dataset,
                              batch_size=CFG.batch_size,
                              shuffle=True,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=True)
    valid_loader = DataLoader(valid_dataset,
                              batch_size=CFG.batch_size * 2,
                              shuffle=False,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=False)

    # model & optimizer
 
    model = CustomModel(CFG, config_path=None, pretrained=True)
    torch.save(model.config, OUTPUT_DIR+'config.pth')
    model.to(device)
    
    def get_optimizer_params(model, encoder_lr, decoder_lr, weight_decay=0.0):
        param_optimizer = list(model.named_parameters())
        no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
        optimizer_parameters = [
            {'params': [p for n, p in model.model.named_parameters() if not any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': weight_decay},
            {'params': [p for n, p in model.model.named_parameters() if any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': 0.0},
            {'params': [p for n, p in model.named_parameters() if "model" not in n],
             'lr': decoder_lr, 'weight_decay': 0.0}
        ]
        return optimizer_parameters

    optimizer_parameters = get_optimizer_params(model,
                                                encoder_lr=CFG.encoder_lr, 
                                                decoder_lr=CFG.decoder_lr,
                                                weight_decay=CFG.weight_decay)
    optimizer = AdamW(optimizer_parameters, lr=CFG.encoder_lr, eps=CFG.eps, betas=CFG.betas)
    
    # scheduler
      # this is to change the learning rate
    def get_scheduler(cfg, optimizer, num_train_steps): #decays the learning rate
        if cfg.scheduler == 'linear':
            scheduler = get_linear_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps
            )
        elif cfg.scheduler == 'cosine':
            scheduler = get_cosine_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps, num_cycles=cfg.num_cycles
            )
        return scheduler
    
    num_train_steps = int(len(train_folds) / CFG.batch_size * CFG.epochs)
    scheduler = get_scheduler(CFG, optimizer, num_train_steps)

    # loop
    criterion = nn.SmoothL1Loss(reduction='mean') # RMSELoss(reduction="mean")
    
    best_score = np.inf

    for epoch in range(CFG.epochs):

        start_time = time.time()

        # train
        avg_loss = train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device)

        # eval
        avg_val_loss, predictions = valid_fn(valid_loader, model, criterion, device)
        
        # scoring
        score, scores = get_score(valid_labels, predictions)

        elapsed = time.time() - start_time

        LOGGER.info(f'Epoch {epoch+1} - avg_train_loss: {avg_loss:.4f}  avg_val_loss: {avg_val_loss:.4f}  time: {elapsed:.0f}s')
        LOGGER.info(f'Epoch {epoch+1} - Score: {score:.4f}  Scores: {scores}')
        
        if best_score > score:
            best_score = score
            LOGGER.info(f'Epoch {epoch+1} - Save Best Score: {best_score:.4f} Model')
            torch.save({'model': model.state_dict(),
                        'predictions': predictions},
                        OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth")

    predictions = torch.load(OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth", 
                             map_location=torch.device('cpu'))['predictions']
    valid_folds[[f"pred_{c}" for c in CFG.target_cols]] = predictions

    torch.cuda.empty_cache()
    gc.collect()
    
    return valid_folds

In [None]:
if __name__ == '__main__':
    
    def get_result(oof_df):
        labels = oof_df[CFG.target_cols].values
        preds = oof_df[[f"pred_{c}" for c in CFG.target_cols]].values
        score, scores = get_score(labels, preds)
        LOGGER.info(f'Score: {score:<.4f}  Scores: {scores}')
    
    if CFG.train:
        oof_df = pd.DataFrame()
        for fold in range(CFG.n_fold):
            if fold in CFG.trn_fold:
                _oof_df = train_loop(train, fold)
                oof_df = pd.concat([oof_df, _oof_df])
                LOGGER.info(f"========== fold: {fold} result ==========")
                get_result(_oof_df)
        oof_df = oof_df.reset_index(drop=True)
        LOGGER.info(f"========== CV ==========")
        get_result(oof_df)
        oof_df.to_pickle(OUTPUT_DIR+'oof_df.pkl')
    

DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-base",
  "attention_dropout": 0.0,
  "attention_probs_dropout_prob": 0.0,
  "hidden_act": "gelu",
  "hidden_dropout": 0.0,
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_hidden_states": true,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "position_biased_input": false,
  "position_buckets": 256,
  "relative_attention": true,
  "share_att_key": true,
  "transformers_version": "4.21.2",
  "type_vocab_size": 0,
  "vocab_size": 128100
}

INFO:__main__:DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-base",
  "attention_dropout": 0.0,
  "at

Downloading pytorch_model.bin:   0%|          | 0.00/354M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2Model: ['mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/1052] Elapsed 0m 4s (remain 83m 12s) Loss: 2.5355(2.5355) Grad: inf  LR: 0.00002000  
Epoch: [1][20/1052] Elapsed 0m 11s (remain 9m 46s) Loss: 0.4508(1.3764) Grad: 202974.1406  LR: 0.00002000  
Epoch: [1][40/1052] Elapsed 0m 21s (remain 8m 42s) Loss: 0.2382(0.8386) Grad: 163518.1406  LR: 0.00001999  
Epoch: [1][60/1052] Elapsed 0m 30s (remain 8m 14s) Loss: 0.5052(0.6358) Grad: 229715.0781  LR: 0.00001998  
Epoch: [1][80/1052] Elapsed 0m 39s (remain 7m 55s) Loss: 0.1636(0.5323) Grad: 147584.4062  LR: 0.00001997  
Epoch: [1][100/1052] Elapsed 0m 47s (remain 7m 27s) Loss: 0.3427(0.4630) Grad: 473317.8750  LR: 0.00001995  
Epoch: [1][120/1052] Elapsed 0m 56s (remain 7m 13s) Loss: 0.4227(0.4200) Grad: 255472.9531  LR: 0.00001993  
Epoch: [1][140/1052] Elapsed 1m 5s (remain 7m 0s) Loss: 0.1096(0.3914) Grad: 116498.6172  LR: 0.00001990  
Epoch: [1][160/1052] Elapsed 1m 13s (remain 6m 45s) Loss: 0.1392(0.3598) Grad: 101454.1328  LR: 0.00001987  
Epoch: [1][180/1052] Elapsed 1m 21s

Epoch 1 - avg_train_loss: 0.1625  avg_val_loss: 0.1290  time: 599s
INFO:__main__:Epoch 1 - avg_train_loss: 0.1625  avg_val_loss: 0.1290  time: 599s
Epoch 1 - Score: 0.5098  Scores: [0.551033963062818, 0.47666119255540784, 0.47541273242466026, 0.5125428013884025, 0.5579914733889101, 0.48495076610909604]
INFO:__main__:Epoch 1 - Score: 0.5098  Scores: [0.551033963062818, 0.47666119255540784, 0.47541273242466026, 0.5125428013884025, 0.5579914733889101, 0.48495076610909604]
Epoch 1 - Save Best Score: 0.5098 Model
INFO:__main__:Epoch 1 - Save Best Score: 0.5098 Model


EVAL: [262/263] Elapsed 1m 56s (remain 0m 0s) Loss: 0.1084(0.1290) 
Epoch: [2][0/1052] Elapsed 0m 0s (remain 12m 11s) Loss: 0.1684(0.1684) Grad: inf  LR: 0.00001499  
Epoch: [2][20/1052] Elapsed 0m 8s (remain 6m 36s) Loss: 0.1327(0.1153) Grad: 215042.9531  LR: 0.00001482  
Epoch: [2][40/1052] Elapsed 0m 15s (remain 6m 33s) Loss: 0.0769(0.1121) Grad: 182222.3750  LR: 0.00001464  
Epoch: [2][60/1052] Elapsed 0m 27s (remain 7m 32s) Loss: 0.0767(0.1153) Grad: 164745.4844  LR: 0.00001447  
Epoch: [2][80/1052] Elapsed 0m 38s (remain 7m 35s) Loss: 0.1127(0.1115) Grad: 152330.4219  LR: 0.00001429  
Epoch: [2][100/1052] Elapsed 0m 50s (remain 7m 59s) Loss: 0.1652(0.1101) Grad: 265016.7500  LR: 0.00001411  
Epoch: [2][120/1052] Elapsed 1m 0s (remain 7m 42s) Loss: 0.0523(0.1106) Grad: 64670.0703  LR: 0.00001392  
Epoch: [2][140/1052] Elapsed 1m 14s (remain 8m 2s) Loss: 0.0927(0.1092) Grad: 97403.8906  LR: 0.00001374  
Epoch: [2][160/1052] Elapsed 1m 25s (remain 7m 53s) Loss: 0.1251(0.1132) Grad: 

Epoch 2 - avg_train_loss: 0.1007  avg_val_loss: 0.1157  time: 602s
INFO:__main__:Epoch 2 - avg_train_loss: 0.1007  avg_val_loss: 0.1157  time: 602s
Epoch 2 - Score: 0.4825  Scores: [0.5068980636354617, 0.48788098042868516, 0.44531621112154063, 0.4918276490887023, 0.4954846555199497, 0.46777500515029796]
INFO:__main__:Epoch 2 - Score: 0.4825  Scores: [0.5068980636354617, 0.48788098042868516, 0.44531621112154063, 0.4918276490887023, 0.4954846555199497, 0.46777500515029796]
Epoch 2 - Save Best Score: 0.4825 Model
INFO:__main__:Epoch 2 - Save Best Score: 0.4825 Model


EVAL: [262/263] Elapsed 1m 59s (remain 0m 0s) Loss: 0.1066(0.1157) 
Epoch: [3][0/1052] Elapsed 0m 0s (remain 10m 50s) Loss: 0.1876(0.1876) Grad: inf  LR: 0.00000499  
Epoch: [3][20/1052] Elapsed 0m 9s (remain 7m 30s) Loss: 0.1199(0.0898) Grad: 199672.7656  LR: 0.00000482  
Epoch: [3][40/1052] Elapsed 0m 16s (remain 6m 51s) Loss: 0.0691(0.0867) Grad: 139371.3438  LR: 0.00000465  
Epoch: [3][60/1052] Elapsed 0m 26s (remain 7m 12s) Loss: 0.0824(0.0818) Grad: 64728.9297  LR: 0.00000448  
Epoch: [3][80/1052] Elapsed 0m 34s (remain 6m 54s) Loss: 0.1518(0.0848) Grad: 96048.1250  LR: 0.00000432  
Epoch: [3][100/1052] Elapsed 0m 42s (remain 6m 42s) Loss: 0.0635(0.0876) Grad: 78310.2578  LR: 0.00000416  
Epoch: [3][120/1052] Elapsed 0m 51s (remain 6m 34s) Loss: 0.0877(0.0877) Grad: 61997.4453  LR: 0.00000400  
Epoch: [3][140/1052] Elapsed 0m 59s (remain 6m 22s) Loss: 0.0719(0.0899) Grad: 80708.4219  LR: 0.00000384  
Epoch: [3][160/1052] Elapsed 1m 8s (remain 6m 19s) Loss: 0.1405(0.0881) Grad: 10

Epoch 3 - avg_train_loss: 0.0823  avg_val_loss: 0.1060  time: 605s
INFO:__main__:Epoch 3 - avg_train_loss: 0.0823  avg_val_loss: 0.1060  time: 605s
Epoch 3 - Score: 0.4612  Scores: [0.49056680872260333, 0.447729041080586, 0.42959229971806323, 0.4724321975319452, 0.4827396224306702, 0.44427748172645665]
INFO:__main__:Epoch 3 - Score: 0.4612  Scores: [0.49056680872260333, 0.447729041080586, 0.42959229971806323, 0.4724321975319452, 0.4827396224306702, 0.44427748172645665]
Epoch 3 - Save Best Score: 0.4612 Model
INFO:__main__:Epoch 3 - Save Best Score: 0.4612 Model


EVAL: [262/263] Elapsed 1m 57s (remain 0m 0s) Loss: 0.1035(0.1060) 


Score: 0.4612  Scores: [0.49056680872260333, 0.447729041080586, 0.42959229971806323, 0.4724321975319452, 0.4827396224306702, 0.44427748172645665]
INFO:__main__:Score: 0.4612  Scores: [0.49056680872260333, 0.447729041080586, 0.42959229971806323, 0.4724321975319452, 0.4827396224306702, 0.44427748172645665]
DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-base",
  "attention_dropout": 0.0,
  "attention_probs_dropout_prob": 0.0,
  "hidden_act": "gelu",
  "hidden_dropout": 0.0,
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_hidden_states": true,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "positi

Epoch: [1][0/1052] Elapsed 0m 0s (remain 12m 45s) Loss: 2.4504(2.4504) Grad: inf  LR: 0.00002000  
Epoch: [1][20/1052] Elapsed 0m 9s (remain 7m 33s) Loss: 0.2966(1.4700) Grad: 224533.9062  LR: 0.00002000  
Epoch: [1][40/1052] Elapsed 0m 17s (remain 7m 5s) Loss: 0.1474(0.8669) Grad: 80267.3750  LR: 0.00001999  
Epoch: [1][60/1052] Elapsed 0m 26s (remain 7m 9s) Loss: 0.1495(0.6390) Grad: 209285.7344  LR: 0.00001998  
Epoch: [1][80/1052] Elapsed 0m 36s (remain 7m 22s) Loss: 0.1890(0.5294) Grad: 122558.4531  LR: 0.00001997  
Epoch: [1][100/1052] Elapsed 0m 47s (remain 7m 25s) Loss: 0.1331(0.4726) Grad: 149387.2344  LR: 0.00001995  
Epoch: [1][120/1052] Elapsed 0m 55s (remain 7m 5s) Loss: 0.0416(0.4160) Grad: 55840.4570  LR: 0.00001993  
Epoch: [1][140/1052] Elapsed 1m 6s (remain 7m 11s) Loss: 0.0938(0.3820) Grad: 111438.6719  LR: 0.00001990  
Epoch: [1][160/1052] Elapsed 1m 15s (remain 7m 0s) Loss: 0.2446(0.3584) Grad: 113020.2344  LR: 0.00001987  
Epoch: [1][180/1052] Elapsed 1m 28s (rema

Epoch 1 - avg_train_loss: 0.1713  avg_val_loss: 0.1589  time: 605s
INFO:__main__:Epoch 1 - avg_train_loss: 0.1713  avg_val_loss: 0.1589  time: 605s
Epoch 1 - Score: 0.5644  Scores: [0.670097504878268, 0.5108709405739653, 0.5143467705219942, 0.6394683534834433, 0.5761442229908109, 0.47577100744073436]
INFO:__main__:Epoch 1 - Score: 0.5644  Scores: [0.670097504878268, 0.5108709405739653, 0.5143467705219942, 0.6394683534834433, 0.5761442229908109, 0.47577100744073436]
Epoch 1 - Save Best Score: 0.5644 Model
INFO:__main__:Epoch 1 - Save Best Score: 0.5644 Model


EVAL: [262/263] Elapsed 1m 58s (remain 0m 0s) Loss: 0.1728(0.1589) 
Epoch: [2][0/1052] Elapsed 0m 0s (remain 14m 22s) Loss: 0.0394(0.0394) Grad: 193672.5938  LR: 0.00001499  
Epoch: [2][20/1052] Elapsed 0m 7s (remain 6m 28s) Loss: 0.1911(0.0949) Grad: 132728.9219  LR: 0.00001482  
Epoch: [2][40/1052] Elapsed 0m 17s (remain 7m 1s) Loss: 0.0767(0.0943) Grad: 96859.9531  LR: 0.00001464  
Epoch: [2][60/1052] Elapsed 0m 28s (remain 7m 39s) Loss: 0.0913(0.0918) Grad: 67054.4453  LR: 0.00001447  
Epoch: [2][80/1052] Elapsed 0m 36s (remain 7m 23s) Loss: 0.0578(0.0950) Grad: 63493.9727  LR: 0.00001429  
Epoch: [2][100/1052] Elapsed 0m 46s (remain 7m 13s) Loss: 0.1563(0.1008) Grad: 76458.5547  LR: 0.00001411  
Epoch: [2][120/1052] Elapsed 0m 56s (remain 7m 13s) Loss: 0.1072(0.0997) Grad: 87829.3828  LR: 0.00001392  
Epoch: [2][140/1052] Elapsed 1m 6s (remain 7m 9s) Loss: 0.1059(0.0989) Grad: 70536.5703  LR: 0.00001374  
Epoch: [2][160/1052] Elapsed 1m 13s (remain 6m 47s) Loss: 0.1153(0.0988) Gra

Epoch 2 - avg_train_loss: 0.1052  avg_val_loss: 0.1087  time: 603s
INFO:__main__:Epoch 2 - avg_train_loss: 0.1052  avg_val_loss: 0.1087  time: 603s
Epoch 2 - Score: 0.4669  Scores: [0.5125808588125716, 0.45487298181818153, 0.42694903224779523, 0.4706615208548398, 0.4858903887372528, 0.4502592682919424]
INFO:__main__:Epoch 2 - Score: 0.4669  Scores: [0.5125808588125716, 0.45487298181818153, 0.42694903224779523, 0.4706615208548398, 0.4858903887372528, 0.4502592682919424]
Epoch 2 - Save Best Score: 0.4669 Model
INFO:__main__:Epoch 2 - Save Best Score: 0.4669 Model


EVAL: [262/263] Elapsed 1m 58s (remain 0m 0s) Loss: 0.1343(0.1087) 
Epoch: [3][0/1052] Elapsed 0m 1s (remain 26m 50s) Loss: 0.0892(0.0892) Grad: 548744.2500  LR: 0.00000499  
Epoch: [3][20/1052] Elapsed 0m 10s (remain 8m 19s) Loss: 0.0798(0.0849) Grad: 89456.8281  LR: 0.00000482  
Epoch: [3][40/1052] Elapsed 0m 17s (remain 7m 15s) Loss: 0.0512(0.0906) Grad: 52381.7305  LR: 0.00000465  
Epoch: [3][60/1052] Elapsed 0m 25s (remain 7m 1s) Loss: 0.0845(0.0905) Grad: 48465.3633  LR: 0.00000448  
Epoch: [3][80/1052] Elapsed 0m 36s (remain 7m 19s) Loss: 0.0597(0.0846) Grad: 61240.2695  LR: 0.00000432  
Epoch: [3][100/1052] Elapsed 0m 44s (remain 6m 54s) Loss: 0.0494(0.0839) Grad: 55160.6328  LR: 0.00000416  
Epoch: [3][120/1052] Elapsed 0m 52s (remain 6m 43s) Loss: 0.0806(0.0853) Grad: 79866.1562  LR: 0.00000400  
Epoch: [3][140/1052] Elapsed 0m 59s (remain 6m 25s) Loss: 0.0588(0.0853) Grad: 21427.0820  LR: 0.00000384  
Epoch: [3][160/1052] Elapsed 1m 8s (remain 6m 18s) Loss: 0.0571(0.0875) Gr

Epoch 3 - avg_train_loss: 0.0870  avg_val_loss: 0.1066  time: 612s
INFO:__main__:Epoch 3 - avg_train_loss: 0.0870  avg_val_loss: 0.1066  time: 612s
Epoch 3 - Score: 0.4621  Scores: [0.5096619275243122, 0.4544673583968285, 0.4116964752938493, 0.46844880133404737, 0.4819587630491237, 0.44665423196295173]
INFO:__main__:Epoch 3 - Score: 0.4621  Scores: [0.5096619275243122, 0.4544673583968285, 0.4116964752938493, 0.46844880133404737, 0.4819587630491237, 0.44665423196295173]
Epoch 3 - Save Best Score: 0.4621 Model
INFO:__main__:Epoch 3 - Save Best Score: 0.4621 Model


EVAL: [262/263] Elapsed 1m 58s (remain 0m 0s) Loss: 0.1285(0.1066) 


Score: 0.4621  Scores: [0.5096619275243122, 0.4544673583968285, 0.4116964752938493, 0.46844880133404737, 0.4819587630491237, 0.44665423196295173]
INFO:__main__:Score: 0.4621  Scores: [0.5096619275243122, 0.4544673583968285, 0.4116964752938493, 0.46844880133404737, 0.4819587630491237, 0.44665423196295173]
DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-base",
  "attention_dropout": 0.0,
  "attention_probs_dropout_prob": 0.0,
  "hidden_act": "gelu",
  "hidden_dropout": 0.0,
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_hidden_states": true,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "positi

Epoch: [1][0/1052] Elapsed 0m 0s (remain 10m 38s) Loss: 2.7257(2.7257) Grad: inf  LR: 0.00002000  
Epoch: [1][20/1052] Elapsed 0m 8s (remain 6m 59s) Loss: 0.3066(1.5049) Grad: 125232.4922  LR: 0.00002000  
Epoch: [1][40/1052] Elapsed 0m 18s (remain 7m 42s) Loss: 0.1088(0.9517) Grad: 150214.5312  LR: 0.00001999  
Epoch: [1][60/1052] Elapsed 0m 26s (remain 7m 11s) Loss: 0.2278(0.7118) Grad: 182192.2188  LR: 0.00001998  
Epoch: [1][80/1052] Elapsed 0m 34s (remain 6m 56s) Loss: 0.0757(0.5867) Grad: 81475.7422  LR: 0.00001997  
Epoch: [1][100/1052] Elapsed 0m 43s (remain 6m 48s) Loss: 0.0616(0.5044) Grad: 96916.7578  LR: 0.00001995  
Epoch: [1][120/1052] Elapsed 0m 52s (remain 6m 41s) Loss: 0.1035(0.4513) Grad: 79166.2344  LR: 0.00001993  
Epoch: [1][140/1052] Elapsed 1m 1s (remain 6m 37s) Loss: 0.6205(0.4072) Grad: 268806.2500  LR: 0.00001990  
Epoch: [1][160/1052] Elapsed 1m 11s (remain 6m 37s) Loss: 0.0581(0.3803) Grad: 80556.3125  LR: 0.00001987  
Epoch: [1][180/1052] Elapsed 1m 19s (re

Epoch 1 - avg_train_loss: 0.1725  avg_val_loss: 0.1117  time: 605s
INFO:__main__:Epoch 1 - avg_train_loss: 0.1725  avg_val_loss: 0.1117  time: 605s
Epoch 1 - Score: 0.4736  Scores: [0.5017294172489042, 0.4590596348112012, 0.4405884342881681, 0.448730609585417, 0.5103142299371903, 0.4813988670355545]
INFO:__main__:Epoch 1 - Score: 0.4736  Scores: [0.5017294172489042, 0.4590596348112012, 0.4405884342881681, 0.448730609585417, 0.5103142299371903, 0.4813988670355545]
Epoch 1 - Save Best Score: 0.4736 Model
INFO:__main__:Epoch 1 - Save Best Score: 0.4736 Model


EVAL: [262/263] Elapsed 2m 6s (remain 0m 0s) Loss: 0.1540(0.1117) 
Epoch: [2][0/1052] Elapsed 0m 0s (remain 12m 55s) Loss: 0.0766(0.0766) Grad: 268765.3438  LR: 0.00001499  
Epoch: [2][20/1052] Elapsed 0m 9s (remain 7m 40s) Loss: 0.0788(0.1086) Grad: 168026.7344  LR: 0.00001482  
Epoch: [2][40/1052] Elapsed 0m 19s (remain 7m 54s) Loss: 0.0722(0.1088) Grad: 148009.5469  LR: 0.00001464  
Epoch: [2][60/1052] Elapsed 0m 32s (remain 8m 48s) Loss: 0.0762(0.1054) Grad: 198252.2031  LR: 0.00001447  
Epoch: [2][80/1052] Elapsed 0m 42s (remain 8m 29s) Loss: 0.0412(0.1072) Grad: 121308.7578  LR: 0.00001429  
Epoch: [2][100/1052] Elapsed 0m 51s (remain 8m 4s) Loss: 0.1284(0.1063) Grad: 228429.1406  LR: 0.00001411  
Epoch: [2][120/1052] Elapsed 1m 2s (remain 7m 57s) Loss: 0.0670(0.1051) Grad: 112664.3672  LR: 0.00001392  
Epoch: [2][140/1052] Elapsed 1m 9s (remain 7m 25s) Loss: 0.1486(0.1050) Grad: 188861.2812  LR: 0.00001374  
Epoch: [2][160/1052] Elapsed 1m 17s (remain 7m 9s) Loss: 0.1293(0.1056)

Epoch 2 - avg_train_loss: 0.1049  avg_val_loss: 0.1024  time: 600s
INFO:__main__:Epoch 2 - avg_train_loss: 0.1049  avg_val_loss: 0.1024  time: 600s
Epoch 2 - Score: 0.4536  Scores: [0.4817983901929516, 0.44935037193829785, 0.42436419424090066, 0.4490897082197307, 0.4672463680741657, 0.449507796993598]
INFO:__main__:Epoch 2 - Score: 0.4536  Scores: [0.4817983901929516, 0.44935037193829785, 0.42436419424090066, 0.4490897082197307, 0.4672463680741657, 0.449507796993598]
Epoch 2 - Save Best Score: 0.4536 Model
INFO:__main__:Epoch 2 - Save Best Score: 0.4536 Model


EVAL: [262/263] Elapsed 2m 6s (remain 0m 0s) Loss: 0.1933(0.1024) 
Epoch: [3][0/1052] Elapsed 0m 0s (remain 10m 44s) Loss: 0.1024(0.1024) Grad: 260828.5000  LR: 0.00000499  
Epoch: [3][20/1052] Elapsed 0m 10s (remain 8m 28s) Loss: 0.0939(0.0886) Grad: 206294.0781  LR: 0.00000482  
Epoch: [3][40/1052] Elapsed 0m 18s (remain 7m 32s) Loss: 0.0539(0.0925) Grad: 133855.6875  LR: 0.00000465  
Epoch: [3][60/1052] Elapsed 0m 28s (remain 7m 43s) Loss: 0.0985(0.0885) Grad: 164048.5156  LR: 0.00000448  
Epoch: [3][80/1052] Elapsed 0m 37s (remain 7m 27s) Loss: 0.0799(0.0840) Grad: 112171.6641  LR: 0.00000432  
Epoch: [3][100/1052] Elapsed 0m 47s (remain 7m 31s) Loss: 0.0913(0.0837) Grad: 380353.3125  LR: 0.00000416  
Epoch: [3][120/1052] Elapsed 0m 59s (remain 7m 35s) Loss: 0.0398(0.0831) Grad: 117656.4297  LR: 0.00000400  
Epoch: [3][140/1052] Elapsed 1m 9s (remain 7m 27s) Loss: 0.0677(0.0814) Grad: 116686.0781  LR: 0.00000384  
Epoch: [3][160/1052] Elapsed 1m 16s (remain 7m 3s) Loss: 0.0455(0.08

Epoch 3 - avg_train_loss: 0.0851  avg_val_loss: 0.0994  time: 598s
INFO:__main__:Epoch 3 - avg_train_loss: 0.0851  avg_val_loss: 0.0994  time: 598s
Epoch 3 - Score: 0.4464  Scores: [0.48423661012320096, 0.44233129396785537, 0.4012893533636031, 0.44106160648641274, 0.46446813771688905, 0.4451265953939622]
INFO:__main__:Epoch 3 - Score: 0.4464  Scores: [0.48423661012320096, 0.44233129396785537, 0.4012893533636031, 0.44106160648641274, 0.46446813771688905, 0.4451265953939622]
Epoch 3 - Save Best Score: 0.4464 Model
INFO:__main__:Epoch 3 - Save Best Score: 0.4464 Model


EVAL: [262/263] Elapsed 2m 6s (remain 0m 0s) Loss: 0.1742(0.0994) 


Score: 0.4464  Scores: [0.48423661012320096, 0.44233129396785537, 0.4012893533636031, 0.44106160648641274, 0.46446813771688905, 0.4451265953939622]
INFO:__main__:Score: 0.4464  Scores: [0.48423661012320096, 0.44233129396785537, 0.4012893533636031, 0.44106160648641274, 0.46446813771688905, 0.4451265953939622]
Score: 0.4567  Scores: [0.49493978380259734, 0.4482033941930121, 0.41435760596752336, 0.460858547153225, 0.4764635149720428, 0.44535385542533723]
INFO:__main__:Score: 0.4567  Scores: [0.49493978380259734, 0.4482033941930121, 0.41435760596752336, 0.460858547153225, 0.4764635149720428, 0.44535385542533723]
