# Table of Contents

* [1. Code explanation](#section1)
* [2. Module installation, import](#section2)
* [3. Identify environment](#section3)
* [4. Configurations](#section4)
* [5. Dataset](#section5)
* [6. Model](#section6)
* [7. Function to train and get accuracy](#section7)
* [8. Hyperparameter tuning with Optuna library](#section8)

# 1. Code explanation<a class="anchor" id="section1"></a>
1. 머신 러닝 코드 competition이 개최되는 **Kaggle**의 대회인 **Feedback Prize - English Language Learning**에 사용할 목적으로 작성  
   대회의 목적: 미국의 8학년6개의 평가 항목에 점수가 매겨진 글들을 학습하여 어떤 글이 주어졌을 때 각 항목에서 최대한 정확한 점수 도출   
   --> 정확한 피드백으로학습에 도움 주려는 목적
2. Transformer 언어 모델의 일종인 Deberta-v3 모델을 대회 데이터로 fine-tuning  
   --> 하이퍼 파라미터 최적화를 위해 **OPTUNA** 라이브러리 사용, **weight-and-bias (WandB)** 라이브러리와 사이트를 통한 로그 기록(결과 비교 및 분석)
2. Kaggle 사이트 및 로컬 컴퓨터에서 모두 사용 가능한 코드  
3. **TPU** (Tensor processing unit, Kaggle에서 사용 가능, 일반적으로 GPU(NVIDIA P100) 보다 계산 속도 수 배 이상 빠름)  
   / **GPU** (Graphics processing unit, Tensorflow의 mirrored strategy를 통해 여러 GPU 사용 학습) 겸용으로 만들었으나  
   아래의 **링크에 설명된 이슈에 따라 TPU가 더 느린 문제점 존재**  
    https://github.com/huggingface/transformers/issues/18239
4. 아래 링크에서 모델 가져와서 OPTUNA, WANDB 추가하고 TPU 사용 가능 포멧으로 변경한 뒤 하이퍼 파라미터 튜닝  
   Reference: https://www.kaggle.com/code/electro/deberta-layerwiselr-lastlayerreinit-tensorflow  
   

# 2. Module installation, import<a class="anchor" id="section2"></a>

Kaggle TPU의 tensorflow 버전 디폴트 세팅으로는 deberta-v3 모델을 사용할 수 없어서 tensorflow 2.7.4 버전으로 업데이트
Reference: https://www.kaggle.com/getting-started/210020  

In [1]:
try:
    from cloud_tpu_client import Client
    c = Client()
    c.configure_tpu_version('tensorflow==2.7.4', restart_type='ifNeeded')
    !pip uninstall -y transformers
    !pip install -qU git+https://github.com/huggingface/transformers

    !pip uninstall keras -y
    !pip uninstall keras-nightly -y
    !pip uninstall keras-Preprocessing -y
    !pip uninstall keras-vis -y
    !pip uninstall tensorflow -y

    !pip install -qU tensorflow==2.7.4
    !pip install -qU keras==2.7.0
    is_tpu = True
except:
    print('This is not TPU notebook')
    is_tpu = False

This is not TPU notebook


In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import numpy as np
from transformers import  AutoTokenizer,  TFAutoModel, AutoConfig
from transformers import RobertaConfig
from transformers import TFDebertaV2Model
from sklearn.model_selection import train_test_split
import pandas as pd
import os
from tqdm import tqdm
import random
import optuna

from tensorflow.keras import Model
import pickle
import gc
from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold
import tensorflow_addons as tfa

2023-01-02 20:52:07.087358: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-02 20:52:07.088374: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-02 20:52:07.089416: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-02 20:52:07.090208: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-02 20:52:07.090992: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from S

# 3. Identify environment<a class="anchor" id="section3"></a>

In [3]:
#계산 환경 파악 --> 로컬 or Kaggle
try:
    from kaggle_datasets import KaggleDatasets
    from kaggle_secrets import UserSecretsClient
    is_local = False
except:
    print('Running in my computer!')
    is_local = True
    
try:
    user_secrets = UserSecretsClient()
    secret_value_0 = user_secrets.get_secret("WANDB")
except:
    print('Your WANDB key must be attached to Add-ons --> secrets')


In [4]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Device:', tpu.master())
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except:
    try:
        strategy = tf.distribute.MirroredStrategy() # for GPU or multi-GPU machines
        if(strategy.num_replicas_in_sync==1):
            strategy = tf.distribute.experimental.CentralStorageStrategy()
    
    except: strategy = tf.distribute.experimental.CentralStorageStrategy()

        
print('Number of replicas:', strategy.num_replicas_in_sync)
AUTO = tf.data.experimental.AUTOTUNE

Number of replicas: 2


# 4. Configurations<a class="anchor" id="section4"></a>
WandB 로깅 여부 및 fold 개수, batch_size, 디버깅 모드(빠른 계산 통한 에러 해결) 등 계산 옵션 설정 

In [5]:
try:
    gcs_path = KaggleDatasets().get_gcs_path('feedback-deberta-v3-tfrecord')
except:
    gcs_path = '../input/feedback-deberta-v3-tfrecord'

config={
    'is_debug': False,
    'is_wandb': False,
    'save_model': True,
    'project': 'Feedback prize outputwise error',
    'name'   : 'deberta-v3-base_rmse_error',
    'group'  : 'run_test',    
    'fold_select': 0,
    'seed': 22,
    'is_base_model_fixed': False,
    'batch_size':2*strategy.num_replicas_in_sync,
    'buffer_size': 3200,
    'seq_len': 512,
    'checkpoint_path': './mymodel',
    'epochs': 30,
    'auto': tf.data.experimental.AUTOTUNE,
    'model': '/kaggle/input/debertav3base',
    'model_path': '/kaggle/input/debertav3base',
    'n_fold': 5,
    'target_cols': ['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions'],
    'gcs_path': gcs_path,
    'data_path': '/feedback_deberta-v3-base_train',
    'is_tpu': is_tpu,
    'is_local': is_local    
}

del gcs_path
gc.collect()

if(config['is_debug']):
    config['epochs'] = 2    

# ⭐ WandB
<img src="https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67" width=600>

Weights & Biases (W&B) is MLOps platform for tracking our experiemnts. We can use it to Build better models faster with experiment tracking, dataset versioning, and model management. Some of the cool features of W&B:

* Track, compare, and visualize ML experiments
* Get live metrics, terminal logs, and system stats streamed to the centralized dashboard.
* Explain how your model works, show graphs of how model versions improved, discuss bugs, and demonstrate progress towards milestones.
* https://wandb.ai

In [6]:
if(not is_local):
    if(config['is_wandb']):
        !pip install -qU wandb --upgrade
        import wandb
        #from wandb.keras import WandbCallback
        os.environ["WANDB_SILENT"] = "true"
        wandb.login(key = secret_value_0)
else:
    import wandb
    os.environ["WANDB_SILENT"] = "true"
    wandb.login()  
    

In [7]:
def seed_everything(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    
seed_everything(config['seed'])

# 5. Dataset<a class="anchor" id="section5"></a>

TFRecord format --> input, target을 묶어서 이진화 하여 용량 감소 + 처리 속도 향상

In [8]:
n_fold_samples = [782, 783, 782, 782, 782]

In [9]:
def read_labeled_tfrecord(example):
    LABELED_TFREC_FORMAT = {
        "ids": tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
        "mask": tf.io.FixedLenFeature([], tf.string),  # shape [] means single element
        "targets": tf.io.FixedLenFeature([], tf.string)        
    }
    example = tf.io.parse_single_example(example, LABELED_TFREC_FORMAT)
    ids = tf.io.parse_tensor(example['ids'], out_type=tf.int32)
    mask = tf.io.parse_tensor(example['mask'], out_type=tf.int32)
    targets = tf.io.parse_tensor(example['targets'], out_type = tf.float32)
    
    #out = tf.cast(example['output'], tf.
    return (tf.reshape(ids, [512]), tf.reshape(mask, [512])), tf.reshape(targets, [6])


def load_dataset(filenames, labeled = True, ordered = True):
    # Read from TFRecords. For optimal performance, reading from multiple files at once and
    # Diregarding data order. Order does not matter since we will be shuffling the data anyway
    
    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False # disable order, increase speed
        
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads = AUTO) # automatically interleaves reads from multiple files  
    dataset = dataset.with_options(ignore_order) # use data as soon as it streams in, rather than in its original order
    dataset = dataset.map(read_labeled_tfrecord if labeled else read_unlabeled_tfrecord, num_parallel_calls = AUTO) # returns a dataset of (image, label) pairs if labeled = True or (image, id) pair if labeld = False
    return dataset


def load_outputwise_dataset(filenames, select_output = 0, labeled = True, ordered = True):
    def read_labeled_tfrecord(example):
        LABELED_TFREC_FORMAT = {
            "ids": tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
            "mask": tf.io.FixedLenFeature([], tf.string),  # shape [] means single element
            "targets": tf.io.FixedLenFeature([], tf.string)        
        }
        example = tf.io.parse_single_example(example, LABELED_TFREC_FORMAT)
        ids = tf.io.parse_tensor(example['ids'], out_type=tf.int32)
        mask = tf.io.parse_tensor(example['mask'], out_type=tf.int32)
        targets = tf.io.parse_tensor(example['targets'], out_type = tf.float32)
    
        #out = tf.cast(example['output'], tf.
        return (tf.reshape(ids, [512]), tf.reshape(mask, [512])), tf.reshape(targets[select_output], [1])

    
    # Read from TFRecords. For optimal performance, reading from multiple files at once and
    # Diregarding data order. Order does not matter since we will be shuffling the data anyway
    
    ignore_order = tf.data.Options()
    ignore_order.experimental_deterministic = True
    if not ordered:
        ignore_order.experimental_deterministic = False # disable order, increase speed
        
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads = AUTO) # automatically interleaves reads from multiple files  
    dataset = dataset.with_options(ignore_order) # use data as soon as it streams in, rather than in its original order
    dataset = dataset.map(read_labeled_tfrecord if labeled else read_unlabeled_tfrecord, num_parallel_calls = AUTO) # returns a dataset of (image, label) pairs if labeled = True or (image, id) pair if labeld = False
    return dataset

def get_training_dataset(dataset, batch_size = config['batch_size']):
    #if do_aug: dataset = dataset.map(transform, num_parallel_calls=AUTO) # note we put AFTER batching    
    dataset.shuffle(1024)
    if(config['is_tpu']):
        dataset = dataset.repeat()
        dataset = dataset.batch(batch_size, drop_remainder = True)
    else:
        dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
    return dataset

def get_validation_dataset(dataset, batch_size = config['batch_size']):
    #if do_aug: dataset = dataset.map(transform, num_parallel_calls=AUTO) # note we put AFTER batching
    #!!!!!!!!!!Must be considered (duplication of score of elements repeated)    
    #dataset.shuffle(1024)
    if(config['is_tpu']):  
        dataset = dataset.repeat()
        dataset = dataset.batch(batch_size, drop_remainder = True)
    else:
        dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
    return dataset

# 6. Model<a class="anchor" id="section6"></a>
Deberta-v3-base 모델의 끝부분에 Kaggle 대회의 target 형식인 6 X float 출력 크기 가진 Dense layer 추가
1. N_REINIT_LAYERS: 모델 output 출력 부분에서 몇층의 layer weight을 초기화 할지
2. N_HIDDEN_POOL: 출력 부분에서 몇 층의 layer를 묶어서 최종 추론에 사용 할지
3. INIT_LR: deberta-v3-base 모델의 부분 초기 학습률
4. LLRDR: deberta-v3-base 모델의 층 별 학습률 감소 정도
5. DECAY_RATE: deberta-v3-base 모델의 학습률 decay rate
6. HEAD_INIT_LR: 모델 output 출력 부분의 학습률
7. LOSS: 사용 loss 종류
8. FINAL_ACITVATION: 출력 부분에서의 활성함수 사용 종류

**Mean Pool** is very useful. I tend to add GlobalAveragePooling1D at the output of Bert Model and then connect a Dense layer. But just average embedding across the whole sequence length may be not a good idea because there are some paddings in many sequences. Simply speaking, mean pool is used to average only non-paddings embeddings in the sequence.

In my experiments, it improves LB score from 0.5 -> 0.48 only by adding this layer. About more information, you could refer to [this great notebook](https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently). 

My implementation of mean pool is refer to [this notebook](https://www.kaggle.com/code/electro/fp3-roberta-meanpool-kfold-tensorflow).

**No dropout**. It turns out setting dropout = 0.0 in Bert Model can make validation score more smooth and yield much lower validation loss. (refer to [this great discussion](https://www.kaggle.com/competitions/commonlitreadabilityprize/discussion/260729))

In [10]:
model_config = AutoConfig.from_pretrained(config["model"], output_hidden_states = True)
model_config.attention_probs_dropout_prob = 0.0
model_config.hidden_dropout_prob = 0.0
model_config

DebertaV2Config {
  "_name_or_path": "/kaggle/input/debertav3base",
  "attention_probs_dropout_prob": 0.0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_hidden_states": true,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "position_biased_input": false,
  "position_buckets": 256,
  "relative_attention": true,
  "share_att_key": true,
  "transformers_version": "4.20.1",
  "type_vocab_size": 0,
  "vocab_size": 128100
}

In [11]:
class MeanPool(tf.keras.layers.Layer):
    def call(self, inputs, mask=None):
        # inputs: (None, 512, 768)
        
        # (None, 512, 1)
        broadcast_mask = tf.expand_dims(tf.cast(mask, "float32"), -1)
        
        # (None, 768)
        embedding_sum = tf.reduce_sum(inputs * broadcast_mask, axis=1)
        
        # (None, 1)
        mask_sum = tf.reduce_sum(broadcast_mask, axis=1)
        
        mask_sum = tf.math.maximum(mask_sum, tf.constant([1e-9]))
        return embedding_sum / mask_sum
    
class WeightsSumOne(tf.keras.constraints.Constraint):
    def __call__(self, w):
        return tf.nn.softmax(w, axis=0)    

In [12]:
def build_model(params):
    N_REINIT_LAYERS = params['n_reinit_layers']
    N_HIDDEN_POOL    = params['n_hidden_pool']
    INIT_LR          = params['init_lr']
    LLRDR            = params['llrdr']
    DECAY_RATE       = params['decay_rate']
    HEAD_INIT_LR     = params['head_init_lr']
    LOSS             = params['loss']
    FINAL_ACTIVATION = params['final_activation']
    # Multi inputs
    if(config['is_tpu']):
        input_ids = keras.Input(shape=(512,), batch_size=config['batch_size'], dtype= "int32", name="input_ids")
        attention_masks = keras.Input(shape=(512,), batch_size=config['batch_size'], dtype= "int32", name="attention_mask")
    else:
        input_ids = keras.Input(shape=(512,), dtype= "int32", name="input_ids")
        attention_masks = keras.Input(shape=(512,),  dtype= "int32", name="attention_mask")        
        
    base_model  = TFAutoModel.from_pretrained(config["model_path"],                                                 
                                                 config = model_config)



    REINIT_LAYERS = 1
    normal_initializer = tf.keras.initializers.GlorotUniform( seed= config['seed'])
    zeros_initializer = tf.keras.initializers.Zeros()
    ones_initializer = tf.keras.initializers.Ones()

    if(config['is_base_model_fixed']): 
        #Fix layers' parameters except output hidden and last encoder layer
        for encoder_block in base_model.deberta.encoder.layer[:-REINIT_LAYERS]:
            for layer in encoder_block.submodules:
                layer.trainable = False

        base_model.deberta.embeddings.trainable = False


    for encoder_block in base_model.deberta.encoder.layer[-N_REINIT_LAYERS:]:
    #for encoder_block in model.layers[8]:
        for layer in encoder_block.submodules:
            if isinstance(layer, tf.keras.layers.Dense):
                layer.kernel.assign(normal_initializer(shape=layer.kernel.shape,
                                                       dtype=layer.kernel.dtype))
                if layer.bias is not None:
                    layer.bias.assign(zeros_initializer(shape=layer.bias.shape,
                                                        dtype=layer.bias.dtype))

            elif isinstance(layer, tf.keras.layers.LayerNormalization):
                layer.beta.assign(zeros_initializer(shape=layer.beta.shape,
                                                    dtype=layer.beta.dtype))
                layer.gamma.assign(ones_initializer(shape=layer.gamma.shape,
                                                    dtype=layer.gamma.dtype))

    base_model_output = base_model(input_ids, attention_mask=attention_masks)
    hidden_states = base_model_output.hidden_states # (None, 512, 768) 여러개

    # WeightedLayerPool + MeanPool of the last 4 hidden states
    stack_meanpool = tf.stack([MeanPool()(hidden_s, mask=attention_masks)
                               for hidden_s in hidden_states[-N_HIDDEN_POOL:]],
                              axis=2) # (None, 768, 4)

    weighted_layer_pool = layers.Dense(1, use_bias=False,
                                      kernel_constraint=WeightsSumOne())(stack_meanpool)

    weighted_layer_pool = tf.squeeze(weighted_layer_pool, axis=-1)

    x = layers.Dense(6, activation=FINAL_ACTIVATION)(weighted_layer_pool)
    output = layers.Rescaling(scale=N_HIDDEN_POOL, offset=1.0)(x)
    model = tf.keras.Model(inputs=[input_ids, attention_masks], outputs=output)

    # Compile model with Layer-wise Learning Rate Decay
    layer_list = [base_model.deberta.embeddings] + list(base_model.deberta.encoder.layer)
    layer_list.reverse()

    LR_SCH_DECAY_STEPS = 3128 //config['batch_size'] # 2 * len(train_df) // BATCH_SIZE
    lr_schedules = [tf.keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=INIT_LR * LLRDR ** i,
        decay_steps=LR_SCH_DECAY_STEPS,
        decay_rate=DECAY_RATE) for i in range(len(layer_list))]

    lr_schedule_head = tf.keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=HEAD_INIT_LR,
        decay_steps=LR_SCH_DECAY_STEPS,
        decay_rate=DECAY_RATE
    )

    optimizers = [tf.keras.optimizers.Adam(learning_rate=lr_sch) for lr_sch in lr_schedules]

    optimizers_and_layers = [(tf.keras.optimizers.Adam(learning_rate=lr_schedule_head),
                              model.layers[-4:])] + list(zip(optimizers, layer_list))

    optimizer = tfa.optimizers.MultiOptimizer(optimizers_and_layers)


    model.compile(optimizer=optimizer,
                  loss=LOSS,
                  metrics=[tf.keras.metrics.RootMeanSquaredError()])    

    return model

# 7. Function to train and get accuracy<a class="anchor" id="section7"></a>

In [13]:
def get_accuracy(params):
    score_list=[]
    for fold in range(0, config['n_fold']):
        if(config['fold_select'] is not None):
            if(fold != config['fold_select']):
                continue

        valid_folds = [fold]
        train_folds = np.arange(0, config['n_fold'])
        train_folds = np.delete(train_folds, fold)

        steps_per_validation = (n_fold_samples[fold])//config['batch_size']+1
        steps_per_epoch     = (np.sum(n_fold_samples) - steps_per_validation)//config['batch_size']+1

        if(config['is_debug']):
            steps_per_validation = 1
            steps_per_epoch      = 1


        train_dataset = load_dataset(tf.io.gfile.glob([config['gcs_path'] + config['data_path'] +
                                                           '%d.tfrec'%x for x in train_folds]))

        valid_dataset = load_dataset(tf.io.gfile.glob([config['gcs_path'] + config['data_path'] +
                                                           '%d.tfrec'%x for x in valid_folds]))


        train_dataset = get_training_dataset(train_dataset, config['batch_size'])
        valid_dataset = get_validation_dataset(valid_dataset, config['batch_size'])

        targets = []
        
        for example in valid_dataset:
            targets.append(example[1].numpy())
        targets=  np.vstack(targets)    

        model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(f"best_model_fold{fold}.h5", 
                                                                       monitor="val_root_mean_squared_error",
                                                         mode="min", save_best_only=True,
                                                         verbose=1, save_weights_only=True),

        earlystop_callback = keras.callbacks.EarlyStopping(monitor='val_root_mean_squared_error',
                                                           patience=3,
                                                           verbose=1)

        reducelr_callback = keras.callbacks.ReduceLROnPlateau(monitor='val_loss',
                                                              factor=0.5,
                                                              patience=5,
                                                              verbose=1)

        callbacks = [
            earlystop_callback,
        ]
        
        if(config['save_model']):
            callbacks.append(model_checkpoint_callback)

        if(not config['is_local']):
            with strategy.scope():
                model = build_model(params)
        else:
            model = build_model(params)

        if(config['is_tpu'] or config['is_debug']):
            freeze_history = model.fit(train_dataset, 
                                   validation_data  = valid_dataset, 
                                   steps_per_epoch  = steps_per_epoch,
                                   validation_steps = steps_per_validation,
                                   callbacks        = callbacks, 
                                   shuffle          = True,
                                   epochs           = config['epochs'],
                                   verbose          = 1
                                   )
        else:
            freeze_history = model.fit(train_dataset, 
                                   validation_data  = valid_dataset, 
                                   callbacks        = callbacks, 
                                   shuffle          = True,
                                   epochs           = config['epochs'],
                                   batch_size       = config['batch_size'],
                                   verbose          = 1
                                   )        
        
        if(not config['is_debug']):
            pred = model.predict(valid_dataset)
        else:
            pred = np.ones(np.shape(targets))
        error =pred - targets
        error **=2
        rmse_error = np.zeros(6)
        
        for i in range(0, 6):
            rmse_error[i] = np.sqrt(np.average(error[:,i]))    

        def merge_two_dicts(x, y):
            z = x.copy()   # start with keys and values of x
            z.update(y)    # modifies z with keys and values of y
            return z
        run_parameter_info = merge_two_dicts(config, params)

        
        #WandB Log 전송
        if(config['is_wandb']):
            wandb.init(project = config['project'], name = f"{config['name']}_{fold}",
               group = config['group'], config = run_parameter_info)
            wandb.log({"Best val rmse": np.min(freeze_history.history['val_root_mean_squared_error'])})
            wandb.log({
                         'cohesion'    : rmse_error[0],
                         'syntax'      : rmse_error[1],
                         'vocabulary'  : rmse_error[2],
                         'phraseology' : rmse_error[3],
                         'grammar'     : rmse_error[4],
                         'conventions' : rmse_error[5]
                      })
            for i in range(0, len(freeze_history.history['loss'])):
                wandb.log({
                            'loss': freeze_history.history['loss'][i], 
                            'root_mean_squared_error': freeze_history.history['root_mean_squared_error'][i],
                            'val_loss': freeze_history.history['val_loss'][i],
                            'val_root_mean_squared_error': freeze_history.history['val_root_mean_squared_error'][i],
                          })

            wandb.finish()
        score_list.append(np.min(freeze_history.history['val_root_mean_squared_error']))
        
        tf.keras.backend.clear_session()
        del model, train_dataset, valid_dataset, pred, error, targets, rmse_error, callbacks, freeze_history
        gc.collect() 
    return np.average(score_list)



# 8. Hyperparameter tuning with Optuna library<a class="anchor" id="section8"></a>

In [14]:
def objective(trial):
    params={
        'n_reinit_layers': trial.suggest_int('n_reinit_layers', 0, 12, 1),
        'n_hidden_pool': trial.suggest_int('n_hidden_pool', 1, 12, 1),
        'init_lr': trial.suggest_float('init_lr', 1e-7, 1e-3),
        'llrdr'  : trial.suggest_float('llrdr', 0.01, 0.99),
        'decay_rate': trial.suggest_float('decay_rate', 0.01, 0.99),
        'head_init_lr': trial.suggest_float('head_init_lr', 1e-7, 1e-3),
        'loss': trial.suggest_categorical('loss', ['categorical_crossentropy', 'categorical_hinge', 
                                                    'mean_absolute_error', 'mean_squared_error' , 'huber_loss']),        
        "final_activation": trial.suggest_categorical("final_activation",["relu", "sigmoid", "softmax", "softplus", 
                                                                           "softsign", "tanh", "selu", "elu", "exponential", "linear"])
    }   
    
    accuracy = get_accuracy(params)
    return accuracy

In [15]:
init_params={
            'n_reinit_layers' : 1,
            'n_hidden_pool'   : 4,
            'init_lr'         : 1e-5,
            'llrdr'           : 0.9,
            'decay_rate'      : 0.3,
            'head_init_lr'    : 1e-4,
            'loss'            : 'mean_squared_error',
            'final_activation': 'sigmoid'
}

In [16]:
study = optuna.create_study(direction="minimize")

[32m[I 2023-01-02 20:52:20,448][0m A new study created in memory with name: no-name-96d3cf9b-0ded-4d1b-94d4-ba0696614854[0m


In [17]:
arr_init_params=[]
for i in range(0, 3):
    cpy_init_params= init_params.copy()
    cpy_init_params['n_reinit_layers'] = i
    arr_init_params.append(cpy_init_params)
    

for i in range(3, 6):
    cpy_init_params = init_params.copy()
    cpy_init_params['n_hidden_pool']= i
    arr_init_params.append(cpy_init_params)

lr_list = np.geomspace(1e-5, 1e-3, 3)
for i in range(0, 3):
    cpy_init_params = init_params.copy()
    cpy_init_params['init_lr'] = lr_list[i]
    arr_init_params.append(cpy_init_params)
    
    cpy_init_params = init_params.copy()
    cpy_init_params['head_init_lr'] = lr_list[i]
    arr_init_params.append(cpy_init_params)
    

for i in range(7, 10):
    cpy_init_params = init_params.copy()
    cpy_init_params['llrdr'] = i*0.1
    arr_init_params.append(cpy_init_params)

for i in range(2, 5):
    cpy_init_params = init_params.copy()
    cpy_init_params['decay_rate'] = i*0.1
    arr_init_params.append(cpy_init_params)
        
losses = ['categorical_crossentropy', 'categorical_hinge', 
         'mean_absolute_error', 'mean_squared_error' , 'huber_loss']

final_activations = ["relu", "sigmoid", "softmax", "softplus", 
                    "softsign", "tanh", "selu", "elu", "exponential", "linear"]
for loss in losses:
    cpy_init_params = init_params.copy()
    cpy_init_params['loss'] = loss
    arr_init_params.append(cpy_init_params)
    
for final_activation in final_activations:
    cpy_init_params = init_params.copy()
    cpy_init_params['final_activation'] = final_activation
    arr_init_params.append(cpy_init_params) 

In [18]:
len(arr_init_params)

33

In [None]:
#for init_params in arr_init_params:
#    study.enqueue_trial(init_params)
study.enqueue_trial(init_params)

study.optimize(objective, n_trials=300, show_progress_bar=True, gc_after_trial = True)

print("Number of finished trials: ", len(study.trials))
print("Best trial:")

  self._init_valid()


  0%|          | 0/300 [00:00<?, ?it/s]

2023-01-02 20:52:20.659857: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2023-01-02 20:52:21.492287: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
All model checkpoint layers were used when initializing TFDebertaV2Model.

All the layers of TFDebertaV2Model were initialized from the model checkpoint at /kaggle/input/debertav3base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDebertaV2Model for predictions without further training.
2023-01-02 20:52:31.291993: W 

Epoch 1/30
     28/Unknown - 107s 429ms/step - loss: 0.5150 - root_mean_squared_error: 0.7177

In [None]:
try:
    import gc    
    tf.keras.backend.clear_session()  
    del model
    gc.collect()
except:
    gc.collect()