This is an account of my tryst with the HAN model.

I experimented with a few variants of this model, but in most cases I could not go below the score of 0.48 on the public LB. Let us take a quick intuitive look at the HAN model before jumping into the code - https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf. This is pretty old paper, relatively speaking. It came out in 2016... way before BERT, transformers and a full 2 years before the landmark paper - Attention is all you need. 

But this paper in itself was (and is) considered to be a landmark, the reason being the structured way in which a document is broken down into paragraphs, sentences and words for analysis...which is the precise way in which a human brain would analyze a long document. What Yang et al did was to bring in 2 innovations:
- They made use of Attention (which had by then become famous after Bahdanu's important paper in 2014) and gave it a slight twist by creating an explicit context-object which was to be 'learnt' by the model and then used this context object to determine the attention weights. 
- More importantly they 'applied' this form of attention NOT to the entire document but at various sub-levels. So they broke down a long document into paragraphs and sentences. They would read each sentence in a para and let the model 'learn' a context-object which would have the dimensions of a single word. This context-object would then be multiplied with all the word embeddings in the sentence to determine the attention weights and then the weighted sum of all words would lead to a best representation of the sentence. So now we have each sentence compressed to the dimensions of a 'single special word' - a classic reduction.
- Next they did the same thing for paragraphs. Each paragraph is a collection of sentences and since they now have one 'special-word' representing each sentence, so a single para boiled down to a collection of 'special-words'. They used the same Attention technique to learn a new context vector, derived attention weights and summed the weighted 'special-words' embeddings to get a single representation for the paragraph which again had the dimensions of a single word.
- Now a simple linear layer could be applied

This process is hierarchical and hence the name hierarchical attention networks. It was used for document classification and even after 5 years is still used for tasks dealing with understanding of long texts. The reason why this works do well is that it mimicks the way a human would read the texts - processing words, then sentences and then paragraphs. In fact, any-one would assume that it would be the de-facto standard for Readibility domain.

And looks like this indeed was the case. ReadNet - https://arxiv.org/pdf/2103.04083.pdf - was published in 2021 and seemed to break most records in the Readability space. It is based on the original HAN model proposed by Yang but consists of 2 changes:
- They use self-attention instead of plain-attention and hence the name - Hierarchical self-attention network. Somwehere the "self-attention" seems to have gotten replaced by the transformer which is a pity since this model does not have anything to do with the traditional transformers as we know them. They just leveage the 'self-attention' component of the transformer
- The other major change was an interesting twist by which they combined the self-attention outputs with the "explicit" readability features. The "interesting" part is in the way they do it by combining sentence-level explicit features like characers per word, num of words, num of long words etc with the sentence 'special word' and paragraph level explicit features like Flesch Reading Ease, Dale Chall Index etc with the paragraph 'special word'
- Couple of other minor changes w.r.t a transfer learning layer if applicable & a neat loss strategy but I doubt whether these affected the scores much

As can be seen the changes are pretty minor w.r.t to the original HAN (of course, devil is in the details) but one might safely assume (as I did) that implementing a original HAN and then changing the 'plain-attention' to 'self-attention' would be the key ingredient to a high score. The remaining bits like the choice of loss or addition of the explicit features didnt produce too much of a improvement as per the paper, though it certainly would have mattered at the top of the charts where even these minor differences would count.

The HAN model implementation is simple but there is one challenge which I faced in implementation. I have outlined it here - https://www.kaggle.com/c/commonlitreadabilityprize/discussion/252715 and will not go into details. But the 1-line summary is that most  public models have a MAX_LEN of 250-300 which is just about the avg length of the whole excerpt column. So word padding is almost never required and the tokens are rich with data, With HAN we have to bring in sentence padding. Once we do that we cant keeo 250 words per sentence. We have to break it into 40-50 assuming 10 sentence. There is lot of data lost and the resulting tokens are sparse. With good batching strategies or with larger GPUs maybe this problem could be covercome to some extent but nowhere could I fit something like 300 words * 50 sentence = 15000 tokens for one excerpt. The memory starts failing for anything above 500. So I created a small hack to overcome this. I feel that this hack should not impact teh embeddings but please feel free to correct me if I am wrong.

Unfortunately, I just couldnt get this model to work with the ComLit dataset. Here are the key variants I tried:
- A simple HAN model with the base embeddings being Glove 300 dim
- A HAN model using fine-tuned RoBERTa-base embeddings
- A self-attentive HAN model using fine-tuned RoBERTa-base embeddings
- A HAN model leveraging mean embeddings. Why did I do this - Several excellent kernels published duringthe competition seemed to indicate that mean embeddings seem to perform better than attention heads. So if we can have an hierarchical-attention network, why not a Hierarchical-Mean Embedding Network (why not H-MEN - I like the name :)) I broke down this approach into 2 key variants:

1. Break a para into sentences and take a mean of means. Forget about all architectures, just raw common-sense dictates that this simple technique should ideally give a bump in score instead of an overall mean (or so I thought). The logic is sound but for some reason, the overall paragraph mean gives better scores than a mean-of-means approach
1. Since I used a Roberta-base, the words already had an element of sequence built into them. So the 'mean' embedding for a sentence probably could not be improved upon, but what about mean-of-mean? i.e. basically when we place all the sentence mean's back-to back in a sequence, instead of just taking their mean(and losing the temporal construct), wouldnt it be better to feed this sequence to a small GRU or LSTM and then take the mean of all hidden states. Again this sounded quite logical and in fact I even tried doing a self-attention instead of LSTM so a to establish relationship between sentences better, but unfortunately each approach seemed to worsen the score further! and so that was the end of my tryst with HAN :)

I am sure many of the gold meadalists would share their versions of HAN which actually get high scores, but I am sharing this version of mine as a baseline and also to welcome comments on where I could have possibly gone wrong.

Source code on top of which I did my changes: https://www.kaggle.com/andretugan/pre-trained-roberta-solution-in-pytorch - an excellent reproducible, simple yet HIGHLY efficient model

I have shared the code for the last variant above, but with couple of line code changes it can be modified to suit all other variants

In [None]:
import os
import math
import random
import time

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from nltk import tokenize
from nltk.tokenize import sent_tokenize

from transformers import AdamW
from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import AutoConfig
from transformers import get_cosine_schedule_with_warmup

from sklearn.model_selection import KFold

import gc
##Enable automatic garbage collection
gc.enable()

In [None]:
NUM_FOLDS = 5
NUM_EPOCHS = 3
BATCH_SIZE = 16
MAX_LEN = 258
MAX_SENTS = 50
EVAL_SCHEDULE = [(0.50, 16), (0.49, 8), (0.48, 4), (0.47, 2), (-1., 1)]
ROBERTA_PATH = "../input/clrp-roberta-base/clrp_roberta_base"
TOKENIZER_PATH = "../input/clrp-roberta-base/clrp_roberta_base"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
def set_random_seed(random_seed):
    random.seed(random_seed)
    np.random.seed(random_seed)
    os.environ["PYTHONHASHSEED"] = str(random_seed)
    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    torch.cuda.manual_seed_all(random_seed)
    torch.backends.cudnn.deterministic = True

In [None]:
train_df = pd.read_csv("/kaggle/input/commonlitreadabilityprize/train.csv")
train_df.drop(train_df[(train_df.target == 0) & (train_df.standard_error == 0)].index,inplace=True)
train_df.reset_index(drop=True, inplace=True)
test_df = pd.read_csv("/kaggle/input/commonlitreadabilityprize/test.csv")
submission_df = pd.read_csv("/kaggle/input/commonlitreadabilityprize/sample_submission.csv")

In [None]:
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)

Let us carry out a simple experiment first!

This is how we normally get the embeddings

In [None]:
excerpt = "how are you? how are you? how am I? not good."
x=tokenizer.batch_encode_plus([excerpt])

config = AutoConfig.from_pretrained(ROBERTA_PATH)
config.update({"output_hidden_states":True, "hidden_dropout_prob": 0.0,"layer_norm_eps": 1e-7})                       
roberta = AutoModel.from_pretrained(ROBERTA_PATH, config=config) 
roberta.eval()
with torch.no_grad():
    roberta_output = roberta(input_ids=torch.tensor(x['input_ids']),attention_mask=torch.tensor(x['attention_mask']))

last_layer_singleshot = roberta_output.hidden_states[-1].detach().squeeze()
print("This is the input:",tokenizer.tokenize(excerpt), "with length", len(tokenizer.tokenize(excerpt)), "excluding CLS and SEP", "\n")
print("This is the tokenized version", x, "\n")
print("This is the output:", last_layer_singleshot.shape, "one 768 embedding for each input token")

Now let us see what happens if we break a sentence before feeding it to Roberta AND MANUALLY CONCATENATE latter

In [None]:
excerpt = "how are you? how are you? how am I? not good."
sentences_list = tokenize.sent_tokenize(excerpt)  ##NLTK - Aint it ironical
print(sentences_list, "\n")
x=tokenizer.batch_encode_plus(sentences_list)
print("Tokens before concatenation:\n",x, "\n")
input_ids = list(np.concatenate(x['input_ids']))
attention_mask = list(np.concatenate(x['attention_mask']))
x['input_ids'] = [input_ids]
x['attention_mask'] = [attention_mask]
print("Tokens after concatenation::\n",x, "\n")
print("Conventional Single-shot tokenization:\n",tokenizer.batch_encode_plus([excerpt]))

Let us analyze the key differences before proceeding to Roberta output. Firstly we see the start stop markers after every sentence. That difference was expected. We will see if this will harm the internal workings of Roberta. But we also see one more change. The unexpected difference is that certain tokens seem to have gotten corrupted? "how" is represented by 9178 consistently in the first tokenization attempt. But when we combine all sentences, we get 141 instead of 9178 for the second "how". The first token of the word seems to be corrupted. However this is not really an issue. It is just a matter of representation. It seems that due to its architecture, Roberta encodes the preceeding "space" also. If we split every sentence the first word will not have space. When we combine sentences, the first word has a space in front of it and hence the encoding is the "special character" + the regular encoding. This should not cause any issue. See below how space is represented by the special character - Ġ

In [None]:
print(excerpt)
print(tokenizer.tokenize(excerpt))

A more relevant question is - Are the embeddings generated after the concatenation comparable to the embeddings gotten with a single shot sentence?

In [None]:
##x has the tokens after concatenation
##feed it to Roberta
roberta_output = roberta(input_ids=torch.tensor(x['input_ids']),attention_mask=torch.tensor(x['attention_mask']))
roberta.eval()
with torch.no_grad():
    last_layer_conc = roberta_output.hidden_states[-1].detach().squeeze()

print("Singleshot:", last_layer_singleshot.shape)
print("Concatenated:", last_layer_conc.shape, "\n")
print("Apart from the extra CLS, SEP tokens, are these embeddings nearly the same?")

In [None]:
print("SINGLE SHOT VERSION", "\n")
print("Original sentence:")
print(excerpt, "\n")
print("CLS: First 5 dim", last_layer_singleshot[0,:5].numpy())
print("SEP: First 5 dim", last_layer_singleshot[last_layer_singleshot.size(0)-1,:5].numpy(), "\n")
print("First 5 dimensions for first 7 words:\n")
for i in range(7):
    print(tokenizer.tokenize(excerpt)[i],"\n" ,last_layer_singleshot[i+1,:5].numpy())

In [None]:
print("REMASTERED SENTENCE AFTER CONCATENATION OF SENTENCE-WISE TOKENS:\n")
print(excerpt)
print(x['input_ids'], "\n")
print("FIRST CLS: First 5 dim", last_layer_conc[0,:5].numpy())
print("LAST SEP: First 5 dim", last_layer_conc[last_layer_conc.size(0)-1,:5].numpy(), "\n")
print("First 5 dimensions for first 7 words including the interim CLS and SEP:\n")
for i in range(7):
    print(last_layer_conc[i+1,:5].numpy())

So, how do we test whether the embeddings generated in this manner are corrupted or not? A manually look at the first few values seem to indicate that they are more less nearby. Let us take 3 cases
- Embedding of you from "how are you?"
- Embedding of you from "how are you? I am fine" using the single shot approach
- Embedding of you from "how are you? I am fine" using the concatenated token approach

If each of these embeddings or more or less equidistant to each other, we are good and it is safe to conclude that the concatenated approach does not break Roberta behaviour in any way, considering all the additional CLS, SEP tokens in between

In [None]:
x=tokenizer.batch_encode_plus(["how are you?"])
out = roberta(input_ids=torch.tensor(x['input_ids']),attention_mask=torch.tensor(x['attention_mask']))
you1 = out.hidden_states[-1].detach().squeeze()[3,:]

x=tokenizer.batch_encode_plus(["how are you? I am fine."])
out = roberta(input_ids=torch.tensor(x['input_ids']),attention_mask=torch.tensor(x['attention_mask']))
you2 = out.hidden_states[-1].detach().squeeze()[3,:]

sentences_list = tokenize.sent_tokenize("how are you? I am fine.") 
x=tokenizer.batch_encode_plus(sentences_list)
x['input_ids'] = [list(np.concatenate(x['input_ids']))]
x['attention_mask'] = [list(np.concatenate(x['attention_mask']))]
out = roberta(input_ids=torch.tensor(x['input_ids']),attention_mask=torch.tensor(x['attention_mask']))
you3 = out.hidden_states[-1].detach().squeeze()[3,:]

##Let us take one more reference word say - fine
fine = out.hidden_states[-1].detach().squeeze()[-3,:]

print("\n", you1[:5], "\n", you2[:5], "\n", you3[:5], "\n", fine[:5])

del roberta
gc.collect()

Note that our main comparision should be between you2 and you3 which are the single shot and consolidated versions respectively. A quick glance reveal they match decently. you1 is for reference. you2 and you3 should be as close to you1 as each other and all of them should be way different from 'fine'

In [None]:
def cosine_similarity(arr1, arr2):
    return sum([i*j for i,j in zip(arr1, arr2)])/(math.sqrt(sum([i*i for i in arr1]))* math.sqrt(sum([i*i for i in arr2])))

print(cosine_similarity(you2, you3))
print(cosine_similarity(you2, you1))
print(cosine_similarity(you3, you1), "\n")

print(cosine_similarity(you2, fine))
print(cosine_similarity(you3, fine))

I rest my case! One could argue that embeddings are sparse in nature and the relevant info is in few dimensions only and cosine is not the best way to measure differences. Then what is the best way? One way is to use the concatenated-token approach and then use the mean embedding (NOT mean of means) across the entire paragraph. The resulting score is nearly the same as a singleshot mean embedding Common-Lit score. 

Ok, now for the actual model. Since we now have a definite plan to tackle token-sparseness we can use any of the simple public models available and not worry about space or time. The model takes about an hour to run at most!

We will do some extra things in the dataset definition like sending a list of BOS (beginning-of-sentence(s) indices within an excerpt) so that we dont waste time in the model code. We need to take care of padding etc manually.

The only other changes are in the model definition,where we retrieve these BOS indices to calculate individual sentence means and then average them up to get the excerpt mean!

In [None]:
class LitDataset(Dataset):
    def __init__(self, df, inference_only=False):
        super().__init__()
        self.df = df        
        self.inference_only = inference_only
        excerpts = df.excerpt.tolist()
        if not self.inference_only:
            self.target = torch.tensor(df.target.values, dtype=torch.float32)
        ##self.encoded = tokenizer.batch_encode_plus(self.text,padding = 'max_length',max_length = MAX_LEN,truncation = True,return_attention_mask=True)
        self.encoded, self.start_pos = [], []
        for excerpt in excerpts:
            sentences = tokenize.sent_tokenize(excerpt) 
            split_tok=tokenizer.batch_encode_plus(sentences)
            input_ids = list(np.concatenate(split_tok['input_ids']))
            attention_mask = list(np.concatenate(split_tok['attention_mask']))
            if len(input_ids) > MAX_LEN:
                input_ids,attention_mask = input_ids[:MAX_LEN-1], attention_mask[:MAX_LEN-1]
                input_ids.extend([2])
                attention_mask.extend([1])
                split_tok['input_ids'],split_tok['attention_mask'] = [input_ids],[attention_mask]
            else:
                pad_cnt = MAX_LEN - len(input_ids)
                input_ids.extend([1] * pad_cnt)
                attention_mask.extend([0] * pad_cnt)
                split_tok['input_ids'], split_tok['attention_mask'] = [input_ids],[attention_mask]
            self.encoded.append(split_tok)
            start_pos = list(np.nonzero(np.array(split_tok['input_ids'][0])==0)[0])
            start_pos.extend([start_pos[-1]]*(MAX_LEN-len(start_pos)))
            self.start_pos.append(start_pos)
    
    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):        
        input_ids = torch.tensor(self.encoded[index]['input_ids'][0])
        attention_mask = torch.tensor(self.encoded[index]['attention_mask'][0])
        start_pos = torch.tensor(self.start_pos[index])
        if self.inference_only:
            return (input_ids, attention_mask, start_pos)            
        else:
            target = self.target[index]
            return (input_ids, attention_mask, start_pos, target)

In [None]:
"""temp = train_df['excerpt'].tolist()[:5]
for excerpt in temp:
    print(excerpt, "\n")
    sentences = tokenize.sent_tokenize(excerpt) 
    [print("-",sentence) for sentence in sentences]
    print("\n")"""

In [None]:
"""train_dataset = LitDataset(train_df.head(2))    
train_loader= DataLoader(train_dataset, batch_size=BATCH_SIZE, drop_last=False)
for batch_num, (input_ids, attention_mask, start_pos, target) in enumerate(train_loader):
    print(len(start_pos[0]), len(input_ids[0][0]))"""

In [None]:
class LitModel(nn.Module):
    def __init__(self):
        super().__init__()
        config = AutoConfig.from_pretrained(ROBERTA_PATH)
        config.update({"output_hidden_states":True, "hidden_dropout_prob": 0.0,"layer_norm_eps": 1e-7})                       
        self.roberta = AutoModel.from_pretrained(ROBERTA_PATH, config=config)  
        ##Neither norm nor adding an LSTM layer nor a self-att layer helped
        ##self.layer_norm = nn.LayerNorm(768)
        ##self.lstm = nn.LSTM(768, 512, bidirectional=True, dropout=0.1, batch_first=True)
        self.attention_layer = nn.Sequential(nn.Linear(768, 512),nn.Tanh(),nn.Linear(512, 1),nn.Softmax(dim=1))   
        ##Note that above is NOT self-attention, I just retained the naming of the orig author
        self.regressor = nn.Sequential(nn.Linear(768, 1))

    def forward(self, input_ids, attention_mask, start_pos):
        ##The commented lines are for the different variants of HAN that I talked off earlier
        roberta_output = self.roberta(input_ids=input_ids,attention_mask=attention_mask)
        last_layer = roberta_output[0]
        ##I apologize for the terseness and will cleanup later
        
        for i, excerpt in enumerate(last_layer):
            ##<num_words, dim>
            ##Below innocent statement led to HOURS of debugging
            ##avg_embeddings = torch.empty(768).to(DEVICE)
            start_pos_excerptwise = start_pos[i,:].squeeze()
            for cnt, pos in enumerate(start_pos_excerptwise):
                if (cnt!=len(start_pos_excerptwise)-1) and (pos!=start_pos_excerptwise[cnt+1]):
                    ##Above, we take adv of the fact that Python will not evaluate
                    ##(False 'and' anything), so no run-time error in second expression
                    emb = excerpt[pos:start_pos_excerptwise[cnt+1],:]
                    ##avg_embeddings = torch.sum(emb,0)/emb.size(0) if cnt==0 else avg_embeddings+torch.sum(emb,0)/emb.size(0)
                    avg_embeddings = (torch.sum(emb,0)/emb.size(0)).unsqueeze(-1).permute(1,0) if cnt==0 else torch.cat((avg_embeddings,(torch.sum(emb,0)/emb.size(0)).unsqueeze(-1).permute(1,0)),dim=0)
                else:
                    mask = attention_mask[i,:].squeeze()[pos:]
                    nonmask = torch.count_nonzero(mask)
                    emb = excerpt[pos:,:]
                    ##avg_embeddings = torch.sum(emb*mask.unsqueeze(-1),0)/nonmask if cnt==0 else avg_embeddings+torch.sum(emb*mask.unsqueeze(-1),0)/nonmask
                    avg_embeddings = (torch.sum(emb*mask.unsqueeze(-1),0)/nonmask).unsqueeze(-1).permute(1,0) if cnt==0 else torch.cat((avg_embeddings,(torch.sum(emb*mask.unsqueeze(-1),0)/nonmask).unsqueeze(-1).permute(1,0)),dim=0)
                    break
            ##Let us pad the individual sentence embedding tensor. Let us say there can be max of 50 sentences.
            ##can explore https://medium.com/huggingface/understanding-emotions-from-keras-to-pytorch-3ccb61d5a983
            ##for an alternate way to do this
            pad=MAX_SENTS-avg_embeddings.size(0)
            if pad>0:
                pad_emb = torch.zeros(pad, 768).to(DEVICE)
                avg_embeddings = torch.cat((pad_emb, avg_embeddings), dim=0)
            else:
                avg_embeddings = avg_embeddings[:MAX_SENTS,:]
                
            if i==0:
                ##mean_emb_weighted = avg_embeddings.unsqueeze(-1).permute(1,0)/ (cnt+1)
                mean_emb_weighted = avg_embeddings.unsqueeze(-1).permute(2,0,1)
            else:
                ##mean_emb_weighted = torch.cat((mean_emb_weighted, (avg_embeddings.unsqueeze(-1).permute(1,0)/(cnt+1))), dim=0)
                mean_emb_weighted = torch.cat((mean_emb_weighted, (avg_embeddings.unsqueeze(-1).permute(2,0,1))), dim=0)
        ##<bs, dim> for mean of means and <bs, MAX_SENTS, dim> for LSTM scenario
        
        ##norm_mean_emb = self.layer_norm(mean_emb_weighted)
        ##lstm, (h1, c) = self.lstm(mean_emb_weighted)
        weights = self.attention_layer(mean_emb_weighted)
        context_vector = torch.sum(weights * mean_emb_weighted, dim=1)        
        out = self.regressor(context_vector)
        ##out = self.regressor(mean_emb_weighted)
        return out

In [None]:
def eval_mse(model, data_loader):
    model.eval()            
    mse_sum = 0

    with torch.no_grad():
        for batch_num, (input_ids, attention_mask, start_pos, target) in enumerate(data_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)                        
            start_pos = start_pos.to(DEVICE)                        
            target = target.to(DEVICE)           
            pred = model(input_ids, attention_mask, start_pos)                       
            mse_sum += nn.MSELoss(reduction="sum")(pred.flatten(), target).item()
                
    return mse_sum / len(data_loader.dataset)

In [None]:
def predict(model, data_loader):
    model.eval()

    result = np.zeros(len(data_loader.dataset))    
    index = 0
    
    with torch.no_grad():
        for batch_num, (input_ids, attention_mask, start_pos) in enumerate(data_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)
            start_pos = start_pos.to(DEVICE)
            pred = model(input_ids, attention_mask, start_pos)                        
            result[index : index + pred.shape[0]] = pred.flatten().to("cpu")
            index += pred.shape[0]

    return result

In [None]:
def train(model, model_path, train_loader, val_loader,optimizer, scheduler=None, num_epochs=NUM_EPOCHS):    
    best_val_rmse = None
    best_epoch = 0
    step = 0
    last_eval_step = 0
    eval_period = EVAL_SCHEDULE[0][1]    
    start = time.time()

    for epoch in range(num_epochs):                           
        val_rmse = None         
        for batch_num, (input_ids, attention_mask, start_pos, target) in enumerate(train_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)            
            start_pos = start_pos.to(DEVICE)            
            target = target.to(DEVICE)                        
            optimizer.zero_grad()
            model.train()
            pred = model(input_ids, attention_mask, start_pos)
            mse = nn.MSELoss(reduction="mean")(pred.flatten(), target)
            mse.backward()
            optimizer.step()
            if scheduler:
                scheduler.step()
                
            if step >= last_eval_step + eval_period:
                elapsed_seconds = time.time() - start
                num_steps = step - last_eval_step
                ##print(f"\n{num_steps} steps took {elapsed_seconds:0.3} seconds")
                last_eval_step = step
                val_rmse = math.sqrt(eval_mse(model, val_loader))                            
                print(f"Epoch: {epoch} batch_num: {batch_num}", f"val_rmse: {val_rmse:0.4}")

                for rmse, period in EVAL_SCHEDULE:
                    if val_rmse >= rmse:
                        eval_period = period
                        break                               
                
                if not best_val_rmse or val_rmse < best_val_rmse:                    
                    best_val_rmse = val_rmse
                    best_epoch = epoch
                    torch.save(model.state_dict(), model_path)
                    print(f"New best_val_rmse: {best_val_rmse:0.4}")
                else:       
                    print(f"Still best_val_rmse: {best_val_rmse:0.4}", f"(from epoch {best_epoch})")                                    
                    
                start = time.time()
            step += 1
    return best_val_rmse

In [None]:
def create_optimizer(model):
    named_parameters = list(model.named_parameters())    
    roberta_parameters = named_parameters[:197]    
    attention_parameters = named_parameters[199:203]
    regressor_parameters = named_parameters[203:]
    attention_group = [params for (name, params) in attention_parameters]
    regressor_group = [params for (name, params) in regressor_parameters]
    parameters = []
    parameters.append({"params": attention_group})
    parameters.append({"params": regressor_group})

    for layer_num, (name, params) in enumerate(roberta_parameters):
        weight_decay = 0.0 if "bias" in name else 0.01
        lr = 2e-5
        if layer_num >= 69:        
            lr = 5e-5
        if layer_num >= 133:
            lr = 1e-4
        parameters.append({"params": params,"weight_decay": weight_decay,"lr": lr})
        
    return AdamW(parameters)

In [None]:
gc.collect()
SEED = 1000
list_val_rmse = []
kfold = KFold(n_splits=NUM_FOLDS, random_state=SEED, shuffle=True)

for fold, (train_indices, val_indices) in enumerate(kfold.split(train_df)):    
    print(f"\nFold {fold + 1}/{NUM_FOLDS}")
    model_path = f"model_{fold + 1}.pth"
    set_random_seed(SEED + fold)
    train_dataset = LitDataset(train_df.loc[train_indices])    
    val_dataset = LitDataset(train_df.loc[val_indices])    
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, drop_last=True, shuffle=True, num_workers=2)    
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, drop_last=False, shuffle=False, num_workers=2)    
    set_random_seed(SEED + fold)    
    model = LitModel().to(DEVICE)
    optimizer = create_optimizer(model)                        
    scheduler = get_cosine_schedule_with_warmup(optimizer,num_training_steps=NUM_EPOCHS * len(train_loader),num_warmup_steps=50)    
    list_val_rmse.append(train(model, model_path, train_loader,val_loader, optimizer, scheduler=scheduler))

    del model
    gc.collect()
    print("\nPerformance estimates:")
    print(list_val_rmse)
    print("Mean:", np.array(list_val_rmse).mean())

In [None]:
test_dataset = LitDataset(test_df, inference_only=True)

In [None]:
all_predictions = np.zeros((len(list_val_rmse), len(test_df)))
test_dataset = LitDataset(test_df, inference_only=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,drop_last=False, shuffle=False, num_workers=2)

for index in range(len(list_val_rmse)):            
    model_path = f"model_{index + 1}.pth"
    print(f"\nUsing {model_path}")
    model = LitModel()
    model.load_state_dict(torch.load(model_path))    
    model.to(DEVICE)
    all_predictions[index] = predict(model, test_loader)
    del model
    gc.collect()

In [None]:
predictions = all_predictions.mean(axis=0)
submission_df.target = predictions
print(submission_df)
submission_df.to_csv("submission.csv", index=False)