# Overview
This is kernel is almost the same as [Lightweight Roberta solution in PyTorch](https://www.kaggle.com/andretugan/lightweight-roberta-solution-in-pytorch), but instead of "roberta-base", it starts from [Maunish's pre-trained model](https://www.kaggle.com/maunish/clrp-roberta-base).

Acknowledgments: some ideas were taken from kernels by [Torch](https://www.kaggle.com/rhtsingh) and [Maunish](https://www.kaggle.com/maunish).

In [1]:
!git init
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
!apt-get install git-lfs
!git lfs install
!git clone https://huggingface.co/roberta-base

# !git clone https://huggingface.co/roberta-large
# !git clone https://huggingface.co/facebook/bart-base
# !git clone https://huggingface.co/bert-base-uncased
# !git clone https://huggingface.co/microsoft/deberta-base
# !git clone https://huggingface.co/distilroberta-base

Initialized empty Git repository in /kaggle/working/.git/
Detected operating system as Ubuntu/bionic.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... done.
Running apt-get update... done.

The repository is setup! You can now install packages.



The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 52 not upgraded.
Need to get 2129 kB of archives.
After this operation, 7662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2129 kB]
Fetched 2129 kB in 1s (1584 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package git-lfs.
(Reading database ... 100757 files and directories currently 

In [2]:
import os
import math
import random
import time

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

from transformers import AdamW
from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import AutoConfig
from transformers import get_cosine_schedule_with_warmup

from sklearn.model_selection import KFold

import gc
gc.enable()

In [3]:
NUM_FOLDS = 7#6#10#9#8#7#6#5
NUM_EPOCHS = 3
BATCH_SIZE = 16#4#16#24#32#16#1
MAX_LEN = 300#248
EVAL_SCHEDULE = [(0.50, 16), (0.49, 8), (0.48, 4), (0.47, 2), (-1., 1)]
ROBERTA_PATH = "../input/clrp-roberta-base/clrp_roberta_base"#"../input/clrp-roberta-base/clrp_roberta_base"
TOKENIZER_PATH = "./roberta-large"#"../input/clrp-roberta-base/clrp_roberta_base"
CONFIG_PATH = "../input/clrp-roberta-base/clrp_roberta_base"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [4]:
def set_random_seed(random_seed):
    random.seed(random_seed)
    np.random.seed(random_seed)
    os.environ["PYTHONHASHSEED"] = str(random_seed)

    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    torch.cuda.manual_seed_all(random_seed)

    torch.backends.cudnn.deterministic = True

In [5]:
train_df = pd.read_csv("/kaggle/input/commonlitreadabilityprize/train.csv")

# Remove incomplete entries if any.
train_df.drop(train_df[(train_df.target == 0) & (train_df.standard_error == 0)].index,
              inplace=True)
train_df.reset_index(drop=True, inplace=True)

test_df = pd.read_csv("/kaggle/input/commonlitreadabilityprize/test.csv")
submission_df = pd.read_csv("/kaggle/input/commonlitreadabilityprize/sample_submission.csv")

In [6]:
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)

Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Dataset

In [7]:
class LitDataset(Dataset):
    def __init__(self, df, inference_only=False):
        super().__init__()

        self.df = df        
        self.inference_only = inference_only
        self.text = df.excerpt.tolist()
        #self.text = [text.replace("\n", " ") for text in self.text]
        
        if not self.inference_only:
            self.target = torch.tensor(df.target.values, dtype=torch.float32)        
    
        self.encoded = tokenizer.batch_encode_plus(
            self.text,
            padding = 'max_length',            
            max_length = MAX_LEN,
            truncation = True,
            return_attention_mask=True
        )        
 

    def __len__(self):
        return len(self.df)

    
    def __getitem__(self, index):        
        input_ids = torch.tensor(self.encoded['input_ids'][index])
        attention_mask = torch.tensor(self.encoded['attention_mask'][index])
        
        if self.inference_only:
            return (input_ids, attention_mask)            
        else:
            target = self.target[index]
            return (input_ids, attention_mask, target)

# Model
The model is inspired by the one from [Maunish](https://www.kaggle.com/maunish/clrp-roberta-svm).

In [8]:
# class LitModel(nn.Module):
#     def __init__(self):
#         super().__init__()

#         config = AutoConfig.from_pretrained(CONFIG_PATH)
#         config.update({"output_hidden_states":True, 
#                        "hidden_dropout_prob": 0.0,
# #                        "attention_probs_dropout_prob":0.0,
#                        "layer_norm_eps": 1e-7})                       
        
#         self.roberta = AutoModel.from_pretrained(ROBERTA_PATH, config=config)  
            
#         self.attention = nn.Sequential(            
#             nn.Linear(768, 512),            
#             nn.Tanh(),  
#             nn.Linear(512, 1),
#             nn.Softmax(dim=1)
#         )        

#         self.regressor = nn.Sequential(      
# #             nn.LayerNorm(768),
#             nn.Linear(768, 1),          
#         )
        

#     def forward(self, input_ids, attention_mask):
#         roberta_output = self.roberta(input_ids=input_ids,
#                                       attention_mask=attention_mask)        

#         # There are a total of 13 layers of hidden states.
#         # 1 for the embedding layer, and 12 for the 12 Roberta layers.
#         # We take the hidden states from the last Roberta layer.
#         last_layer_hidden_states = roberta_output.hidden_states[-1]

#         # The number of cells is MAX_LEN.
#         # The size of the hidden state of each cell is 768 (for roberta-base).
#         # In order to condense hidden states of all cells to a context vector,
#         # we compute a weighted average of the hidden states of all cells.
#         # We compute the weight of each cell, using the attention neural network.
#         weights = self.attention(last_layer_hidden_states)
                
#         # weights.shape is BATCH_SIZE x MAX_LEN x 1
#         # last_layer_hidden_states.shape is BATCH_SIZE x MAX_LEN x 768        
#         # Now we compute context_vector as the weighted average.
#         # context_vector.shape is BATCH_SIZE x 768
#         context_vector = torch.sum(weights * last_layer_hidden_states, dim=1)        
        
#         # Now we reduce the context vector to the prediction score.
#         return self.regressor(context_vector)

In [9]:
# class LitModel(nn.Module):
#     def __init__(self):
#         super().__init__()

#         config = AutoConfig.from_pretrained(CONFIG_PATH)
#         config.update({"output_hidden_states":True, 
#                        "hidden_dropout_prob": 0.0,
# #                        "attention_probs_dropout_prob":0.0,
#                        "layer_norm_eps": 1e-7})                       
        
#         self.roberta = AutoModel.from_pretrained(ROBERTA_PATH, config=config)  
#         self.cnn1 = nn.Conv1d(768, MAX_LEN, kernel_size=2, padding=1)
#         self.cnn2 = nn.Conv1d(MAX_LEN, 1, kernel_size=2, padding=1)

        

#     def forward(self, input_ids, attention_mask):
#         roberta_output = self.roberta(input_ids=input_ids,
#                                       attention_mask=attention_mask)        
#         last_hidden_state = roberta_output[0]
#         last_hidden_state = last_hidden_state.permute(0, 2, 1)
#         cnn_embeddings = F.relu(self.cnn1(last_hidden_state))
#         cnn_embeddings = self.cnn2(cnn_embeddings)
#         logits, _ = torch.max(cnn_embeddings, 2)
#         return logits

In [10]:
# https://arxiv.org/pdf/2103.04083v1.pdf
class LitModel(nn.Module):  
    def __init__(self):
        super().__init__()

        config = AutoConfig.from_pretrained(CONFIG_PATH)
        config.update({"output_hidden_states":True, 
                       "hidden_dropout_prob": 0.0,
#                        "attention_probs_dropout_prob":0.0,
                       "layer_norm_eps": 1e-7})                       
        
        self.roberta = AutoModel.from_pretrained(ROBERTA_PATH, config=config)  
#         self.cnn1 = nn.Conv1d(768, MAX_LEN, kernel_size=1)
#         self.cnn2 = nn.Conv1d(MAX_LEN, 1, kernel_size=1)
        self.cnn1 = nn.Conv1d(768, 512, kernel_size=1)
        self.cnn2 = nn.Conv1d(512, MAX_LEN, kernel_size=1)
         
#         self.layernorm = nn.LayerNorm(MAX_LEN,MAX_LEN)    
        self.layernorm = nn.LayerNorm(MAX_LEN)
            
        self.attention = nn.Sequential(            
            nn.Linear(MAX_LEN, MAX_LEN),            
            nn.Tanh(),  
            nn.Linear(MAX_LEN, 1),
            nn.Softmax(dim=1)
        )        

        self.regressor = nn.Sequential(      
#             nn.LayerNorm(768),
            nn.Linear(MAX_LEN, 1),          
        )
        

    def forward(self, input_ids, attention_mask):
        roberta_output = self.roberta(input_ids=input_ids,
                                      attention_mask=attention_mask)   
        last_hidden_state = roberta_output.hidden_states[-1]
#         print(last_hidden_state.shape)
        last_hidden_state = last_hidden_state.permute(0, 2, 1)#16*1024*MAX_LEN
#         print(last_hidden_state.shape)
        cnn_embeddings = F.relu(self.cnn1(last_hidden_state))#16*512*MAX_LEN
#         print(cnn_embeddings.shape)
        cnn_embeddings = self.cnn2(cnn_embeddings)#16*MAX_LEN(embedding)*MAX_LEN(tokens)
#         print(cnn_embeddings.shape)
        cnn_embeddings = cnn_embeddings.permute(0, 2, 1)
#         cnn_embeddings = self.layernorm(cnn_embeddings)
#         print(cnn_embeddings.shape)
        # There are a total of 13 layers of hidden states.
        # 1 for the embedding layer, and 12 for the 12 Roberta layers.
        # We take the hidden states from the last Roberta layer.
#         last_layer_hidden_states = roberta_output.hidden_states[-1]
        

        # The number of cells is MAX_LEN.
        # The size of the hidden state of each cell is 768 (for roberta-base).
        # In order to condense hidden states of all cells to a context vector,
        # we compute a weighted average of the hidden states of all cells.
        # We compute the weight of each cell, using the attention neural network.
#         print(cnn_embeddings.shape)
        weights = self.attention(cnn_embeddings)#16*MAX_LEN*1
#         print('weights.shape',weights.shape)
                
        # weights.shape is BATCH_SIZE x MAX_LEN x 1
        # last_layer_hidden_states.shape is BATCH_SIZE x MAX_LEN x 768        
        # Now we compute context_vector as the weighted average.
        # context_vector.shape is BATCH_SIZE x 768
        context_vector = torch.sum(weights * cnn_embeddings, dim=1)#16*MAX_LEN   
#         print('context_vector',context_vector.shape)
        
        # Now we reduce the context vector to the prediction score.
        return self.regressor(context_vector)#16

In [11]:
def eval_mse(model, data_loader):
    """Evaluates the mean squared error of the |model| on |data_loader|"""
    model.eval()            
    mse_sum = 0

    with torch.no_grad():
        for batch_num, (input_ids, attention_mask, target) in enumerate(data_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)                        
            target = target.to(DEVICE)           
            
            pred = model(input_ids, attention_mask)                       

            mse_sum += nn.MSELoss(reduction="sum")(pred.flatten(), target).item()
                

    return mse_sum / len(data_loader.dataset)

In [12]:
def predict(model, data_loader):
    """Returns an np.array with predictions of the |model| on |data_loader|"""
    model.eval()

    result = np.zeros(len(data_loader.dataset))    
    index = 0
    
    with torch.no_grad():
        for batch_num, (input_ids, attention_mask) in enumerate(data_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)
                        
            pred = model(input_ids, attention_mask)                        

            result[index : index + pred.shape[0]] = pred.flatten().to("cpu")
            index += pred.shape[0]

    return result

In [13]:
def train(model, model_path, train_loader, val_loader,
          optimizer, scheduler=None, num_epochs=NUM_EPOCHS):    
    best_val_rmse = None
    best_epoch = 0
    step = 0
    last_eval_step = 0
    eval_period = EVAL_SCHEDULE[0][1]    

    start = time.time()

    for epoch in range(num_epochs):                           
        val_rmse = None         

        for batch_num, (input_ids, attention_mask, target) in enumerate(train_loader):
            input_ids = input_ids.to(DEVICE)
            attention_mask = attention_mask.to(DEVICE)            
            target = target.to(DEVICE)                        

            optimizer.zero_grad()
            
            model.train()

            pred = model(input_ids, attention_mask)
                                                        
            mse = nn.MSELoss(reduction="mean")(pred.flatten(), target)
                        
            mse.backward()

            optimizer.step()
            if scheduler:
                scheduler.step()
            
            if step >= last_eval_step + eval_period:
                # Evaluate the model on val_loader.
                elapsed_seconds = time.time() - start
                num_steps = step - last_eval_step
                print(f"\n{num_steps} steps took {elapsed_seconds:0.3} seconds")
                last_eval_step = step
                
                val_rmse = math.sqrt(eval_mse(model, val_loader))                            

                print(f"Epoch: {epoch} batch_num: {batch_num}", 
                      f"val_rmse: {val_rmse:0.4}")

                for rmse, period in EVAL_SCHEDULE:
                    if val_rmse >= rmse:
                        eval_period = period
                        break                               
                
                if not best_val_rmse or val_rmse < best_val_rmse:                    
                    best_val_rmse = val_rmse
                    best_epoch = epoch
                    torch.save(model.state_dict(), model_path)
                    print(f"New best_val_rmse: {best_val_rmse:0.4}")
                else:       
                    print(f"Still best_val_rmse: {best_val_rmse:0.4}",
                          f"(from epoch {best_epoch})")                                    
                    
                start = time.time()
                                            
            step += 1
                        
    
    return best_val_rmse

In [14]:
def create_optimizer(model):
    named_parameters = list(model.named_parameters())    
    
    roberta_parameters = named_parameters[:197]    
    attention_parameters = named_parameters[199:203]
    regressor_parameters = named_parameters[203:]
        
    attention_group = [params for (name, params) in attention_parameters]
    regressor_group = [params for (name, params) in regressor_parameters]

    parameters = []
    parameters.append({"params": attention_group,
                       "weight_decay": 0.001,
                      "lr": 1e-3})
    parameters.append({"params": regressor_group,
                       "weight_decay": 0.001,
                      "lr": 1e-3})

    for layer_num, (name, params) in enumerate(roberta_parameters):
        weight_decay = 0.0 if "bias" in name else 0.01

        lr = 2e-5

        if layer_num >= 69:        
            lr = 5e-5

        if layer_num >= 133:
            lr = 1e-4

        parameters.append({"params": params,
                           "weight_decay": weight_decay,
                           "lr": lr})

    return AdamW(parameters)

In [15]:
gc.collect()

SEED = 1000
list_val_rmse = []

kfold = KFold(n_splits=NUM_FOLDS, random_state=SEED, shuffle=True)

for fold, (train_indices, val_indices) in enumerate(kfold.split(train_df)):    
    print(f"\nFold {fold + 1}/{NUM_FOLDS}")
    model_path = f"model_{fold + 1}.pth"
        
    set_random_seed(SEED + fold)
    
    train_dataset = LitDataset(train_df.loc[train_indices])    
    val_dataset = LitDataset(train_df.loc[val_indices])    
        
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                              drop_last=True, shuffle=True, num_workers=2)    
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,
                            drop_last=False, shuffle=False, num_workers=2)    
        
    set_random_seed(SEED + fold)    
    
    model = LitModel().to(DEVICE)
    
    optimizer = create_optimizer(model)                        
    scheduler = get_cosine_schedule_with_warmup(
        optimizer,
        num_training_steps=NUM_EPOCHS * len(train_loader),
        num_warmup_steps=50)    
    
    list_val_rmse.append(train(model, model_path, train_loader,
                               val_loader, optimizer, scheduler=scheduler))

    del model
    gc.collect()
    
    print("\nPerformance estimates:")
    print(list_val_rmse)
    print("Mean:", np.array(list_val_rmse).mean())
    


Fold 1/7


Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



16 steps took 9.86 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.9438
New best_val_rmse: 0.9438

16 steps took 8.21 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.839
New best_val_rmse: 0.839

16 steps took 8.28 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.9068
Still best_val_rmse: 0.839 (from epoch 0)

16 steps took 8.24 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.8852
Still best_val_rmse: 0.839 (from epoch 0)

16 steps took 8.25 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.7456
New best_val_rmse: 0.7456

16 steps took 8.19 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.5917
New best_val_rmse: 0.5917

16 steps took 8.16 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.6405
Still best_val_rmse: 0.5917 (from epoch 0)

16 steps took 8.28 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.5252
New best_val_rmse: 0.5252

16 steps took 8.23 seconds
Epoch: 0 batch_num: 144 val_rmse: 0.518
New best_val_rmse: 0.518

16 steps took 8.34 seconds
Epoch: 1 batch_num: 9 val_rmse: 0.5495
Still best_val_rmse: 0.518 (from epoch 

Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



16 steps took 8.87 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.886
New best_val_rmse: 0.886

16 steps took 8.16 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.7454
New best_val_rmse: 0.7454

16 steps took 8.27 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.6897
New best_val_rmse: 0.6897

16 steps took 8.2 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.7156
Still best_val_rmse: 0.6897 (from epoch 0)

16 steps took 8.22 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.5564
New best_val_rmse: 0.5564

16 steps took 8.31 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.5798
Still best_val_rmse: 0.5564 (from epoch 0)

16 steps took 8.18 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.5705
Still best_val_rmse: 0.5564 (from epoch 0)

16 steps took 8.19 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.5303
New best_val_rmse: 0.5303

16 steps took 8.22 seconds
Epoch: 0 batch_num: 144 val_rmse: 0.6015
Still best_val_rmse: 0.5303 (from epoch 0)

16 steps took 8.4 seconds
Epoch: 1 batch_num: 9 val_rmse: 0.5145
New best_val_rmse: 0

Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



16 steps took 8.86 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.95
New best_val_rmse: 0.95

16 steps took 8.25 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.7005
New best_val_rmse: 0.7005

16 steps took 8.2 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.7381
Still best_val_rmse: 0.7005 (from epoch 0)

16 steps took 8.21 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.5689
New best_val_rmse: 0.5689

16 steps took 8.22 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.6624
Still best_val_rmse: 0.5689 (from epoch 0)

16 steps took 8.19 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.5728
Still best_val_rmse: 0.5689 (from epoch 0)

16 steps took 8.17 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.5344
New best_val_rmse: 0.5344

16 steps took 8.2 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.5406
Still best_val_rmse: 0.5344 (from epoch 0)

16 steps took 8.24 seconds
Epoch: 0 batch_num: 144 val_rmse: 0.5903
Still best_val_rmse: 0.5344 (from epoch 0)

16 steps took 8.37 seconds
Epoch: 1 batch_num: 9 val_rmse: 0.5018
New b

Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



16 steps took 8.86 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.9628
New best_val_rmse: 0.9628

16 steps took 8.28 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.8627
New best_val_rmse: 0.8627

16 steps took 8.23 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.6337
New best_val_rmse: 0.6337

16 steps took 8.22 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.6111
New best_val_rmse: 0.6111

16 steps took 8.23 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.5481
New best_val_rmse: 0.5481

16 steps took 8.24 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.5415
New best_val_rmse: 0.5415

16 steps took 8.27 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.6224
Still best_val_rmse: 0.5415 (from epoch 0)

16 steps took 8.23 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.5888
Still best_val_rmse: 0.5415 (from epoch 0)

16 steps took 8.17 seconds
Epoch: 0 batch_num: 144 val_rmse: 0.5293
New best_val_rmse: 0.5293

16 steps took 8.38 seconds
Epoch: 1 batch_num: 9 val_rmse: 0.5366
Still best_val_rmse: 0.5293 (from epoch 0)

16 ste

Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



16 steps took 8.84 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.9256
New best_val_rmse: 0.9256

16 steps took 8.19 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.7616
New best_val_rmse: 0.7616

16 steps took 8.25 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.7078
New best_val_rmse: 0.7078

16 steps took 8.25 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.6713
New best_val_rmse: 0.6713

16 steps took 8.2 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.7208
Still best_val_rmse: 0.6713 (from epoch 0)

16 steps took 8.18 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.6294
New best_val_rmse: 0.6294

16 steps took 8.28 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.5973
New best_val_rmse: 0.5973

16 steps took 8.22 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.6289
Still best_val_rmse: 0.5973 (from epoch 0)

16 steps took 8.17 seconds
Epoch: 0 batch_num: 144 val_rmse: 0.5471
New best_val_rmse: 0.5471

16 steps took 8.35 seconds
Epoch: 1 batch_num: 9 val_rmse: 0.6255
Still best_val_rmse: 0.5471 (from epoch 0)

16 step

Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



16 steps took 8.9 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.9189
New best_val_rmse: 0.9189

16 steps took 8.2 seconds
Epoch: 0 batch_num: 32 val_rmse: 1.017
Still best_val_rmse: 0.9189 (from epoch 0)

16 steps took 8.17 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.6595
New best_val_rmse: 0.6595

16 steps took 8.21 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.6287
New best_val_rmse: 0.6287

16 steps took 8.27 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.5726
New best_val_rmse: 0.5726

16 steps took 8.23 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.6137
Still best_val_rmse: 0.5726 (from epoch 0)

16 steps took 8.18 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.5699
New best_val_rmse: 0.5699

16 steps took 8.2 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.5979
Still best_val_rmse: 0.5699 (from epoch 0)

16 steps took 8.19 seconds
Epoch: 0 batch_num: 144 val_rmse: 0.5506
New best_val_rmse: 0.5506

16 steps took 8.42 seconds
Epoch: 1 batch_num: 9 val_rmse: 0.4975
New best_val_rmse: 0.4975

8 steps to

Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



16 steps took 8.87 seconds
Epoch: 0 batch_num: 16 val_rmse: 0.9503
New best_val_rmse: 0.9503

16 steps took 8.23 seconds
Epoch: 0 batch_num: 32 val_rmse: 0.7445
New best_val_rmse: 0.7445

16 steps took 8.17 seconds
Epoch: 0 batch_num: 48 val_rmse: 0.7168
New best_val_rmse: 0.7168

16 steps took 8.2 seconds
Epoch: 0 batch_num: 64 val_rmse: 0.6742
New best_val_rmse: 0.6742

16 steps took 8.27 seconds
Epoch: 0 batch_num: 80 val_rmse: 0.6242
New best_val_rmse: 0.6242

16 steps took 8.23 seconds
Epoch: 0 batch_num: 96 val_rmse: 0.5893
New best_val_rmse: 0.5893

16 steps took 8.24 seconds
Epoch: 0 batch_num: 112 val_rmse: 0.6304
Still best_val_rmse: 0.5893 (from epoch 0)

16 steps took 8.19 seconds
Epoch: 0 batch_num: 128 val_rmse: 0.6059
Still best_val_rmse: 0.5893 (from epoch 0)

16 steps took 8.19 seconds
Epoch: 0 batch_num: 144 val_rmse: 0.5472
New best_val_rmse: 0.5472

16 steps took 8.45 seconds
Epoch: 1 batch_num: 9 val_rmse: 0.6582
Still best_val_rmse: 0.5472 (from epoch 0)

16 step

# Inference

In [16]:
test_dataset = LitDataset(test_df, inference_only=True)

In [17]:
all_predictions = np.zeros((len(list_val_rmse), len(test_df)))

test_dataset = LitDataset(test_df, inference_only=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                         drop_last=False, shuffle=False, num_workers=2)

for index in range(len(list_val_rmse)):            
    model_path = f"model_{index + 1}.pth"
    print(f"\nUsing {model_path}")
                        
    model = LitModel()
    model.load_state_dict(torch.load(model_path))    
    model.to(DEVICE)
    
    all_predictions[index] = predict(model, test_loader)
    
    del model
    gc.collect()


Using model_1.pth


Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Using model_2.pth


Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Using model_3.pth


Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Using model_4.pth


Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Using model_5.pth


Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Using model_6.pth


Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Using model_7.pth


Some weights of RobertaModel were not initialized from the model checkpoint at ../input/clrp-roberta-base/clrp_roberta_base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
predictions = all_predictions.mean(axis=0)
submission_df.target = predictions
print(submission_df)
submission_df.to_csv("submission.csv", index=False)

          id    target
0  c0f722661 -0.399351
1  f0953f0a5 -0.645916
2  0df072751 -0.399908
3  04caf4e0c -2.534776
4  0e63f8bea -1.638422
5  12537fe78 -1.449965
6  965e592c0  0.187443
