![](https://image4.owler.com/logo/commonlit-org_owler_20170203_094053_original.png)
<center><img src="https://i.imgur.com/iywFvlD.png" width="2000" alt="Weights & Biases" /></center><br>

<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spacing: 1px; background-color: #f6f5f5; color :#6666ff; border-radius: 200px 200px; text-align:center">Squeeze BERT</h1>

<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets, large computing systems, and better neural network models, natural language processing (NLP) technology has made significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. <br><br>
In particular, we consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today’s highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these techniques have not been adopted by NLP neural network designers.<br><br>
Squeeze BERT demonstrates how to replace several operations in self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test set. The SqueezeBERT code will be released.</p>

<p p style = "font-family: garamond; font-size:40px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">What are we discussing today? </p>
 <p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#006699; border-radius: 10px 10px; text-align:center">Squeeze BERT <br>
 MadGrad optimizer <br>
 Gradient Accumulation <br>
 HuggingFace Accelerate <br>
 Weights and Biases for Experiment Tracking

<p p style = "font-family: garamond; font-size:35px; font-style: normal;background-color: #f6f5f5; color :#ff0066; border-radius: 10px 10px; text-align:center">Upvote the kernel if you find it insightful!</p>

# <center><img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases" /></center><br>
<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">Wandb is a developer tool for companies turn deep learning research projects into deployed software by helping teams track their models, visualize model performance and easily automate training and improving models.
We will use their tools to log hyperparameters and output metrics from your runs, then visualize and compare results and quickly share findings with your colleagues.<br><br>We'll be using this to train our K Fold Cross Validation and gain better insights about our training. <br><br></p>

![img](https://i.imgur.com/BGgfZj3.png)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Accelerate by HuggingFace 🤗</p>

<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">Accelerate provides an easy API to make your scripts run with mixed precision and on any kind of distributed setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then runs seamlessly on your local machine for debugging or your training environment.
In 5 Lines of code we can run our scripts on any distributed setting!</p>

In [None]:
!pip install -q wandb --upgrade
!pip install -q transformers
!pip install -q accelerate

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Import Libraries</p>

In [None]:
# Hide warnings
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

# Python
import os
import random
from collections import defaultdict
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

# Utilities
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold

# Pytorch for Deep Learning
import torch
import transformers
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda import amp
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
from torch.optim.optimizer import Optimizer
from torch.optim import lr_scheduler

#HuggingFace Libraries
import transformers
from transformers import SqueezeBertTokenizer, SqueezeBertModel
from accelerate import Accelerator

accelerator = Accelerator()

# Weights and Biases Tool
import wandb

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Define Configurations/Parameters</p>

In [None]:
params = {
    'seed': 42,
    'model' : 'squeezebert/squeezebert-uncased',
    'name': 'squeezebert-uncased',
    'tokenizer' : SqueezeBertTokenizer.from_pretrained('squeezebert/squeezebert-uncased'),
    'device': accelerator.device,
    'lr': 1e-4,
    'weight_decay': 1e-6,
    'batch_size': 32,
    'num_workers' : 8,
    'epochs': 3,
    'max_len': 205,
    'nfolds': 5,
    'gradient_accumulation_steps': 2
}

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Define Seed for Reproducibility</p>

In [None]:
def seed_everything(seed=42):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    
seed_everything(params['seed'])

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Define Train and Test</p>

In [None]:
train_df = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
train_df.head()

In [None]:
test_df = pd.read_csv("../input/commonlitreadabilityprize/test.csv")
test_df.head()

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Custom Dataset</p>

In [None]:
class BERTDataset(Dataset):
    def __init__(self, review, target=None, is_test = False):
        self.review = review
        self.target = target
        self.is_test = is_test
        self.tokenizer = params['tokenizer']
        self.max_len = params['max_len']
    
    def __len__(self):
        return len(self.review)
    
    def __getitem__(self, idx):
        review = str(self.review[idx])
        review = ' '.join(review.split())
        
        inputs = self.tokenizer.encode_plus(
            review,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True
        )   
        
        ids = torch.tensor(inputs['input_ids'], dtype=torch.long)
        mask = torch.tensor(inputs['attention_mask'], dtype=torch.long)
        token_type_ids = torch.tensor(inputs['token_type_ids'], dtype=torch.long)
  
  
        targets = torch.tensor(self.target[idx], dtype=torch.float)
        return {
            'ids': ids,
            'mask': mask,
            'token_type_ids': token_type_ids,
            'targets': targets
                }

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Custom Class for Monitoring Loss and ROC</p>

In [None]:
class MetricMonitor:
    def __init__(self, float_precision=3):
        self.float_precision = float_precision
        self.reset()

    def reset(self):
        self.metrics = defaultdict(lambda: {"val": 0, "count": 0, "avg": 0})

    def update(self, metric_name, val):
        metric = self.metrics[metric_name]

        metric["val"] += val
        metric["count"] += 1
        metric["avg"] = metric["val"] / metric["count"]

    def __str__(self):
        return " | ".join(
            [
                "{metric_name}: {avg:.{float_precision}f}".format(
                    metric_name=metric_name, avg=metric["avg"],
                    float_precision=self.float_precision
                )
                for (metric_name, metric) in self.metrics.items()
            ]
        )
    
def use_rmse_score(output, target):
    return np.sqrt(mean_squared_error(target, output))

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Train and Valid Loader</p>


In [None]:
def get_loader(train_data, valid_data):
    
    train_set = BERTDataset(
        review = train_data['excerpt'].values,
        target = train_data['target'].values
    )

    valid_set = BERTDataset(
        review = valid_data['excerpt'].values,
        target = valid_data['target'].values
    )

    train_loader = DataLoader(
        train_set,
        batch_size = params['batch_size'],
        shuffle = True,
        num_workers=params['num_workers'],
        pin_memory = True
    )

    valid_loader = DataLoader(
        valid_set,
        batch_size = params['batch_size'],
        shuffle = False,
        num_workers=params['num_workers'],
        pin_memory = True
    )
    
    return train_loader, valid_loader

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Squeeze BERT Tokenizer and Model</p>

<center><img src="https://i.imgur.com/T6C6HG1.png" width="1500" alt="tokenizer" /></center><br><br>

<center><img src="https://i.imgur.com/sRXgv3p.png" width="1500" alt="tokenizer" /></center>

In [None]:
class SqueezeBert(nn.Module):
    def __init__(self):
        super(SqueezeBert, self).__init__()
        self.bert = SqueezeBertModel.from_pretrained(params['model'])
        self.drop = nn.Dropout(0.3)
        self.fc = nn.Linear(768, 1)
    
    def forward(self, ids, mask, token_type_ids):
        _, output = self.bert(ids, attention_mask=mask, token_type_ids=token_type_ids, return_dict=False)
        output = self.drop(output)
        output = self.fc(output)
        return output

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">MadGrad Optimizer</p>
<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">Introducing MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly.<br><br>
MADGRAD is a general purpose optimizer that can be used in place of SGD or Adam may converge faster and generalize better. Currently GPU-only. Typically, the same learning rate schedule that is used for SGD or Adam may be used. The overall learning rate is not comparable to either method and should be determined by a hyper-parameter sweep.</p>

![](https://warehouse-camo.ingress.cmh1.psfhosted.org/9f23964e71fdb90e6464c7f643b6dff2a3ededb6/68747470733a2f2f6769746875622e636f6d2f66616365626f6f6b72657365617263682f6d6164677261642f626c6f622f6d61737465722f666967757265732f6e6c702e706e673f7261773d74727565)

In [None]:
import math
from typing import TYPE_CHECKING, Any, Callable, Optional

if TYPE_CHECKING:
    from torch.optim.optimizer import _params_t
else:
    _params_t = Any

class MADGRAD(Optimizer):

    def __init__(
        self, params: _params_t, lr: float = 1e-2, momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-6,
    ):
        if momentum < 0 or momentum >= 1:
            raise ValueError(f"Momentum {momentum} must be in the range [0,1]")
        if lr <= 0:
            raise ValueError(f"Learning rate {lr} must be positive")
        if weight_decay < 0:
            raise ValueError(f"Weight decay {weight_decay} must be non-negative")
        if eps < 0:
            raise ValueError(f"Eps must be non-negative")

        defaults = dict(lr=lr, eps=eps, momentum=momentum, weight_decay=weight_decay)
        super().__init__(params, defaults)

    @property
    def supports_memory_efficient_fp16(self) -> bool:
        return False

    @property
    def supports_flat_params(self) -> bool:
        return True

    def step(self, closure: Optional[Callable[[], float]] = None) -> Optional[float]:

        loss = None
        if closure is not None:
            loss = closure()

        if 'k' not in self.state:
            self.state['k'] = torch.tensor([0], dtype=torch.long)
        k = self.state['k'].item()

        for group in self.param_groups:
            eps = group["eps"]
            lr = group["lr"] + eps
            decay = group["weight_decay"]
            momentum = group["momentum"]

            ck = 1 - momentum
            lamb = lr * math.pow(k + 1, 0.5)

            for p in group["params"]:
                if p.grad is None:
                    continue
                grad = p.grad.data
                state = self.state[p]

                if "grad_sum_sq" not in state:
                    state["grad_sum_sq"] = torch.zeros_like(p.data).detach()
                    state["s"] = torch.zeros_like(p.data).detach()
                    if momentum != 0:
                        state["x0"] = torch.clone(p.data).detach()

                if momentum != 0.0 and grad.is_sparse:
                    raise RuntimeError("momentum != 0 is not compatible with sparse gradients")

                grad_sum_sq = state["grad_sum_sq"]
                s = state["s"]

                # Apply weight decay
                if decay != 0:
                    if grad.is_sparse:
                        raise RuntimeError("weight_decay option is not compatible with sparse gradients")

                    grad.add_(p.data, alpha=decay)

                if grad.is_sparse:
                    grad = grad.coalesce()
                    grad_val = grad._values()

                    p_masked = p.sparse_mask(grad)
                    grad_sum_sq_masked = grad_sum_sq.sparse_mask(grad)
                    s_masked = s.sparse_mask(grad)

                    # Compute x_0 from other known quantities
                    rms_masked_vals = grad_sum_sq_masked._values().pow(1 / 3).add_(eps)
                    x0_masked_vals = p_masked._values().addcdiv(s_masked._values(), rms_masked_vals, value=1)

                    # Dense + sparse op
                    grad_sq = grad * grad
                    grad_sum_sq.add_(grad_sq, alpha=lamb)
                    grad_sum_sq_masked.add_(grad_sq, alpha=lamb)

                    rms_masked_vals = grad_sum_sq_masked._values().pow_(1 / 3).add_(eps)

                    s.add_(grad, alpha=lamb)
                    s_masked._values().add_(grad_val, alpha=lamb)

                    # update masked copy of p
                    p_kp1_masked_vals = x0_masked_vals.addcdiv(s_masked._values(), rms_masked_vals, value=-1)
                    # Copy updated masked p to dense p using an add operation
                    p_masked._values().add_(p_kp1_masked_vals, alpha=-1)
                    p.data.add_(p_masked, alpha=-1)
                else:
                    if momentum == 0:
                        # Compute x_0 from other known quantities
                        rms = grad_sum_sq.pow(1 / 3).add_(eps)
                        x0 = p.data.addcdiv(s, rms, value=1)
                    else:
                        x0 = state["x0"]

                    # Accumulate second moments
                    grad_sum_sq.addcmul_(grad, grad, value=lamb)
                    rms = grad_sum_sq.pow(1 / 3).add_(eps)

                    # Update s
                    s.data.add_(grad, alpha=lamb)

                    # Step
                    if momentum == 0:
                        p.data.copy_(x0.addcdiv(s, rms, value=-1))
                    else:
                        z = x0.addcdiv(s, rms, value=-1)

                        # p is a moving average of z
                        p.data.mul_(1 - ck).add_(z, alpha=ck)


        self.state['k'] += 1
        return loss

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Define Loss Function, Optimizer and Scheduler</p>

In [None]:
def get_criterion(outputs, targets):
    return torch.sqrt(nn.MSELoss()(outputs, targets))

In [None]:
def get_scheduler(optimizer, nb_train_steps):
    return  transformers.get_linear_schedule_with_warmup(
                                                            optimizer,
                                                            num_warmup_steps=0,
                                                            num_training_steps=nb_train_steps
                                                        )


In [None]:
model = SqueezeBert()
model = model.to(params['device'])
optimizer = MADGRAD(model.parameters(), lr=params['lr'], weight_decay=params['weight_decay'])

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Training with Gradient Accumulation</p>
<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">The idea behind gradient accumulation is very simple. It calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches. And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches. It serves the same purpose as having a mini-batch with higher number of images.</p>

In [None]:
def train(train_loader, model, optimizer, epoch, params, scheduler):
    metric_monitor = MetricMonitor()
    model.train()
    stream = tqdm(enumerate(train_loader), total=len(train_loader))
    scaler = amp.GradScaler()   
    for idx, inputs in stream:

        ids = inputs['ids'].to(params['device'], dtype=torch.long)
        mask = inputs['mask'].to(params['device'], dtype=torch.long)
        ttis = inputs['token_type_ids'].to(params['device'], dtype=torch.long)
        target = inputs['targets'].to(params['device'], dtype=torch.float)

        # AMP with Gradient Scaling
        with amp.autocast(enabled=True):
            output = model(ids=ids, mask=mask, token_type_ids=ttis).view(-1)
            loss = get_criterion(output, target)
            loss = loss / params['gradient_accumulation_steps']

        accelerator.backward(scaler.scale(loss))

        # Gradient Accumulation
        if idx % params['gradient_accumulation_steps'] == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            scheduler.step()

            
                
        rmse_score = use_rmse_score(output.detach().cpu().numpy(), target.detach().cpu().numpy())
        metric_monitor.update('RMSE', rmse_score)
        wandb.log({"Train RMSE":rmse_score})
        

        stream.set_description(
            "Epoch: {epoch}. Train.      {metric_monitor}".format(
                epoch=epoch,
                metric_monitor=metric_monitor)
        )

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Validation Loop</p>

In [None]:
def validate(val_loader, model,epoch, params):
    metric_monitor = MetricMonitor()
    model.eval()
    stream = tqdm(enumerate(val_loader), total=len(val_loader))
    final_targets = []
    final_outputs = []
    with torch.no_grad():
        for idx, inputs in stream:
            ids = inputs['ids'].to(params['device'], dtype=torch.long)
            mask = inputs['mask'].to(params['device'], dtype=torch.long)
            ttis = inputs['token_type_ids'].to(params['device'], dtype=torch.long)
            target = inputs['targets'].to(params['device'], dtype=torch.float)
            
            output = model(ids=ids, mask=mask, token_type_ids=ttis).view(-1)
            loss = get_criterion(output, target)
            
            rmse_score = use_rmse_score(output.detach().cpu().numpy(), target.detach().cpu().numpy())
            
            metric_monitor.update('RMSE', rmse_score)
            wandb.log({"Valid RMSE":rmse_score})
            stream.set_description(
                "Epoch: {epoch}. Validation. {metric_monitor}".format(
                    epoch=epoch,
                    metric_monitor=metric_monitor)
            )
            
            targets = target.detach().cpu().numpy().tolist()
            outputs = output.detach().cpu().numpy().tolist()
            
            final_targets.extend(targets)
            final_outputs.extend(outputs)
    return final_outputs, final_targets

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">W&B Initialization for K-FOLD CV</p>

<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">K-Fold CV gives a model with less bias compared to other methods. In K-Fold CV, we have a paprameter ‘k’. This parameter decides how many folds the dataset is going to be divided. Every fold gets chance to appears in the training set (k-1) times, which in turn ensures that every observation in the dataset appears in the dataset, thus enabling the model to learn the underlying data distribution better.<br><br>Another approach is to shuffle the dataset just once prior to splitting the dataset into k folds, and then split, such that the ratio of the observations in each class remains the same in each fold. Also the test set does not overlap between consecutive iterations. This approach is called Stratified K-Fold CV. This approach is useful for imbalanced datasets.</p>


In [None]:
def create_folds(train_df):
    
    data = train_df.sample(frac=1).reset_index(drop=True)
    data = data[['excerpt', 'target']]
    kf = StratifiedKFold(n_splits=params['nfolds'])
    nb_bins = int(np.floor(1 + np.log2(len(data))))
    data.loc[:, 'bins'] = pd.cut(data['target'], bins=nb_bins, labels=False)
    
    return kf, data

In [None]:
best_rmse = 100
best_epoch = -np.inf
best_model_name = None

kf, data = create_folds(train_df)

for fold, (train_idx, valid_idx) in enumerate(kf.split(X=data, y=data['bins'].values)):
    
    run = wandb.init(project='CommonLit', 
             config= {'competetion': 'CommonLit-Readability', '_wandb_kernel':'tang'}, 
             group = 'SqueezeBert',
             job_type='train',
             name = f'Fold{fold}')
    
    print(f"{'='*36} Fold: {fold} {'='*36}")

    train_data = data.loc[train_idx]
    valid_data = data.loc[valid_idx]
    
    train_loader, valid_loader = get_loader(train_data, valid_data)
    nb_train_steps = int(len(train_data) / params['batch_size'] * params['epochs'])
    scheduler = get_scheduler(optimizer, nb_train_steps)
    model, optimizer, train_loader, valid_loader = accelerator.prepare(model, optimizer, train_loader, valid_loader)

    for epoch in range(1, params['epochs'] + 1):

        train(train_loader, model, optimizer, epoch, params, scheduler)
        predictions, valid_targets = validate(valid_loader, model, epoch, params)
        rmse = round(use_rmse_score(valid_targets, predictions), 3)
        
        torch.save(model.state_dict(),f"{params['name']}_{epoch}_epoch_{rmse}_rmse.pth")

        if rmse < best_rmse:
            best_rmse = rmse
            best_epoch = epoch
            best_model_name = f"{params['name']}_{epoch}_epoch_{rmse}_rmse.pth"
            
        
    wandb.log({"Best RMSE":best_rmse})        
    print(f"Best RMSE in fold: {fold} was: {best_rmse:.4f}")
    print(f"Final RMSE in fold: {fold} was: {rmse:.4f}")

print(f"Best RMSE of {best_rmse:.4f} was found in fold: {fold}")

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Cross Validation Results</p>

<p style = "font-family: garamond; font-size: 25px; font-style: normal; border-radius: 10px 10px; text-align:center">We are able to achieve a RMSE score of 0.21 from the 5th Fold!<br><br> Weights & Biases provides us with such easy to use interface and tools to keep a track of our Evaluation metrics like training and validation RMSE along with other resources like Best fold RMSE and Gpu usage.<br><br> Let's take a look at some of our training graphs</p>

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center"><a href = 'https://wandb.ai/tanishqgautam/CommonLit?workspace=user-tanishqgautam'>Check out the Weights and Biases Dashboard here $\rightarrow$ </a></p>

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">KFold Metrics Visualization</p>
<center><img src="https://i.imgur.com/zIrMSbF.png" width="2000" alt="Weights & Biases" /></center><br>

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Best Fold RMSE</p>

<center><img src="https://i.imgur.com/zyoth0U.png" width="2000" alt="Weights & Biases" /></center><br>

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">GPU Utilization</p>
<center><img src="https://i.imgur.com/ZoudAs7.png" width="2000" alt="Weights & Biases" /></center><br>