# BERT Training script

In order to train a BERT model, we first need to generate positive and negative samples. To make things more realistic, we will first retrieve a top-10 results for each trainint query using Anserini, and then, randomly pick a few that are not relevant (2) as "negative sampling".

In [2]:
import os
from os.path import expanduser
home = expanduser("~")
os.environ["JAVA_HOME"] = f"{home}/.sdkman/candidates/java/11.0.7.hs-adpt"  #Set right JAVA version
data_home = "/ssd2/arthur/MsMarcoTREC/"
def path(x):
    return os.path.join(data_home, x)

try:
    import pyserini
except:
    !pip install pyserini==0.9.2.0 # install pyserini
try:
    import tqdm
except:
    !pip install tqdm # Good for progress bars!

In [2]:
import jnius_config
jnius_config.add_options('-Xmx16G') # Adjust to your machine. Probably less than 16G.
from pyserini.search import pysearch
import subprocess
from tqdm.auto import tqdm
import random
import pickle
import sys
import unicodedata
import string
import re
import os
from collections import defaultdict
import math

In [3]:
# Anserini uses this "SimpleSearcher" object for interfacing with the index.
index_path = path("lucene-index.msmarco-doc.pos+docvectors+rawdocs")
searcher = pysearch.SimpleSearcher(index_path)

## Extracting Anserini top-10
We will use pyserini to retrieve the top-10 results using BM25. It doesn't need to be perfect, so, we won't care about fine-tunning it. Default settings should be enough

The way this works is by:
1. submiting each query as a new search on Anserini, with the `SimpleSearcher.search()` method
2. For each query, find $neg_samples$ negative samples from the top-$k$ results from BM25.
3. Store these and the positive samples in a list

obs.: Potentinaly, it would be faster to use Anserini's `batch_search()` method, since it works in multiple threads. However, the lack of feedback (i.e. How may queries have been processed already) and higher memory footprint could cause issues.

### Loading all relevant docs from the qrels file

In [4]:
# Load the relevant query-document pairs
relevant_docs = defaultdict(lambda:[])
for file in [path("qrels/msmarco-doctrain-qrels.tsv"), path("qrels/msmarco-docdev-qrels.tsv")]:
    for line in open(file):
        query_id, _, doc_id, rel = line.split()
        assert rel == "1"
        relevant_docs[query_id].append(doc_id)                            

### Get the top-10 using BM25 and create a training set based on this
Some notes:

- If it finds the `.pkl` file created in the end of the loop, it won't re-compute everything.
- Each query is sanitized before being submitted to Anserini. (lines 22-32)
- The code will "batch" a number of queries to be submitted at once to Anserini, and will run these in parallel. This is much faster than one at a time, and more efficient than all of the queries at once.
- We store the end results in a pickle file, that is a list with the triples `query_id, doc_id, label`. 
- Should take about 1.5h to finish.
- Each element in the output list is: `[query_id, document_id, label]` where `label` is `1` for relevant and `0` for non-relevant
 
### **PAY ATTENTION TO YOUR MACHINE**
this notebook was ran at DeepIR, with 56 threads and 128GB of memory. Make sure to pick a fair number of threads, and a batchsize that fits confortably on memory. BE MINDFULL OF FAIR USAGE OF THE MACHINE. Check if someone else is using the machine, and chose a fair number of threads/batch size. In this configuration, 42 threads and batch_size 10000, this took about 6 minutes to finish. YMMV.

In [5]:
pattern = re.compile('([^\s\w]|_)+')

anserini_top_10 = defaultdict(lambda:[])
searcher.set_bm25_similarity(0.9, 0.4)
pairs_per_split = defaultdict(lambda: [])
threads = 42 # Number of Threads to use when retrieving
k = 10       # Number of documents to retrieve 
neg_samples = 2 # Number of negatives samples to use
batch_size = 10000 # Batch size for each retrieval step on Anserini

query_texts = dict()
for split in ["train", "dev"]:
    file_path = path(f"queries/msmarco-doc{split}-queries.tsv")
    run_search=True
    if os.path.isfile(f"{split}_triples.pkl"):
        print(f"Already found file {file_path}. Cowardly refusing to run this again. Will only load querytexts.")
        pairs_per_split[split] = pickle.load(open(path(f"{split}_triples.pkl"), 'rb'))
        run_search = False
    number_of_queries = int(subprocess.run(f"wc -l {file_path}".split(), capture_output=True).stdout.split()[0])
    number_of_batches = math.ceil(number_of_queries/batch_size)
    pbar = tqdm(total=number_of_batches, desc="Retrieval batches")
    queries = []
    query_ids = []
    for idx, line in enumerate(open(file_path, encoding="utf-8")):
        query_id, query = line.strip().split("\t")
        query_ids.append(query_id)
        query = unicodedata.normalize("NFKD", query) # Force queries into UTF-8
        query = pattern.sub(' ',query) # Remove non-ascii characters. It clears up most of the issues we may find on the query datasets
        query_texts[query_id] = query
        if run_search is False:
            continue
        queries.append(query)
        if len(queries) == batch_size or idx == number_of_queries-1:
            results = searcher.batch_search(queries, query_ids, k=k, threads=threads)
            pbar.update()
            for query, query_id in zip(queries, query_ids):
                retrieved_docs_ids = [hit.docid for hit in results[query_id]]
                relevant_docs_for_query = relevant_docs[query_id]
                retrieved_non_relevant_documents = set(retrieved_docs_ids).difference(set(relevant_docs_for_query))
                  
                if len(retrieved_non_relevant_documents) < 2:
                    print(f"query {query} has less than 2 retrieved docs.")
                    continue
                random_negative_samples = random.sample(retrieved_non_relevant_documents, neg_samples)
                pairs_per_split[split] += [(query_id, doc_id, 1) for doc_id in relevant_docs_for_query]
                pairs_per_split[split] += [(query_id, doc_id, 0) for doc_id in random_negative_samples]
            queries = []
            query_ids = []
    pickle.dump(pairs_per_split[split], open(path(f"{split}_triples.pkl"), 'wb'))
    pbar.close()



Already found file /ssd2/arthur/MsMarcoTREC/queries/msmarco-doctrain-queries.tsv. Cowardly refusing to run this again. Will only load querytexts.


HBox(children=(FloatProgress(value=0.0, description='Retrieval batches', max=37.0, style=ProgressStyle(descrip…


Already found file /ssd2/arthur/MsMarcoTREC/queries/msmarco-docdev-queries.tsv. Cowardly refusing to run this again. Will only load querytexts.


HBox(children=(FloatProgress(value=0.0, description='Retrieval batches', max=1.0, style=ProgressStyle(descript…




## Dataset Creation

This dataset is too big to fit in memory. Therefore, it's a good idea to leave it in disk, and retrieve as needed.

To do so, we will create three files: 
- `msmarco_samples.txt`: file with every sample already tokenized and in the right format to be used as input to BERT.
- `msmarco_offset.pkl`: pickle file with a dictionary with the file address for each of these samples. Will make it WAY faster to retrieve data from disk. 
- `msmarco_index.pkl`: A pickle file with a dictionary mapping each sample id (`queryid_docid`) to it's numbered position on the previous file. This will enable us to find a sample by index, and not only ID.

In [8]:
from torch.utils.data import Dataset
import torch

# This is our main Dataset class.
class MsMarcoDataset(Dataset):
    def __init__(self,
                 samples,
                 tokenizer,
                 searcher,
                 split,
                 tokenizer_batch=8000):
        '''Initialize a Dataset object. 
        Arguments:
            samples: A list of samples. Each sample should be a tuple with (query_id, doc_id, <label>), where label is optional
            tokenizer: A tokenizer object from Hugging Face's Tokenizer lib. (need to implement encode_batch())
            searcher: A PySerini Simple Searcher object. Should implement the .doc() method
            split: A strong indicating if we are in a train, dev or test dataset.
            tokenizer_batch: How many samples to be tokenized at once by the tokenizer object.
            The biggest bottleneck is the searcher, not the tokenizer.
        '''
        self.searcher = searcher
        self.split = split
        # If we already have the data pre-computed, we shouldn't need to re-compute it.
        self.split = split
        if (os.path.isfile(path(f"{split}_msmarco_samples.tsv"))
                and os.path.isfile(path(f"{split}_msmarco_offset.pkl"))
                and os.path.isfile(path(f"{split}_msmarco_index.pkl"))):
            print("Already found every meaningful file. Cowardly refusing to re-compute.")
            self.samples_offset_dict = pickle.load(open(path(f"{split}_msmarco_offset.pkl"), 'rb'))
            self.index_dict = pickle.load(open(path(f"{split}_msmarco_index.pkl"), 'rb'))
            return
        self.tokenizer = tokenizer
        print("Loading and tokenizing dataset...")
        self.samples_offset_dict = dict()
        self.index_dict = dict()

        self.samples_file = open(path(f"{split}_msmarco_samples.tsv"),'w',encoding="utf-8")
        self.processed_samples = 0
        query_batch = []
        doc_batch = []
        sample_ids_batch = []
        labels_batch = []
        number_of_batches = math.ceil(len(samples) // tokenizer_batch)
        # A progress bar to display how far we are.
        batch_pbar = tqdm(total=number_of_batches, desc="Tokenizer batches")
        for i, sample in enumerate(samples):
            if split=="train" or split == "dev":
                label = sample[2]
                labels_batch.append(label)
            query_batch.append(query_texts[sample[0]])
            doc_batch.append(self._get_document_content_from_id(sample[1]))
            sample_ids_batch.append(f"{sample[0]}_{sample[1]}")
            #If we hit the number of samples for this batch OR this is the last sample
            if len(query_batch) == tokenizer_batch or i == len(samples) - 1:
                self._tokenize_and_dump_batch(doc_batch, query_batch, labels_batch, sample_ids_batch)
                batch_pbar.update()
                query_batch = []
                doc_batch = []
                sample_ids_batch = []
                if split == "train" or split == "dev":
                    labels_batch = []
        batch_pbar.close()
        # Dump files in disk, so we don't need to go over it again.
        self.samples_file.close()
        pickle.dump(self.index_dict, open(path(f"{self.split}_msmarco_index.pkl"), 'wb'))
        pickle.dump(self.samples_offset_dict, open(path(f"{self.split}_msmarco_offset.pkl"), 'wb'))

    def _tokenize_and_dump_batch(self, doc_batch, query_batch, labels_batch,
                                 sample_ids_batch):
        '''tokenizes and dumps the samples in the current batch
        It also store the positions from the current file into the samples_offset_dict.
        '''
        # Use the tokenizer object
        tokens = self.tokenizer.encode_batch(list(zip(query_batch, doc_batch)))
        for idx, (sample_id, token) in enumerate(zip(sample_ids_batch, tokens)):
            #BERT supports up to 512 tokens. If we have more than that, we need to remove some tokens from the document
            if len(token.ids) >= 512:
                token_ids = token.ids[:511]
                token_ids.append(tokenizer.token_to_id("[SEP]"))
                segment_ids = token.type_ids[:512]
            # With less tokens, we need to "pad" the vectors up to 512.
            else:
                padding = [0] * (512 - len(token.ids))
                token_ids = token.ids + padding
                segment_ids = token.type_ids + padding
            # How far in the file are we? This is where we need to go to find the documents later.
            file_location = self.samples_file.tell()
            # If we have labels
            if self.split=="train" or split == "dev":
                self.samples_file.write(f"{sample_id}\t{token_ids}\t{segment_ids}\t{labels_batch[idx]}\n")
            else:
                self.samples_file.write(f"{sample_id}\t{token_ids}\t{segment_ids}\n")
            self.samples_offset_dict[sample_id] = file_location
            self.index_dict[self.processed_samples] = sample_id
            self.processed_samples += 1

    def _get_document_content_from_id(self, doc_id):
        '''Get the raw text value from the doc_id
        There is probably an easier way to do that, but this works.
        '''
        doc_text = self.searcher.doc(doc_id).lucene_document().getField("raw").stringValue()
        return doc_text[7:-8]

    def __getitem__(self, idx):
        '''Returns a sample with index idx
        DistilBERT does not take into account segment_ids. (indicator if the token comes from the query or the document) 
        However, for the sake of completness, we are including it here, together with the attention mask
        position_ids, with the positional encoder, is not needed. It's created for you inside the model.
        '''
        if isinstance(idx, int):
            idx = self.index_dict[idx]
        with open(path(f"{self.split}_msmarco_samples.tsv"), 'r', encoding="utf-8") as inf:
            inf.seek(self.samples_offset_dict[idx])
            line = inf.readline().split("\t")
            try:
                sample_id = line[0]
                input_ids = eval(line[1])
                token_type_ids = eval(line[2])
                input_mask = [1] * 512
            except:
                print(line, idx)
                raise IndexError
            # If it's a training dataset, we also have a label tag.
            if split=="train" or split == "dev":
                label = int(line[3])
                return (torch.tensor(input_ids, dtype=torch.long),
                        torch.tensor(input_mask, dtype=torch.long),
                        torch.tensor(token_type_ids, dtype=torch.long),
                        torch.tensor([label], dtype=torch.long))
            return (torch.tensor(input_ids, dtype=torch.long),
                    torch.tensor(input_mask, dtype=torch.long),
                    torch.tensor(token_type_ids, dtype=torch.long))
    def __len__(self):
        return len(self.samples_offset_dict)

## Training script
For actually training our model, we need to do the following:
1. Create a DataLoader object for train and one for dev. This will help with batching and such.
2. Load a BERT pre-trained model. For this example, we are using DistilBert. Because it's smaller and faster.
    - For ease of use, we will use the `DistilBertForSequenceClassification` model. It's ready for computing whether two senteces are related.
    - Also note that, for this model, weirdly enough, $1$ is NOT RELEVANT and  $0$ is RELEVANT
    - Alternativelly, we can use the default `DistilBert` and extract the `[CLS]` token embedding and feed it to a shallow NN using PyTorch or even Sklearn and a linear regression.
    - There is an extra class here that does that, if you want to follow that path.
3. Create a training loop that for every $X$ samples will check the results on the dev dataset.
4. Store breakpoints every $N$ steps

In [9]:
from transformers import DistilBertForSequenceClassification
from torch.utils.data import DataLoader
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("/ssd2/arthur/bert-axioms/tokenizer/bert-base-uncased-vocab.txt", lowercase=True)

In [10]:
train_dataset = MsMarcoDataset(pairs_per_split["train"], tokenizer, searcher, split = "train")
dev_dataset = MsMarcoDataset(pairs_per_split["dev"], tokenizer, searcher, split = "dev")

Already found every meaningful file. Cowardly refusing to re-compute.
Already found every meaningful file. Cowardly refusing to re-compute.


We NEED to use GPUs for this. If you don't have access to some GPUs you can try Google Colab OR if you are a MSc from WIS, get in touch.

In [35]:
import torch
from torch import nn
from transformers import DistilBertModel, BertModel

class BertRelevanceRanker(nn.Module):
    def __init__(self, model="distilbert-base-uncased"):
        """Creates an instance of Bert Relevance Ranker. 
        It feeds two senteces into a pre-trained BERT model, extracts the [CLS] token and feeds it into a one-layer FFNN"""
        super().__init__()
        self.distil = False
        self.loss_fct = nn.CrossEntropyLoss()
        if "distil" in model:
            self.distil = True
            self.bert = DistilBertModel.from_pretrained(model)
        else:
            self.bert = BertModel.from_pretrained(model)
        self.config = self.bert.config
        self.linear1 = nn.Linear(self.bert.config.dim, self.bert.config.dim)
        self.linear2 = nn.Linear(self.bert.config.dim, 2)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
        if not self.distil and token_type_ids is None:
            raise ValueError("Model is not distilBERT and it did not received token_type_ids!")
        if not self.distil:
            outputs = self.bert(input_ids, attention_mask, token_type_ids)
        else:
            pooled_output = self.bert(input_ids, attention_mask)[0][:, 0]
        pooled_output = self.linear1(pooled_output)
        pooled_output = nn.ReLU()(pooled_output)
        pooled_output = self.dropout(pooled_output)
        logits = self.linear2(pooled_output)
        outputs = (logits,)
        if labels is not None:
            loss = self.loss_fct(logits.view(-1, 2), labels.view(-1))
            outputs = (loss, ) + outputs
        return outputs
    

In [42]:
from transformers import AdamW, get_linear_schedule_with_warmup

# With these configurations, on DeepIR, it takes ~3h/batch to train, with ~2batches/s
GPUS_TO_USE = [2,4,5,6,7] # If you have multiple GPUs, pick the ones you want to use.
number_of_cpus = 24 # Number of CPUS to use when loading your dataset.
n_epochs = 2 # How may passes over the whole dataset to complete
weight_decay = 0.0 # Some papers define a weight decay, meaning, the weights on some layers will decay slower overtime. By default, we don't do this.
lr = 0.00005 # Learning rate for the fine-tunning.
warmup_proportion = 0.1 # Percentage of training steps to perform before we start to decrease the learning rate.
steps_to_print = 1000 # How many steps to wait before printing loss
steps_to_eval = 2000 # How many steps to wait before running an eval step

# This is our base model
try:
    del model
    torch.cuda.empty_cache() # Make sure we have a clean slate. Usefull in a Notebook.
except:
    pass

# model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
model = BertRelevanceRanker()

if torch.cuda.is_available():
    # Asssign the model to GPUs, specifying to use Data parallelism.
    model = torch.nn.DataParallel(model, device_ids=GPUS_TO_USE)
    # The main model should be on the first GPU
    device = torch.device(f"cuda:{GPUS_TO_USE[0]}") 
    model.to(device)
    
    # For a 1080Ti, 16 samples fit on a GPU confortably for a DistilBert model. A bert-base, not more than 8. So, the train batch size will be 16*the number of GPUS
    train_batch_size = len(GPUS_TO_USE) * 16
    print(f"running on {len(GPUS_TO_USE)} GPUS, on {train_batch_size}-sized batches")
else:
    print("Are you sure about it? We will try to run this in CPU, but it's a BAD idea...")
    device = torch.device("cpu")
    train_batch_size = 16
    model.to(device)

# A DataLoader is a nice device for generating batches for you easily.
# It receives any object that implementes __getitem__(self, idx) and __len__(self)

train_data_loader = DataLoader(train_dataset, batch_size=train_batch_size, num_workers=number_of_cpus,shuffle=True)
dev_data_loader = DataLoader(dev_dataset, batch_size=32, num_workers=number_of_cpus,shuffle=True)

#how many optimization steps to run, given the NUMBER OF BATCHES. (The len of the dataloader is the number of batches).
num_train_optimization_steps = len(train_data_loader) * n_epochs

#which layers will not have a linear weigth decay when training
no_decay = ['bias', 'LayerNorm.weight']

#all parameters to be optimized by our fine tunning.
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any( nd in n for nd in no_decay)], 'weight_decay': weight_decay},
    {'params': [p for n, p in model.named_parameters() if any( nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

#We use the AdamW optmizer here.
optimizer = AdamW(optimizer_grouped_parameters, lr=lr, eps=1e-8) 

# How many steps to wait before we start to decrease the learning rate
warmup_steps = num_train_optimization_steps * warmup_proportion 
# A scheduler to take care of the above.
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=num_train_optimization_steps)
print(f"*********Total optmization steps: {num_train_optimization_steps}*********")

running on 5 GPUS, on 80-sized batches
*********Total optmization steps: 42042*********


In [43]:
import warnings
import numpy as np
import datetime

try:
    from sklearn.metrics import f1_score, average_precision_score, accuracy_score, roc_auc_score
except:
    !pip install sklearn
    from sklearn.metrics import f1_score, average_precision_score, accuracy_score, roc_auc_score

global_step = 0 # Number of steps performed so far
tr_loss = 0.0 # Training loss
model.zero_grad() # Initialize gradients to 0

for _ in tqdm(range(n_epochs), desc="Epochs"):
    for step, batch in tqdm(enumerate(train_data_loader), desc="Batches", total=len(train_data_loader)):
        model.train()
        # get the batch inpute
        inputs = {
            'input_ids': batch[0].to(device),
            'attention_mask': batch[1].to(device),
            'labels': batch[3].to(device)
        }
        # Run through the network.
        
        with warnings.catch_warnings():
            # There is a very annoying warning here when we are using multiple GPUS,
            # As described here: https://github.com/huggingface/transformers/issues/852.
            # We can safely ignore this.
            warnings.simplefilter("ignore")
            outputs = model(**inputs)
        loss = outputs[0]

        loss = loss.sum()/len(model.device_ids) # Average over all GPUS.
        # Clipping gradients. Avoid gradient explosion, if the gradient is too large.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # Backward pass on the network
        loss.backward()
        tr_loss += loss.item()
        # Run the optimizer with the gradients
        optimizer.step()
        scheduler.step()
        model.zero_grad()
        if step % steps_to_print == 0:
            # Logits is the actual output from the network. 
            # This is the probability of being relevant or not.
            # You can check its shape (Should be a vector sized 2) with logits.shape()
            logits = outputs[1]
            # Send the logits to the CPU and in numpy form. Easier to check what is going on.
            preds = logits.detach().cpu().numpy()
            
            # Bring the labels to CPU too.
            tqdm.write(f"Training loss: {loss.item()} Learning Rate: {scheduler.get_last_lr()[0]}")
        global_step += 1
        
        # Run an evluation step over the eval dataset. Let's see how we are going.
        if global_step%steps_to_eval == 0:
            eval_loss = 0.0
            nb_eval_steps = 0
            preds = None
            out_label_ids = None
            for batch in tqdm(dev_data_loader, desc="Dev batch"):
                model.eval()
                with torch.no_grad(): # Avoid upgrading gradients here
                    inputs = {'input_ids': batch[0].to(device),
                      'attention_mask': batch[1].to(device),
                      'labels': batch[3].to(device)}
                    with warnings.catch_warnings():
                        warnings.simplefilter("ignore")
                        outputs = model(**inputs)
                    tmp_eval_loss, logits = outputs[:2] # Logits is the actual output. Probabilities between 0 and 1.
                    eval_loss += tmp_eval_loss.mean().item()
                    nb_eval_steps += 1
                    # Concatenate all outputs to evaluate in the end.
                    if preds is None:
                        preds = logits.detach().cpu().numpy() # PRedictions into numpy mode
                        out_label_ids = inputs['labels'].detach().cpu().numpy().flatten() # Labels assigned by model
                    else:
                        batch_predictions = logits.detach().cpu().numpy()
                        preds = np.append(preds, batch_predictions, axis=0)
                        out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy().flatten(), axis=0)
                eval_loss = eval_loss / nb_eval_steps
            results = {}
            results["ROC Dev"] = roc_auc_score(out_label_ids, preds[:, 1])
            preds = np.argmax(preds, axis=1)
            results["Acuracy Dev"] = accuracy_score(out_label_ids, preds)
            results["F1 Dev"] = f1_score(out_label_ids, preds)
            results["AP Dev"] = average_precision_score(out_label_ids, preds)
            tqdm.write("***** Eval results *****")
            for key in sorted(results.keys()):
                tqdm.write(f"  {key} = {str(results[key])}")
            output_dir = path(f"checkpoints/checkpoint-{global_step}")
            if not os.path.isdir(output_dir):
                os.makedirs(path(output_dir))
#             print(f"Saving model checkpoint to {output_dir}")
            model_to_save = model.module if hasattr(model, 'module') else model
            model_to_save.config.save_pretrained(output_dir)
            torch.save(model_to_save.state_dict(), output_dir+"/pytorch_model.bin")
            


# Save final model 
output_dir = path(f"models/distilBERT-{str(datetime.date.today())}")
if not os.path.isdir(output_dir):
    os.makedirs(path(output_dir))
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.config.save_pretrained(output_dir)
torch.save(model_to_save.state_dict(), output_dir+"/pytorch_model.bin")

HBox(children=(FloatProgress(value=0.0, description='Epochs', max=2.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='Batches', max=21021.0, style=ProgressStyle(description_wi…

Training loss: 0.7429962754249573 Learning Rate: 1.189286903572618e-08
Training loss: 0.6137410998344421 Learning Rate: 1.1904761904761907e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5293512055623834
  Acuracy Dev = 0.7571878279118573
  F1 Dev = 0.6457657216337028
  ROC Dev = 0.8253530962920068
Training loss: 0.5174795389175415 Learning Rate: 2.3797630940488087e-05
Training loss: 0.3816486895084381 Learning Rate: 3.5690499976214264e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5495730855000485
  Acuracy Dev = 0.778174186778594
  F1 Dev = 0.6443711728685821
  ROC Dev = 0.8438980489934637
Training loss: 0.4404618442058563 Learning Rate: 4.7583369011940445e-05
Training loss: 0.43097397685050964 Learning Rate: 4.894708466137037e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5561074157707073
  Acuracy Dev = 0.784218258132214
  F1 Dev = 0.6384415219073072
  ROC Dev = 0.8563399891928818
Training loss: 0.369981974363327 Learning Rate: 4.762565476851191e-05
Training loss: 0.42483997344970703 Learning Rate: 4.630422487565345e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.57188751809242
  Acuracy Dev = 0.7889611752360965
  F1 Dev = 0.6810454199441766
  ROC Dev = 0.8639991786767734
Training loss: 0.3996042013168335 Learning Rate: 4.498279498279498e-05
Training loss: 0.41216525435447693 Learning Rate: 4.366136508993652e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5806967078935027
  Acuracy Dev = 0.794501573976915
  F1 Dev = 0.6897731592953998
  ROC Dev = 0.8672993079445701
Training loss: 0.36422398686408997 Learning Rate: 4.233993519707805e-05
Training loss: 0.3106476366519928 Learning Rate: 4.101850530421959e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5771497711912323
  Acuracy Dev = 0.791311647429171
  F1 Dev = 0.6897154268597104
  ROC Dev = 0.8676886167711565
Training loss: 0.3256611227989197 Learning Rate: 3.9697075411361125e-05
Training loss: 0.3615630567073822 Learning Rate: 3.837564551850266e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5832131863608856
  Acuracy Dev = 0.7948793284365163
  F1 Dev = 0.6965915440491711
  ROC Dev = 0.8700944340962947
Training loss: 0.322876513004303 Learning Rate: 3.705421562564419e-05
Training loss: 0.4105748236179352 Learning Rate: 3.5732785732785736e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5883950206287615
  Acuracy Dev = 0.7997061909758657
  F1 Dev = 0.6954301761552208
  ROC Dev = 0.8730471911409979
Training loss: 0.31920647621154785 Learning Rate: 3.4411355839927267e-05
Training loss: 0.3777822256088257 Learning Rate: 3.3089925947068804e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5856603889092339
  Acuracy Dev = 0.8007555089192026
  F1 Dev = 0.675551910327387
  ROC Dev = 0.8757322753280555
Training loss: 0.49660611152648926 Learning Rate: 3.176849605421034e-05
Training loss: 0.4242243766784668 Learning Rate: 3.0447066161351874e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.597058873022513
  Acuracy Dev = 0.8061699895068206
  F1 Dev = 0.6968622817382172
  ROC Dev = 0.8781652721702214
Training loss: 0.4124586284160614 Learning Rate: 2.912563626849341e-05
Training loss: 0.27975550293922424 Learning Rate: 2.7804206375634945e-05



HBox(children=(FloatProgress(value=0.0, description='Batches', max=21021.0, style=ProgressStyle(description_wi…

Training loss: 0.26966866850852966 Learning Rate: 2.7776456347884916e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5963447782572543
  Acuracy Dev = 0.8050786988457502
  F1 Dev = 0.7003097573567372
  ROC Dev = 0.879107783460562
Training loss: 0.2813105285167694 Learning Rate: 2.6455026455026456e-05
Training loss: 0.3830776512622833 Learning Rate: 2.513359656216799e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5958801274717004
  Acuracy Dev = 0.8047429171038825
  F1 Dev = 0.7001804588811549
  ROC Dev = 0.8808074605233469
Training loss: 0.3111289143562317 Learning Rate: 2.3812166669309527e-05
Training loss: 0.38978663086891174 Learning Rate: 2.249073677645106e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5990996255714716
  Acuracy Dev = 0.80700944386149
  F1 Dev = 0.7012345679012346
  ROC Dev = 0.8803928430754654
Training loss: 0.3295662999153137 Learning Rate: 2.1169306883592597e-05
Training loss: 0.26243922114372253 Learning Rate: 1.984787699073413e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.5991149488934595
  Acuracy Dev = 0.8084784889821616
  F1 Dev = 0.6884686283880658
  ROC Dev = 0.8823239998837508
Training loss: 0.29984936118125916 Learning Rate: 1.8526447097875668e-05
Training loss: 0.38034573197364807 Learning Rate: 1.7205017205017205e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.6085644484307055
  Acuracy Dev = 0.8130535152151102
  F1 Dev = 0.7064329027155286
  ROC Dev = 0.8832567761742633
Training loss: 0.3424564301967621 Learning Rate: 1.588358731215874e-05
Training loss: 0.33130455017089844 Learning Rate: 1.4562157419300276e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.6064592677699424
  Acuracy Dev = 0.8112906610703043
  F1 Dev = 0.7086195722618276
  ROC Dev = 0.8849555376079928
Training loss: 0.3085682988166809 Learning Rate: 1.3240727526441813e-05
Training loss: 0.45851755142211914 Learning Rate: 1.1919297633583346e-05


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.6096129126893163
  Acuracy Dev = 0.8131794333683106
  F1 Dev = 0.7112927288058637
  ROC Dev = 0.8856695697207766
Training loss: 0.38201770186424255 Learning Rate: 1.0597867740724883e-05
Training loss: 0.29469871520996094 Learning Rate: 9.276437847866419e-06


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.6080821099123832
  Acuracy Dev = 0.8105771248688353
  F1 Dev = 0.7191836226743824
  ROC Dev = 0.8863449947902688
Training loss: 0.28664761781692505 Learning Rate: 7.955007955007956e-06
Training loss: 0.25723132491111755 Learning Rate: 6.633578062149491e-06


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.6087906985691338
  Acuracy Dev = 0.8128436516264428
  F1 Dev = 0.709454616537434
  ROC Dev = 0.8863522920763777
Training loss: 0.22093525528907776 Learning Rate: 5.312148169291026e-06
Training loss: 0.24985454976558685 Learning Rate: 3.9907182764325615e-06


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.6079977362388271
  Acuracy Dev = 0.8115844700944386
  F1 Dev = 0.7138759640512462
  ROC Dev = 0.8870989261328032
Training loss: 0.24013452231884003 Learning Rate: 2.6692883835740977e-06
Training loss: 0.4052871763706207 Learning Rate: 1.3478584907156336e-06


HBox(children=(FloatProgress(value=0.0, description='Dev batch', max=745.0, style=ProgressStyle(description_wi…


***** Eval results *****
  AP Dev = 0.6083502009552072
  Acuracy Dev = 0.8121301154249738
  F1 Dev = 0.7121913580246914
  ROC Dev = 0.8872791108323864
Training loss: 0.2829817235469818 Learning Rate: 2.6428597857169283e-08


