<a href="https://colab.research.google.com/github/Amsterdam-Internships/Readability-Lexical-Simplification/blob/master/Combined_Loss.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.3 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 37.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 36.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.1 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 43.0 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found ex

In [None]:
import pandas as pd
import random
import torch

In [None]:
normal = pd.read_csv("/content/drive/MyDrive/Thesis/notebooks/data/normal.aligned", sep = "\t", names=["subject","nr", "sentence"])
simple = pd.read_csv("/content/drive/MyDrive/Thesis/notebooks/data/simple.aligned", sep = "\t", names=["subject","nr", "sentence"])

In [None]:
normal[:100]

Unnamed: 0,subject,nr,sentence
0,"Cherokee, Oklahoma",0,It is the county seat of Alfalfa County .
1,"Cherokee, Oklahoma",0,"Cherokee is a city in Alfalfa County , Oklahom..."
2,Skateboard,5,Skateboard decks are usually between 28 and 33...
3,Skateboard,5,The underside of the deck can be printed with ...
4,Skateboard,6,This was created by two surfers ; Ben Whatson ...
...,...,...,...
95,Invisible ink,2,The ink is later made visible by different met...
96,Invisible ink,3,One can obtain toy invisible ink pens which ha...
97,Lizard,3,Some species of lizard also utilize bright col...
98,Lizard,4,The particular innovation in this respect is t...


In [None]:
normal_sents = normal['sentence'].tolist()
simple_sents = simple['sentence'].tolist()

In [None]:
normal_sents = normal_sents[:1000]
simple_sents = simple_sents[:1000]

In [None]:
normal_sents[:5]

['It is the county seat of Alfalfa County .',
 'Cherokee is a city in Alfalfa County , Oklahoma , United States .',
 'Skateboard decks are usually between 28 and 33 inches long .',
 'The underside of the deck can be printed with a design by the manufacturer , blank , or decorated by any other means .',
 'This was created by two surfers ; Ben Whatson and Jonny Drapper .']

In [None]:
simple_sents[:5]

['It is the county seat of Alfalfa County .',
 'Cherokee is a city of Oklahoma in the United States .',
 'Skateboard decks are normally between 28 and 33 inches long .',
 'The bottom of the deck can be printed with a design by the maker . Or it can be blank .',
 'The longboard was made by two surfers ; Ben Whatson and Jonny Drapper .']

In [None]:
# . I ultimately had one document which contained multiple paragraphs and each paragraph pertained to one topic. 
# I’m using each paragraph as a separate context when using extractive QA. 
# This ultimately gave me a list of strings where each string was a paragraph (ie. multiple sentences). 
# This was stored inside the ‘text’ variable.

sentence_a = []
sentence_b = []
label = []

# Task goal: direction of simplification

# bag_of_sentences = [sentence.strip() for paragraph in text for sentence in paragraph.split('.') if sentence != '']
# bag_size = len(bag_of_sentences)


for normal_sent, simple_sent in zip(normal_sents, simple_sents):
  if random.random()>0.5:
    sentence_a.append(normal_sent)
    sentence_b.append(simple_sent)
    label.append(1)
  else:
    sentence_a.append(simple_sent)
    sentence_b.append(normal_sent)
    label.append(0)

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-large-uncased-whole-word-masking')
inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt', 
                   max_length=250, truncation=True, padding='max_length')



Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Creating labels for NSP
inputs['next_sentence_label'] = torch.LongTensor([label]).T

In [None]:
inputs['labels'] = inputs.input_ids.detach().clone()
rand = torch.rand(inputs.input_ids.shape)
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102) * (inputs.input_ids != 0)

selection = []
for i in range(inputs.input_ids.shape[0]):
  selection.append(torch.flatten(mask_arr[i].nonzero()).tolist())

for i in range(inputs.input_ids.shape[0]):
  inputs.input_ids[i, selection[i]] = 103


In [None]:
from transformers import BertForPreTraining

In [None]:
class OurDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)
dataset = OurDataset(inputs)
loader = torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=False)

In [None]:
import torch.optim as optim

In [None]:
model = BertForPreTraining.from_pretrained('bert-large-uncased-whole-word-masking')

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
device = torch.device('cpu')
model.to(device)

BertForPreTraining(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine

In [None]:
model.train()

BertForPreTraining(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine

In [None]:
from transformers import AdamW

optim = AdamW(model.parameters(), lr=5e-5)



In [None]:
from tqdm import tqdm  # for our progress bar

epochs = 2

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(loader, leave=False)
    for batch in loop:
        print("batch")
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        print("after input to device")
        token_type_ids = batch['token_type_ids'].to(device)
        print("after token_type_ids to device")
        attention_mask = batch['attention_mask'].to(device)
        print("after attention_mask to device")
        
        next_sentence_label = batch['next_sentence_label'].to(device)
        print("after NSL to device")
        
        labels = batch['labels'].to(device)
        print("after labels to device")
        
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        token_type_ids=token_type_ids,
                        next_sentence_label=next_sentence_label,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())


  """


batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   1%|          | 1/125 [01:08<2:21:51, 68.64s/it, loss=29.7]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   2%|▏         | 2/125 [02:11<2:13:22, 65.06s/it, loss=22.3]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   2%|▏         | 3/125 [03:10<2:06:53, 62.41s/it, loss=19.6]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   3%|▎         | 4/125 [04:08<2:02:35, 60.79s/it, loss=15.8]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   4%|▍         | 5/125 [05:05<1:58:57, 59.48s/it, loss=11.5]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   5%|▍         | 6/125 [06:03<1:56:28, 58.72s/it, loss=10.6]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   6%|▌         | 7/125 [07:00<1:54:31, 58.23s/it, loss=10.1]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   6%|▋         | 8/125 [07:57<1:53:00, 57.95s/it, loss=8.72]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   7%|▋         | 9/125 [08:54<1:51:31, 57.68s/it, loss=8.61]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   8%|▊         | 10/125 [09:51<1:50:08, 57.46s/it, loss=9.22]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:   9%|▉         | 11/125 [10:48<1:48:51, 57.29s/it, loss=8.02]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  10%|▉         | 12/125 [11:51<1:50:47, 58.83s/it, loss=7.05]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  10%|█         | 13/125 [12:52<1:51:21, 59.65s/it, loss=6.94]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  11%|█         | 14/125 [13:50<1:49:37, 59.25s/it, loss=6.63]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  12%|█▏        | 15/125 [14:48<1:47:39, 58.72s/it, loss=6.41]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  13%|█▎        | 16/125 [15:45<1:45:54, 58.29s/it, loss=5.91]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  14%|█▎        | 17/125 [16:43<1:44:36, 58.12s/it, loss=6.19]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  14%|█▍        | 18/125 [17:43<1:44:29, 58.59s/it, loss=5.37]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  15%|█▌        | 19/125 [18:41<1:43:27, 58.56s/it, loss=5]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  16%|█▌        | 20/125 [19:40<1:42:28, 58.55s/it, loss=4.62]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  17%|█▋        | 21/125 [20:40<1:42:37, 59.20s/it, loss=4.29]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  18%|█▊        | 22/125 [21:38<1:40:44, 58.68s/it, loss=3.6]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  18%|█▊        | 23/125 [22:45<1:44:10, 61.28s/it, loss=3.21]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  19%|█▉        | 24/125 [23:43<1:41:16, 60.17s/it, loss=2.76]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  20%|██        | 25/125 [24:43<1:40:30, 60.30s/it, loss=2.5]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  21%|██        | 26/125 [25:42<1:38:52, 59.92s/it, loss=2.19]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  22%|██▏       | 27/125 [26:39<1:36:07, 58.85s/it, loss=2.27]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  22%|██▏       | 28/125 [27:36<1:34:10, 58.25s/it, loss=1.85]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  23%|██▎       | 29/125 [28:33<1:32:49, 58.02s/it, loss=1.77]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  24%|██▍       | 30/125 [29:32<1:32:12, 58.24s/it, loss=1.61]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  25%|██▍       | 31/125 [30:29<1:30:49, 57.97s/it, loss=1.76]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  26%|██▌       | 32/125 [31:26<1:29:27, 57.71s/it, loss=1.76]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  26%|██▋       | 33/125 [32:23<1:27:55, 57.35s/it, loss=1.4]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  27%|██▋       | 34/125 [33:20<1:26:46, 57.22s/it, loss=1.35]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  28%|██▊       | 35/125 [34:16<1:25:30, 57.00s/it, loss=1.25]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  29%|██▉       | 36/125 [35:22<1:28:16, 59.51s/it, loss=1.23]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  30%|██▉       | 37/125 [36:18<1:25:55, 58.59s/it, loss=1.07]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  30%|███       | 38/125 [37:15<1:24:11, 58.06s/it, loss=1.13]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  31%|███       | 39/125 [38:14<1:23:45, 58.43s/it, loss=1.06]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  32%|███▏      | 40/125 [39:14<1:23:16, 58.78s/it, loss=1.04]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  33%|███▎      | 41/125 [40:11<1:21:39, 58.33s/it, loss=0.911]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  34%|███▎      | 42/125 [41:11<1:21:32, 58.95s/it, loss=0.86]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  34%|███▍      | 43/125 [42:10<1:20:23, 58.82s/it, loss=1.23]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  35%|███▌      | 44/125 [43:08<1:18:58, 58.49s/it, loss=0.888]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  36%|███▌      | 45/125 [44:05<1:17:24, 58.06s/it, loss=1.36]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  37%|███▋      | 46/125 [45:09<1:18:55, 59.94s/it, loss=1]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  38%|███▊      | 47/125 [46:10<1:18:09, 60.13s/it, loss=1.1]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  38%|███▊      | 48/125 [47:06<1:15:47, 59.05s/it, loss=0.898]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  39%|███▉      | 49/125 [48:04<1:14:29, 58.81s/it, loss=0.83]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  40%|████      | 50/125 [49:04<1:13:39, 58.92s/it, loss=0.819]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  41%|████      | 51/125 [50:02<1:12:23, 58.70s/it, loss=0.865]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  42%|████▏     | 52/125 [51:00<1:11:06, 58.44s/it, loss=1.11]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  42%|████▏     | 53/125 [51:57<1:09:55, 58.27s/it, loss=0.92]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  43%|████▎     | 54/125 [52:55<1:08:47, 58.13s/it, loss=0.889]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  44%|████▍     | 55/125 [53:52<1:07:26, 57.81s/it, loss=0.96]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  45%|████▍     | 56/125 [54:54<1:07:43, 58.89s/it, loss=0.824]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  46%|████▌     | 57/125 [55:54<1:07:18, 59.39s/it, loss=0.793]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  46%|████▋     | 58/125 [56:51<1:05:21, 58.53s/it, loss=0.769]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  47%|████▋     | 59/125 [57:48<1:03:53, 58.08s/it, loss=0.89]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  48%|████▊     | 60/125 [58:49<1:03:59, 59.07s/it, loss=1.1]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  49%|████▉     | 61/125 [59:48<1:03:04, 59.13s/it, loss=0.722]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  50%|████▉     | 62/125 [1:00:46<1:01:34, 58.64s/it, loss=0.87]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  50%|█████     | 63/125 [1:01:45<1:00:48, 58.84s/it, loss=0.816]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  51%|█████     | 64/125 [1:02:44<59:45, 58.78s/it, loss=0.818]  

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  52%|█████▏    | 65/125 [1:03:41<58:17, 58.29s/it, loss=0.806]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  53%|█████▎    | 66/125 [1:04:40<57:31, 58.49s/it, loss=0.792]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  54%|█████▎    | 67/125 [1:05:39<56:35, 58.54s/it, loss=0.69]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  54%|█████▍    | 68/125 [1:06:37<55:30, 58.43s/it, loss=0.88]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  55%|█████▌    | 69/125 [1:07:34<54:12, 58.08s/it, loss=0.944]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  56%|█████▌    | 70/125 [1:08:38<54:42, 59.69s/it, loss=0.831]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  57%|█████▋    | 71/125 [1:09:39<54:06, 60.11s/it, loss=0.673]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  58%|█████▊    | 72/125 [1:10:35<52:13, 59.13s/it, loss=0.9]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  58%|█████▊    | 73/125 [1:11:35<51:21, 59.27s/it, loss=0.948]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  59%|█████▉    | 74/125 [1:12:34<50:12, 59.07s/it, loss=0.736]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  60%|██████    | 75/125 [1:13:33<49:12, 59.06s/it, loss=0.817]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  61%|██████    | 76/125 [1:14:31<47:58, 58.74s/it, loss=1.03]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  62%|██████▏   | 77/125 [1:15:28<46:35, 58.25s/it, loss=0.772]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  62%|██████▏   | 78/125 [1:16:24<45:13, 57.74s/it, loss=0.737]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  63%|██████▎   | 79/125 [1:17:26<45:09, 58.90s/it, loss=0.698]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  64%|██████▍   | 80/125 [1:18:23<43:50, 58.46s/it, loss=0.921]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  65%|██████▍   | 81/125 [1:19:32<45:07, 61.54s/it, loss=0.818]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  66%|██████▌   | 82/125 [1:20:32<43:38, 60.91s/it, loss=0.583]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  66%|██████▋   | 83/125 [1:21:31<42:16, 60.40s/it, loss=0.877]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  67%|██████▋   | 84/125 [1:22:30<40:57, 59.95s/it, loss=0.794]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  68%|██████▊   | 85/125 [1:23:28<39:33, 59.34s/it, loss=0.88]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  69%|██████▉   | 86/125 [1:24:24<38:05, 58.60s/it, loss=0.911]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  70%|██████▉   | 87/125 [1:25:23<37:03, 58.52s/it, loss=0.983]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  70%|███████   | 88/125 [1:26:21<36:00, 58.39s/it, loss=0.848]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  71%|███████   | 89/125 [1:27:20<35:13, 58.71s/it, loss=0.996]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  72%|███████▏  | 90/125 [1:28:20<34:28, 59.09s/it, loss=1.04]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  73%|███████▎  | 91/125 [1:29:23<34:06, 60.19s/it, loss=0.766]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  74%|███████▎  | 92/125 [1:30:24<33:10, 60.31s/it, loss=0.677]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  74%|███████▍  | 93/125 [1:31:22<31:47, 59.62s/it, loss=0.791]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  75%|███████▌  | 94/125 [1:32:19<30:30, 59.05s/it, loss=0.716]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  76%|███████▌  | 95/125 [1:33:16<29:13, 58.46s/it, loss=0.781]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  77%|███████▋  | 96/125 [1:34:14<28:06, 58.15s/it, loss=0.933]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  78%|███████▊  | 97/125 [1:35:12<27:05, 58.04s/it, loss=0.848]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  78%|███████▊  | 98/125 [1:36:09<26:03, 57.91s/it, loss=0.725]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  79%|███████▉  | 99/125 [1:37:07<25:04, 57.87s/it, loss=0.705]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  80%|████████  | 100/125 [1:38:05<24:03, 57.74s/it, loss=0.767]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  81%|████████  | 101/125 [1:39:06<23:30, 58.78s/it, loss=0.818]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  82%|████████▏ | 102/125 [1:40:05<22:36, 58.97s/it, loss=0.975]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  82%|████████▏ | 103/125 [1:41:02<21:23, 58.34s/it, loss=0.888]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  83%|████████▎ | 104/125 [1:41:59<20:16, 57.94s/it, loss=0.81]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  84%|████████▍ | 105/125 [1:42:56<19:14, 57.72s/it, loss=0.706]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  85%|████████▍ | 106/125 [1:43:54<18:15, 57.65s/it, loss=0.806]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  86%|████████▌ | 107/125 [1:44:54<17:30, 58.34s/it, loss=0.973]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  86%|████████▋ | 108/125 [1:45:53<16:36, 58.62s/it, loss=0.97]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  87%|████████▋ | 109/125 [1:46:51<15:34, 58.38s/it, loss=0.852]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  88%|████████▊ | 110/125 [1:47:49<14:37, 58.49s/it, loss=0.805]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  89%|████████▉ | 111/125 [1:48:53<13:59, 59.94s/it, loss=0.897]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  90%|████████▉ | 112/125 [1:49:54<13:03, 60.28s/it, loss=0.872]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  90%|█████████ | 113/125 [1:50:53<11:58, 59.91s/it, loss=0.829]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  91%|█████████ | 114/125 [1:51:52<10:55, 59.56s/it, loss=0.787]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  92%|█████████▏| 115/125 [1:52:50<09:52, 59.26s/it, loss=0.591]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  93%|█████████▎| 116/125 [1:53:49<08:50, 59.00s/it, loss=0.784]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  94%|█████████▎| 117/125 [1:54:46<07:47, 58.45s/it, loss=0.808]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  94%|█████████▍| 118/125 [1:55:43<06:47, 58.19s/it, loss=0.776]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  95%|█████████▌| 119/125 [1:56:41<05:48, 58.09s/it, loss=0.827]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  96%|█████████▌| 120/125 [1:57:39<04:49, 57.86s/it, loss=0.842]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  97%|█████████▋| 121/125 [1:58:36<03:50, 57.75s/it, loss=0.768]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  98%|█████████▊| 122/125 [1:59:45<03:02, 60.99s/it, loss=0.877]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  98%|█████████▊| 123/125 [2:00:45<02:01, 60.74s/it, loss=0.713]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 0:  99%|█████████▉| 124/125 [2:01:44<01:00, 60.36s/it, loss=0.813]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


  0%|          | 0/125 [00:00<?, ?it/s]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   1%|          | 1/125 [00:59<2:02:44, 59.39s/it, loss=0.902]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   2%|▏         | 2/125 [01:57<2:00:23, 58.73s/it, loss=0.761]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   2%|▏         | 3/125 [02:56<1:59:48, 58.92s/it, loss=0.639]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   3%|▎         | 4/125 [03:55<1:58:17, 58.65s/it, loss=0.795]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   4%|▍         | 5/125 [04:52<1:56:33, 58.28s/it, loss=0.752]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   5%|▍         | 6/125 [05:50<1:55:17, 58.13s/it, loss=0.864]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   6%|▌         | 7/125 [06:54<1:58:21, 60.18s/it, loss=0.672]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   6%|▋         | 8/125 [07:54<1:56:57, 59.98s/it, loss=0.707]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   7%|▋         | 9/125 [08:51<1:54:22, 59.16s/it, loss=0.717]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   8%|▊         | 10/125 [09:51<1:53:30, 59.22s/it, loss=0.897]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:   9%|▉         | 11/125 [10:51<1:53:13, 59.59s/it, loss=0.647]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  10%|▉         | 12/125 [11:48<1:50:58, 58.92s/it, loss=0.733]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  10%|█         | 13/125 [12:46<1:49:19, 58.57s/it, loss=0.671]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  11%|█         | 14/125 [13:43<1:47:33, 58.14s/it, loss=0.626]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  12%|█▏        | 15/125 [14:41<1:46:07, 57.88s/it, loss=0.852]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  13%|█▎        | 16/125 [15:39<1:45:24, 58.03s/it, loss=0.561]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  14%|█▎        | 17/125 [16:41<1:46:20, 59.08s/it, loss=0.962]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  14%|█▍        | 18/125 [17:40<1:45:37, 59.23s/it, loss=0.793]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  15%|█▌        | 19/125 [18:37<1:43:31, 58.60s/it, loss=0.845]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  16%|█▌        | 20/125 [19:34<1:41:35, 58.05s/it, loss=0.909]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  17%|█▋        | 21/125 [20:31<1:40:06, 57.75s/it, loss=0.848]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  18%|█▊        | 22/125 [21:28<1:38:41, 57.49s/it, loss=0.587]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  18%|█▊        | 23/125 [22:25<1:37:24, 57.30s/it, loss=0.65]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  19%|█▉        | 24/125 [23:22<1:36:32, 57.35s/it, loss=0.747]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  20%|██        | 25/125 [24:21<1:36:03, 57.64s/it, loss=0.781]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  21%|██        | 26/125 [25:19<1:35:23, 57.81s/it, loss=0.725]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  22%|██▏       | 27/125 [26:17<1:34:50, 58.06s/it, loss=0.729]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  22%|██▏       | 28/125 [27:23<1:37:41, 60.43s/it, loss=0.546]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  23%|██▎       | 29/125 [28:23<1:36:24, 60.25s/it, loss=0.756]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  24%|██▍       | 30/125 [29:23<1:35:04, 60.05s/it, loss=0.655]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  25%|██▍       | 31/125 [30:21<1:33:17, 59.55s/it, loss=1.12]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  26%|██▌       | 32/125 [31:19<1:31:39, 59.13s/it, loss=0.82]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  26%|██▋       | 33/125 [32:18<1:30:20, 58.92s/it, loss=0.765]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  27%|██▋       | 34/125 [33:15<1:28:30, 58.35s/it, loss=0.671]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  28%|██▊       | 35/125 [34:13<1:27:17, 58.19s/it, loss=0.559]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  29%|██▉       | 36/125 [35:11<1:26:19, 58.20s/it, loss=0.737]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  30%|██▉       | 37/125 [36:09<1:25:20, 58.19s/it, loss=0.769]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  30%|███       | 38/125 [37:10<1:25:43, 59.13s/it, loss=0.626]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  31%|███       | 39/125 [38:10<1:24:56, 59.26s/it, loss=0.589]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  32%|███▏      | 40/125 [39:11<1:24:34, 59.70s/it, loss=0.464]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  33%|███▎      | 41/125 [40:10<1:23:32, 59.68s/it, loss=0.568]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  34%|███▎      | 42/125 [41:07<1:21:23, 58.84s/it, loss=0.386]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  34%|███▍      | 43/125 [42:06<1:20:31, 58.92s/it, loss=0.703]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  35%|███▌      | 44/125 [43:05<1:19:18, 58.75s/it, loss=0.385]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  36%|███▌      | 45/125 [44:02<1:17:56, 58.45s/it, loss=0.738]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  37%|███▋      | 46/125 [45:00<1:16:44, 58.29s/it, loss=0.48]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  38%|███▊      | 47/125 [45:59<1:16:04, 58.52s/it, loss=0.64]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  38%|███▊      | 48/125 [46:58<1:15:00, 58.45s/it, loss=0.549]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  39%|███▉      | 49/125 [47:56<1:14:10, 58.56s/it, loss=0.464]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  40%|████      | 50/125 [48:59<1:14:40, 59.74s/it, loss=0.521]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  41%|████      | 51/125 [50:01<1:14:30, 60.41s/it, loss=0.283]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  42%|████▏     | 52/125 [50:59<1:12:28, 59.57s/it, loss=0.941]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  42%|████▏     | 53/125 [52:00<1:12:18, 60.26s/it, loss=0.496]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  43%|████▎     | 54/125 [53:00<1:11:07, 60.10s/it, loss=0.383]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  44%|████▍     | 55/125 [54:00<1:10:04, 60.06s/it, loss=0.765]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  45%|████▍     | 56/125 [54:59<1:08:35, 59.64s/it, loss=0.484]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  46%|████▌     | 57/125 [55:58<1:07:23, 59.46s/it, loss=0.466]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  46%|████▋     | 58/125 [56:56<1:05:51, 58.98s/it, loss=0.444]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  47%|████▋     | 59/125 [57:54<1:04:41, 58.81s/it, loss=0.688]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  48%|████▊     | 60/125 [58:52<1:03:22, 58.50s/it, loss=1.34]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  49%|████▉     | 61/125 [59:50<1:02:22, 58.48s/it, loss=0.362]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  50%|████▉     | 62/125 [1:00:49<1:01:26, 58.52s/it, loss=0.667]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  50%|█████     | 63/125 [1:01:52<1:02:02, 60.04s/it, loss=0.413]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  51%|█████     | 64/125 [1:02:54<1:01:32, 60.54s/it, loss=0.434]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  52%|█████▏    | 65/125 [1:03:51<59:28, 59.48s/it, loss=0.483]  

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  53%|█████▎    | 66/125 [1:04:49<57:57, 58.94s/it, loss=0.582]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  54%|█████▎    | 67/125 [1:05:48<56:53, 58.86s/it, loss=0.599]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  54%|█████▍    | 68/125 [1:06:44<55:20, 58.26s/it, loss=0.248]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  55%|█████▌    | 69/125 [1:07:42<54:08, 58.01s/it, loss=0.229]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  56%|█████▌    | 70/125 [1:08:39<52:50, 57.65s/it, loss=0.263]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  57%|█████▋    | 71/125 [1:09:36<51:44, 57.50s/it, loss=0.271]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  58%|█████▊    | 72/125 [1:10:34<51:05, 57.83s/it, loss=0.664]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  58%|█████▊    | 73/125 [1:11:33<50:15, 57.98s/it, loss=0.792]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  59%|█████▉    | 74/125 [1:12:42<52:11, 61.40s/it, loss=0.705]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  60%|██████    | 75/125 [1:13:40<50:22, 60.45s/it, loss=0.561]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  61%|██████    | 76/125 [1:14:40<49:09, 60.19s/it, loss=0.566]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  62%|██████▏   | 77/125 [1:15:38<47:44, 59.68s/it, loss=0.737]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  62%|██████▏   | 78/125 [1:16:38<46:43, 59.65s/it, loss=0.409]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  63%|██████▎   | 79/125 [1:17:36<45:26, 59.27s/it, loss=0.35]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  64%|██████▍   | 80/125 [1:18:37<44:47, 59.73s/it, loss=0.309]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  65%|██████▍   | 81/125 [1:19:37<43:53, 59.85s/it, loss=0.416]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  66%|██████▌   | 82/125 [1:20:35<42:23, 59.14s/it, loss=0.175]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  66%|██████▋   | 83/125 [1:21:32<40:55, 58.47s/it, loss=0.308]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  67%|██████▋   | 84/125 [1:22:29<39:43, 58.14s/it, loss=0.511]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  68%|██████▊   | 85/125 [1:23:28<38:59, 58.49s/it, loss=0.693]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  69%|██████▉   | 86/125 [1:24:27<37:58, 58.41s/it, loss=0.609]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  70%|██████▉   | 87/125 [1:25:25<37:02, 58.50s/it, loss=0.52]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  70%|███████   | 88/125 [1:26:24<36:01, 58.43s/it, loss=0.307]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  71%|███████   | 89/125 [1:27:21<34:49, 58.03s/it, loss=0.359]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  72%|███████▏  | 90/125 [1:28:22<34:30, 59.15s/it, loss=0.483]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  73%|███████▎  | 91/125 [1:29:23<33:40, 59.44s/it, loss=0.534]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  74%|███████▎  | 92/125 [1:30:20<32:25, 58.95s/it, loss=0.233]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  74%|███████▍  | 93/125 [1:31:20<31:36, 59.25s/it, loss=0.524]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  75%|███████▌  | 94/125 [1:32:19<30:31, 59.09s/it, loss=0.577]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  76%|███████▌  | 95/125 [1:33:18<29:28, 58.96s/it, loss=0.511]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  77%|███████▋  | 96/125 [1:34:17<28:34, 59.13s/it, loss=1]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  78%|███████▊  | 97/125 [1:35:16<27:30, 58.95s/it, loss=0.682]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  78%|███████▊  | 98/125 [1:36:14<26:29, 58.87s/it, loss=0.447]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  79%|███████▉  | 99/125 [1:37:13<25:28, 58.77s/it, loss=0.154]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  80%|████████  | 100/125 [1:38:14<24:48, 59.55s/it, loss=0.27]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  81%|████████  | 101/125 [1:39:14<23:53, 59.74s/it, loss=0.493]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  82%|████████▏ | 102/125 [1:40:12<22:40, 59.14s/it, loss=0.879]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  82%|████████▏ | 103/125 [1:41:10<21:31, 58.71s/it, loss=0.621]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  83%|████████▎ | 104/125 [1:42:10<20:41, 59.11s/it, loss=0.624]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  84%|████████▍ | 105/125 [1:43:09<19:40, 59.03s/it, loss=0.457]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  85%|████████▍ | 106/125 [1:44:07<18:35, 58.69s/it, loss=0.744]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  86%|████████▌ | 107/125 [1:45:05<17:31, 58.43s/it, loss=0.549]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  86%|████████▋ | 108/125 [1:46:03<16:31, 58.35s/it, loss=0.556]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  87%|████████▋ | 109/125 [1:47:01<15:34, 58.43s/it, loss=0.312]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  88%|████████▊ | 110/125 [1:48:04<14:56, 59.77s/it, loss=0.687]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  89%|████████▉ | 111/125 [1:49:06<14:04, 60.36s/it, loss=0.905]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  90%|████████▉ | 112/125 [1:50:06<13:03, 60.23s/it, loss=0.593]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  90%|█████████ | 113/125 [1:51:05<11:58, 59.87s/it, loss=0.516]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  91%|█████████ | 114/125 [1:52:04<10:57, 59.75s/it, loss=0.385]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  92%|█████████▏| 115/125 [1:53:03<09:55, 59.52s/it, loss=0.426]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  93%|█████████▎| 116/125 [1:54:02<08:52, 59.12s/it, loss=0.379]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  94%|█████████▎| 117/125 [1:55:00<07:51, 58.92s/it, loss=0.692]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  94%|█████████▍| 118/125 [1:55:58<06:50, 58.63s/it, loss=0.538]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  95%|█████████▌| 119/125 [1:56:56<05:50, 58.35s/it, loss=0.726]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  96%|█████████▌| 120/125 [1:57:54<04:52, 58.44s/it, loss=0.503]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  97%|█████████▋| 121/125 [1:59:01<04:03, 60.78s/it, loss=0.581]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  98%|█████████▊| 122/125 [1:59:58<02:59, 59.78s/it, loss=0.495]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  98%|█████████▊| 123/125 [2:00:56<01:58, 59.23s/it, loss=0.356]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device


Epoch 1:  99%|█████████▉| 124/125 [2:01:53<00:58, 58.70s/it, loss=0.51]

batch
after input to device
after token_type_ids to device
after attention_mask to device
after NSL to device
after labels to device




In [None]:
def predict_masked_sent(text, top_k, model, tokenizer):
    model.to(device)

    model.eval()

    # Tokenize input
    tokenized_text = tokenizer.tokenize(text)
    masked_index = tokenized_text.index("[MASK]")
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    tokens_tensor = torch.tensor([indexed_tokens])
    # tokens_tensor = tokens_tensor.to('cuda')

    # Predict all tokens
    with torch.no_grad():
        outputs = model(tokens_tensor)
        predictions = outputs[0]

    probs = torch.nn.functional.softmax(predictions[0, masked_index], dim=-1)
    top_k_weights, top_k_indices = torch.topk(probs, top_k, sorted=True)

    for i, pred_idx in enumerate(top_k_indices):
        predicted_token = tokenizer.convert_ids_to_tokens([pred_idx])[0]
        token_weight = top_k_weights[i]
        print("[MASK]: '%s'"%predicted_token, " | weights:", float(token_weight))

In [None]:
sentence = "[CLS] we [MASK] documents based on subject. [SEP] we cluster documents based on subject [CLS]"
sentence2 = "[CLS] we [MASK] documents based on subject. [CLS]"

predict_masked_sent(sentence, 5, model, tokenizer)
print()
predict_masked_sent(sentence2, 5, model, tokenizer)


[MASK]: 'cluster'  | weights: 0.43790990114212036
[MASK]: '[PAD]'  | weights: 0.1074443832039833
[MASK]: 'add'  | weights: 0.011403925716876984
[MASK]: 'clustered'  | weights: 0.010327151045203209
[MASK]: 'compact'  | weights: 0.009970532730221748

[MASK]: '##ve'  | weights: 0.17298102378845215
[MASK]: 'have'  | weights: 0.04960228130221367
[MASK]: 'to'  | weights: 0.04892673343420029
[MASK]: 'was'  | weights: 0.044326528906822205
[MASK]: 'had'  | weights: 0.03716942295432091


In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

In [None]:
base_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
base_model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
predict_masked_sent(sentence, 5, base_model, base_tokenizer)
print()
predict_masked_sent(sentence2, 5, base_model, base_tokenizer)

[MASK]: 'cluster'  | weights: 0.33416369557380676
[MASK]: 'compare'  | weights: 0.16992069780826569
[MASK]: 'label'  | weights: 0.03357220068573952
[MASK]: 'create'  | weights: 0.023759303614497185
[MASK]: 'construct'  | weights: 0.021330634132027626

[MASK]: 'have'  | weights: 0.05249989032745361
[MASK]: 'are'  | weights: 0.037207696586847305
[MASK]: 'do'  | weights: 0.02007931098341942
[MASK]: '.'  | weights: 0.015799084678292274
[MASK]: 'were'  | weights: 0.014494902454316616


In [None]:
import os

In [None]:
output_dir = '/content/drive/MyDrive/Thesis/notebooks/23march'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

Saving model to /content/drive/MyDrive/Thesis/notebooks/23march


('/content/drive/MyDrive/Thesis/notebooks/23march/tokenizer_config.json',
 '/content/drive/MyDrive/Thesis/notebooks/23march/special_tokens_map.json',
 '/content/drive/MyDrive/Thesis/notebooks/23march/vocab.txt',
 '/content/drive/MyDrive/Thesis/notebooks/23march/added_tokens.json',
 '/content/drive/MyDrive/Thesis/notebooks/23march/tokenizer.json')

In [None]:
from transformers import BertModel, BertTokenizer, AutoModel, AutoTokenizer

In [None]:
loaded_model = BertForPreTraining.from_pretrained("/content/drive/MyDrive/Thesis/notebooks/23march")
loaded_tokenizer = BertTokenizerFast.from_pretrained("/content/drive/MyDrive/Thesis/notebooks/23march")

In [None]:
predict_masked_sent(sentence, 5, loaded_model, loaded_tokenizer)
predict_masked_sent(sentence2, 5, loaded_model, loaded_tokenizer)

[MASK]: 'cluster'  | weights: 0.43790990114212036
[MASK]: '[PAD]'  | weights: 0.1074443832039833
[MASK]: 'add'  | weights: 0.011403925716876984
[MASK]: 'clustered'  | weights: 0.010327151045203209
[MASK]: 'compact'  | weights: 0.009970532730221748
[MASK]: '##ve'  | weights: 0.17298102378845215
[MASK]: 'have'  | weights: 0.04960228130221367
[MASK]: 'to'  | weights: 0.04892673343420029
[MASK]: 'was'  | weights: 0.044326528906822205
[MASK]: 'had'  | weights: 0.03716942295432091
