<a href="https://colab.research.google.com/github/Azizkhaled/NLP_with_Aziz/blob/main/Projects/TrainingPretrainedBert/MLM_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
pip install transformers accelerate -U


Installing collected packages: tokenizers, safetensors, huggingface-hub, transformers, accelerate
Successfully installed accelerate-0.21.0 huggingface-hub-0.16.4 safetensors-0.3.2 tokenizers-0.13.3 transformers-4.32.0


### Initailize tokenizer and model

In [4]:
from transformers import  BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')



Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Load data, clean, tokenize

We will utilize the UN report titled "The Question of Palestine" for this project. The data can be accessed using the following link:

https://raw.githubusercontent.com/Azizkhaled/NLP/main/Data/UN_text.txt

We'll start by loading the data from the provided URL and removing any duplicate entries while preserving the order of appearance.

In [5]:
import requests

# function to remove duplicates while keeping the same order
def remove_duplicates_keep_order(input_list):
    seen = set()
    result = []

    for item in input_list:
        if item.strip() != '' and item not in seen:  # Check if the line is not empty and not seen before
            result.append(item)
            seen.add(item)

    return result


In [6]:
data = requests.get('https://raw.githubusercontent.com/Azizkhaled/NLP/main/Data/UN_text.txt')
text = data.text.split('\n')

text = remove_duplicates_keep_order(text)
print( len(text),': ', text[0:3])

886 :  ['The question of Palestine was brought before the United Nations shortly after the end of the Second World War.\r', 'The origins of the Palestine problem as an international issue, however, lie in events occurring towards the end of the First World War. These events led to a League of Nations decision to place Palestine under the administration of Great Britain as the Mandatory Power under the Mandates System adopted by the League. In principle, the Mandate was meant to be in the nature of a transitory phase until Palestine attained the status of a fully independent nation, a status provisionally recognized in the League’s Covenant, but in fact the Mandate’s historical evolution did not result in the emergence of Palestine as an independent nation.\r', 'The decision on the Mandate did not take into account the wishes of the people of Palestine, despite the Covenant’s requirements that “the wishes of these communities must be a principal consideration in the selection of the Man

In [7]:
inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')

### Create label tensor

Then we create our labels tensor by cloning the input_ids tensor.

In [8]:
inputs['labels'] = inputs.input_ids.detach().clone()
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

### Randomly mask tokens

In [9]:
# create random array of floats with equal dimensions to input_ids tensor
rand = torch.rand(inputs.input_ids.shape)
# make sure we don't mask the CLS (101) and SEP tokens (102) and paddings (0)
mask_arr = (rand < 0.15) * (inputs.input_ids != 101) * (inputs.input_ids != 102) * (inputs.input_ids != 0)

In [10]:
## indices to be masked in each paragraph
selection = []

for i in range(inputs.input_ids.shape[0]):
    selection.append(
        torch.flatten(mask_arr[i].nonzero()).tolist()
    )


In [11]:
for i in range(inputs.input_ids.shape[0]):
    inputs.input_ids[i, selection[i]] = 103


In [12]:
inputs.input_ids

tensor([[  101,  1996,  3160,  ...,     0,     0,     0],
        [  101,  1996,  7321,  ...,     0,     0,     0],
        [  101,  1996,  3247,  ...,     0,     0,     0],
        ...,
        [  101,  1006,  3818,  ...,     0,     0,     0],
        [  101, 17827, 11814,  ...,     0,     0,     0],
        [  101,  9686,  4747,  ...,     0,     0,     0]])

###  Create a PyTorch dataset from our data

In [13]:
class UnDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)

In [16]:
dataset = UnDataset(inputs)

## Training Method 1: Pytorch Training

### Initialize loader

In [13]:
loader = torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=True)


In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# and move our model over to the selected device
model.to(device)

Activate the training mode of our model, and initialize our optimizer (Adam with weighted decay - reduces chance of overfitting).

In [15]:
from transformers import AdamW

# activate training mode
model.train()
# initialize optimizer
optim = AdamW(model.parameters(), lr=5e-5)



### Start the training loop

In [16]:
from tqdm import tqdm  # for our progress bar

epochs = 3

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|██████████| 111/111 [01:39<00:00,  1.11it/s, loss=0.0865]
Epoch 1: 100%|██████████| 111/111 [01:41<00:00,  1.09it/s, loss=0.0366]
Epoch 2: 100%|██████████| 111/111 [01:41<00:00,  1.09it/s, loss=0.0221]


## Training Method 2: HuggingFace Trainer function

In [19]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir='out',
    per_device_train_batch_size=8,
    num_train_epochs=3
)

In [20]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset
)

In [22]:
torch.cuda.empty_cache()

In [23]:
trainer.train()


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Step,Training Loss


TrainOutput(global_step=333, training_loss=0.3669326369826858, metrics={'train_runtime': 327.3756, 'train_samples_per_second': 8.119, 'train_steps_per_second': 1.017, 'total_flos': 699598392422400.0, 'train_loss': 0.3669326369826858, 'epoch': 3.0})

Same outcome, Perfect!