## Install the Data

In [1]:
%pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 9.0 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.9.0-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 27.3 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 26.3 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 31.4 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 37.0 MB/s 
Installing colle

In [2]:
import datasets

In [3]:
all_ds = datasets.list_datasets()
len(all_ds)

8830

In [4]:
dataset = datasets.load_dataset('oscar', 'unshuffled_deduplicated_az')

Downloading builder script:   0%|          | 0.00/5.58k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/359k [00:00<?, ?B/s]

Downloading and preparing dataset oscar/unshuffled_deduplicated_az (download: 497.57 MiB, generated: 1.42 GiB, post-processed: Unknown size, total: 1.91 GiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_az/1.0.0/84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2...


Downloading data:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/522M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/626796 [00:00<?, ? examples/s]

Dataset oscar downloaded and prepared to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_az/1.0.0/84838bd49d2295f62008383b05620571535451d84545037bb94d6f3501651df2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [16]:
dataset['train']

Dataset({
    features: ['id', 'text'],
    num_rows: 626796
})

In [27]:
data = dataset['train'].train_test_split(test_size=0.04)
data

DatasetDict({
    train: Dataset({
        features: ['id', 'text'],
        num_rows: 601724
    })
    test: Dataset({
        features: ['id', 'text'],
        num_rows: 25072
    })
})

In [36]:
from tqdm.auto import tqdm  # for our loading bar

text_data = []
file_count = 0

for sample in tqdm(data['test']):
    # remove newline characters from each sample as we need to use exclusively as seperators
    sample = sample['text'].replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 5_000:
        # once we hit the 5K mark, save to file
        with open(f'text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 5K chunks, we will have ~3808 leftover samples, we save those now too
with open(f'text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))


  0%|          | 0/25072 [00:00<?, ?it/s]

## Build a Custom Transformer Tokenizer

In [37]:
from pathlib import Path
import os

In [38]:
paths = [str(x) for x in Path('').glob('**/*.txt')]
paths

['text_2.txt',
 'text_4.txt',
 'text_1.txt',
 'text_5.txt',
 'text_3.txt',
 'text_0.txt']

In [39]:
%pip install tokenizers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [40]:
from tokenizers import ByteLevelBPETokenizer

In [41]:
tokenizer = ByteLevelBPETokenizer()

In [42]:
tokenizer.train(files=paths, vocab_size=30_522, min_frequency=2,
                    special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

In [43]:
os.mkdir('bert_dayi')

tokenizer.save_model('bert_dayi')

['bert_dayi/vocab.json', 'bert_dayi/merges.txt']

#### Using the tokenizer

In [44]:
%pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 15.0 MB/s 
Installing collected packages: transformers
Successfully installed transformers-4.21.2


In [45]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained('bert_dayi') # load the tokenizer

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.


In [46]:
lorem_ipsum = (
    "AZTV-Artıq 7 ildir ki, Abşeron rayonu dotasiya almadan bütün"
    " xərclərini yerli daxilolmalar hesabına maliyyələşdirir"
)

In [47]:
tokenizer(lorem_ipsum, max_length=512, padding='max_length', truncation=True)

{'input_ids': [0, 2429, 9256, 17, 8973, 1086, 5423, 359, 16, 4596, 2490, 808, 88, 846, 25484, 788, 16496, 2224, 21211, 3541, 5636, 272, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

## Building MLM Training Input Pipeline

In [48]:
with open('text_0.txt', 'r', encoding='utf-8') as fp:
    lines = fp.read().split('\n')

In [49]:
batch = tokenizer(lines, max_length=512, padding='max_length', truncation=True)
len(batch)

2

In [50]:
import torch

def mlm(tensor):
    rand = torch.rand(tensor.shape)
    mask_arr = rand < 0.15 * (tensor > 2)
    for i in range(tensor.shape[0]):
        selection = torch.flatten(mask_arr[i].nonzero())
        tensor[i, selection] = 4
    return tensor

In [51]:
from pathlib import Path

paths = [str(x) for x in Path('').glob('*.txt')]
paths[:5]

['text_2.txt', 'text_4.txt', 'text_1.txt', 'text_5.txt', 'text_3.txt']

In [52]:
from tqdm.auto import tqdm

input_ids = []
mask = []
labels = []

for path in tqdm(paths):
    with open(path, 'r', encoding='utf-8') as fp:
        lines = fp.read().split('\n')
    sample = tokenizer(lines, max_length=512, padding='max_length', truncation=True, return_tensors='pt')
    # input_ids.append(batch['input_ids'])
    # mask.append(batch['attention_mask'])
    labels.append(sample.input_ids)
    mask.append(sample.attention_mask)
    input_ids.append(mlm(sample.input_ids.detach().clone()))

  0%|          | 0/6 [00:00<?, ?it/s]

In [53]:
input_ids = torch.cat(input_ids)
mask = torch.cat(mask)
labels = torch.cat(labels)

In [54]:
input_ids[0][:10]

tensor([    0, 18557, 18658,  5017, 18658,   413, 11345,   713,  5017,     4])

In [55]:
encodings = {
    'input_ids': input_ids,
    'attention_mask': mask,
    'labels': labels
}

In [56]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encoding):
        self.encoding = encoding
    def __len__(self):
        return self.encoding['input_ids'].shape[0]
    def __getitem__(self, idx):
        return {key: tensor[idx] for key, tensor in self.encoding.items()}

In [57]:
dataset = Dataset(encodings)

In [59]:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True)

## Training and Testing Azerbaijani BERT

In [60]:
from transformers import RobertaConfig
config =  RobertaConfig(
    vocab_size=tokenizer.vocab_size,
    max_position_embeddings=514,
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [61]:
from transformers import RobertaForMaskedLM

In [62]:
model = RobertaForMaskedLM(config)

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

In [None]:
model

In [70]:
from transformers import AdamW

# activate training mode
model.train()
# initialize optimizer
optim = AdamW(model.parameters(), lr=1e-4)



In [71]:
epochs = 2

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(dataloader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

  0%|          | 0/2508 [00:00<?, ?it/s]

  0%|          | 0/2508 [00:00<?, ?it/s]

In [72]:
model.save_pretrained('./bert_dayi')

## Testing

In [73]:
from transformers import pipeline

In [74]:
fill = pipeline('fill-mask', model='bert_dayi', tokenizer='bert_dayi')

In [79]:
fill(f'{fill.tokenizer.mask_token} Resbublikası Dövlət Sərhəd Xidməti')

[{'score': 0.17659823596477509,
  'token': 741,
  'token_str': 'Azərbaycan',
  'sequence': 'Azərbaycan Resbublikası Dövlət Sərhəd Xidməti'},
 {'score': 0.025710102170705795,
  'token': 895,
  'token_str': '“',
  'sequence': '“ Resbublikası Dövlət Sərhəd Xidməti'},
 {'score': 0.020212380215525627,
  'token': 2158,
  'token_str': 'Bakı',
  'sequence': 'Bakı Resbublikası Dövlət Sərhəd Xidməti'},
 {'score': 0.010301139205694199,
  'token': 37,
  'token_str': 'A',
  'sequence': 'A Resbublikası Dövlət Sərhəd Xidməti'},
 {'score': 0.009763382375240326,
  'token': 49,
  'token_str': 'M',
  'sequence': 'M Resbublikası Dövlət Sərhəd Xidməti'}]

In [80]:
fill(f'Azərbaycan {fill.tokenizer.mask_token} Bankı')

[{'score': 0.04589273780584335,
  'token': 4919,
  'token_str': ' Qəzeti',
  'sequence': 'Azərbaycan Qəzeti Bankı'},
 {'score': 0.041458260267972946,
  'token': 1094,
  'token_str': ' Respublikası',
  'sequence': 'Azərbaycan Respublikası Bankı'},
 {'score': 0.029113128781318665,
  'token': 18,
  'token_str': '.',
  'sequence': 'Azərbaycan. Bankı'},
 {'score': 0.026965584605932236,
  'token': 890,
  'token_str': ' Respublikasının',
  'sequence': 'Azərbaycan Respublikasının Bankı'},
 {'score': 0.01985001191496849,
  'token': 741,
  'token_str': 'Azərbaycan',
  'sequence': 'AzərbaycanAzərbaycan Bankı'}]

In [84]:
fill(f'“Azərbaycan {fill.tokenizer.mask_token} Ali Məhkəməsinin ')

[{'score': 0.025280063971877098,
  'token': 18,
  'token_str': '.',
  'sequence': '“Azərbaycan. Ali Məhkəməsinin '},
 {'score': 0.012677758000791073,
  'token': 301,
  'token_str': ' və',
  'sequence': '“Azərbaycan və Ali Məhkəməsinin '},
 {'score': 0.011768593452870846,
  'token': 17,
  'token_str': '-',
  'sequence': '“Azərbaycan- Ali Məhkəməsinin '},
 {'score': 0.010865180753171444,
  'token': 408,
  'token_str': ' Azərbaycan',
  'sequence': '“Azərbaycan Azərbaycan Ali Məhkəməsinin '},
 {'score': 0.01046405453234911,
  'token': 879,
  'token_str': ' Dövlət',
  'sequence': '“Azərbaycan Dövlət Ali Məhkəməsinin '}]

In [88]:
fill(f'Qusardakı {fill.tokenizer.mask_token} ')

[{'score': 0.017240870743989944,
  'token': 18,
  'token_str': '.',
  'sequence': 'Qusardakı. '},
 {'score': 0.006505417171865702,
  'token': 1513,
  'token_str': 'е',
  'sequence': 'Qusardakıе '},
 {'score': 0.005885988939553499,
  'token': 16,
  'token_str': ',',
  'sequence': 'Qusardakı, '},
 {'score': 0.005774783901870251,
  'token': 2174,
  'token_str': ' #',
  'sequence': 'Qusardakı # '},
 {'score': 0.005217336118221283,
  'token': 301,
  'token_str': ' və',
  'sequence': 'Qusardakı və '}]

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [89]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [92]:
!cp -r "bert_dayi" "/content/gdrive/MyDrive/bert_aze"