<a href="https://colab.research.google.com/github/IgnatiusEzeani/IGBONLP/blob/master/training_igbo_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Training Igbo BERT Language Model from Scratch**

1. Find a dataset

In [1]:
!wget -c https://github.com/IgnatiusEzeani/IGBONLP/raw/master/ig_monoling/text.zip

--2021-11-25 18:46:42--  https://github.com/IgnatiusEzeani/IGBONLP/raw/master/ig_monoling/text.zip
Resolving github.com (github.com)... 52.69.186.44
Connecting to github.com (github.com)|52.69.186.44|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/IgnatiusEzeani/IGBONLP/master/ig_monoling/text.zip [following]
--2021-11-25 18:46:42--  https://raw.githubusercontent.com/IgnatiusEzeani/IGBONLP/master/ig_monoling/text.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7604282 (7.3M) [application/zip]
Saving to: ‘text.zip’


2021-11-25 18:46:44 (70.9 MB/s) - ‘text.zip’ saved [7604282/7604282]



In [13]:
import zipfile
import os


def unzip(zipfilename):
  try:
    with zipfile.ZipFile(zipfilename, 'r') as zip_ref:
      zip_ref.extractall(zipfilename[:-4])
      return f"'{zipfilename}' unzipped!"
  except FileNotFoundError:
    print(f"Cannot find '{zipfilename}' file")

unzip("text.zip")
!rm text.zip

In [42]:
# import os
import shutil
dir_name = "/content/text/"
text=""
for fname in os.listdir(dir_name):
  fname = os.path.join(dir_name, fname)
  with open(fname, "r", encoding="utf8") as datafile:
    text = text+"\n"+datafile.read()

with open("data.txt", "w", encoding="utf8") as datafile:
  datafile.write(text)

shutil.rmtree("text")

2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more <unk> tokens!).

In [8]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

Found existing installation: tensorflow 2.7.0
Uninstalling tensorflow-2.7.0:
  Successfully uninstalled tensorflow-2.7.0
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-funr693d
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-funr693d
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 4.1 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 72.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K

In [9]:
%%time 
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 17.9 s, sys: 1.22 s, total: 19.1 s
Wall time: 5.27 s


In [None]:
Path(".").glob("**/*.txt") ##remove later...

Now let's save files to disk

In [12]:
!mkdir igbo_bert
tokenizer.save_model("igbo_bert")

['igbo_bert/vocab.json', 'igbo_bert/merges.txt']

What is great is that our tokenizer is optimized for Igbo. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics are encoded natively. We also represent sequences in a more efficient manner.

Here’s how you can use it in tokenizers, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from transformers.


In [18]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./igbo_bert/vocab.json",
    "./igbo_bert/merges.txt",
)

In [19]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [20]:
tokenizer.encode("Simone kọrọ ya akụkọ nke ya ahu.").tokens

['<s>',
 'Simone',
 'Ġká»įrá»į',
 'Ġya',
 'Ġaká»¥ká»į',
 'Ġnke',
 'Ġya',
 'Ġahu',
 '.',
 '</s>']

3. Train a language model from scratch
Update: This section follows along the run_language_modeling.py script, using our new Trainer directly. Feel free to pick the approach you like best.

We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the documentation for more details).

As the model is BERT-like, we’ll train it on a task of Masked language modeling, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.

In [21]:
# Check that we have a GPU
!nvidia-smi

Thu Nov 25 19:05:50 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [22]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

We'll define the following config for the model

In [23]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [26]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./igbo_bert", max_len=512)

Finally let's initialize our model.

Important:

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [27]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [28]:
model.num_parameters()
# => 83 million parameters

83504416

Now let's build our training Dataset
We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our Dataset. We'll just use the LineByLineDataset out-of-the-box.

In [43]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer = tokenizer,
    file_path = "/content/data.txt",
    block_size = 128
)



CPU times: user 1min 1s, sys: 1.71 s, total: 1min 3s
Wall time: 28.4 s


Like in the run_language_modeling.py script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [44]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

Finally, we are all set to initialize our Trainer

In [45]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./igbo_bert",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

Start training

In [46]:
%%time
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 766896
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 23966
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss
500,6.3151
1000,5.1957
1500,4.7537
2000,4.402
2500,4.1739
3000,4.0149
3500,3.895
4000,3.7691
4500,3.6636
5000,3.6076


Saving model checkpoint to ./igbo_bert/checkpoint-10000
Configuration saved in ./igbo_bert/checkpoint-10000/config.json
Model weights saved in ./igbo_bert/checkpoint-10000/pytorch_model.bin
Saving model checkpoint to ./igbo_bert/checkpoint-20000
Configuration saved in ./igbo_bert/checkpoint-20000/config.json
Model weights saved in ./igbo_bert/checkpoint-20000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 2h 20min 2s, sys: 11min 27s, total: 2h 31min 29s
Wall time: 2h 30min 47s


TrainOutput(global_step=23966, training_loss=3.2467709280041257, metrics={'train_runtime': 9047.7711, 'train_samples_per_second': 169.522, 'train_steps_per_second': 2.649, 'total_flos': 2.546346938655437e+16, 'train_loss': 3.2467709280041257, 'epoch': 2.0})

 Save final model (+ tokenizer + config) to disk

In [47]:
trainer.save_model("./igbo_bert")

Saving model checkpoint to ./igbo_bert
Configuration saved in ./igbo_bert/config.json
Model weights saved in ./igbo_bert/pytorch_model.bin


# 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the FillMaskPipeline.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, <mask>) and return a list of the most probable filled sequences, with their probabilities.

In [48]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./igbo_bert",
    tokenizer="./igbo_bert"
)

loading configuration file ./igbo_bert/config.json
Model config RobertaConfig {
  "_name_or_path": "./igbo_bert",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.13.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./igbo_bert/config.json
Model config RobertaConfig {
  "_name_or_path": "./igbo_bert",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout

In [59]:
# The sun <mask>.
# =>

fill_mask("Uwa anyị niile <mask>.") #=eri
# fill_mask("Nwaanyị na <mask> ji na akara.") #=eri

[{'score': 0.3766884505748749,
  'sequence': 'Uwa anyị niile A.',
  'token': 351,
  'token_str': ' A'},
 {'score': 0.19403190910816193,
  'sequence': 'Uwa anyị niile O.',
  'token': 393,
  'token_str': ' O'},
 {'score': 0.025298675522208214,
  'sequence': 'Uwa anyị niile..',
  'token': 18,
  'token_str': '.'},
 {'score': 0.02208337001502514,
  'sequence': 'Uwa anyị niile T.',
  'token': 606,
  'token_str': ' T'},
 {'score': 0.01943317987024784,
  'sequence': 'Uwa anyị niile M.',
  'token': 445,
  'token_str': ' M'}]

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:

In [60]:
fill_mask("O riri <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

[{'score': 0.09363380074501038,
  'sequence': 'O riri ya.',
  'token': 290,
  'token_str': ' ya'},
 {'score': 0.054225388914346695,
  'sequence': 'O riri m.',
  'token': 270,
  'token_str': ' m'},
 {'score': 0.02468796633183956,
  'sequence': 'O riri Nwanna.',
  'token': 1397,
  'token_str': ' Nwanna'},
 {'score': 0.017323195934295654,
  'sequence': 'O riri M.',
  'token': 445,
  'token_str': ' M'},
 {'score': 0.01361154019832611,
  'sequence': 'O riri afọ.',
  'token': 480,
  'token_str': ' afọ'}]

Zip and download the model file

In [61]:
shutil.make_archive("/content/igbo_bert", 'zip', "igbo_bert")

'/content/igbo_bert.zip'

In [1]:
from google.colab import files
files.download("/content/igbo_bert.zip")

FileNotFoundError: ignored

In [None]:
model_save_name = 'IGBOchy'
path = F"/content/drive/Shareddrives/{model_save_name}" 
torch.save(model.state_dict(), /content/drive/Shareddrives)

SyntaxError: ignored

In [None]:
!cp "/content/IGBOchy" "/content/drive/Shareddrives"

cp: -r not specified; omitting directory '/content/IGBOchy'


In [None]:
!ls /content/drive

MyDrive  Shareddrives


In [None]:
torch.save("IGBOchy", /content/drive/Shareddrives)

SyntaxError: ignored

In [None]:
#reading a text file saved in github using requests package
import requests
url = 'https://github.com/chiamaka249/Igbo-merged-corpus-/blob/main/IGBO_CORPUS_MERGE.zip'
page = requests.get(url)
print (page.text)







<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">



  <link crossorigin="anonymous" media="all" integrity="sha512-1G4rYJktwRTQKn7fVfJUxH8RRZFUJlGo77xMZfBfIhZPx4BHVrzPE1VgnafttXI8G3y/PywH3uXyhNkSLp3+oA==" rel="stylesheet" href="https://github.githubassets.com/assets/light-d46e2b60992dc114d02a7edf55f254c4.css" /><link crossorigin="anonymous" media="all" integrity="sha512-hI5b2oqTE9njfjYrfuzXqA4bSGSNrE5OMc9IiFhZy+RDGg9Qn4Si1A97o0MlinlwFt3xAifvoLX0s7jHmHSvVw==" re