<a href="https://colab.research.google.com/github/Chiamakac/TRAININGS/blob/main/IgboBERT%202.0/Training_IboBERT_2_0_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Training an Igbo language model from scratch using Transformers and Tokenizers**

**1. Getting the data.**

In [None]:
!wget -c https://github.com/IgnatiusEzeani/IGBONLP/raw/master/ig_monoling/text.zip
!wget -c https://raw.githubusercontent.com/chiamaka249/lacuna_pos_ner/main/language_corpus/ibo/ibo.txt
!wget -c https://raw.githubusercontent.com/Chiamakac/IboBETA/main/config.json?token=GHSAT0AAAAAAB5DFIJTG6K26ACHVLWSCFIAY6JXY4Q
!wget -c https://raw.githubusercontent.com/Chiamakac/TRAININGS/main/Alignment/Projection_Work/TAG_ENG/Igbo%20Sents_all.txt

--2023-01-19 22:04:47--  https://github.com/IgnatiusEzeani/IGBONLP/raw/master/ig_monoling/text.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/IgnatiusEzeani/IGBONLP/master/ig_monoling/text.zip [following]
--2023-01-19 22:04:47--  https://raw.githubusercontent.com/IgnatiusEzeani/IGBONLP/master/ig_monoling/text.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7604282 (7.3M) [application/zip]
Saving to: ‚Äòtext.zip‚Äô


2023-01-19 22:04:48 (173 MB/s) - ‚Äòtext.zip‚Äô saved [7604282/7604282]

--2023-01-19 22:04:48--  https://raw.githubusercontent.com/chiamaka249/lacuna_pos_ner/main/language_

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Unzip the zipped file and remove the zipped file after unzipping
import zipfile
import os


def unzip(zipfilename):
  try:
    with zipfile.ZipFile(zipfilename, 'r') as zip_ref:
      zip_ref.extractall(zipfilename[:-4])
      return f"'{zipfilename}' unzipped!"
  except FileNotFoundError:
    print(f"Cannot find '{zipfilename}' file")

unzip("text.zip")
!rm text.zip

In [None]:
#copies the file "ibo.txt" into the folder "text"
import shutil
shutil.move('/content/ibo.txt', '/content/text')



'/content/text/ibo.txt'

In [None]:
import shutil
shutil.move('/content/Igbo Sents_all.txt', '/content/text')

'/content/text/Igbo Sents_all.txt'

In [None]:

# import os
#import shutil
dir_name = "/content/text"
text=""
for fname in os.listdir(dir_name):
  fname = os.path.join(dir_name, fname)
  with open(fname, "r", encoding="utf8") as datafile:
    text = text+"\n"+datafile.read()

with open("data.txt", "w", encoding="utf8") as datafile:
  datafile.write(text)

shutil.rmtree("text")

**2.  Import Transformers, Tokenizer and Train the tokenizer**

In [None]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow

# Install `transformers` from master stating the version
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

Found existing installation: tensorflow 2.9.2
Uninstalling tensorflow-2.9.2:
  Successfully uninstalled tensorflow-2.9.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-v_5j3rey
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-v_5j3rey
  Resolved https://github.com/huggingface/transformers to commit 862888a35834527fed61beaf42373423ffdbd216
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

In [None]:

%%time 
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

# Describing the path to all of our Igbo data 
paths = [str(x) for x in Path(".").glob("**/*.txt")]
print(paths)

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",#Beginning of sequence (BOS) or classifier (CLS) token
    "<pad>",# Padding token
    "</s>",#End of sequence (EOS) or seperator (SEP) token
    "<unk>",# Unknown token
    "<mask>", # Masking token
])

['data.txt', 'drive/MyDrive/TEXT.txt', 'drive/MyDrive/IBO_BETA/LREC FINAL TRAINING/bert-base-multilingual-cased checkpoint-5600/vocab.txt', 'drive/MyDrive/IBO_BETA/LREC FINAL TRAINING/distilbert-base-uncased checkpoint-5600/vocab.txt', 'drive/MyDrive/IBO_BETA/LREC FINAL TRAINING/IgboBert-finetuned-ner checkpoint-5600/merges.txt', 'drive/MyDrive/IBO_BETA/LREC FINAL TRAINING/IgboBert_1e-4-finetuned-ner checkpoint-5600/merges.txt', 'drive/MyDrive/IBO_BETA/Results for IgboBert/results/checkpoint-500/merges.txt', 'drive/MyDrive/IBO_BETA/IgboBert/merges.txt', 'drive/MyDrive/IboBERT_2.0/IgboBert_2.0/merges.txt']
CPU times: user 28.2 s, sys: 9.15 s, total: 37.4 s
Wall time: 16.4 s


In [None]:
#Our tokenizer is now ready and we have two files that define our new IgboBert tokenizer( a vocab.json-which is a list of the most frequent tokens ranked by frequency and a merges.txt list of merges)
#we then save the file for later use

!mkdir IgboBert_2.0
tokenizer.save_model("IgboBert_2.0")

['IgboBert_2.0/vocab.json', 'IgboBert_2.0/merges.txt']

In [None]:
shutil.move('/content/config.json', '/content/IgboBert_2.0')

'/content/IgboBert_2.0/config.json'

**2. Initializing the Tokenizer**

Let's initialize our tokenizer. This way we can use it as we would use any other from_pretrained tokenizer.

In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./IgboBert_2.0/vocab.json",
    "./IgboBert_2.0/merges.txt",
)

In [None]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [None]:
tokenizer.encode("Simone gara ·ª•ka ·ª•nyah·ª• gu·ªç egwu ma ga-kwa taa.", "Aha ya b·ª• ifeoma.").tokens

In [None]:
# Check that we have a GPU
!nvidia-smi

Thu Jan 19 22:11:44 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    25W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

In [None]:
#For training, we need a raw (not pre-trained) BERTLMHeadModel. 
#To create that, we first need to create a RoBERTa config object to describe the parameters we‚Äôd like to initialize IgboBERT with.

from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)


In [None]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./IgboBert_2.0", max_len=512, config=config)

In [None]:
#We import and initialize our RoBERTa model with a language modeling (LM) head.

from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)

In [None]:
model.num_parameters()
# => 83 million parameters

83504416

In [None]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer = tokenizer,
    file_path = "/content/data.txt",
    block_size = 128
)



CPU times: user 31.8 s, sys: 1.25 s, total: 33 s
Wall time: 15.9 s


In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./IgboBert_2.0",
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


In [None]:
%%time
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 402581
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 31455
  Number of trainable parameters = 83504416
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,6.4481
1000,5.3484
1500,4.9074
2000,4.6166
2500,4.3902
3000,4.2323
3500,4.071
4000,3.9431
4500,3.8655
5000,3.7913


Saving model checkpoint to ./IgboBert_2.0/checkpoint-10000
Configuration saved in ./IgboBert_2.0/checkpoint-10000/config.json
Model weights saved in ./IgboBert_2.0/checkpoint-10000/pytorch_model.bin
Saving model checkpoint to ./IgboBert_2.0/checkpoint-20000
Configuration saved in ./IgboBert_2.0/checkpoint-20000/config.json
Model weights saved in ./IgboBert_2.0/checkpoint-20000/pytorch_model.bin
Saving model checkpoint to ./IgboBert_2.0/checkpoint-30000
Configuration saved in ./IgboBert_2.0/checkpoint-30000/config.json
Model weights saved in ./IgboBert_2.0/checkpoint-30000/pytorch_model.bin
Deleting older checkpoint [IgboBert_2.0/checkpoint-10000] due to args.save_total_limit


Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 5h 37min 21s, sys: 30.3 s, total: 5h 37min 52s
Wall time: 5h 36min 37s


TrainOutput(global_step=31455, training_loss=3.216083864846606, metrics={'train_runtime': 20197.041, 'train_samples_per_second': 99.663, 'train_steps_per_second': 1.557, 'total_flos': 3.5219237384375616e+16, 'train_loss': 3.216083864846606, 'epoch': 5.0})

In [None]:
trainer.save_model("./IgboBert_2.0")

Saving model checkpoint to ./IgboBert_2.0
Configuration saved in ./IgboBert_2.0/config.json
Model weights saved in ./IgboBert_2.0/pytorch_model.bin


# **4. Test the Model**

We first initialize a pipeline object, using the 'fill-mask' argument. Then begin testing our model like so



In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./IgboBert_2.0",
    tokenizer="./IgboBert_2.0"
)

loading configuration file ./IgboBert_2.0/config.json
Model config RobertaConfig {
  "_name_or_path": "./IgboBert_2.0",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.26.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./IgboBert_2.0/config.json
Model config RobertaConfig {
  "_name_or_path": "./IgboBert_2.0",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classi

In [None]:
# The sun <mask>.
# =>

fill_mask("Ab·ª• m Maaz·ªã <mask>.") #= okafor/·ªåkaf·ªç
# fill_mask("Nwaany·ªã na <mask> ji na akara.") #=eri

[{'score': 0.02693532407283783,
  'token': 355,
  'token_str': ' A',
  'sequence': 'Ab·ª• m Maaz·ªã A.'},
 {'score': 0.011763395741581917,
  'token': 455,
  'token_str': ' M',
  'sequence': 'Ab·ª• m Maaz·ªã M.'},
 {'score': 0.009665265679359436,
  'token': 610,
  'token_str': ' T',
  'sequence': 'Ab·ª• m Maaz·ªã T.'},
 {'score': 0.008098028600215912,
  'token': 3380,
  'token_str': ' Nwaany·ªã',
  'sequence': 'Ab·ª• m Maaz·ªã Nwaany·ªã.'},
 {'score': 0.007046069949865341,
  'token': 6993,
  'token_str': ' ·ªåkaf·ªç',
  'sequence': 'Ab·ª• m Maaz·ªã ·ªåkaf·ªç.'}]

In [None]:
# The sun <mask>.
# =>

fill_mask("Nwaany·ªã na <mask> ji na akara.") #= eri
# fill_mask("Nwaany·ªã na <mask> ji na akara.") #=eri

[{'score': 0.18506872653961182,
  'token': 292,
  'token_str': ' ya',
  'sequence': 'Nwaany·ªã na ya ji na akara.'},
 {'score': 0.037972841411828995,
  'token': 300,
  'token_str': ' ha',
  'sequence': 'Nwaany·ªã na ha ji na akara.'},
 {'score': 0.0356743261218071,
  'token': 911,
  'token_str': ' Na·ªãjir·ªãa',
  'sequence': 'Nwaany·ªã na Na·ªãjir·ªãa ji na akara.'},
 {'score': 0.0318823978304863,
  'token': 671,
  'token_str': ' nwaany·ªã',
  'sequence': 'Nwaany·ªã na nwaany·ªã ji na akara.'},
 {'score': 0.021502790972590446,
  'token': 774,
  'token_str': ' nwunye',
  'sequence': 'Nwaany·ªã na nwunye ji na akara.'}]

In [None]:
# The sun <mask>.
# =>

fill_mask("Chineke ga- ebibikwa nd·ªã niile na- eme ihe <mask>.") #=·ªçj·ªç·ªç
# fill_mask("Nwaany·ªã na <mask> ji na akara.") #=eri

[{'score': 0.1908637136220932,
  'token': 760,
  'token_str': ' ·ªçj·ªç·ªç',
  'sequence': 'Chineke ga- ebibikwa nd·ªã niile na- eme ihe ·ªçj·ªç·ªç.'},
 {'score': 0.144007608294487,
  'token': 266,
  'token_str': ' a',
  'sequence': 'Chineke ga- ebibikwa nd·ªã niile na- eme ihe a.'},
 {'score': 0.13014595210552216,
  'token': 445,
  'token_str': ' niile',
  'sequence': 'Chineke ga- ebibikwa nd·ªã niile na- eme ihe niile.'},
 {'score': 0.10273794829845428,
  'token': 518,
  'token_str': ' ·ªçma',
  'sequence': 'Chineke ga- ebibikwa nd·ªã niile na- eme ihe ·ªçma.'},
 {'score': 0.02662363275885582,
  'token': 412,
  'token_str': ' anya',
  'sequence': 'Chineke ga- ebibikwa nd·ªã niile na- eme ihe anya.'}]

In [None]:
fill_mask("·ªçba akw·ª•kw·ªç ·ªåkamm·ª•ta Kenneth Dike d·ªã <mask>.") #n'Awka

# This is the beginning of a beautiful <mask>.
# =>

In [None]:
# The sun <mask>.
# =>

fill_mask("Nwaany·ªã na eri <mask> na akara.") #= ji
# fill_mask("Nwaany·ªã na <mask> ji na akara.") #=eri

In [None]:
# The sun <mask>.
# =>

fill_mask("Gaan·ª• mee nd·ªã <mask> niile ka ha b·ª•r·ª• nd·ªã na- eso ·ª•z·ªç m  .") #= mba


In [None]:
# The sun <mask>.
# =>

fill_mask("Jehova h·ªçp·ª•tara Mozis ka ·ªç b·ª•r·ª• onye nd√∫ ·ª•m·ª• <mask>.") #= Izrel


In [None]:
# The sun <mask>.
# =>

fill_mask("·ª§m·ª•akw·ª•kw·ªç Chibok an·ªç·ªçla ·ª•b·ªçch·ªã 2000 n‚Äô aka <mask> Haram.") #= Boko


In [None]:
# The sun <mask>.
# =>

fill_mask("Nwunye G·ªçvan·ªç Ekiti steeti b·ª• Bisi Fayemi so na nd·ªã na- akwado <mask> ·ªçh·ª•r·ª• a.") #= iwu


In [None]:
# The sun <mask>.
# =>

fill_mask(" <mask> s·ªã ka ehiwe ·ª•l·ªçikpe p·ª•r·ª•iche maka mp·ª•.") #= Buhari


In [None]:
# The sun <mask>.
# =>

fill_mask("Ala <mask>  ga- eweta ezi ·ªçn·ªçd·ª• nchekwa maka nd·ªã ch·ªçr·ªç ·ªãwebata ego n‚Äô ·ªçr·ª• ugbo.") #= Na·ªãjir·ªãa


In [None]:
# The sun <mask>.
# =>

fill_mask("·ªå b·ª• <mask>a ka a na- ar·ªãa .") #= mmad·ª•
# fill_mask("Nwaany·ªã na <mask> ji na akara.") #=eri

In [None]:
#mount gdrive
from google.colab import drive
drive.mount('/content/gdrive')

MessageError: ignored

In [None]:
#move model to gdrive
shutil.move('/content/IgboBert_2.0','/content/drive/MyDrive/IboBERT_2.0')

'/content/drive/MyDrive/IboBERT_2.0/IgboBert_2.0'