# Help BOBAI: Classify an unknown language

<img src="https://drive.google.com/uc?id=1Hvgrrah-T7yFTzDP002XuRodhyfY1Hju" width="750">

## Background
Bob's AI start-up, Bobai, builds AI solutions for other companies which have to process large volumes of text in their daily tasks. Bobai serve companies from all over the world, and they pride themselves on their ability to handle a variety of languages, from English, through Arabic to Mandarin. The secret to Bobai's success is that all of their products are based on a strong multilingual language encoder, mBERT. Bobai's infrastructure is actually highly optimized for this specific language encoder, which makes their products super fast and efficient, i.e. very attractive to clients.

## Task

But mBERT is trained on just 101 languages. So what happens when one of Bobai's biggest clients, Amoira, requests support for a new language X that is not among those 101 languages? Bob and his team have to find a way to meet this request, as they cannot risk losing the client.

The data Amoira has provided consists of a small labeled dataset for text classification and a larger corpus or raw text in the language.

To make things even more complicated, Amoira has encrypted the data, as they don't want to risk competitors finding out which new market they are targetting.

Bob has found out that at this time his team has no bandwidth to develop this product, so he is asking for your help. He has shared the baseline solution he uses for languages that mBERT already has support for, so you can start by checking how well this solution does and modify it to obtain better results. You should not waste any efforts on trying to decrypt the data - this will not help you build a better classifier and it will get you in trouble with Bob!

Your task is to build the best text classifier for language X that you can, while operating within the constraints of Bobai:

*   The classifier has to be based on mBERT (and cannot use any additional pre-trained language encoder).
*   The classifier has to train in under 8 hours using an L4 GPU as the compute resources of the company are limited.
*   The classifier has to perform inference on any random 500 data samples in under 5 minutes (Bobai will then apply their optimization tricks to bring this time even further down).

## Deliverables

You need to submit:


*   Your model predictions on the test inputs that we will provide 48 hours before the deadline.
  * saved as a text file in the format shown at the bottom of the notebook
*   Your best trained model.
  * as a link to the Huggingface Hub (read up on `push_to_hub` [here](push_to_hub)).
*   Working code that can be used to reproduce your best trained model.
  * In this Colab notebook.


## Prerequisites


### HuggingFace configuration

The steps below need to be completed by the team leader:

1. Create a team account on [HuggingFace](https://huggingface.co/) using the Gmail account provided by the IOAI organizers.

2. Go to the [IOAI HuggingFace repo](https://huggingface.co/InternationalOlympiadAI) and request access to all datasets.

3. In settings, create two Access Tokens, one with read rights, one with write rights, and store those in [Colab Secrets](https://www.youtube.com/watch?v=q87i2LZbbPc) as `hf_read` and `hf_write`, respectively.

In [None]:
from google.colab import userdata

read_access_token = userdata.get('hf_read')
write_access_token = userdata.get('hf_write')

### Dependencies

In [None]:
import importlib
import torch, transformers

if '2.3.0' not in torch.__version__:
  !pip install torch==2.3.0
if transformers.__version__!='4.41.2':
  !pip install transformers==4.41.2

if importlib.util.find_spec('datasets') is None:
  !pip install datasets==2.18.0
  !pip install evaluate==0.4.2
  !pip install accelerate -U

Collecting torch==2.3.0
  Downloading torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.0)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.0)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.0)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.0)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.0)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.3.0)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-m

Collecting transformers==4.41.2
  Downloading transformers-4.41.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.41.2-py3-none-any.whl (9.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m120.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.42.4
    Uninstalling transformers-4.42.4:
      Successfully uninstalled transformers-4.42.4
Successfully installed transformers-4.41.2


Collecting datasets==2.18.0
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets==2.18.0)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets==2.18.0)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets==2.18.0)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.2.0,>=2023.1.0 (from fsspec[http]<=2024.2.0,>=2023.1.0->datasets==2.18.0)
  Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloadin

If you've just installed `accelerate`, execute `Runtime > Restart session and run all` in the Colab UI menu above.

# Data

In [None]:
# load the data

from datasets import load_dataset, Dataset, DatasetDict

classification_dataset = load_dataset('InternationalOlympiadAI/NLP_problem', token=read_access_token)
raw_text = load_dataset('InternationalOlympiadAI/NLP_problem_raw', token=read_access_token)

Downloading readme:   0%|          | 0.00/397 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 126k/126k [00:00<00:00, 297kB/s]
Downloading data: 100%|██████████| 19.4k/19.4k [00:00<00:00, 69.1kB/s]


Generating train split:   0%|          | 0/1524 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/218 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/281 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 90.6M/90.6M [00:00<00:00, 161MB/s]


Generating train split:   0%|          | 0/611245 [00:00<?, ? examples/s]

# Baseline

In [None]:
import re
import unicodedata
from datasets import load_dataset
from transformers import PreTrainedTokenizerFast, DataCollatorWithPadding, DataCollatorForLanguageModeling
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors, trainers
import evaluate
import numpy as np

def clean_text(text):
    text = unicodedata.normalize('NFKC', text)
    text = re.sub(r'\s+', ' ', text.strip())
    text = re.sub(r'\d+', '<NUM>', text)
    text = re.sub(r'([𑀯।॥]){2,}', r'\1', text)
    return text

def create_custom_tokenizer(raw_texts, vocab_size=32000):
    tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
    trainer = trainers.WordPieceTrainer(
        vocab_size=vocab_size,
        special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "<NUM>"],
        min_frequency=2,
        limit_alphabet=1000
    )

    cleaned_texts = [clean_text(text) for text in raw_texts]
    tokenizer.train_from_iterator(cleaned_texts, trainer=trainer)

    tokenizer.post_processor = processors.TemplateProcessing(
        single="[CLS] $A [SEP]",
        pair="[CLS] $A [SEP] $B:1 [SEP]:1",
        special_tokens=[
            ("[CLS]", tokenizer.token_to_id("[CLS]")),
            ("[SEP]", tokenizer.token_to_id("[SEP]")),
        ],
    )

    return PreTrainedTokenizerFast(
        tokenizer_object=tokenizer,
        unk_token="[UNK]",
        pad_token="[PAD]",
        cls_token="[CLS]",
        sep_token="[SEP]",
        mask_token="[MASK]",
    )

print("Creating custom tokenizer...")
custom_tokenizer = create_custom_tokenizer(raw_text['train']['text'])
print("Custom tokenizer created.")

def preprocess_function(examples):
    cleaned_texts = [clean_text(text) for text in examples["text"]]
    tokenized = custom_tokenizer(
        cleaned_texts,
        truncation=True,
        padding='max_length',
        max_length=256,
        return_tensors="pt"
    )
    if 'label' in examples:
        tokenized['labels'] = examples['label']
    return tokenized

print("Preprocessing datasets...")
tokenized_raw_text = raw_text.map(preprocess_function, batched=True, remove_columns=raw_text["train"].column_names)
train_val_split = tokenized_raw_text["train"].train_test_split(test_size=0.1, seed=42)
tokenized_raw_text["train"] = train_val_split["train"]
tokenized_raw_text["validation"] = train_val_split["test"]
tokenized_classification = classification_dataset.map(preprocess_function, batched=True, remove_columns=classification_dataset["train"].column_names)
print("Datasets preprocessed.")

Creating custom tokenizer...
Custom tokenizer created.
Preprocessing datasets...


Map:   0%|          | 0/611245 [00:00<?, ? examples/s]

Map:   0%|          | 0/1524 [00:00<?, ? examples/s]

Map:   0%|          | 0/218 [00:00<?, ? examples/s]

Datasets preprocessed.


In [None]:
mlm_data_collator = DataCollatorForLanguageModeling(tokenizer=custom_tokenizer, mlm=True, mlm_probability=0.15)

mlm_model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-multilingual-uncased")
mlm_model.resize_token_embeddings(len(custom_tokenizer))

mlm_training_args = TrainingArguments(
    output_dir="./mlm_mbert",
    overwrite_output_dir=True,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=4,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    logging_steps=100,
    learning_rate=1e-4,
    warmup_steps=1000,
    weight_decay=0.01,
    fp16=True,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False,
    max_steps=20000,
    dataloader_num_workers=4,
    group_by_length=True,
    lr_scheduler_type="cosine_with_restarts",
    remove_unused_columns=True,
)

mlm_trainer = Trainer(
    model=mlm_model,
    args=mlm_training_args,
    data_collator=mlm_data_collator,
    train_dataset=tokenized_raw_text["train"],
    eval_dataset=tokenized_raw_text["validation"],
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]
)

print("Starting MLM pre-training...")
mlm_trainer.train()
mlm_trainer.save_model("./mlm_mbert")
print("MLM pre-training completed and model saved.")

data_collator = DataCollatorWithPadding(tokenizer=custom_tokenizer)

Creating custom tokenizer...
Custom tokenizer created.
Preprocessing datasets...


Map:   0%|          | 0/611245 [00:00<?, ? examples/s]

Map:   0%|          | 0/1524 [00:00<?, ? examples/s]

Map:   0%|          | 0/218 [00:00<?, ? examples/s]

Datasets preprocessed.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
max_steps is given, it will override any value given in num_train_epochs


Starting MLM pre-training...


  self.pid = os.fork()


Step,Training Loss,Validation Loss
500,5.6862,5.449687
1000,4.8246,4.719574
1500,4.4106,4.270133
2000,4.1308,4.034754
2500,3.9446,3.837208
3000,3.7723,3.677743
3500,3.6673,3.58382
4000,3.5675,3.488495
4500,3.4963,3.400489
5000,3.4204,3.333559


  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()
  self.pid = os.fork()


Step,Training Loss,Validation Loss
500,5.6862,5.449687
1000,4.8246,4.719574
1500,4.4106,4.270133
2000,4.1308,4.034754
2500,3.9446,3.837208
3000,3.7723,3.677743
3500,3.6673,3.58382
4000,3.5675,3.488495
4500,3.4963,3.400489
5000,3.4204,3.333559


  self.pid = os.fork()
  self.pid = os.fork()
There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


MLM pre-training completed and model saved.


In [None]:
# define the evaluation metric

import evaluate
import numpy as np

f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels, average='macro')

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

In [None]:
# define the model and the training configuration

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Load the pre-trained model for classification
model = AutoModelForSequenceClassification.from_pretrained(
    "./mlm_mbert",
    num_labels=5,
    output_attentions=False,
    output_hidden_states=False,
)

# Set up classification training arguments
training_args = TrainingArguments(
    output_dir="./mlm_bobai",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=20,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    metric_for_best_model='f1',
    load_best_model_at_end=True,
    push_to_hub=True,
    hub_strategy="checkpoint",
    hub_token=write_access_token,
    hub_private_repo=True,
    lr_scheduler_type="cosine_with_restarts",
    hub_model_id='bobai',
    fp16=True,
)

# Create classification trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_classification["train"],
    eval_dataset=tokenized_classification["dev"],
    tokenizer=custom_tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./mlm_mbert and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# execute the model training
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,No log,0.653808,0.66967
2,No log,0.505207,0.78334
3,No log,0.481955,0.800763
4,No log,0.763111,0.742865
5,No log,0.537884,0.84372
6,No log,0.747539,0.822097
7,No log,0.731021,0.834184
8,No log,0.778603,0.82867
9,No log,0.768963,0.843127
10,No log,0.791831,0.852189


TrainOutput(global_step=960, training_loss=0.11180834714323282, metrics={'train_runtime': 137.4791, 'train_samples_per_second': 221.706, 'train_steps_per_second': 6.983, 'total_flos': 4009920491151360.0, 'train_loss': 0.11180834714323282, 'epoch': 20.0})

# Inference

In [None]:
# run the trained model on a dev/test split
data_split = "dev"
eval_out = trainer.predict(tokenized_classification[data_split])
predictions = eval_out.predictions.argmax(1)
labels = eval_out.label_ids
dev_f1 = f1.compute(predictions=predictions, references=labels, average='macro')
print(f"F1 score: {dev_f1['f1']:.4f}")

F1 score: 0.8522


In [None]:
# UPDATE THIS CELL ACCORDINGLY

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# define a funciton to load your tokenizer and model from a HF path
# the path variables can be strings or lists of strings (for ensemble solutions)
def load_model(path_to_tokenizer, path_to_model, token):
    tokenizer = AutoTokenizer.from_pretrained(path_to_tokenizer, token=token)
    model = AutoModelForSequenceClassification.from_pretrained(path_to_model, token=token)
    model.eval()
    return tokenizer, model

# define a "predict" function that takes the model and a list of input strings
# and returns the outputs as a list of integer classes
def predict(tokenizer, model, input_texts):
    predictions = []
    for input_text in input_texts:
        input_text = clean_text(input_text)
        inputs = tokenizer(input_text, truncation=True, padding='max_length', max_length=256, return_tensors="pt")

        with torch.no_grad():
            outputs = model(**inputs)

        predictions.append(outputs.logits.argmax().item())

    return predictions

# set variables
path_to_model = "Romania1/bobai"  # Path to your saved model
path_to_tokenizer = "Romania1/bobai"  # Path to your saved tokenizer
model_access_token = write_access_token  # Use the same token as in your training code
data_split = "test"

In [None]:
# DO NOT CHANGE THIS CELL!!!
from datasets import load_dataset, Dataset, DatasetDict

tokenizer, model = load_model(path_to_model, path_to_tokenizer, token=model_access_token)

test_data = load_dataset("InternationalOlympiadAI/NLP_problem_test")['test']['text']

predictions = predict(tokenizer, model, test_data)

with open('{}_predictions.txt'.format(data_split), 'w') as outfile:
  outfile.write('\n'.join([str(p) for p in predictions]))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.13M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/264 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 36.0k/36.0k [00:00<00:00, 73.4kB/s]


Generating test split:   0%|          | 0/438 [00:00<?, ? examples/s]