<a href="https://colab.research.google.com/github/TheIronAdrian/ExtremeTicTacToe/blob/main/IOAI_2024_NLP_Problem_Mocanu_Mihai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Help BOBAI: Classify an unknown language

<img src="https://drive.google.com/uc?id=1Hvgrrah-T7yFTzDP002XuRodhyfY1Hju" width="750">

## Background
Bob's AI start-up, Bobai, builds AI solutions for other companies which have to process large volumes of text in their daily tasks. Bobai serve companies from all over the world, and they pride themselves on their ability to handle a variety of languages, from English, through Arabic to Mandarin. The secret to Bobai's success is that all of their products are based on a strong multilingual language encoder, mBERT. Bobai's infrastructure is actually highly optimized for this specific language encoder, which makes their products super fast and efficient, i.e. very attractive to clients.

## Task

But mBERT is trained on just 101 languages. So what happens when one of Bobai's biggest clients, Amoira, requests support for a new language X that is not among those 101 languages? Bob and his team have to find a way to meet this request, as they cannot risk losing the client.

The data Amoira has provided consists of a small labeled dataset for text classification and a larger corpus or raw text in the language.

To make things even more complicated, Amoira has encrypted the data, as they don't want to risk competitors finding out which new market they are targetting.

Bob has found out that at this time his team has no bandwidth to develop this product, so he is asking for your help. He has shared the baseline solution he uses for languages that mBERT already has support for, so you can start by checking how well this solution does and modify it to obtain better results. You should not waste any efforts on trying to decrypt the data - this will not help you build a better classifier and it will get you in trouble with Bob!

Your task is to build the best text classifier for language X that you can, while operating within the constraints of Bobai:

*   The classifier has to be based on mBERT (and cannot use any additional pre-trained language encoder).
*   The classifier has to train in under 8 hours using an L4 GPU as the compute resources of the company are limited.
*   The classifier has to perform inference on any random 500 data samples in under 5 minutes (Bobai will then apply their optimization tricks to bring this time even further down).

## Deliverables

You need to submit:


*   Your model predictions on the test inputs that we will provide 48 hours before the deadline.
  * saved as a text file in the format shown at the bottom of the notebook
*   Your best trained model.
  * as a link to the Huggingface Hub (read up on `push_to_hub` [here](push_to_hub)).
*   Working code that can be used to reproduce your best trained model.
  * In this Colab notebook.


## Prerequisites


### HuggingFace configuration

The steps below need to be completed by the team leader:

1. Create a team account on [HuggingFace](https://huggingface.co/) using the Gmail account provided by the IOAI organizers.

2. Go to the [IOAI HuggingFace repo](https://huggingface.co/InternationalOlympiadAI) and request access to all datasets.

3. In settings, create two Access Tokens, one with read rights, one with write rights, and store those in [Colab Secrets](https://www.youtube.com/watch?v=q87i2LZbbPc) as `hf_read` and `hf_write`, respectively.

In [None]:
from google.colab import userdata

read_access_token = userdata.get('hf_read')
write_access_token = userdata.get('hf_write')

### Dependencies

In [None]:
import importlib
import torch, transformers

if '2.3.0' not in torch.__version__:
  !pip install torch==2.3.0
if transformers.__version__!='4.41.2':
  !pip install transformers==4.41.2

if importlib.util.find_spec('datasets') is None:
  !pip install datasets==2.18.0s
  !pip install evaluate==0.4.2
  !pip install accelerate -U


If you've just installed `accelerate`, execute `Runtime > Restart session and run all` in the Colab UI menu above.

# Data

In [None]:
# load the data

from datasets import load_dataset, Dataset, DatasetDict

classification_dataset = load_dataset('InternationalOlympiadAI/NLP_problem', token=read_access_token)
raw_text = load_dataset('InternationalOlympiadAI/NLP_problem_raw', token=read_access_token)

# Baseline

In [None]:
unique_list = []

for line in classification_dataset['train'][:]['text']:
  splitText = line.split()
  for chr in splitText:
    if(chr not in unique_list):
      if chr != ' ':
        unique_list.append(chr)

print(unique_list)

['झ𑁣झच𑀪𑀢𑀟', '𑀣च', '𑀠न𑀞𑁦', 'ण𑀢', '𑀟च', '𑀫चझ𑁣', '𑀠च𑀟', '𑀲𑁦पन𑀪', 'च', 'ब𑁣𑀠ढ𑁦', 'ढचनत𑀫𑀢', 'ष', 'ब𑀱च𑀠𑀟च', '𑀢𑀟न𑀱च', 'णच𑀫चणच', 'णच𑀣𑀣च', '𑀞णच𑀟𑀣च', '𑀞𑁦', 'ढच𑀪च𑀤च𑀟च', '𑀣न𑀟𑀢णच', 'ढचढच𑀟ब𑀢𑀣च', 'चल𑀢णन𑀕', '𑀙णच𑀟', '𑀟च𑀘𑁦𑀪𑀢णच', '𑀞𑁦𑀱च𑀪', '𑀳𑀫𑁦𑀞च𑀪न', '𑀭𑁢', '𑀠नल𑀞𑀢𑀟', 'ध𑀣ध', 'त𑁣𑀪𑁣𑀟चख𑀢𑀪न𑀳𑀕', 'तनपच𑀪', 'पच', '𑀫च𑀟च', '𑀠च𑀪च𑀳च', 'लच𑀲𑀢णच', '𑀠𑀢ल𑀢णच𑀟', 'ठ', '𑀤न𑀱च', 'च𑀳𑀢ढ𑀢प𑀢', '𑀟𑀢ब𑁦𑀪𑀢च', '𑀠𑁦त𑁦', 'त𑁦', '𑀞णच𑀟𑀣च𑀪', 'ढ𑀢𑀪𑀢𑀦', '𑀞न𑀠च', '𑀠𑁦', '𑀘च𑀱𑁣', 'पच𑀡', 'च𑀠न𑀪𑀞च', '𑀤च', '𑀘च𑀟ण𑁦', '𑀳𑁣𑀘च', '𑀭𑀦𑀧𑀧𑀧', '𑀣चबच', '𑀳ण𑀪𑀢च', '𑀳चढ𑁣𑀣च', 'पन𑀪𑀞𑀢णणच', 'चढ𑀢𑀟', 'णच', '𑀞च𑀠चपच', '𑀞न', '𑀳च𑀟𑀢', '𑀞च𑀟', '𑀱च𑀳च𑀟', 'त𑀪𑁣चप𑀢च', 'ढ𑀪च𑀤𑀢ल', '𑀳च', 'च𑀲𑀢𑀪𑀞च', 'णच𑀠𑀠च', 'ढन𑀞चपच𑀪', '𑀞न𑀣𑀢𑀟', 'ढच𑀢', '𑀣चणच', 'तच𑀪ल𑁣𑀳', 'प𑁦ख𑁦𑀤', '𑀳च𑀟च𑀪', 'ण𑀢𑀟', '𑀪𑀢पचणच', 'पच𑀞च', 'ल𑁦𑀣च', '𑀳𑀫च𑀪𑀢𑀙च𑀪', 'ढन𑀫च𑀪𑀢', 'चप𑀢𑀞न𑀕', 'च𑀟', '𑀞च𑀢', '𑀱च', '𑀳𑀫च𑀢𑀣न𑀟', 'चप𑀢𑀞न', '𑀙𑀫च𑀪𑀢𑀙', 'चढन', 'ढ𑀢णच𑀪', '𑀳चढ𑁣𑀟', '𑀪च𑀫𑁣प𑁣𑀟', '𑀫चन𑀫च𑀱च𑀪', '𑀲च𑀪च𑀳𑀫𑀢', '𑀞न𑀟𑀳च', 'ढ𑀢𑀟𑀣𑀢बच', '𑀫चलच𑀞च', '𑀳𑀢णच𑀳च', 'णच𑀟𑀞𑀢𑀟', 'झचढ𑀢लच𑀪', '𑀢बढ𑁣', '𑀢𑀲𑁦च𑀟ण𑀢', '𑁣𑀞𑁣𑀱च𑀕', '𑀤चढ𑀢', 'ब𑀱च𑀠𑀟च𑀟', '𑀣𑁦लपच', '𑀠चप𑀳चण𑀢𑀟', '𑀠चपच𑀢𑀠च𑀞𑀢𑀟', 'ञच𑀟', 'पच𑀞च𑀪च𑀪', 

In [None]:
# load the pre-trained tokenizer and use it to process the data

from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-uncased")
tokenizer.add_tokens(unique_list)

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_data = classification_dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/1524 [00:00<?, ? examples/s]

Map:   0%|          | 0/218 [00:00<?, ? examples/s]

In [None]:
# define the evaluation metric

import evaluate
import numpy as np

f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels, average='macro')

In [None]:
# define the evaluation metric

import evaluate
import numpy as np

f1 = evaluate.load("f1")
# def compute_metrics(eval_pred):
#     predictions, labels = eval_pred
#     predictions = np.argmax(predictions, axis=1)
#     return f1.compute(predictions=predictions, references=labels, average='macro')

accuracy = evaluate.load('accuracy', trust_remote_code=True)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy_score = accuracy.compute(predictions=predictions, references=labels)
    f1_score = f1.compute(predictions=predictions, references=labels, average='macro')
    return {"accuracy": accuracy_score['accuracy'], "f1": f1_score['f1']}

In [None]:
# define the model and the training configuration

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-multilingual-uncased", num_labels=5
)
model.resize_token_embeddings(len(tokenizer))

training_args = TrainingArguments(
    output_dir="basiline_bobai",
    learning_rate=1e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=20,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=5,
    metric_for_best_model='f1',
    load_best_model_at_end=True,
    push_to_hub=True,
    hub_strategy="checkpoint",
    hub_token=write_access_token,
    hub_private_repo=True,
    hub_model_id='baseline_bobai'

)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# execute the model training
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,1.586494,0.238532,0.101375
2,No log,1.57386,0.348624,0.252223
3,No log,1.543689,0.330275,0.227792
4,No log,1.467001,0.380734,0.312445
5,No log,1.371762,0.444954,0.345731
6,No log,1.329822,0.449541,0.364761
7,No log,1.222147,0.509174,0.469311
8,No log,1.161659,0.555046,0.510818
9,No log,1.07451,0.591743,0.546226
10,No log,1.066668,0.600917,0.559937


TrainOutput(global_step=480, training_loss=0.9687404632568359, metrics={'train_runtime': 198.2851, 'train_samples_per_second': 153.718, 'train_steps_per_second': 2.421, 'total_flos': 323279694616296.0, 'train_loss': 0.9687404632568359, 'epoch': 20.0})

In [None]:
print(tokenized_data["train"][0])
print(tokenizer.decode(tokenized_data["train"][0]['input_ids']))

{'text': 'झ𑁣झच𑀪𑀢𑀟 𑀣च 𑀠न𑀞𑁦 ण𑀢 𑀟च 𑀫चझ𑁣 𑀠च𑀟 𑀲𑁦पन𑀪 च ब𑁣𑀠ढ𑁦 𑀣च ढचनत𑀫𑀢 ष ब𑀱च𑀠𑀟च 𑀢𑀟न𑀱च णच𑀫चणच', 'label': 0, 'input_ids': [101, 105879, 105880, 105881, 105882, 105883, 105884, 105885, 105886, 552, 105887, 105880, 105888, 578, 105889, 105890, 105891, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] झ𑁣झच𑀪𑀢𑀟 𑀣च 𑀠न𑀞𑁦 ण𑀢 𑀟च 𑀫चझ𑁣 𑀠च𑀟 𑀲𑁦पन𑀪 च ब𑁣𑀠ढ𑁦 𑀣च ढचनत𑀫𑀢 ष ब𑀱च𑀠𑀟च 𑀢𑀟न𑀱च णच𑀫चणच [SEP]


In [None]:
text = "झ𑁣झच𑀪𑀢𑀟 𑀣च 𑀠न𑀞𑁦"
tokens = list(text)  # Character-level tokenization
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(token_ids)
print(tokenizer.decode(token_ids))

[555, 100, 555, 552, 100, 100, 100, 100, 100, 552, 100, 100, 566, 100, 100]
झ [UNK] झ च [UNK] [UNK] [UNK] [UNK] [UNK] च [UNK] [UNK] न [UNK] [UNK]


# Inference

In [None]:
# run the trained model on a dev/test split
data_split = "dev"
eval_out = trainer.predict(tokenized_data[data_split])
predictions = eval_out.predictions.argmax(1)
labels = eval_out.label_ids
dev_f1 = f1.compute(predictions=predictions, references=labels, average='macro')

In [None]:
# write the predictions to a file
with open('{}_predictions.txt'.format(data_split), 'w') as outfile:
  outfile.write('\n'.join([str(p) for p in predictions.tolist()]))