# Help BOBAI: Classify an unknown language

<img src="https://drive.google.com/uc?id=1Hvgrrah-T7yFTzDP002XuRodhyfY1Hju" width="750">

## Background
Bob's AI start-up, Bobai, builds AI solutions for other companies which have to process large volumes of text in their daily tasks. Bobai serve companies from all over the world, and they pride themselves on their ability to handle a variety of languages, from English, through Arabic to Mandarin. The secret to Bobai's success is that all of their products are based on a strong multilingual language encoder, mBERT. Bobai's infrastructure is actually highly optimized for this specific language encoder, which makes their products super fast and efficient, i.e. very attractive to clients.

## Task

But mBERT is trained on just 101 languages. So what happens when one of Bobai's biggest clients, Amoira, requests support for a new language X that is not among those 101 languages? Bob and his team have to find a way to meet this request, as they cannot risk losing the client.

The data Amoira has provided consists of a small labeled dataset for text classification and a larger corpus or raw text in the language.

To make things even more complicated, Amoira has encrypted the data, as they don't want to risk competitors finding out which new market they are targetting.

Bob has found out that at this time his team has no bandwidth to develop this product, so he is asking for your help. He has shared the baseline solution he uses for languages that mBERT already has support for, so you can start by checking how well this solution does and modify it to obtain better results. You should not waste any efforts on trying to decrypt the data - this will not help you build a better classifier and it will get you in trouble with Bob!

Your task is to build the best text classifier for language X that you can, while operating within the constraints of Bobai:

*   The classifier has to be based on mBERT (and cannot use any additional pre-trained language encoder).
*   The classifier has to train in under 8 hours using an L4 GPU as the compute resources of the company are limited.
*   The classifier has to perform inference on any random 500 data samples in under 5 minutes (Bobai will then apply their optimization tricks to bring this time even further down).

## Deliverables

You need to submit:


*   Your model predictions on the test inputs that we will provide 48 hours before the deadline.
  * saved as a text file in the format shown at the bottom of the notebook
*   Your best trained model.
  * as a link to the Huggingface Hub (read up on `push_to_hub` [here](push_to_hub)).
*   Working code that can be used to reproduce your best trained model.
  * In this Colab notebook.


## Prerequisites


### HuggingFace configuration

The steps below need to be completed by the team leader:

1. Create a team account on [HuggingFace](https://huggingface.co/) using the Gmail account provided by the IOAI organizers.

2. Go to the [IOAI HuggingFace repo](https://huggingface.co/InternationalOlympiadAI) and request access to all datasets.

3. In settings, create two Access Tokens, one with read rights, one with write rights, and store those in [Colab Secrets](https://www.youtube.com/watch?v=q87i2LZbbPc) as `hf_read` and `hf_write`, respectively.

In [1]:
import os

# Get tokens from environment variables
read_access_token = os.getenv('HF_READ_TOKEN')
write_access_token = os.getenv('HF_WRITE_TOKEN')

### Dependencies

If you've just installed `accelerate`, execute `Runtime > Restart session and run all` in the Colab UI menu above.

In [2]:
import importlib
import torch, transformers

if '2.3.0' not in torch.__version__:
  %pip install torch==2.3.0
if transformers.__version__!='4.41.2':
  %pip install transformers==4.41.2

if importlib.util.find_spec('datasets') is None:
  %pip install datasets==2.18.0
  %pip install evaluate==0.4.2
  %pip install accelerate -U


  from .autonotebook import tqdm as notebook_tqdm


# Data

In [3]:
# load the data

from datasets import load_dataset, Dataset, DatasetDict

classification_dataset = load_dataset('InternationalOlympiadAI/NLP_problem', token=read_access_token)
raw_text = load_dataset('InternationalOlympiadAI/NLP_problem_raw', token=read_access_token)

In [4]:
type(raw_text)

datasets.dataset_dict.DatasetDict

In [5]:
import plotly.express as px
from collections import Counter
import pandas as pd
import plotly.graph_objects as go

# Function to get string length
def get_string_length(text):
    return len(text)

# Function to get character frequency
def get_char_frequency(text):
    return Counter(text)

def get_dataset_statistics(dataset: Dataset):
    # Get string lengths
    string_lengths = [get_string_length(text) for text in dataset['text']]

    # Get character frequencies
    char_frequencies = Counter()
    for text in dataset['text']:
        char_frequencies.update(get_char_frequency(text))

    # Plot histogram of string lengths
    fig_lengths = px.histogram(x=string_lengths, nbins=50, log_y=True)
    fig_lengths.update_layout(
        title='Histogram of String Lengths',
        xaxis_title='String Length',
        yaxis_title='Frequency (log scale)'
    )
    fig_lengths.show()

    # Plot histogram of character frequencies
    fig_chars = px.bar(x=list(char_frequencies.keys()), y=list(char_frequencies.values()), log_y=True)
    fig_chars.update_layout(
        title='Histogram of Character Frequencies',
        xaxis_title='Characters',
        yaxis_title='Frequency (log scale)'
    )
    fig_chars.update_xaxes(tickangle=90)
    fig_chars.show()

    # Print some summary statistics
    print(f"Total number of texts: {len(dataset['text'])}")
    print(f"Average string length: {sum(string_lengths) / len(string_lengths):.2f}")
    print(f"Minimum string length: {min(string_lengths)}")
    print(f"Maximum string length: {max(string_lengths)}")
    print(f"Number of unique characters: {len(char_frequencies)}")
    print(f"Most common character: '{char_frequencies.most_common(1)[0][0]}' (occurs {char_frequencies.most_common(1)[0][1]} times)")


def compare_datasets(
    dataset1: Dataset,
    dataset2: Dataset,
    name1: str = 'Dataset 1',
    name2: str = 'Dataset 2'
):
    # Get string lengths for both datasets
    lengths1 = [get_string_length(text) for text in dataset1['text']]
    lengths2 = [get_string_length(text) for text in dataset2['text']]


    # Create separate DataFrames for string lengths
    df_lengths1 = pd.DataFrame({name1: lengths1})
    df_lengths2 = pd.DataFrame({name2: lengths2})

    # Plot histogram of string lengths
    fig_lengths = go.Figure()
    fig_lengths.add_trace(go.Histogram(x=df_lengths1[name1], name=name1, opacity=0.7, nbinsx=50, histnorm='probability'))
    fig_lengths.add_trace(go.Histogram(x=df_lengths2[name2], name=name2, opacity=0.7, nbinsx=50, histnorm='probability'))

    fig_lengths.update_layout(
        title=f'Histogram of String Lengths: {name1} vs {name2}',
        xaxis_title='String Length',
        yaxis_title='Frequency',
        yaxis_type='log',
        barmode='overlay'
    )
    fig_lengths.update_traces(marker_line_width=1, marker_line_color="white")
    fig_lengths.show()

    # Get character frequencies for both datasets
    char_freq1 = Counter()
    char_freq2 = Counter()
    for text in dataset1['text']:
        char_freq1.update(get_char_frequency(text))
    for text in dataset2['text']:
        char_freq2.update(get_char_frequency(text))

    # Prepare data for scatter plot
    all_chars = set(char_freq1.keys()) | set(char_freq2.keys())
    df_chars = pd.DataFrame({
        'char': list(all_chars),
        name1: [char_freq1.get(char, 0) for char in all_chars],
        name2: [char_freq2.get(char, 0) for char in all_chars]
    })

    # Plot scatter plot of character frequencies
    fig_chars = px.scatter(
        df_chars,
        x=name1,
        y=name2,
        text='char',
        log_x=True,
        log_y=True
    )
    fig_chars.update_traces(textposition='top center')
    fig_chars.update_layout(
        title=f'Character Frequencies: {name1} vs {name2}',
        xaxis_title=f'{name1} Frequency',
        yaxis_title=f'{name2} Frequency'
    )
    fig_chars.show()

    # Print some comparison statistics
    print(f"Total number of texts in {name1}: {len(dataset1['text'])}")
    print(f"Total number of texts in {name2}: {len(dataset2['text'])}")
    print(f"Average string length in {name1}: {sum(lengths1) / len(lengths1):.2f}")
    print(f"Average string length in {name2}: {sum(lengths2) / len(lengths2):.2f}")
    print(f"Number of unique characters in {name1}: {len(char_freq1)}")
    print(f"Number of unique characters in {name2}: {len(char_freq2)}")




In [6]:
# get_summary_statistics(classification_dataset['dev'])
compare_datasets(classification_dataset['train'], raw_text['train'], 'Train', 'Raw Text')


Total number of texts in Train: 1524
Total number of texts in Raw Text: 611245
Average string length in Train: 62.99
Average string length in Raw Text: 119.84
Number of unique characters in Train: 61
Number of unique characters in Raw Text: 69


## Train a new tokenizer

In [7]:
raw_text['train']['text'][0: 10]

['  𑀫च𑀞च𑀟च𑀟 𑀤च𑀙च 𑀢णच च𑀠𑀲च𑀟𑀢 𑀣च ब𑁦𑀟𑁣पणध𑁦 𑀣𑁣𑀟 𑀞𑁣𑀠च𑀱च 𑀤न𑀱च चलल𑁦ल𑁦𑀳 𑀞𑁣 ढच𑀠ढच𑀟त𑁦षढच𑀠ढच𑀟त𑁦𑀟 𑀣च 𑀠नपन𑀠 𑀞𑁦 ञचन𑀞च च त𑀢𑀞𑀢𑀟 𑀱च𑀟𑀢 पच𑀞च𑀠च𑀢𑀠च𑀟 𑀤नढ𑀢𑀟 𑀫चल𑀢पपच 𑀞𑁣 𑀱न𑀪𑀢𑀟 𑀞𑀱चण𑁣ण𑀢𑀟 𑀫चल𑀢पपच𑀯',
 ' च बच𑀳च𑀪 𑀱च𑀳च𑀟𑀟𑀢𑀟 𑁣लण𑀠ध𑀢त𑀳 𑀙णच𑀟 𑀱च𑀳च𑀟𑀳न 𑀳न𑀟 𑀳च𑀠न लच𑀠ढ𑁣ढ𑀢𑀟 णचढ𑁣 𑀭ठ𑀦 बच𑀠𑀢 𑀣च 𑀤𑀢𑀟च𑀪𑁦 𑀖𑀯',
 '𑀪चण𑀠𑁣𑀟𑀣 च𑀟प𑀫𑁣𑀟ण चल𑁦𑁣ब𑀫𑁣 𑀣𑁣𑀞ध𑁦𑀳𑀢 𑀝च𑀟 𑀫च𑀢𑀲𑁦 𑀳𑀫𑀢 च 𑀪च𑀟च𑀪 ठ𑀖 बच 𑀱चपच𑀟 𑁣𑀞प𑁣ढच 𑀟च 𑀳𑀫𑁦𑀞च𑀪च𑀪 𑀭थ𑀖𑀭 ष ठथ 𑀠चणन ठ𑀧ठ𑀰𑀮 च बच𑀪𑀢𑀟 𑀢ढच𑀣च𑀟 𑀣च𑀟 𑀞च𑀳न𑀱च 𑀟𑁦 𑀟च 𑀞च𑀲च𑀲𑁦𑀟 णच𑀣च लचढच𑀪च𑀢 𑀟च 𑀟च𑀘𑁦𑀪𑀢णच𑀯',
 'णच𑀣𑀣च च𑀞𑁦 𑀞𑀢𑀱𑁣𑀟 पच𑀟पचढच𑀪च𑀣न𑀞 𑀠नपन𑀠𑀢𑀟 𑀣च णच पच𑀳𑁣 त𑀢𑀞𑀢𑀟 𑀞च𑀪𑀞च𑀪च 𑀱चप𑁣 झचनण𑁦 णच𑀟च 𑀣च 𑀢𑀪𑀪𑀢𑀟 𑀞𑀢𑀱𑁣𑀟 𑀣च णच 𑀞च𑀪ढ𑁦 𑀳𑀫𑀢𑀯',
 '𑀪च𑀳𑀫𑀢𑀟 𑀣च𑀢𑀣च𑀢प𑁣 𑀟च 𑀳च𑀠न𑀟 𑀞न𑀣𑀢𑀟 𑀳𑀫𑀢बच 𑀞च𑀠च𑀪 णच𑀣𑀣च च 𑀳𑀫𑁦𑀞च𑀪च पच ठ𑀧𑀭𑀧𑀦 ब𑀢𑀟𑀢 त𑁣𑁦𑀲𑀲𑀢त𑀢𑁦𑀟प 𑀟च 𑀟च𑀘𑁦𑀪𑀢णच च𑀟 𑀞𑀢णच𑀳पच 𑀠चप𑀳च𑀞च𑀢त𑀢𑀦 च 𑀧𑀯',
 ' च त𑀢𑀞𑀢𑀟 𑀱चपच𑀟 𑀠च𑀪𑀢𑀳 𑀭𑀖𑀧𑀖𑀦 णच𑀟च 𑀣च 𑀳𑀫𑁦𑀞च𑀪न ठ𑀖𑀦 𑀠चब𑁦ललच𑀟 णच 𑀳𑀫𑀢बच त𑀢𑀞𑀢𑀟 𑀪न𑀟𑀣न𑀟च𑀪 𑀘𑀢𑀪चब𑁦𑀟 𑀪न𑀱च ठठ 𑀣च च𑀞च च𑀢𑀞च 𑀣𑁣𑀟 𑀞च𑀪ढच𑀪 ढच𑀞न𑀟त𑀢𑀟 𑀲𑀪च𑀟त𑀢𑀳त𑁣 𑀣𑁦 चल𑀠𑁦𑀢𑀣च च 𑀠चप𑀳चण𑀢𑀟 𑀠चपच𑀢𑀠च𑀞𑀢𑀟 𑀲च𑀪𑀞𑁣 𑀟च ध𑁣𑀪पनबन𑁦𑀳𑁦 𑀢𑀟𑀣𑀢णच𑀯',
 'ध𑀫𑁣प𑁣 चलढन𑀠च𑀟च ढच𑀢 𑀱च 𑀠च𑀳न च𑀠𑀲च𑀟𑀢 𑀣च𑀠च𑀪 𑀳च𑀟णच 𑀫𑁣पन𑀟च𑀟 𑀞च𑀟𑀳न 𑀞𑁣 𑀱च𑀳न 𑀫𑁣पन𑀟च 𑀣च 𑀳न𑀞च 𑀣चन𑀞च 𑀤न𑀱च ढचणच𑀟च𑀟 𑀠च𑀢 च𑀠𑀲च𑀟𑀢 𑀣च 

In [14]:
from tqdm import tqdm
from transformers import BertTokenizerFast

# repositor id for saving the tokenizer
tokenizer_id="bert-base-uncased-ioi-2024"

# create a python generator to dynamically load the data
def batch_iterator(batch_size=100):
    for i in tqdm(range(0, len(raw_text['train']['text'][:1000]), batch_size)):
    # yield raw_text['train']['text'][0: 10]
    # for i in tqdm(range(0, 1, batch_size)):
        yield raw_text['train']['text'][i : i + batch_size]

# create a tokenizer from existing one to re-use special tokens
new_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")


In [26]:
# new_tokenizer = new_tokenizer.train_new_from_iterator(text_iterator=batch_iterator(), vocab_size=32_000)
# new_tokenizer.save_pretrained("tokenizer")

# you need to be logged in to push the tokenizer
new_tokenizer.push_to_hub(tokenizer_id)


HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-671d2627-07edb7c81a18f3d17ba96cb4;65739d30-6626-44d6-b6ea-7bc619276513)

Invalid username or password.

In [25]:
new_tokenizer.model_input_names

['input_ids', 'token_type_ids', 'attention_mask']

In [None]:
from transformers import AutoTokenizer
from huggingface_hub import HfApi

user_id = HfApi().whoami()["name"]

# load tokenizer
# new_tokenizer = AutoTokenizer.from_pretrained(f"{user_id}/{tokenizer_id}")
new_tokenizer = AutoTokenizer.from_pretrained("tokenizer")

def group_texts(examples):
    tokenized_inputs = new_tokenizer(
        examples["text"],
        return_special_tokens_mask=True,
        truncation=False,
        padding=False,
        max_length=200,
        return_overflowing_tokens=True,
        stride=100
    )
    return tokenized_inputs

# preprocess dataset
tokenized_datasets = raw_datasets.map(group_texts, batched=True, remove_columns=["text"], num_proc=num_proc)
tokenized_datasets.features


# Baseline

In [35]:
# load the pre-trained tokenizer and use it to process the data

from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

new_tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-uncased")

def preprocess_function(examples):
    return new_tokenizer(examples["text"], truncation=True)

tokenized_data = classification_dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=new_tokenizer)

Map: 100%|██████████| 218/218 [00:00<00:00, 15259.40 examples/s]


In [37]:
# define the evaluation metric

import evaluate
import numpy as np

f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels, average='macro')

In [38]:
# define the model and the training configuration

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-multilingual-uncased", num_labels=5
)

training_args = TrainingArguments(
    output_dir="basiline_bobai",
    learning_rate=1e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=20,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=5,
    metric_for_best_model='f1',
    load_best_model_at_end=True,
    push_to_hub=True,
    hub_strategy="checkpoint",
    hub_token=write_access_token,
    hub_private_repo=True,
    hub_model_id='baseline_bobai'

)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["dev"],
    tokenizer=new_tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [39]:
# execute the model training
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,No log,1.566817,0.204884
2,No log,1.555382,0.225585
3,No log,1.521984,0.262378
4,No log,1.49666,0.270352
5,No log,1.485887,0.257852
6,No log,1.494712,0.26093
7,No log,1.49271,0.263981
8,No log,1.476671,0.287499
9,No log,1.452095,0.309658
10,No log,1.453665,0.308777


TrainOutput(global_step=480, training_loss=1.4480936686197916, metrics={'train_runtime': 212.9048, 'train_samples_per_second': 143.163, 'train_steps_per_second': 2.255, 'total_flos': 372809629324656.0, 'train_loss': 1.4480936686197916, 'epoch': 20.0})

# Inference

In [None]:
# run the trained model on a dev/test split
data_split = "dev"
eval_out = trainer.predict(tokenized_data[data_split])
predictions = eval_out.predictions.argmax(1)
labels = eval_out.label_ids
dev_f1 = f1.compute(predictions=predictions, references=labels, average='macro')

# Testing

In [None]:
# UPDATE THIS CELL ACCORDINGLY

# define a funciton to load your tokenizer and model from a HF path
# the path variables can be strings or lists of strings (for ensemble solutions)
def load_model(path_to_tokenizer, path_to_model, token):
  # Example:
  tokenizer = AutoTokenizer.from_pretrained(path_to_tokenizer, token=token)
  model = AutoModelForSequenceClassification.from_pretrained(path_to_model, token=token)
  model.eval()

  return tokenizer, model

# define a "predict" function that takes the model and a list of input strings
# and returns the outputs as a list of integer classes
def predict(tokenizer, model, input_texts):
  #Example:
  predictions = []
  for input_text in input_texts:

    input_ids = tokenizer(input_text, return_tensors="pt")

    with torch.no_grad():
      logits = model(**input_ids).logits

    predictions.append(logits.argmax().item())

  return predictions

# set variables
path_to_model = "path/to/your/best/model/on/hf" # can be a list instead
path_to_tokenizer = "path/to/your/best/tokenizer/on/hf" # can be a list instead
model_access_token = "access token" # a fine-grained token with read rights for your model repository


In [None]:
# DO NOT CHANGE THIS CELL!!!

new_tokenizer, model = load_model(path_to_tokenizer, path_to_model, token=model_access_token)

test_data = load_dataset("InternationalOlympiadAI/NLP_problem_test")['test']['text']

predictions = predict(new_tokenizer, model, test_data)

with open('test_predictions.txt', 'w') as outfile:
  outfile.write('\n'.join([str(p) for p in predictions]))