<img src="./figs/IOAI-Logo.png" alt="IOAI Logo" width="200" height="auto">

[IOAI 2024 (Burgas, Bulgaria), On-Site Round](https://ioai-official.org/bulgaria-2024)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IOAI-official/IOAI-2024/blob/main/On-Site-Round/Help_BOBAI/Help_BOBAI.ipynb)

# Help BOBAI: More classification in an unknown language

<img src="./figs/Help BOBAI Fig 1.png" width="700">

## Background
Last time you heard from Bob, he asked you to help him by building a classifier for a new unknown language. The client, Amoira, was happy with your solution so Bob instructed his team to deploy the new model and after some heavy optimization and careful unit testing, the service was deployed and has been running smoothly since.

## Task

This very morning, Amoira returned with a request to extend the number of classes which the classifier can handle from 5 to 7. And this has to be done *today*!

Amoira has provided labeled data for the new classes. With more time, Bob could just use your earlier solution to train a new model on the union of the old and new data, right? The trouble is that the deployment of a new model is a complex process and cannot be done in a day, so the solution has to be built entirely around the model already deployed. Bob has once more come to you for help, as you know the task best.

Whatsmore, Amoira's security concerns have grown even further with the addition of the new data, so they have requested that Bob does not release the text in any form - what if someone managed to decrypt it! So Bob has provided you with a precomputed and cached encoding of all available data: the train and dev set previously used for the 5-way classification, and the new data Amoira provided for the 2 additional classes. The encoding is the output of the pooling layer in mBERT, so is fits right into the classifier previously trained.

Your task is to build a solution for 7-way classification, while operating within the following constraints:

*   The solution can use the 5-way classifier, but cannot change the parameters of the classifier or add any new learned parameters.

*   You are allowed to compute averages and distances between the data encodings.

*   The solution should be reproducible in under 1 hour on an L4 GPU card.

*   The classifier has to perform inference on any random 500 data samples in under 2 minutes on an L4 GPU card.

## Deliverables

You need to submit:

*   Working code that can be used to reproduce and test your best model.
  * In this Colab notebook.
  * Reproducing your best model means that starting from the baseline classifier, we should be able to arrive at your final best model by executing the cells of the notebook.
*   The predictions on the test data (released two hours before the end of the competition).

**You absolutely need to ensure that:**

(1) your notebook is executable from top to bottom

(2) that the notebook contains the full code needed to reproduce your model

(3) that it can run on an L4 GPU



## Training Dataset

In [None]:
import torch

dataset = torch.load('./training_set/train-dev_dataset_with_labels.pt')

inputs = dataset[:,:,:-1]
labels = dataset[:, :, -1]


## Baseline Solution

Below you will find a very naive baseline solution: given an input vector, we use either randomly assign one of the new labels (5 and 6) with uniform probability over a 7-way classification, or we use the base classifier to make a prediction.

You can replace the code below with your solution.

In [None]:
import torch
import random

class SevenWayClassifier():
  def __init__(self, ):
    base_clf = torch.nn.Linear(in_features=768, out_features=5, bias=True)
    base_clf.load_state_dict(torch.load("training_set/base_classifier.pth"))
    self.base_clf = base_clf

  def base_classification(self, input_vector):

    with torch.no_grad():
      logits = self.base_clf(input_vector)
      preds = torch.softmax(logits, 1)
      predicted_class = preds.argmax(dim=1).numpy()[0]

    return predicted_class

  def __call__(self, input_vector):

    random_class = random.choice([0,1,2,3,4,5,6])
    if random_class in [5,6]:
      predicted_class = random_class
    else:
      predicted_class = self.base_classification(input_vector)

    return predicted_class

clf = SevenWayClassifier()

## Inference and Evaluation

In [None]:
from sklearn.metrics import f1_score

def compute_f1(labels, predictions):
  return f1_score(labels, predictions, average='macro')

In [None]:
from tqdm import tqdm

def inference(clf, input_vectors):
  predictions = []
  for sample in tqdm(input_vectors):
    predictions.append(clf(sample))
  return predictions

In [None]:

predictions = inference(clf, inputs)

f1 = compute_f1(labels, predictions)
print('\nNaive solution F1', f1)

## Validation Dataset

In [None]:
# The leaderboard may or may not work... If it doesn't forgive us. We will try to get it running.

import pandas as pd
import numpy as np

def submission_to_csv(pred: np.ndarray, output_fpath: str = "submission.csv"):
    pred = np.array(pred).flatten()
    data_size = pred.size
    df = pd.DataFrame({
        "ID": np.arange(data_size),
        "class": pred
    })

    df.to_csv(output_fpath, index=False)

eval_inputs = torch.load('./Solution/validation_set/eval_dataset.pt')

eval_predictions = inference(clf, eval_inputs)

submission_to_csv(eval_predictions)

## Test Dataset

In [None]:
# DO NOT CHANGE THIS CELL

# this download link will not work until two hours before the end of the competition
test_inputs = torch.load('Solution/test_set/test_dataset.pt')

split='test'

test_predictions = inference(clf, test_inputs)

with open('{}_predictions.txt'.format("Team Name"), 'w') as outfile:
  outfile.write('\n'.join([str(p) for p in test_predictions]))