# **Harmony**: Questionnaire Parsing Algorithm Improvement Challenge

**NLP challenge** | [Visit the challenge page](https://doxaai.com/competition/harmony-parsing)

Your challenge is to develop an improved algorithm for identifying mental health survey questions and selectable answers in plain text that can be integrated into the [Harmony tool](https://harmonydata.ac.uk/developer-guide/).

This Jupyter notebook will introduce you to the challenge and guide you through the process of making your first submission to the [DOXA AI platform](https://doxaai.com/competition/harmony-parsing).

**Before you get started, make sure to [sign up for an account](https://doxaai.com/sign-up) if you do not already have one and [enrol to take part](https://doxaai.com/competition/harmony-parsing) in the challenge.**

**If you have any questions, feel free to ask them in the [Harmony community Discord server](https://discord.com/invite/harmonydata).**


## Installing and importing useful packages

Before you get started, please make sure you have [PyTorch](https://pytorch.org/get-started/locally/) installed in your Python environment. If you do not have `pandas`, `transformers` or `intervaltree`, uncomment the code in the following cell to install them.


In [None]:
# %pip install "pandas>=2.2.2" "transformers>=4.43.1" "intervaltree>=3.1.0"

In [None]:
# Install the latest version of the DOXA CLI
%pip install -U doxa-cli

In [None]:
import os
import json

import pandas as pd

pd.set_option("display.max_colwidth", None)

## Loading the data

In [None]:
# Download the dataset if we do not already have it
if not os.path.exists("data"):
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-parsing-getting-started/main/data/train_raw.txt --create-dirs --output data/train_raw.txt
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-parsing-getting-started/main/data/train_clean.txt --output data/train_clean.txt
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-parsing-getting-started/main/data/train_labels.json --output data/train_labels.json

if not os.path.exists("submission"):
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-parsing-getting-started/main/submission/competition.py --create-dirs --output submission/competition.py
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-parsing-getting-started/main/submission/doxa.yaml --output submission/doxa.yaml
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-parsing-getting-started/main/submission/run.py --output submission/run.py

with open("data/train_raw.txt") as f:
    raw_train = f.read()

with open("data/train_clean.txt") as g:
    clean_train = g.read()

with open("data/train_labels.json") as h:
    labels_train = json.load(h)

## Exploring the data

Let's get started by taking a look at what the data looks like. The data comes in two forms:
- **The raw plain text** where questions and answers have been manually tagged with `<q>`/`</q>` and `<a>`/`</a>` by the Harmony team
- **A clean version** where the tags have been removed (with the question and answer ranges provided separately)

In [None]:
print(raw_train[:515])

In [None]:
print(clean_train[:451])

The `labels_train` dictionary has starting indexes (inclusive) and ending indexes (exclusive) for the clean text that correspond to the tagged questions and answers in the raw text. For example, to pick out the first question in the raw text `"I'm afraid that I might injury myself if I exercise"`, you can do the following:

In [None]:
start, end = labels_train["q"][0]

clean_train[start:end]

To make it significantly faster to query whether a word in a certain range is a question or an answer, we will build up two interval trees:

In [None]:
from intervaltree import Interval, IntervalTree

tree_q = IntervalTree(
    Interval(start, end) for start, end in labels_train["q"] if start != end
)

tree_a = IntervalTree(
    Interval(start, end) for start, end in labels_train["a"] if start != end
)

## Tokenising the text

Now, we'll tokenise the clean text and match up the question and answer ranges so that we can fine-tune a pre-trained DistilBERT model for our task. DistilBERT has a max token length of 512, so we have to also at the same time split up the training text into smaller chunks.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

tokenizer

In [None]:
label_list = ["other", "question", "answer"]

id2label = {k: v for k, v in enumerate(label_list)}
label2id = {v: k for k, v in enumerate(label_list)}

In [None]:
MAX_LENGTH = 512
STRIDE = 32


def tokenize(text, tokenizer, tree_q, tree_a):
    encodings = tokenizer(
        text,
        return_offsets_mapping=True,
        return_overflowing_tokens=True,
        truncation=True,
        max_length=MAX_LENGTH,
        stride=STRIDE,
        add_special_tokens=True,  # Includes the [CLS] and [SEP] tokens
    )

    all_token_labels = []
    for batch_index, (input_ids, offsets) in enumerate(
        zip(encodings["input_ids"], encodings["offset_mapping"])
    ):
        word_ids = encodings.word_ids(batch_index=batch_index)

        token_labels = []
        current_word_idx = None

        for word_id, (start, end) in zip(word_ids, offsets):
            if word_id is None:  # Special tokens like [CLS] or [SEP]
                token_labels.append(-100)
            elif word_id != current_word_idx:  # New word
                if len(tree_q.overlap(start, end)) > 0:
                    label = "question"
                elif len(tree_a.overlap(start, end)) > 0:
                    label = "answer"
                else:
                    label = "other"

                token_labels.append(label2id[label])
                current_word_idx = word_id
            else:  # Subword token
                token_labels.append(-100)

        all_token_labels.append(token_labels)

    encodings["labels"] = all_token_labels

    return encodings

In [None]:
from datasets import Dataset

tokenized_dataset = tokenize(clean_train, tokenizer, tree_q, tree_a)

training_dataset = Dataset.from_dict(
    {
        "input_ids": tokenized_dataset["input_ids"],
        "attention_mask": tokenized_dataset["attention_mask"],
        "labels": tokenized_dataset["labels"],
    }
)

Great &ndash; now that our data has been prepared, we can inspect the tokens that have been produced and labelled:

In [None]:
for i, (input_ids, labels) in enumerate(  # type: ignore
    zip(tokenized_dataset["input_ids"], tokenized_dataset["labels"])  # type: ignore
):
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    for token, label in zip(tokens, labels):
        print(f"Token: {token:<20} Label: {id2label.get(label)}")

    if i > 32:
        break

## Fine-tuning a token classification model

We are now ready to fine-tune a pre-trained DistilBERT model to perform this token classification task!

First, we need to load the model:

In [None]:
from transformers import (
    DataCollatorForTokenClassification,
    AutoModelForTokenClassification,
)

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

model = AutoModelForTokenClassification.from_pretrained(
    "distilbert/distilbert-base-uncased",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

model

Now that the model has been loaded, we are now ready to start fine-tuning it! You may want to experiment with the training arguments (just remember not to accidentally save models you do not want to submit in the `submission/` directory).

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="checkpoints",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_dataset,
    eval_dataset=training_dataset,  # You may wish to make your own validation set!
    data_collator=data_collator,
)

trainer.train()

Now that our model has finished training, we can use it to make some predictions for the text we have:

In [None]:
import torch

inputs = tokenizer(
    clean_train,
    return_offsets_mapping=True,
    return_overflowing_tokens=True,
    truncation=True,
    padding=True,
    max_length=MAX_LENGTH,
    stride=STRIDE,
    add_special_tokens=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    for i, (input_ids, attention_mask, offsets) in enumerate(zip(inputs["input_ids"], inputs["attention_mask"], inputs["offset_mapping"])):  # type: ignore
        predictions = torch.argmax(
            model(input_ids=input_ids, attention_mask=attention_mask).logits, dim=2
        )
        predicted_token_class = [
            model.config.id2label[t.item()] for t in predictions[0]
        ]

        for cls, offset in zip(predicted_token_class, offsets):
            word = clean_train[offset[0] : offset[1]]
            print(f"{word:<20}", cls)

        if i > 8:
            break

## Producing a submission package

**Now, we will move onto creating your first submission!**

When you upload your work to the DOXA AI platform, your code will be run in an environment with no internet access. As such, your submission needs to contain any models you want to use as part of the submission, as well as any code necessary to use those models (including tokenisers).

Currently, the `submission/` folder contains three files:

- `submission/competition.py`: this contains competition-specific code used to interface with the platform
- `submission/doxa.yaml`: this is a configuration file used by the DOXA CLI when you make a submission
- `submission/run.py`: this is the Python script that gets run when your work gets evaluated (**you will need to edit this to implement your solution!**)

First, we will save model and tokeniser into our `submission/` directory:

In [None]:
tokenizer.save_pretrained("submission/tokenizer")
trainer.save_model("submission/model")

When you upload your submission to the platform, based on the current configuration in `doxa.yaml`, the `run.py` entrypoint file will be run. If you take a look at `run.py`, you will see the following:

```py
class Evaluator(BaseEvaluator):
    def predict(
        self, text: str
    ) -> Generator[Tuple[int, int, Literal["Q", "A"]], Any, None]:
        # Load the saved tokeniser and model 
        tokenizer = AutoTokenizer.from_pretrained(directory / "tokenizer")
        model = AutoModelForTokenClassification.from_pretrained(directory / "model")

        # Tokenise the input text
        inputs = tokenizer(
            text,
            return_offsets_mapping=True,
            return_overflowing_tokens=True,
            truncation=True,
            padding=True,
            max_length=512,
            stride=16,
            add_special_tokens=True,
            return_tensors="pt",
        ).to(model.device)

        # Chunks overlap, so we want to keep track of predictions we have already made
        done = set() 

        # Produce predictions for each example (in inference mode)
        with torch.inference_mode():
            for input_ids, attention_mask, offsets in zip(inputs["input_ids"], inputs["attention_mask"], inputs["offset_mapping"]):  # type: ignore
                predictions = torch.argmax(
                    model(input_ids=input_ids, attention_mask=attention_mask).logits,
                    dim=2,
                )

                for t, (start, end) in zip(predictions[0], offsets):
                    if (start, end) in done or (start == 0 and end == 0):
                        continue

                    done.add((start, end))

                    predicted_token_class = model.config.id2label[t.item()]
                    if predicted_token_class == "question":
                        yield (start, end, "Q")
                    elif predicted_token_class == "answer":
                        yield (start, end, "A")
```

In the `predict()` method, we load the tokeniser and the model we had just been fine-tuning and then use them to produce predictions for the test set. You only need to output where you believe the questions and answers are, and the starting and ending ranges can be larger than a single token (i.e. you could produce a single prediction for a whole question or multiple predictions for each individual word, and the platform will match them up).

**When you come to implement your own solution, you will likely need to edit `predict()` in `run.py` to work with your model. Also, make sure you include the right model in your submission!**

You can edit `predict()` however you wish, as long as it produces question and answer range predictions that are contained within the document! If your submission is slow to evaluate on the platform, you may wish to edit `predict()` to perform inference in batches rather than chunk by chunk, but this will use more RAM. Note that in addition to the RAM limit, there is a submission size limit, so make sure you are only uploading models that are relevant to your current submisison.

## Uploading your submission to the platform

You are now ready to make your first submission to the platform! 👀

**Make sure to [enrol to take part](https://doxaai.com/competition/harmony-parsing) in the challenge if you have not already done so.**

First, we need to make sure we are logged in:


In [None]:
!doxa login

And then, we can submit our work for evaluation:


In [None]:
!doxa upload submission

**Congratulations!** 🥳

By this point, you will now have just made your first submission for this challenge on the DOXA AI platform!

If everything went well, your submission will now be queued up for evaluation. It will first be run on a small validation set to make sure that your submission does not crash on the full test set. If your submission runs into an issue at this point, you will be able to see the error logs from this phase. Otherwise, if your submission passes this stage, it will be evaluated on the full test set, and you will soon appear on the [competition scoreboard](https://doxaai.com/competition/harmony-parsing/scoreboard)!


## Next steps

**Now, it is up to you as to where you go from here to solve this challenge!**

Here are some ideas you might want to test out:

- How could you improve the training process to boost performance?
- What other [pre-trained models](https://huggingface.co/models?pipeline_tag=token-classification&sort=trending) in HuggingFace transformers could you use?
- How could you provide a `compute_metrics` function to the `Trainer` to produce additional metrics? (e.g. accuracy)
- How could you make better use of the training data provided?

If you are new to fine-tuning language models, take a look at the excellent [HuggingFace `transformers` documentation](https://huggingface.co/docs/transformers/en/training)!

**We look forward to seeing what you build!** We would love to hear about what you are working on for this challenge, so do let us know how you are finding the challenge on the [Harmony community Discord server](https://discord.com/invite/harmonydata) or the [DOXA AI community Discord server](https://discord.gg/MUvbQ3UYcf). 😎
