# **Harmony**: Question Matching Algorithm Improvement Challenge

**NLP challenge** | [Visit the challenge page](https://doxaai.com/competition/harmony-matching)

Your challenge is to develop an improved algorithm for matching psychology survey questions that produces similarity ratings more closely aligned with those given by humans psychologists working in the field and that can be integrated into the [Harmony tool](https://harmonydata.ac.uk/developer-guide/).

This notebook will expand upon the getting-started resources in `getting-started.ipynb`, showing you how to fine-tune a pre-trained model!

**Before you get started, make sure to [sign up for an account](https://doxaai.com/sign-up) if you do not already have one and [enrol to take part](https://doxaai.com/competition/harmony-matching) in the challenge.**

**If you have any questions, feel free to ask them in the [Harmony community Discord server](https://discord.com/invite/harmonydata).**


## Installing and importing useful packages

Before you get started, please make sure you have [PyTorch](https://pytorch.org/get-started/locally/) installed in your Python environment. If you do not have `pandas`, `seaborn`, `transformers` or `sentence-transformers`, the code in the following cell will install them.


In [None]:
%pip install "pandas>=2.2.2" "seaborn>=0.13.2" "transformers>=4.43.1" "sentence-transformers[train]>=3.0.1"

In [None]:
# Install the latest version of the DOXA CLI
%pip install -U doxa-cli

In [9]:
import os

import pandas as pd

pd.set_option("display.max_colwidth", None)

## Loading the data


In [10]:
# Download the dataset if we do not already have it
if not os.path.exists("train.csv"):
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/train.csv --output train.csv

if not os.path.exists("submission"):
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/submission/competition.py --create-dirs --output submission_finetuning/competition.py
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/submission/doxa.yaml --output submission_finetuning/doxa.yaml
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/submission/model.py --output submission_finetuning/model.py
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/submission/run.py --output submission_finetuning/run.py

# Load the data
df = pd.read_csv("train.csv")

In order to fine-tune a pre-trained model with `SentenceTransformers`, we need to transform our data to be in a slightly different format:


In [None]:
from datasets import Dataset

df = df[["sentence_1", "sentence_2", "human_similarity"]].rename(
    columns={
        "sentence_1": "sentence1",
        "sentence_2": "sentence2",
        "human_similarity": "score",
    }
)

# Rescale the scores to be in the range [0.0, 1.0]
df["score"] /= 100.0

dataset = Dataset.from_pandas(
    df,
)

dataset

## Fine-tuning a SentenceTransformers model

In this notebook, we will walk you through how to fine-tune a pre-trained [SentenceTransformers](https://sbert.net/) model for our task.

There are multiple fine-tuning approaches that you can take, but in this example, we are going to fine-tune a pre-trained `all-mpnet-base-v2` model using the `CosineSimilarityLoss` in order to make our model produce cosine similarity-based scores that align more closely with the human-provided similarity scores.

First, we import a pre-trained model once again:


In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")

model

Next, we can use the useful functionality built into `SentenceTransformers` to start fine-tuning the model:


In [None]:
from sentence_transformers import (
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.losses import CosineSimilarityLoss


loss = CosineSimilarityLoss(model)

trainer = SentenceTransformerTrainer(
    model=model,
    args=SentenceTransformerTrainingArguments(
        output_dir="checkpoints",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        report_to="none",
    ),
    train_dataset=dataset,
    loss=loss,
)

trainer.train()

Let's now compute some evaluation metrics using this fine-tuned model:


In [None]:
from sentence_transformers.evaluation import (
    EmbeddingSimilarityEvaluator,
    SimilarityFunction,
)

dev_evaluator = EmbeddingSimilarityEvaluator(
    sentences1=df["sentence1"],
    sentences2=df["sentence2"],
    scores=df["score"],
    main_similarity=SimilarityFunction.COSINE,
)

dev_evaluator(model)

## Producing a submission

**Now, we will move onto creating your submission!**

Just as before in the `getting-started.ipynb` notebook, we need to prepare a submission folder containing our fine-tuned model, as well as the code necessary to use it (which is identical to the code we used in the `getting-started.ipynb` notebook).

Currently, the `submission_finetuning/` folder contains three files:

- `submission_finetuning/competition.py`: this contains competition-specific code used to interface with the platform
- `submission_finetuning/doxa.yaml`: this is a configuration file used by the DOXA CLI when you make a submission
- `submission_finetuning/run.py`: this is the Python script that gets run when your work gets evaluated

All that is left to do is to save the SentenceTransformer model we have just fine-tuned into our `submission_finetuning/` directory:


In [16]:
model.save("submission_finetuning/model")

## Uploading your submission to the platform

You are now ready to make your first submission to the platform! 👀

**Make sure to [enrol to take part](https://doxaai.com/competition/harmony-matching) in the challenge if you have not already done so.**

First, we need to make sure we are logged in:


In [None]:
!doxa login

And then, we can submit our work for evaluation:


In [None]:
!doxa upload submission_finetuning

**Congratulations!** 🥳

By this point, you will now have just made a submission for this challenge on the DOXA AI platform!

If everything went well, your submission will now be queued up for evaluation. It will first be run on a small validation set to make sure that your submission does not crash on the full test set. If your submission runs into an issue at this point, you will be able to see the error logs from this phase. Otherwise, if your submission passes this stage, it will be evaluated on the full test set, and you will soon appear on the [competition scoreboard](https://doxaai.com/competition/harmony-matching/scoreboard)!


## Next steps

**Now, it is up to you as to where you go from here to solve this challenge!**

We would highly recommend taking a look at the [SentenceTransformers documentation](https://sbert.net/index.html) and the [HuggingFace `transformers` documentation](https://huggingface.co/docs/transformers/en/training) for inspiration as to what to do next!

**We look forward to seeing what you build!** We would love to hear about what you are working on for this challenge, so do let us know how you are finding the challenge on the [Harmony community Discord server](https://discord.com/invite/harmonydata) or the [DOXA AI community Discord server](https://discord.gg/MUvbQ3UYcf). 😎
