# **Harmony**: Question Matching Algorithm Improvement Challenge

**NLP challenge** | [Visit the challenge page](https://doxaai.com/competition/harmony-matching)

Your challenge is to develop an improved algorithm for matching psychology survey questions that produces similarity ratings more closely aligned with those given by humans psychologists working in the field and that can be integrated into the [Harmony tool](https://harmonydata.ac.uk/developer-guide/).

This Jupyter notebook will introduce you to the challenge and guide you through the process of making your first submission to the [DOXA AI platform](https://doxaai.com/competition/harmony-matching).

**Before you get started, make sure to [sign up for an account](https://doxaai.com/sign-up) if you do not already have one and [enrol to take part](https://doxaai.com/competition/harmony-matching) in the challenge.**

**If you have any questions, feel free to ask them in the [Harmony community Discord server](https://discord.com/invite/harmonydata).**


## Installing and importing useful packages


In [1]:
%pip install pandas seaborn transformers sentence-transformers
%pip install -U doxa-cli

In [2]:
import os

import pandas as pd
import seaborn as sns

pd.set_option("display.max_colwidth", None)

## Loading the data


In [3]:
# Download the dataset if we do not already have it
if not os.path.exists("train.csv"):
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/train.csv --output train.csv

# Load the data
df = pd.read_csv("train.csv")

## Exploring the data

Let's get started by taking a look at the training dataset, which contains the following data variables:

- `sentence_1` and `sentence_2`: a pair of English-language sentences drawn from psychology surveys
- `human_similarity`: the human-judged similarity of the two sentences (integers in the range `[0, 100]`)
- `cosine_from_harmony`: cosine similarity values currently generated by the Harmony tool, which are provided purely for reference and do not form part of the challenge


In [None]:
df

In [None]:
df.info()

In [None]:
df.hist(figsize=(12, 5))

In [None]:
sns.displot(df, x="human_similarity", y="cosine_from_harmony", bins=25)

In [None]:
df[["human_similarity", "cosine_from_harmony"]].corr()

As you can see from the visualisations and the correlation matrix, the cosine similarity scores currently being used within the Harmony tool do not correlate particularly well with the human-sourced similarity ratings. Your challenge is to develop a matching algorithm that aligns more closely with the human-provided scores!


## Generating embeddings

In this notebook, as an example to get you started, we are going to implement the relatively simple strategy of using a pre-trained model to compute sentence embeddings for each sentence in the training dataset and using the cosine similarity between the sentence pairs in the dataset as the basis for our similarity score predictions.

First, we will load a pre-trained [SentenceTransformers](https://sbert.net/) model:


In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-mpnet-base-v2")

model

Next, we will use this to generate embeddings for all the sentence pairs in the training dataset and then compute the cosine similarity for each pair. Then, to produce similarity scores in the range `[0, 100]` to match the human-provided scores, we will slightly rescale and clip the cosine similarities we just computed.


In [None]:
embeddings_1 = model.encode(df["sentence_1"], normalize_embeddings=True)
embeddings_2 = model.encode(df["sentence_2"], normalize_embeddings=True)

df["prediction"] = model.similarity_pairwise(embeddings_1, embeddings_2)
df["prediction"] = (100 * df["prediction"]).apply(int).clip(0, 100)

df

In [None]:
df.hist(["human_similarity", "prediction"])

In [None]:
sns.displot(df, x="human_similarity", y="prediction", bins=25)

In [None]:
df[["human_similarity", "cosine_from_harmony", "prediction"]].corr()

While these predictions represent a slight improvement, it is now your challenge to see how much better you can make things! 👀


## Producing a submission package

**Now, we will move onto creating your first submission!**

When you upload your work to the DOXA AI platform, your code will be run in an environment with no internet access. As such, your submission needs to contain any models you want to use as part of the submission, as well as any code necessary to use those models.

Currently, the `submission/` folder contains three files:

- `submission/competition.py`: this contains competition-specific code used to interface with the platform
- `submission/doxa.yaml`: this is a configuration file used by the DOXA CLI when you make a submission
- `submission/run.py`: this is the Python script that gets run when your work gets evaluated (**you will need to edit this to implement your solution!**)

First, we will save the SentenceTransformer model we have just loaded into our `submission/` directory:


In [14]:
model.save("submission/model")

Next, if you take a look at `run.py`, you will see the following:

```py
class Evaluator(BaseEvaluator):
    def predict(self, df: pd.DataFrame) -> Generator[int, Any, None]:
        model = SentenceTransformer(str(directory / "model"))

        embeddings_1 = model.encode(df["sentence_1"], normalize_embeddings=True)
        embeddings_2 = model.encode(df["sentence_2"], normalize_embeddings=True)

        df["prediction"] = model.similarity_pairwise(embeddings_1, embeddings_2)
        df["prediction"] = (100 * df["prediction"]).apply(int).clip(0, 100)

        for _, row in df.iterrows():
            yield row["prediction"]
```

In the `predict()` method, we load the `SentenceTransformer` model we just saved at `submission/model`, produce embeddings for the dataframe provided (in this instance containing the test set sentence pairs), compute the cosine similarities and then transform it into integer scores in the range `[0, 100]`. There are multiple ways to produce these similarity scores, and it is up to you to experiment with different techniques! For example, instead of computing the cosine similarity here, you may want to feed the embeddings you generate into another neural network you have trained for this task.

**When you come to implement your own solution, you will need to edit `predict()` in `run.py` and make sure you include the right model in your submission!**

You can modify `predict()` however you wish: it just has to yield your similarity score predictions in the same order as they appear in the dataframe. If your submission requires a lot of RAM, you may wish to modify `predict()` to process the test set in batches instead of all at once. Note that in addition to the RAM limit, there is a submission size limit, so make sure you are only uploading models that are relevant to your current submisison.


## Uploading your submission to the platform

You are now ready to make your first submission to the platform! 👀

**Make sure to [enrol to take part](https://doxaai.com/competition/harmony-matching) in the challenge if you have not already done so.**

First, we need to make sure we are logged in:


In [None]:
!doxa login

And then, we can submit our work for evaluation:


In [None]:
!doxa upload submission

**Congratulations!** 🥳

By this point, you will now have just made your first submission for this challenge on the DOXA AI platform!

If everything went well, your submission will now be queued up for evaluation. It will first be run on a small validation set to make sure that your submission does not crash on the full test set. If your submission runs into an issue at this point, you will be able to see the error logs from this phase. Otherwise, if your submission passes this stage, it will be evaluated on the full test set, and you will soon appear on the [competition scoreboard](https://doxaai.com/competition/harmony-matching/scoreboard)!


## Next steps

**Now, it is up to you as to where you go from here to solve this challenge!**

Here are some ideas you might want to test out:

- Using other [SentenceTransformers](https://sbert.net/) models that may perform better at this task than `all-mpnet-base-v2`
- Training an additional model to predict `human_similarity` from embeddings computed using the [SentenceTransformers](https://sbert.net/) library
- Fine-tuning a language model for this task

**We look forward to seeing what you build!** We would love to hear about what you are working on for this challenge, so do let us know how you are finding the challenge on the [Harmony community Discord server](https://discord.com/invite/harmonydata) or the [DOXA AI community Discord server](https://discord.gg/MUvbQ3UYcf). 😎
