# **Harmony**: Question Matching Algorithm Improvement Challenge

**NLP challenge** | [Visit the challenge page](https://doxaai.com/competition/harmony-matching)

Your challenge is to develop an improved algorithm for matching psychology survey questions that produces similarity ratings more closely aligned with those given by humans psychologists working in the field and that can be integrated into the [Harmony tool](https://harmonydata.ac.uk/developer-guide/).

This Jupyter notebook will introduce you to the challenge and guide you through the process of making your first submission to the [DOXA AI platform](https://doxaai.com/competition/harmony-matching).

**Before you get started, make sure to [sign up for an account](https://doxaai.com/sign-up) if you do not already have one and [enrol to take part](https://doxaai.com/competition/harmony-matching) in the challenge.**

**If you have any questions, feel free to ask them in the [Harmony community Discord server](https://discord.com/invite/harmonydata).**


## Installing and importing useful packages

Before you get started, please make sure you have [PyTorch](https://pytorch.org/get-started/locally/) installed in your Python environment. If you do not have `pandas`, `seaborn`, `transformers` or `sentence-transformers`, the code in the following cell will install them.


In [1]:
%pip install "pandas>=2.2.2" "seaborn>=0.13.2" "transformers==4.43.1" "sentence-transformers==3.0.1"

Collecting transformers==4.43.1
  Downloading transformers-4.43.1-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers==3.0.1
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.43.1)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.43.1-py3-none-any.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━

In [2]:
# Install the latest version of the DOXA CLI
%pip install -U doxa-cli

Collecting doxa-cli
  Downloading doxa_cli-0.1.8-py3-none-any.whl.metadata (4.5 kB)
Collecting halo>=0.0.31,~=0.0.31 (from doxa-cli)
  Downloading halo-0.0.31.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting requests-toolbelt~=0.10.1 (from doxa-cli)
  Downloading requests_toolbelt-0.10.1-py2.py3-none-any.whl.metadata (14 kB)
Collecting requests~=2.26.0 (from doxa-cli)
  Downloading requests-2.26.0-py2.py3-none-any.whl.metadata (4.8 kB)
Collecting log_symbols>=0.0.14 (from halo>=0.0.31,~=0.0.31->doxa-cli)
  Downloading log_symbols-0.0.14-py3-none-any.whl.metadata (523 bytes)
Collecting spinners>=0.0.24 (from halo>=0.0.31,~=0.0.31->doxa-cli)
  Downloading spinners-0.0.24-py3-none-any.whl.metadata (576 bytes)
Collecting colorama>=0.3.9 (from halo>=0.0.31,~=0.0.31->doxa-cli)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting urllib3<1.27,>=1.21.1 (from requests~=2.26.0->doxa-cli)
  Downloading urllib3-1.26.20-py2.py3-none-any.whl.met

In [3]:
pip  install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1

In [4]:
import os

import pandas as pd
import seaborn as sns
from datasets import Dataset
from sentence_transformers import SentenceTransformer,SentenceTransformerTrainer,InputExample
from sentence_transformers.losses import CosineSimilarityLoss
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from torch.utils.data import DataLoader

pd.set_option("display.max_colwidth", None)

## Loading the data


In [5]:
# Download the dataset if we do not already have it
if not os.path.exists("train.csv"):
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/train.csv --output train.csv

if not os.path.exists("submission"):
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/submission/competition.py --create-dirs --output submission/competition.py
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/submission/doxa.yaml --output submission/doxa.yaml
    !curl https://raw.githubusercontent.com/DoxaAI/harmony-matching-getting-started/main/submission/run.py --output submission/run.py

# Load the data
df = pd.read_csv("train.csv")

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  361k  100  361k    0     0   701k      0 --:--:-- --:--:-- --:--:--  702k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   786  100   786    0     0   2081      0 --:--:-- --:--:-- --:--:--  2079
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    87  100    87    0     0    218      0 --:--:-- --:--:-- --:--:--   219
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2044  100  2044    0     0   2713      0 --:--:-- --:--:-- --:--:--  2710


## Exploring the data

Let's get started by taking a look at the training dataset, which contains the following data variables:

- `sentence_1` and `sentence_2`: a pair of English-language sentences drawn from psychology surveys
- `human_similarity`: the human-judged similarity of the two sentences (integers in the range `[0, 100]`)
- `cosine_from_harmony`: cosine similarity values currently generated by the Harmony tool, which are provided purely for reference and do not form part of the challenge


In [9]:
df

Unnamed: 0,sentence_1,sentence_2,human_similarity,cosine_from_harmony
0,Do you believe in telepathy (mind-reading)?,I believe that there are secret signs in the world if you just know how to look for them.,15,0.242434
1,"Irritable behavior, angry outbursts, or acting aggressively?",Felt “on edge”?,62,-0.325047
2,I have some eccentric (odd) habits.,I often have difficulty following what someone is saying to me.,0,0.441590
3,Do you often feel nervous when you are in a group of unfamiliar people?,Been easily annoyed by different things?,0,0.407776
4,Do you believe in telepathy (mind-reading)?,Most of the time I find it is very difficult to get my thoughts in order.,26,0.335685
...,...,...,...,...
2346,Little interest or pleasure in doing things,At times I have wondered if my body was really my own,0,0.233570
2347,"Feeling down, depressed, or hopeless?",I find that I am very often confused about what is going on around me.,0,0.377662
2348,Not being able to stop or control worrying?,"If given the choice, I would much rather be with another person than alone.",16,-0.170234
2349,"Feeling nervous, anxious or on edge?",Have had changes in appetite or sleep?,16,0.357956


In [6]:
data=df.copy()

In [21]:
data.head(10)

Unnamed: 0,sentence_1,sentence_2,human_similarity,cosine_from_harmony
0,Do you believe in telepathy (mind-reading)?,I believe that there are secret signs in the world if you just know how to look for them.,15,0.242434
1,"Irritable behavior, angry outbursts, or acting aggressively?",Felt “on edge”?,62,-0.325047
2,I have some eccentric (odd) habits.,I often have difficulty following what someone is saying to me.,0,0.44159
3,Do you often feel nervous when you are in a group of unfamiliar people?,Been easily annoyed by different things?,0,0.407776
4,Do you believe in telepathy (mind-reading)?,Most of the time I find it is very difficult to get my thoughts in order.,26,0.335685
5,Taking too many risks or doing things that could cause you harm?,"Avoiding external reminders of the experience (for example, people, places, conversations, objects, activities, or situations)?",21,-0.324714
6,I sometimes forget what I am trying to say.,"I have had experiences with seeing the future, ESP or a sixth sense.",0,0.214685
7,Blaming yourself or someone else for the stressful experience or what happened after it?,Experienced sleep disturbances?,0,0.236051
8,Feeling afraid as if something awful might happen?,My thoughts and behaviors are almost always disorganized.,76,0.257179
9,I sometimes avoid going to places where there will be many people because I will get anxious.,"I have had experiences with seeing the future, ESP or a sixth sense.",0,0.272508


In [7]:
data['normalize_score']=data['human_similarity']/100
data.drop(columns=['cosine_from_harmony','human_similarity'],inplace=True,axis=1)
data

Unnamed: 0,sentence_1,sentence_2,normalize_score
0,Do you believe in telepathy (mind-reading)?,I believe that there are secret signs in the world if you just know how to look for them.,0.15
1,"Irritable behavior, angry outbursts, or acting aggressively?",Felt “on edge”?,0.62
2,I have some eccentric (odd) habits.,I often have difficulty following what someone is saying to me.,0.00
3,Do you often feel nervous when you are in a group of unfamiliar people?,Been easily annoyed by different things?,0.00
4,Do you believe in telepathy (mind-reading)?,Most of the time I find it is very difficult to get my thoughts in order.,0.26
...,...,...,...
2346,Little interest or pleasure in doing things,At times I have wondered if my body was really my own,0.00
2347,"Feeling down, depressed, or hopeless?",I find that I am very often confused about what is going on around me.,0.00
2348,Not being able to stop or control worrying?,"If given the choice, I would much rather be with another person than alone.",0.16
2349,"Feeling nervous, anxious or on edge?",Have had changes in appetite or sleep?,0.16


In [23]:
data.head(10)

Unnamed: 0,sentence_1,sentence_2,normalize_score
0,Do you believe in telepathy (mind-reading)?,I believe that there are secret signs in the world if you just know how to look for them.,0.15
1,"Irritable behavior, angry outbursts, or acting aggressively?",Felt “on edge”?,0.62
2,I have some eccentric (odd) habits.,I often have difficulty following what someone is saying to me.,0.0
3,Do you often feel nervous when you are in a group of unfamiliar people?,Been easily annoyed by different things?,0.0
4,Do you believe in telepathy (mind-reading)?,Most of the time I find it is very difficult to get my thoughts in order.,0.26
5,Taking too many risks or doing things that could cause you harm?,"Avoiding external reminders of the experience (for example, people, places, conversations, objects, activities, or situations)?",0.21
6,I sometimes forget what I am trying to say.,"I have had experiences with seeing the future, ESP or a sixth sense.",0.0
7,Blaming yourself or someone else for the stressful experience or what happened after it?,Experienced sleep disturbances?,0.0
8,Feeling afraid as if something awful might happen?,My thoughts and behaviors are almost always disorganized.,0.76
9,I sometimes avoid going to places where there will be many people because I will get anxious.,"I have had experiences with seeing the future, ESP or a sixth sense.",0.0


In [27]:
data['sentence_1']=data['sentence_1'].astype(str)
data['sentence_2']=data['sentence_2'].astype(str)

In [None]:
df.hist(figsize=(12, 5))

In [None]:
sns.displot(df, x="human_similarity", y="cosine_from_harmony", bins=25)

In [None]:
df[["human_similarity", "cosine_from_harmony"]].corr()

As you can see from the visualisations and the correlation matrix, the cosine similarity scores currently being used within the Harmony tool do not correlate particularly well with the human-sourced similarity ratings. Your challenge is to develop a matching algorithm that aligns more closely with the human-provided scores!


## Generating embeddings

In this notebook, as an example to get you started, we are going to implement the relatively simple strategy of using a pre-trained model to compute sentence embeddings for each sentence in the training dataset and using the cosine similarity between the sentence pairs in the dataset as the basis for our similarity score predictions.

First, we will load a pre-trained [SentenceTransformers](https://sbert.net/) model:


In [8]:
full_data=Dataset.from_pandas(data)
train=full_data.train_test_split(test_size=0.2)
train_data=[InputExample(texts=[row['sentence_1'],row['sentence_2']],label=row['normalize_score']) for row in train['train']]

In [47]:
train_data1

<torch.utils.data.dataloader.DataLoader at 0x7cf396f1b5e0>

In [9]:
train_data1=DataLoader(train_data,batch_size=16)

In [40]:
train['train'][10]['sentence_1']

'People sometimes comment on my unusual mannerisms and habits.'

In [11]:
a=InputExample(train['train'][10:20],label=train['train']['normalize_score'])
a

<sentence_transformers.readers.InputExample.InputExample at 0x7ddb6ff0f850>

In [10]:
model = SentenceTransformer("all-mpnet-base-v2")

model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Next, we will use this to generate embeddings for all the sentence pairs in the training dataset and then compute the cosine similarity for each pair. Then, to produce similarity scores in the range `[0, 100]` to match the human-provided scores, we will slightly rescale and clip the cosine similarities we just computed.


In [12]:
loss=CosineSimilarityLoss(model=model)

In [13]:
model.fit(train_objectives=[(train_data1,loss)],epochs=5)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
500,0.0677


In [48]:

loss=CosineSimilarityLoss(model=model)
trainer=SentenceTransformerTrainer(
    model=model,
    train_dataset=train_data1,
    loss=loss)
trainer.train()

AttributeError: 'list' object has no attribute 'column_names'

In [14]:
sentences = list(set(df["sentence_1"]) | set(df["sentence_2"]))
embeddings = {
    sentence: embedding
    for sentence, embedding in zip(
        sentences, model.encode(sentences, batch_size=16, show_progress_bar=True)
    )
}

df["prediction"] = model.similarity_pairwise(
    df["sentence_1"].map(embeddings),
    df["sentence_2"].map(embeddings),
)
df["prediction"] = (100 * df["prediction"]).apply(int).clip(0, 100)

df

Batches:   0%|          | 0/8 [00:00<?, ?it/s]

  a = torch.tensor(a)


Unnamed: 0,sentence_1,sentence_2,human_similarity,cosine_from_harmony,prediction
0,Do you believe in telepathy (mind-reading)?,I believe that there are secret signs in the world if you just know how to look for them.,15,0.242434,45
1,"Irritable behavior, angry outbursts, or acting aggressively?",Felt “on edge”?,62,-0.325047,33
2,I have some eccentric (odd) habits.,I often have difficulty following what someone is saying to me.,0,0.441590,17
3,Do you often feel nervous when you are in a group of unfamiliar people?,Been easily annoyed by different things?,0,0.407776,29
4,Do you believe in telepathy (mind-reading)?,Most of the time I find it is very difficult to get my thoughts in order.,26,0.335685,23
...,...,...,...,...,...
2346,Little interest or pleasure in doing things,At times I have wondered if my body was really my own,0,0.233570,6
2347,"Feeling down, depressed, or hopeless?",I find that I am very often confused about what is going on around me.,0,0.377662,37
2348,Not being able to stop or control worrying?,"If given the choice, I would much rather be with another person than alone.",16,-0.170234,2
2349,"Feeling nervous, anxious or on edge?",Have had changes in appetite or sleep?,16,0.357956,29


In [None]:
df.hist(["human_similarity", "prediction"])

In [None]:
sns.displot(df, x="human_similarity", y="prediction", bins=25)

In [15]:
df[["human_similarity", "cosine_from_harmony", "prediction"]].corr()

Unnamed: 0,human_similarity,cosine_from_harmony,prediction
human_similarity,1.0,0.114113,0.449124
cosine_from_harmony,0.114113,1.0,0.336404
prediction,0.449124,0.336404,1.0


While these predictions represent a slight improvement, it is now your challenge to see how much better you can make things! 👀


## Producing a submission package

**Now, we will move onto creating your first submission!**

When you upload your work to the DOXA AI platform, your code will be run in an environment with no internet access. As such, your submission needs to contain any models you want to use as part of the submission, as well as any code necessary to use those models.

Currently, the `submission/` folder contains three files:

- `submission/competition.py`: this contains competition-specific code used to interface with the platform
- `submission/doxa.yaml`: this is a configuration file used by the DOXA CLI when you make a submission
- `submission/run.py`: this is the Python script that gets run when your work gets evaluated (**you will need to edit this to implement your solution!**)

First, we will save the SentenceTransformer model we have just loaded into our `submission/` directory:


In [16]:
model.save("submission/model_second")

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Next, if you take a look at `run.py`, you will see the following:

```py
class Evaluator(BaseEvaluator):
    def predict(self, df: pd.DataFrame) -> Generator[int, Any, None]:
        model = SentenceTransformer(str(directory / "model"))

        sentences = list(set(df["sentence_1"]) | set(df["sentence_2"]))
        embeddings = {
            sentence: embedding
            for sentence, embedding in zip(
                sentences, model.encode(sentences, batch_size=16)
            )
        }

        df["prediction"] = model.similarity_pairwise(
            df["sentence_1"].map(embeddings),
            df["sentence_2"].map(embeddings),
        )
        df["prediction"] = (100 * df["prediction"]).apply(int).clip(0, 100)

        for _, row in df.iterrows():
            yield row["prediction"]
```

In the `predict()` method, we load the `SentenceTransformer` model we just saved at `submission/model`, produce embeddings for the dataframe provided (in this instance containing the test set sentence pairs), compute the cosine similarities and then transform it into integer scores in the range `[0, 100]`. There are multiple ways to produce these similarity scores, and it is up to you to experiment with different techniques! For example, instead of computing the cosine similarity here, you may want to feed the embeddings you generate into another neural network you have trained for this task.

**When you come to implement your own solution, you will need to edit `predict()` in `run.py` and make sure you include the right model in your submission!**

You can modify `predict()` however you wish: it just has to yield your similarity score predictions in the same order as they appear in the dataframe. If your submission requires a lot of RAM, you may wish to modify `predict()` to process the test set in batches instead of all at once. Note that in addition to the RAM limit, there is a submission size limit, so make sure you are only uploading models that are relevant to your current submisison.


## Uploading your submission to the platform

You are now ready to make your first submission to the platform! 👀

**Make sure to [enrol to take part](https://doxaai.com/competition/harmony-matching) in the challenge if you have not already done so.**

First, we need to make sure we are logged in:


In [None]:
!doxa login

And then, we can submit our work for evaluation:


In [None]:
!doxa upload submission

**Congratulations!** 🥳

By this point, you will now have just made your first submission for this challenge on the DOXA AI platform!

If everything went well, your submission will now be queued up for evaluation. It will first be run on a small validation set to make sure that your submission does not crash on the full test set. If your submission runs into an issue at this point, you will be able to see the error logs from this phase. Otherwise, if your submission passes this stage, it will be evaluated on the full test set, and you will soon appear on the [competition scoreboard](https://doxaai.com/competition/harmony-matching/scoreboard)!


## Next steps

**Now, it is up to you as to where you go from here to solve this challenge!**

Here are some ideas you might want to test out:

- Using other [SentenceTransformers](https://sbert.net/) models that may perform better at this task than `all-mpnet-base-v2`
- Training an additional model to predict `human_similarity` from embeddings computed using the [SentenceTransformers](https://sbert.net/) library
- Fine-tuning a language model for this task

If you are new to fine-tuning language models, take a look at the excellent [HuggingFace `transformers` documentation](https://huggingface.co/docs/transformers/en/training)!

**We look forward to seeing what you build!** We would love to hear about what you are working on for this challenge, so do let us know how you are finding the challenge on the [Harmony community Discord server](https://discord.com/invite/harmonydata) or the [DOXA AI community Discord server](https://discord.gg/MUvbQ3UYcf). 😎
