# TalentCLEF 2025 - Task B: Baseline for Test set

In this notebook, we provide the Test Set baseline for TalentCLEF Task B. This document includes downloading the [Task B dataset](https://doi.org/10.5281/zenodo.14002665), applying an embedding model as a baseline to generate the submission .trec file, which will then be compressed and uploaded to the Codalab platform.


-----------------------------
TalentCLEF is an initiative to advance Natural Language Processing (NLP) in Human Capital Management (HCM). It aims to create a public benchmark for model evaluation and promote collaboration to develop fair, multilingual, and flexible systems that improve Human Resources (HR) practices across different industries.

This shared-task's inaugural edition is part of the [Conference and Labs of the Evaluation Forum (CLEF)](https://clef2025.clef-initiative.eu/index.php?page=Pages/labs.html), scheduled to be held in Madrid in 2025. If you are interested in registering, you can find registration form [here](https://clef2025-labs-registration.dei.unipd.it/).

<img src="https://github.com/TalentCLEF/talentclef/blob/main/logo_talentclef.png?raw=true" alt="TalentCLEF logo" width="200"/>
<img src="https://talentclef.github.io/talentclef/docs/talentclef-2025/workshop/logo_clef_madrid.png" alt="TalentCLEF logo" width="150"/>


## Imports

In [None]:
!pip install codecarbon

In [2]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
import subprocess
from codecarbon import EmissionsTracker

## Download Task B files

First, let's download the Task A and Task B zip files directly from Zenodo.



In [None]:
# Download
!wget https://zenodo.org/records/15292308/files/TaskB.zip
!unzip TaskB.zip -d taskB

## Generate submission file using the baseline model

Load queries and corpus elements in English from the Validation folder:

In [4]:
queries = "/content/taskB/test/queries"
corpus_elements = "/content/taskB/test/corpus_elements"

In [5]:
queries = pd.read_csv(queries,sep="\t")
corpus_elements = pd.read_csv(corpus_elements, sep="\t")


Transform `skill_aliases` column to a list of strings:

In [6]:
import ast
corpus_elements["skill_aliases"] = corpus_elements["skill_aliases"].apply(lambda x: ast.literal_eval(x))

Generate a mapping dictionary between IDs and texts from query and corpus element strings.

In [7]:
queries_ids = queries.q_id.to_list()
queries_texts = queries.jobtitle.to_list()
map_queries = dict(zip(queries_ids,queries_texts))

Before creating a mapping dictionary of texts to corpus_ids, explode the `skill_aliases` column:

In [8]:
list_aliases_df = corpus_elements.explode("skill_aliases")

In [9]:
corpus_ids = list_aliases_df.c_id.to_list()
corpus_texts = list_aliases_df.skill_aliases.to_list()
map_corpus = dict(zip(corpus_texts,corpus_ids))

Load simple embedding model:

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [11]:
tracker = EmissionsTracker()
tracker.start_task("all-MiniLM-L6-v2")

Encode queries and corpus elements:

In [12]:
query_embeddings = model.encode(queries_texts, convert_to_tensor=True)
corpus_embeddings = model.encode(corpus_texts, convert_to_tensor=True)

Compute similarities

In [13]:
similarities = util.cos_sim(query_embeddings, corpus_embeddings).cpu().numpy()
emissions = tracker.stop_task("all-MiniLM-L6-v2")

## Prepare submission file

The submissions must follow the TREC Run File format, including headers in the output file. This means that the fle have 6 space-spearated columns per line, with following information:

- q_id: Query ID.
- Q0: A constant identifier, usually "Q0".
- doc_id: ID of the retrieved document.
- rank: Position of the document in the ranking.
- score: Relevance score assigned by the model.
- tag: Experiment name

In [15]:
import numpy as np
results = []
results_name = []

for q_idx, q_id in enumerate(queries_ids):
    sorted_indices = np.argsort(-similarities[q_idx])
    used_doc_ids = set()
    rank_counter = 0
    for c_idx in sorted_indices:  # Consider the full list.
        doc_id = corpus_ids[c_idx]
        # If doc_id was already processed, go to the next one.
        if doc_id in used_doc_ids:
            continue
        used_doc_ids.add(doc_id)
        rank_counter += 1

        query_name = map_queries[q_id]
        doc_name = corpus_texts[c_idx]
        score = similarities[q_idx, c_idx]

        results.append(f"{q_id} Q0 {doc_id} {rank_counter} {score:.4f} baseline_model")
        results_name.append(f"{query_name} Q0 {doc_name} {rank_counter} {score:.4f} baseline_model")

Let's save the list as a file:

In [21]:
with open("run_testset_baseline_taskB.trec", "w", encoding="utf-8") as f:
    f.write("\n".join(results))

In [17]:
import json
json.dump(dict(emissions.values), open("./emissions.json", "w"), ensure_ascii=False, indent=4)

Zip the results that will be uploaded into the [Task B Codabench](https://www.codabench.org/competitions/7059)

In [None]:
!zip taskB_testset_baseline.zip run_*
!zip taskB_testset_baseline_emissions.zip emissions.json