# TalentCLEF 2025 - Task A: Baseline for Test set

In this notebook, we provide the Test Set baseline for TalentCLEF Task A. This document includes downloading the[Task A dataset](https://doi.org/10.5281/zenodo.14002665), applying a multilingual embedding model as a baseline to generate .trec files, which will then be compressed and uploaded to the Codalab platform.


-----------------------------
TalentCLEF is an initiative to advance Natural Language Processing (NLP) in Human Capital Management (HCM). It aims to create a public benchmark for model evaluation and promote collaboration to develop fair, multilingual, and flexible systems that improve Human Resources (HR) practices across different industries.

This shared-task's inaugural edition is part of the [Conference and Labs of the Evaluation Forum (CLEF)](https://clef2025.clef-initiative.eu/index.php?page=Pages/labs.html), scheduled to be held in Madrid in 2025. If you are interested in registering, you can find registration form [here](https://clef2025-labs-registration.dei.unipd.it/).

<img src="https://github.com/TalentCLEF/talentclef/blob/main/logo_talentclef.png?raw=true" alt="TalentCLEF logo" width="200"/>
<img src="https://talentclef.github.io/talentclef/docs/talentclef-2025/workshop/logo_clef_madrid.png" alt="TalentCLEF logo" width="150"/>


## Imports

In [None]:
!pip install codecarbon

In [4]:
import json

import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
from codecarbon import EmissionsTracker

## Download Task A files

First, let's download the Task A from Zenodo.


In [None]:
# Download
!wget https://zenodo.org/records/15292308/files/TaskA.zip
!unzip TaskA.zip -d taskA

### Baseline

Define language directionalities (queries-documents):

In [13]:
language_pairs = ["de-de","es-es","en-en","zh-zh","en-es","en-de"]
map_lang = {"de":"german","es":"spanish","en":"english","zh":"chinese"}

The baseline model is [`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)

In [7]:
models = ["sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"]

Apply the model and save results:

In [None]:
emissions = {}

for model_name in models:
  tracker = EmissionsTracker()
  tracker.start_task(model_name)
  # Download and load embedding model
  model = SentenceTransformer(model_name)
  for language_pair in language_pairs:
    source_lang, target_lang = language_pair.split("-")
    source_lang = map_lang[source_lang]
    target_lang = map_lang[target_lang]
    print(f"Language of queries: {source_lang}")
    print(f"Language of corpus elements: {target_lang}")
    # Read queries and corpus elements for the specific language
    queries = f"/content/taskA/test/{source_lang}/queries"
    corpus_elements = f"/content/taskA/test/{target_lang}/corpus_elements"
    queries = pd.read_csv(queries,sep="\t")
    corpus_elements = pd.read_csv(corpus_elements, sep="\t")
    # Get ids, strings and generate a mapping dictionary for queries
    queries_ids = queries.q_id.to_list()
    queries_texts = queries.jobtitle.to_list()
    map_queries = dict(zip(queries_ids,queries_texts))
    # Get ids, strings and generate a mapping dictionary for corpus elements
    corpus_ids = corpus_elements.c_id.to_list()
    corpus_texts = corpus_elements.jobtitle.to_list()
    map_corpus = dict(zip(queries_ids,queries_texts))
    # Encode queries and corpus elements with the baseline model.
    query_embeddings = model.encode(queries_texts, convert_to_tensor=True)
    corpus_embeddings = model.encode(corpus_texts, convert_to_tensor=True)

    # Compute similarities between query and corpus element embeddings
    similarities = util.cos_sim(query_embeddings, corpus_embeddings).cpu().numpy()

    # Process results and prepare output file
    results = []
    for q_idx, q_id in enumerate(queries_ids):
        sorted_indices = np.argsort(-similarities[q_idx])  # Decrease order
        for rank, c_idx in enumerate(sorted_indices):
            doc_id = corpus_ids[c_idx]
            score = similarities[q_idx, c_idx]
            results.append(f"{str(q_id)} Q0 {str(doc_id)} {rank+1} {score:.4f} baseline_model")

    # Save the predictions in a trecfile. Follow the naming guidelines
    with open(f"run_{language_pair}_testbaseline-{model_name.split('/')[1]}.trec", "w", encoding="utf-8") as f:
        f.write("\n".join(results))
    pass
  emissions[model_name]: float = dict(tracker.stop_task(model_name).values)

json.dump(emissions, open("./emissions.json", "w"), ensure_ascii=False, indent=4)

Zip the results that will be uploaded into the [Task A Codabench](https://www.codabench.org/competitions/5842)

In [None]:
!zip taskA_testset_baseline.zip run_*
!zip taskA_testset_baseline_emissions.zip emissions.json