# Tutorial: Downloading and Loading Data from Zenodo for TalentCLEF 2026

This notebook provides a step-by-step guide on downloading and loading training data for the TalentCLEF 2026 shared-task, hosted on [Zenodo](https://doi.org/10.5281/zenodo.17625261).

TalentCLEF is an initiative to advance Natural Language Processing (NLP) in Human Capital Management (HCM). It aims to create a public benchmark for model evaluation and promote collaboration to develop fair, multilingual, and flexible systems that improve Human Resources (HR) practices across different industries.

The second edition of TalentCLEF shared task’s will be part of the [Conference and Labs of the Evaluation Forum (CLEF)](https://clef2026.clef-initiative.eu/), scheduled to be held in Jena, Germany, in 2026. If you are interested in registering, you can find registration form [here](https://clef-labs-registration.dipintra.it/)


<img src="https://github.com/TalentCLEF/talentclef/blob/main/logo_talentclef.png?raw=true" alt="TalentCLEF logo" width="200"/>
<img src="https://talentclef.github.io/talentclef/docs/talentclef-2026/workshop/logo_clef_jena.svg" alt="CLEF2026 logo" width="150"/>

In this notebook you will learn how to download and load the data of the task.



## Download files

First, let's download the Task A and Task B zip files directly from Zenodo.



In [None]:
# Download
!wget https://zenodo.org/records/18449283/files/TaskA.zip
!wget https://zenodo.org/records/18449283/files/TaskB.zip
# Unzip
!unzip TaskA.zip -d taskA
!unzip TaskB.zip -d taskB

## Task A files

### Training data



No training set is provided for this task this. However, participants are encouraged to use external resources or data from previous TalentCLEF editions if needed for their problem modeling and solution development.

### Development data

The development data for Task A is located in different folders within 'development', based on the language of the data:

In [2]:
spanish_dev_path = "/content/taskA/TaskA/development/es"
english_dev_path = "/content/taskA/TaskA/development/en"

As described on [the data description page in TalentCLEF website](https://talentclef.github.io/talentclef/docs/talentclef-2026/data/description_corpus), the development set for each language is organized into two folders: **`queries`**, which contains the job descriptions, and **`candidates`**, which contains the résumés. In addition, a **`qrels.tsv`** file is provided, specifying the relevance judgments of the résumés for each job description.

Both the **`queries`** and **`corpus`** folders contain the documents used in the task, with each file name serving as the unique identifier for the corresponding document.

For reading those files, a function called `load_text_corpus` is created:


In [4]:
import os
import pandas as pd

def load_text_corpus(path, id_col="c_id", encoding="utf-8"):
    """
    Load text files from a directory into a pandas DataFrame.

    Each file in the directory is treated as a document. The file name is used
    as the document identifier, and the file content is stored as text.

    Parameters
    ----------
    path : str
        Path to the directory containing the text files.
    id_col : str, optional
        Name of the column used as the document identifier
        (e.g., 'c_id', 'q_id'). Default is 'c_id'.
    encoding : str, optional
        Text encoding used to read the files. Default is 'utf-8'.

    Returns
    -------
    pd.DataFrame
        A DataFrame with two columns:
        - id_col: document identifier (file name)
        - text: document content
    """
    records = []

    for filename in os.listdir(path):
        file_path = os.path.join(path, filename)

        if os.path.isfile(file_path):
            with open(file_path, "r", encoding=encoding, errors="ignore") as f:
                text = f.read()

            records.append({
                id_col: filename,
                "text": text
            })

    return pd.DataFrame(records)

In [5]:
en_queries = load_text_corpus(os.path.join(english_dev_path,"queries"), id_col="q_id")
en_corpus = load_text_corpus(os.path.join(english_dev_path,"corpus"), id_col="c_id")
en_qrels = pd.read_csv(os.path.join(english_dev_path,"qrels.tsv"),sep="\t", names=["query_id","iteration","corpus_id","relevancy"])

es_queries = load_text_corpus(os.path.join(spanish_dev_path,"queries"), id_col="q_id")
es_corpus = load_text_corpus(os.path.join(spanish_dev_path,"corpus"), id_col="c_id")
es_qrels = pd.read_csv(os.path.join(spanish_dev_path,"qrels.tsv"),sep="\t", names=["query_id","iteration","corpus_id","relevancy"])


Some examples of the english data:

In [6]:
en_queries.head(2)

Unnamed: 0,q_id,text
0,44719,Cashier\n\nIntroduction\nWe are seeking a Cash...
1,35129,IELTS Teacher\n\nWe are seeking a dedicated IE...


In [7]:
en_corpus.head(2)

Unnamed: 0,c_id,text
0,12980,"Mariana Rodríguez\nCalle Lavalleja 2847, Apart..."
1,16049,"Dmitri Volkov\nLeninsky Prospekt 42, Apartment..."


In the qrels we can explore those corpus elements that are relevant for each query.

In [8]:
en_qrels.head(2)

Unnamed: 0,query_id,iteration,corpus_id,relevancy
0,36044,0,13884,1
1,39060,0,9516,1


For example, in the previous step, the corpus element 13884 is relevant for the query 36044

In [18]:
from IPython.display import Markdown, display
display(Markdown("**The résumé with id 13884:**"))
print(en_corpus[en_corpus.c_id == "13884"]["text"].iloc[0][:300])

display(Markdown("**IS RELEVANT TO**"))

display(Markdown("**The job description with id 138360448:**"))
print(en_queries[en_queries.q_id == "36044"]["text"].iloc[0][:300])



**The résumé with id 13884:**

Siti Wijaya

Jalan Merdeka 42, Bandung 40173
+62-274-8156-3429 | siti.wijaya@email.com
https://www.linkedin.com/in/sitiwijaya


PROFESSIONAL SUMMARY

Results-driven Sr. Principal Product Development Engineer with 15+ years of experience leading complex semiconductor product development initiatives f


**IS RELEVANT TO**

**The job description with id 138360448:**

Failure Analysis Engineer

Introduction
We are seeking a Failure Analysis Engineer to join our team. The role focuses on investigating product and system failures, identifying root causes, and implementing corrective actions to improve reliability and quality across a range of products and processes


Some examples of the spanish data

In [19]:
es_queries.head(2)

Unnamed: 0,q_id,text
0,86302,Profesor/a de IELTS\n\nBuscamos un/a Profesor/...
1,96027,Ingeniero/a de Análisis de Fallas\n\nIntroducc...


In [20]:
es_corpus.head(2)

Unnamed: 0,c_id,text
0,57152,Siriporn Jantharapruk\n487 Sukhumvit Soi 12\nB...
1,63844,"Abdelrahman Osman\nKhartoum, Sudan\n+249 912 4..."


In [21]:
es_qrels.head(2)

Unnamed: 0,query_id,iteration,corpus_id,relevancy
0,96027,0,70456,1
1,88540,0,72948,1


## Task B files

### Training

The training data for Task B is shared in 3 files:
- `job2skill.tsv`: This file has been curated to include the most representative skills for each job title in ESCO. This file contains three columns:
  - `job_id`: ESCO identifier for the job position.
  - `skill_id`: ESCO identifier for the skill.
  - `rel_type`: Indicator specifying whether the skill_id is essential or optional for a specific job_id. It can have the value “essential” or “optional.”

- `jobid2terms.json`: This JSON file contains job_id identifiers used in the training set for Task A as keys, and a list of valid lexical variants for each identifier as values.

- `skillid2terms.json`: This JSON file contains skill_id identifiers as keys, and a list of valid lexical variants for each identifier as values.



In [22]:
taskB_path = "/content/taskB/TaskB/training"

Let's read the data

In [23]:
# Read job2skill file
import os
job2skill = pd.read_csv(os.path.join(taskB_path, 'job2skill.tsv'),
                        sep="\t",
                        names=["job_id","skill_id","rel_type"])
job2skill.head()

Unnamed: 0,job_id,skill_id,rel_type
0,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/93a68dcb-3dc6...,essential
1,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/05bc7677-5a64...,essential
2,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/860be36a-d19b...,essential
3,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/fed5b267-73fa...,essential
4,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/f64fe2c2-d090...,essential


In [24]:
# Read json files
import json

with open(os.path.join(taskB_path,"jobid2terms.json"), 'r') as file:
    jobid2terms = json.load(file)

with open(os.path.join(taskB_path,"skillid2terms.json"), 'r') as file:
    skillid2terms = json.load(file)

We can map the values to the relation dataframe:

In [25]:
job2skill["job_terms"] = job2skill["job_id"].map(jobid2terms)
job2skill["skill_terms"] = job2skill["skill_id"].map(skillid2terms)

Let's see the skills for a specific job_id.



In [26]:
job_id_to_search = job2skill.job_id.to_list()[8]
job_data = job2skill[job2skill.job_id == job_id_to_search]

The job titles of the selected job_id are:

In [27]:
job_data.job_terms.iloc[0]

['technical director',
 'technical and operations director',
 'head of technical',
 'director of technical arts',
 'head of technical department',
 'technical supervisor',
 'technical manager']

For that job_id we fin 9 essential skills and 1 optional skill:

In [28]:
job_data.rel_type.value_counts()

Unnamed: 0_level_0,count
rel_type,Unnamed: 1_level_1
essential,9
optional,1


The list of skills related to the job_id are:

In [29]:
job_data.skill_terms.to_list()

[['promote health and safety',
  'promote importance of health and safety',
  'promoting health and safety',
  'advertise health and safety'],
 ['organise rehearsals',
  'organise rehearsal',
  'organize rehearsals',
  'plan rehearsals',
  'arrange rehearsals',
  'organising rehearsals',
  'schedule rehearsals'],
 ['negotiate health and safety issues with third parties',
  'agree with third parties on health and safety',
  'negotiate issues on health and safety with third parties',
  'negotiate with third parties on health and safety issues',
  'negotiate health and safety matters with third parties'],
 ['theatre techniques',
  'theatre technique',
  'theatre approaches',
  'theatre methods'],
 ['coordinate technical teams in artistic productions',
  'supervise technical teams during a production',
  'coordinate technical teams during artistic production',
  'coordinate technical teams',
  'coordinate technical teams for artistic production'],
 ['write risk assessment on performing art

In [30]:
job2skill.job_id.to_list()[8]

'http://data.europa.eu/esco/occupation/00030d09-2b3a-4efd-87cc-c4ea39d27c34'