# Tutorial: Downloading and Loading Training Data from Zenodo for TalentCLEF

This notebook provides a step-by-step guide on downloading and loading training data for the TalentCLEF shared-task, hosted on [Zenodo](https://doi.org/10.5281/zenodo.14002665).

TalentCLEF is an initiative to advance Natural Language Processing (NLP) in Human Capital Management (HCM). It aims to create a public benchmark for model evaluation and promote collaboration to develop fair, multilingual, and flexible systems that improve Human Resources (HR) practices across different industries.

This shared-task's inaugural edition is part of the [Conference and Labs of the Evaluation Forum (CLEF)](https://clef2025.clef-initiative.eu/index.php?page=Pages/labs.html), scheduled to be held in Madrid in 2025. If you are interested in registering, you can find registration form [here](https://clef2025-labs-registration.dei.unipd.it/).

<img src="https://github.com/TalentCLEF/talentclef/blob/main/logo_talentclef.png?raw=true" alt="TalentCLEF logo" width="200"/>
<img src="https://talentclef.github.io/talentclef/docs/talentclef-2025/workshop/logo_clef_madrid.png" alt="TalentCLEF logo" width="150"/>

In this notebook you will learn how to download and load the training data of the task.



## Download files

First, let's download the Task A and Task B zip files directly from Zenodo.



In [10]:
# Download
!wget https://zenodo.org/records/14693201/files/TaskA.zip
!wget https://zenodo.org/records/14693201/files/TaskB.zip
# Unzip
!unzip TaskA.zip -d taskA
!unzip TaskB.zip -d taskB

--2025-01-20 10:41:14--  https://zenodo.org/records/14693201/files/TaskA.zip
Resolving zenodo.org (zenodo.org)... 188.185.45.92, 188.185.48.194, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.45.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 863884 (844K) [application/octet-stream]
Saving to: ‘TaskA.zip.2’


2025-01-20 10:41:15 (1.14 MB/s) - ‘TaskA.zip.2’ saved [863884/863884]

--2025-01-20 10:41:15--  https://zenodo.org/records/14693201/files/TaskB.zip
Resolving zenodo.org (zenodo.org)... 188.185.45.92, 188.185.48.194, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.45.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4192240 (4.0M) [application/octet-stream]
Saving to: ‘TaskB.zip.2’


2025-01-20 10:41:17 (4.14 MB/s) - ‘TaskB.zip.2’ saved [4192240/4192240]

Archive:  TaskA.zip
replace taskA/training/english/taskA_training_en.tsv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: taskA/traini

## Task A files

### Training data

The training data for Task A is located in different folders within 'training', based on the language of the data:

In [16]:
spanish_training = "/content/taskA/training/spanish/taskA_training_es.tsv"
english_training = "/content/taskA/training/english/taskA_training_en.tsv"
german_training = "/content/taskA/training/german/taskA_training_de.tsv"


As explained on [the data description page in TalentCLEF website](https://talentclef.github.io/talentclef/docs/talentclef-2025/data/description_corpus/), the training set is provided in a tabular format where each row displays two related job titles (`jobtitle_1`, `jobtitle_2`) along with the ESCO identifier they originate from (`id`) and the ISCO family to which the ESCO identifier belongs(`family_id`), that might be useful to process the data for model training.

In [22]:
import pandas as pd
spanish_df = pd.read_csv(spanish_training,sep="\t", names=["family_id","id","jobtitle_1","jobtitle_2"])
english_df = pd.read_csv(english_training,sep="\t", names=["family_id","id","jobtitle_1","jobtitle_2"])
german_df = pd.read_csv(german_training,sep="\t", names=["family_id","id","jobtitle_1","jobtitle_2"])

In [21]:
spanish_df.head(2)

Unnamed: 0,family_id,id,jobtitle_1,jobtitle_2
0,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,jefe de escuadrón,instructor
1,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,comandante de aeronave,instructor de simulador


In [23]:
english_df.head(2)

Unnamed: 0,family_id,id,jobtitle_1,jobtitle_2
0,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,air commodore,flight lieutenant
1,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,command and control officer,flight officer


In [24]:
german_df.head(2)

Unnamed: 0,family_id,id,jobtitle_1,jobtitle_2
0,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,Staffelkommandantin,Kommodore
1,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,Luftwaffenoffizierin,Luftwaffenoffizier/Luftwaffenoffizierin


## Task B files

### Training

The training data for Task B is shared in 3 files:
- `job2skill.tsv`: This file has been curated to include the most representative skills for each job title in ESCO. This file contains three columns:
  - `job_id`: ESCO identifier for the job position.
  - `skill_id`: ESCO identifier for the skill.
  - `rel_type`: Indicator specifying whether the skill_id is essential or optional for a specific job_id. It can have the value “essential” or “optional.”

- `jobid2terms.json`: This JSON file contains job_id identifiers used in the training set for Task A as keys, and a list of valid lexical variants for each identifier as values.

- `skillid2terms.json`: This JSON file contains skill_id identifiers as keys, and a list of valid lexical variants for each identifier as values.



In [26]:
taskB_path = "/content/taskB/training"

Let's read the data

In [30]:
# Read job2skill file
import os
job2skill = pd.read_csv(os.path.join(taskB_path, 'job2skill.tsv'),
                        sep="\t",
                        names=["job_id","skill_id","rel_type"])
job2skill.head()

Unnamed: 0,job_id,skill_id,rel_type
0,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/93a68dcb-3dc6...,essential
1,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/05bc7677-5a64...,essential
2,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/860be36a-d19b...,essential
3,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/fed5b267-73fa...,essential
4,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/f64fe2c2-d090...,essential


In [32]:
# Read json files
import json

with open(os.path.join(taskB_path,"jobid2terms.json"), 'r') as file:
    jobid2terms = json.load(file)

with open(os.path.join(taskB_path,"skillid2terms.json"), 'r') as file:
    skillid2terms = json.load(file)

We can map the values to the relation dataframe:

In [33]:
job2skill["job_terms"] = job2skill["job_id"].map(jobid2terms)
job2skill["skill_terms"] = job2skill["skill_id"].map(skillid2terms)

Let's see the skills for a specific job_id.



In [37]:
job_id_to_search = job2skill.job_id.to_list()[8]
job_data = job2skill[job2skill.job_id == job_id_to_search]

The job titles of the selected job_id are:

In [44]:
job_data.job_terms.iloc[0]

['technical director',
 'technical and operations director',
 'head of technical',
 'director of technical arts',
 'head of technical department',
 'technical supervisor',
 'technical manager']

For that job_id we fin 9 essential skills and 1 optional skill:

In [43]:
job_data.rel_type.value_counts()

Unnamed: 0_level_0,count
rel_type,Unnamed: 1_level_1
essential,9
optional,1


The list of skills related to the job_id are:

In [42]:
job_data.skill_terms.to_list()

[['promote health and safety',
  'promote importance of health and safety',
  'promoting health and safety',
  'advertise health and safety'],
 ['organise rehearsals',
  'organise rehearsal',
  'organize rehearsals',
  'plan rehearsals',
  'arrange rehearsals',
  'organising rehearsals',
  'schedule rehearsals'],
 ['negotiate health and safety issues with third parties',
  'agree with third parties on health and safety',
  'negotiate issues on health and safety with third parties',
  'negotiate with third parties on health and safety issues',
  'negotiate health and safety matters with third parties'],
 ['theatre techniques',
  'theatre technique',
  'theatre approaches',
  'theatre methods'],
 ['coordinate technical teams in artistic productions',
  'supervise technical teams during a production',
  'coordinate technical teams during artistic production',
  'coordinate technical teams',
  'coordinate technical teams for artistic production'],
 ['write risk assessment on performing art

In [35]:
job2skill.job_id.to_list()[8]

'http://data.europa.eu/esco/occupation/00030d09-2b3a-4efd-87cc-c4ea39d27c34'