<a href="https://colab.research.google.com/github/Bhanu-py/Bhanu-py/blob/main/talentclef2025/TalentCLEF_data_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Downloading and Loading Training Data from Zenodo for TalentCLEF

This notebook provides a step-by-step guide on downloading and loading training data for the TalentCLEF shared-task, hosted on [Zenodo](https://doi.org/10.5281/zenodo.14002665).

TalentCLEF is an initiative to advance Natural Language Processing (NLP) in Human Capital Management (HCM). It aims to create a public benchmark for model evaluation and promote collaboration to develop fair, multilingual, and flexible systems that improve Human Resources (HR) practices across different industries.

This shared-task's inaugural edition is part of the [Conference and Labs of the Evaluation Forum (CLEF)](https://clef2025.clef-initiative.eu/index.php?page=Pages/labs.html), scheduled to be held in Madrid in 2025. If you are interested in registering, you can find registration form [here](https://clef2025-labs-registration.dei.unipd.it/).

<img src="https://github.com/TalentCLEF/talentclef/blob/main/logo_talentclef.png?raw=true" alt="TalentCLEF logo" width="200"/>
<img src="https://talentclef.github.io/talentclef/docs/talentclef-2025/workshop/logo_clef_madrid.png" alt="TalentCLEF logo" width="150"/>

In this notebook you will learn how to download and load the training data of the task.



## Download files

First, let's download the Task A and Task B zip files directly from Zenodo.



In [1]:
# Download
!wget https://zenodo.org/records/14693201/files/TaskA.zip
!wget https://zenodo.org/records/14693201/files/TaskB.zip
# Unzip
!unzip TaskA.zip -d taskA
!unzip TaskB.zip -d taskB

--2025-01-28 09:42:33--  https://zenodo.org/records/14693201/files/TaskA.zip
Resolving zenodo.org (zenodo.org)... 188.185.43.25, 188.185.48.194, 188.185.45.92, ...
Connecting to zenodo.org (zenodo.org)|188.185.43.25|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 863884 (844K) [application/octet-stream]
Saving to: ‘TaskA.zip’


2025-01-28 09:42:34 (1.65 MB/s) - ‘TaskA.zip’ saved [863884/863884]

--2025-01-28 09:42:34--  https://zenodo.org/records/14693201/files/TaskB.zip
Resolving zenodo.org (zenodo.org)... 188.185.43.25, 188.185.48.194, 188.185.45.92, ...
Connecting to zenodo.org (zenodo.org)|188.185.43.25|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4192240 (4.0M) [application/octet-stream]
Saving to: ‘TaskB.zip’


2025-01-28 09:42:35 (4.97 MB/s) - ‘TaskB.zip’ saved [4192240/4192240]

Archive:  TaskA.zip
   creating: taskA/test/
   creating: taskA/training/
   creating: taskA/training/english/
  inflating: taskA/training/english/

## Task A files

### Training data

The training data for Task A is located in different folders within 'training', based on the language of the data:

In [2]:
spanish_training = "/content/taskA/training/spanish/taskA_training_es.tsv"
english_training = "/content/taskA/training/english/taskA_training_en.tsv"
german_training = "/content/taskA/training/german/taskA_training_de.tsv"


As explained on [the data description page in TalentCLEF website](https://talentclef.github.io/talentclef/docs/talentclef-2025/data/description_corpus/), the training set is provided in a tabular format where each row displays two related job titles (`jobtitle_1`, `jobtitle_2`) along with the ESCO identifier they originate from (`id`) and the ISCO family to which the ESCO identifier belongs(`family_id`), that might be useful to process the data for model training.

In [3]:
import pandas as pd
spanish_df = pd.read_csv(spanish_training,sep="\t", names=["family_id","id","jobtitle_1","jobtitle_2"])
english_df = pd.read_csv(english_training,sep="\t", names=["family_id","id","jobtitle_1","jobtitle_2"])
german_df = pd.read_csv(german_training,sep="\t", names=["family_id","id","jobtitle_1","jobtitle_2"])

In [4]:
spanish_df

Unnamed: 0,family_id,id,jobtitle_1,jobtitle_2
0,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,jefe de escuadrón,instructor
1,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,comandante de aeronave,instructor de simulador
2,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,instructor,oficial del Ejército del Aire
3,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,comandante de aeronave,instructor
4,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,oficial de operaciones,instructora
...,...,...,...,...
20719,http://data.europa.eu/esco/isco/C9629,http://data.europa.eu/esco/occupation/1f1a45a8...,encargado de vestuarios,encargada de vestidores
20720,http://data.europa.eu/esco/isco/C9629,http://data.europa.eu/esco/occupation/1f1a45a8...,encargado de vestuarios/encargada de vestuarios,encargada de vestuarios
20721,http://data.europa.eu/esco/isco/C9629,http://data.europa.eu/esco/occupation/7f346896...,acomodadora,acomodador/acomodadora
20722,http://data.europa.eu/esco/isco/C9629,http://data.europa.eu/esco/occupation/7f346896...,acomodador,acomodador/acomodadora


In [17]:
english_df[english_df['jobtitle_1'].str.contains('consumer')]

Unnamed: 0,family_id,id,jobtitle_1,jobtitle_2
2969,http://data.europa.eu/esco/isco/C1420,http://data.europa.eu/esco/occupation/d8e5410b...,consumer electronics shop manager,ICT shop manager
2971,http://data.europa.eu/esco/isco/C1420,http://data.europa.eu/esco/occupation/d8e5410b...,consumer electronics store manager,IT shop manager
3723,http://data.europa.eu/esco/isco/C2113,http://data.europa.eu/esco/occupation/8606c0fd...,sensory and consumer research scientist,sensory research scientist
3724,http://data.europa.eu/esco/isco/C2113,http://data.europa.eu/esco/occupation/8606c0fd...,consumer and sensory scientist,sensory science advisor
3725,http://data.europa.eu/esco/isco/C2113,http://data.europa.eu/esco/occupation/8606c0fd...,sensory and consumer research scientist,bakery scientist
8627,http://data.europa.eu/esco/isco/C2422,http://data.europa.eu/esco/occupation/3826cb92...,consumer protection officer,policy manager for competition
13277,http://data.europa.eu/esco/isco/C3122,http://data.europa.eu/esco/occupation/40517ed4...,consumer goods production forewoman,production machine supervisor
13280,http://data.europa.eu/esco/isco/C3122,http://data.europa.eu/esco/occupation/40517ed4...,consumer goods production foreman,production overseer
13282,http://data.europa.eu/esco/isco/C3122,http://data.europa.eu/esco/occupation/40517ed4...,consumer goods quality control operator,consumer goods production overseer
13283,http://data.europa.eu/esco/isco/C3122,http://data.europa.eu/esco/occupation/40517ed4...,consumer goods production quality supervisor,manufacturing supervisor


In [None]:
german_df.head(2)

Unnamed: 0,family_id,id,jobtitle_1,jobtitle_2
0,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,Staffelkommandantin,Kommodore
1,http://data.europa.eu/esco/isco/C0110,http://data.europa.eu/esco/occupation/f2cc5978...,Luftwaffenoffizierin,Luftwaffenoffizier/Luftwaffenoffizierin


## Task B files

### Training

The training data for Task B is shared in 3 files:
- `job2skill.tsv`: This file has been curated to include the most representative skills for each job title in ESCO. This file contains three columns:
  - `job_id`: ESCO identifier for the job position.
  - `skill_id`: ESCO identifier for the skill.
  - `rel_type`: Indicator specifying whether the skill_id is essential or optional for a specific job_id. It can have the value “essential” or “optional.”

- `jobid2terms.json`: This JSON file contains job_id identifiers used in the training set for Task A as keys, and a list of valid lexical variants for each identifier as values.

- `skillid2terms.json`: This JSON file contains skill_id identifiers as keys, and a list of valid lexical variants for each identifier as values.



In [None]:
taskB_path = "/content/taskB/training"

Let's read the data

In [None]:
# Read job2skill file
import os
job2skill = pd.read_csv(os.path.join(taskB_path, 'job2skill.tsv'),
                        sep="\t",
                        names=["job_id","skill_id","rel_type"])
job2skill.head()

Unnamed: 0,job_id,skill_id,rel_type
0,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/93a68dcb-3dc6...,essential
1,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/05bc7677-5a64...,essential
2,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/860be36a-d19b...,essential
3,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/fed5b267-73fa...,essential
4,http://data.europa.eu/esco/occupation/00030d09...,http://data.europa.eu/esco/skill/f64fe2c2-d090...,essential


In [None]:
# Read json files
import json

with open(os.path.join(taskB_path,"jobid2terms.json"), 'r') as file:
    jobid2terms = json.load(file)

with open(os.path.join(taskB_path,"skillid2terms.json"), 'r') as file:
    skillid2terms = json.load(file)

We can map the values to the relation dataframe:

In [None]:
job2skill["job_terms"] = job2skill["job_id"].map(jobid2terms)
job2skill["skill_terms"] = job2skill["skill_id"].map(skillid2terms)

Let's see the skills for a specific job_id.



In [None]:
job_id_to_search = job2skill.job_id.to_list()[8]
job_data = job2skill[job2skill.job_id == job_id_to_search]

The job titles of the selected job_id are:

In [None]:
job_data.job_terms.iloc[0]

['technical director',
 'technical and operations director',
 'head of technical',
 'director of technical arts',
 'head of technical department',
 'technical supervisor',
 'technical manager']

For that job_id we fin 9 essential skills and 1 optional skill:

In [None]:
job_data.rel_type.value_counts()

Unnamed: 0_level_0,count
rel_type,Unnamed: 1_level_1
essential,9
optional,1


The list of skills related to the job_id are:

In [None]:
job_data.skill_terms.to_list()

[['promote health and safety',
  'promote importance of health and safety',
  'promoting health and safety',
  'advertise health and safety'],
 ['organise rehearsals',
  'organise rehearsal',
  'organize rehearsals',
  'plan rehearsals',
  'arrange rehearsals',
  'organising rehearsals',
  'schedule rehearsals'],
 ['negotiate health and safety issues with third parties',
  'agree with third parties on health and safety',
  'negotiate issues on health and safety with third parties',
  'negotiate with third parties on health and safety issues',
  'negotiate health and safety matters with third parties'],
 ['theatre techniques',
  'theatre technique',
  'theatre approaches',
  'theatre methods'],
 ['coordinate technical teams in artistic productions',
  'supervise technical teams during a production',
  'coordinate technical teams during artistic production',
  'coordinate technical teams',
  'coordinate technical teams for artistic production'],
 ['write risk assessment on performing art

In [None]:
job2skill.job_id.to_list()[8]

'http://data.europa.eu/esco/occupation/00030d09-2b3a-4efd-87cc-c4ea39d27c34'