# Classification Challenge

## 1. Introduction

This is our solution for the Web Intelligence - Classification of Occupations for Online Job Advertisements Challenge, organized by the European Statistics Awards Programme. The primary goal of this challenge is to develop robust and efficient methodologies for automatically classifying job advertisements into predefined occupational categories using the International Standard Classification of Occupations (ISCO) taxonomy.

Online job advertisements (OJAs) provide a rich source of data that can be leveraged for various statistical and analytical purposes. These advertisements typically include information such as job titles, descriptions, company details, and requirements. Given the multilingual nature and large volume of the dataset, the challenge involves addressing cross-linguality and ensuring scalability of the solution.

In this notebook, we present our approach to tackle this challenge and we detail all the steps taken throughout the process.

![image](assets/classification_logo.png)

## 2. Setup Instructions

### 2.1 Setup data directories

Before we begin, it's important to set up the directory structure and place the necessary input data files correctly. Please follow these steps to ensure everything is in place:

1. Create a folder named `data` in the root directory of this project.
2. Inside the `data` folder, create the following subfolders:
   - `raw`
   - `interim`
   - `embeddings`
   - `submission`

3. Place the raw data files in the `raw` subfolder. The required raw data files are:
   - `ISCO-08 EN Structure and definitions.xlsx`: This file contains the complete  taxonomy and can be downloaded from the [International Labour Organization's website](https://ilostat.ilo.org/methods/concepts-and-definitions/classification-occupation/).
   - `wi_dataset.csv`: This file contains the job advertisements dataset provided for the competition.
   - `wi_labels.csv`: This file contains the ISCO taxonomy labels provided for the competition.

### 2.2 Install packages

In [38]:
%pip install numpy pandas tqdm pickle-mixin torch pathlib FlagEmbedding peft openpyxl loguru torch accelerate optimum transformers

Note: you may need to restart the kernel to use updated packages.


## 3. Imports

### 3.1 Import external modules

In [7]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import pickle
import torch
from pathlib import Path
from FlagEmbedding import BGEM3FlagModel
import accelerate
import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
from huggingface_hub import login

### 3.2 Import internal modules

In [68]:
from src.utils.constants import Paths
from src.data_utils.file_reader import FileReader
from src.data_utils.data_preprocessing import clean_job_advertisements
from src.data_utils.labels_preprocessing import clean_isco_labels, merge_taxonomies
from src.data_utils.embeddings_manager import embed_and_save, compute_cosine_similarity_matrix, get_knn
from src.data_utils.llm_manager import generate_prompts
from src.utils.helpers import convert_knn_indices_to_codes, extract_first_4_digit_string

## 4. Pre-processing

### 4.1 Pre-process job advertisments

In [9]:
# Read in raw job advertisments
jobs = FileReader.read_job_advertisements()

# Read in job advertisments if exists, otherwise clean and save
try:
    clean_jobs = FileReader.read_clean_job_advertisements()
except:
    clean_jobs = clean_job_advertisements(df=jobs)
    clean_jobs.to_csv(Paths.CLEAN_DATA_PATH, index=False)

Reading raw job advertisements from: /mnt/batch/tasks/shared/LS_root/mounts/clusters/a100-c48-dimitris/code/Users/dimitrios.petridis/eurostat-data-challenge-2024/data/raw/wi_dataset.csv
Raw job advertisements read successfully. Shape: (25665, 3)
Reading clean job advertisements from: /mnt/batch/tasks/shared/LS_root/mounts/clusters/a100-c48-dimitris/code/Users/dimitrios.petridis/eurostat-data-challenge-2024/data/interim/wi_dataset_clean.csv
Clean job advertisements read successfully. Shape: (25665, 7)


In [10]:
# Inspect job advertisments
clean_jobs.head()

Unnamed: 0,id,title,description,description_clean,description_clean_nn,title_clean,title_clean_nn
0,872828466,Panel & Paint Technician,Panel & Paint Technician required in Colcheste...,panel & paint technician required in colcheste...,panel & paint technician required in colcheste...,panel & paint technician,panel & paint technician
1,839465958,"Lärare i slöjd och teknik för årkurs 7-9, Ljun...",Sista ansökningsdatum: 1 juni 2021 Referensnum...,sista ansökningsdatum: 1 juni 2021 referensnum...,sista ansökningsdatum: juni referensnummer: ...,"lärare i slöjd och teknik för årkurs 7-9, ljun...","lärare i slöjd och teknik för årkurs -, ljungs..."
2,857077872,Consultants in Emergency Medicine - Doughiska,"The Galway Clinic is a leading 146 bed, state ...","the galway clinic is a leading 146 bed, state ...","the galway clinic is a leading bed, state of ...",consultants in emergency medicine - doughiska,consultants in emergency medicine - doughiska
3,801801567,Senior IT Support Engineers,"My Client, who has been continually growing th...","my client, who has been continually growing th...","my client, who has been continually growing th...",senior it support engineers,senior it support engineers
4,855162927,Commercial Sales Representatives,"Jobbtitel: ""Commercial Sales Representatives"" ...","jobbtitel: ""commercial sales representatives"" ...","jobbtitel: ""commercial sales representatives"" ...",commercial sales representatives,commercial sales representatives


### 4.2 Pre-process taxonomy

In [11]:
# Read in eurostat taxonomy
labels = FileReader.read_isco_labels()

# Read in ISCO-08 taxonomy
taxonomy = FileReader.read_external_labels()

Reading raw ISCO taxonomy from: /mnt/batch/tasks/shared/LS_root/mounts/clusters/a100-c48-dimitris/code/Users/dimitrios.petridis/eurostat-data-challenge-2024/data/raw/wi_labels.csv
Raw eurostat taxonomy read successfully. Shape: (436, 3)
Reading reference taxonomy from: /mnt/batch/tasks/shared/LS_root/mounts/clusters/a100-c48-dimitris/code/Users/dimitrios.petridis/eurostat-data-challenge-2024/data/raw/ISCO-08 EN Structure and definitions.xlsx
Reference taxonomy read and processed successfully. Shape: (619, 8)


In [12]:
# Merge taxonomies
enhanced_labels = merge_taxonomies(labels=labels, taxonomy=taxonomy)

Merging reference taxonomy and eurostat taxonomy...
Taxonomies merged successfully. Shape: (436, 27)
Combined taxonomy (boosted) read and enhanced successfully. Shape: (436, 27)


In [13]:
# Clean labels
clean_labels = clean_isco_labels(df=enhanced_labels)

Starting cleaning isco labels...
Number of missing values in code: 0
Replaced 0 missing values in code with '-'.
Number of missing values in title_ext_level_4: 0
Replaced 0 missing values in title_ext_level_4 with '-'.
Number of missing values in description_ext_level_4: 0
Replaced 0 missing values in description_ext_level_4 with '-'.
Number of missing values in tasks_include_level_4: 9
Replaced 9 missing values in tasks_include_level_4 with '-'.
Number of missing values in included_occupations_level_4: 0
Replaced 0 missing values in included_occupations_level_4 with '-'.
Number of missing values in excluded_occupations_level_4: 132
Replaced 132 missing values in excluded_occupations_level_4 with '-'.
Number of missing values in notes_level_4: 368
Replaced 368 missing values in notes_level_4 with '-'.
Number of missing values in title: 0
Replaced 0 missing values in title with '-'.
Number of missing values in description: 0
Replaced 0 missing values in description with '-'.
Number of m

In [14]:
# Inspect labels
clean_labels.head()

Unnamed: 0,code,title_ext_level_4,description_ext_level_4,tasks_include_level_4,included_occupations_level_4,excluded_occupations_level_4,notes_level_4,title,description,title_ext_level_1,...,included_occupations_level_2_clean,excluded_occupations_level_2_clean,notes_level_2_clean,title_ext_level_3_clean,description_ext_level_3_clean,tasks_include_level_3_clean,included_occupations_level_3_clean,excluded_occupations_level_3_clean,notes_level_3_clean,id
0,1111,Legislators,"Legislators determine, formulate, and direct p...",Tasks include - (a) presiding over or partici...,Examples of the occupations classified here: -...,-,-,Legislators,"Legislators determine, formulate and direct po...",Managers,...,occupations in this sub-major group are classi...,-,-,legislators and senior officials,"legislators and senior officials determine, fo...",tasks performed usually include: presiding ove...,occupations in this minor group are classified...,-,-,1
1,1112,Senior Government Officials,Senior government officials advise governments...,"Tasks include - (a) advising national, state,...",Examples of the occupations classified here: -...,-,Chief executives of Government-owned enterpris...,Senior government officials,Senior government officials advise governments...,Managers,...,occupations in this sub-major group are classi...,-,-,legislators and senior officials,"legislators and senior officials determine, fo...",tasks performed usually include: presiding ove...,occupations in this minor group are classified...,-,-,2
2,1113,Traditional Chiefs and Heads of Villages,Traditional chiefs and heads of villages perfo...,Tasks include - (a) allocating the use of com...,Examples of the occupations classified here: -...,-,-,Traditional chiefs and heads of village,Traditional chiefs and heads of villages perfo...,Managers,...,occupations in this sub-major group are classi...,-,-,legislators and senior officials,"legislators and senior officials determine, fo...",tasks performed usually include: presiding ove...,occupations in this minor group are classified...,-,-,3
3,1114,Senior Officials of Special-interest Organizat...,Senior officials of special-interest organizat...,Tasks include - (a) determining and formulati...,Examples of the occupations classified here: -...,-,-,Senior officials of special-interest organisat...,Senior officials of special-interest organizat...,Managers,...,occupations in this sub-major group are classi...,-,-,legislators and senior officials,"legislators and senior officials determine, fo...",tasks performed usually include: presiding ove...,occupations in this minor group are classified...,-,-,4
4,1120,Managing Directors and Chief Executives,Managing directors and chief executives formul...,"Tasks include - (a) planning, directing and c...",Examples of the occupations classified here: -...,-,Regional managers and other senior managers wh...,Managing directors and chief executives,Managing directors and chief executives formul...,Managers,...,occupations in this sub-major group are classi...,-,-,managing directors and chief executives,managing directors and chief executives formul...,"tasks performed usually include: planning, dir...",occupations in this minor group are classified...,-,-,5


## 5. Embeddings

In [15]:
# Instantiate embedding model
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False)

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

----------using 2*GPUs----------


### 5.1 Embed and save job advertisments

In [None]:
for col in ['description_clean', 'title_clean']:
    embed_and_save(model=model, col=clean_jobs[col].to_list(), file_name=f'jobs_{col}.pickle', batch_size=12)

### 5.2 Embed and save taxonomy

In [10]:
for col in ['title_ext_level_4_clean',
            'description_ext_level_4_clean', 
            'tasks_include_level_4_clean',
            'included_occupations_level_4_clean',
            'excluded_occupations_level_4_clean', 
            'notes_level_4_clean',
            
            'title_ext_level_1_clean', 
            'description_ext_level_1_clean',
            'tasks_include_level_1_clean', 
            'included_occupations_level_1_clean',
            'excluded_occupations_level_1_clean', 
            'notes_level_1_clean',

            'title_ext_level_2_clean', 
            'description_ext_level_2_clean',
            'tasks_include_level_2_clean',
             'included_occupations_level_2_clean',
            'excluded_occupations_level_2_clean',
             'notes_level_2_clean',

            'title_ext_level_3_clean', 
            'description_ext_level_3_clean',
            'tasks_include_level_3_clean', 
            'included_occupations_level_3_clean',
            'excluded_occupations_level_3_clean', 
            'notes_level_3_clean']:
    embed_and_save(model=model, col=clean_labels[col].to_list(), file_name=f'labels_{col}.pickle', batch_size=12)

Inference Embeddings: 100%|██████████| 19/19 [00:01<00:00, 16.40it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:01<00:00, 13.53it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:03<00:00,  5.16it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:01<00:00, 15.33it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:01<00:00, 15.46it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:02<00:00,  8.93it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:00<00:00, 19.72it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:01<00:00, 12.61it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:02<00:00,  8.71it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:01<00:00, 15.05it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:00<00:00, 20.02it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:01<00:00, 15.79it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:01<00:00, 16.50it/s]
Inference Embeddings: 100%|██████████| 19/19 [00:01<00:00, 10.80it/s]
Inference Embeddings

Embeddings saved successfully to labels_title_ext_level_4_clean.pickle
Embeddings saved successfully to labels_description_ext_level_4_clean.pickle
Embeddings saved successfully to labels_tasks_include_level_4_clean.pickle
Embeddings saved successfully to labels_included_occupations_level_4_clean.pickle
Embeddings saved successfully to labels_excluded_occupations_level_4_clean.pickle
Embeddings saved successfully to labels_notes_level_4_clean.pickle
Embeddings saved successfully to labels_title_ext_level_1_clean.pickle
Embeddings saved successfully to labels_description_ext_level_1_clean.pickle
Embeddings saved successfully to labels_tasks_include_level_1_clean.pickle
Embeddings saved successfully to labels_included_occupations_level_1_clean.pickle
Embeddings saved successfully to labels_excluded_occupations_level_1_clean.pickle
Embeddings saved successfully to labels_notes_level_1_clean.pickle
Embeddings saved successfully to labels_title_ext_level_2_clean.pickle
Embeddings saved succ

### 5.3 Read in job advertisments embeddings

In [17]:
with open(Paths.EMBEDDINGS_DATA_PATH/'jobs_title_clean.pickle', 'rb') as file:
    jobs_title_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH/'jobs_description_clean.pickle', 'rb') as file:
    jobs_description_loaded_array = pickle.load(file)

print("Loaded arrays successfully.")

Loaded arrays successfully.


### 5.4 Read in taxonomy embeddings

In [18]:
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_title_ext_level_1_clean.pickle', 'rb') as file:
    labels_title_ext_level_1_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_description_ext_level_1_clean.pickle', 'rb') as file:
    labels_description_ext_level_1_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_tasks_include_level_1_clean.pickle', 'rb') as file:
    labels_tasks_include_level_1_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_included_occupations_level_1_clean.pickle', 'rb') as file:
    labels_included_occupations_level_1_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_excluded_occupations_level_1_clean.pickle', 'rb') as file:
    labels_excluded_occupations_level_1_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_notes_level_1_clean.pickle', 'rb') as file:
    labels_notes_level_1_loaded_array = pickle.load(file)

with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_title_ext_level_2_clean.pickle', 'rb') as file:
    labels_title_ext_level_2_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_description_ext_level_2_clean.pickle', 'rb') as file:
    labels_description_ext_level_2_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_tasks_include_level_2_clean.pickle', 'rb') as file:
    labels_tasks_include_level_2_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_included_occupations_level_2_clean.pickle', 'rb') as file:
    labels_included_occupations_level_2_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_excluded_occupations_level_2_clean.pickle', 'rb') as file:
    labels_excluded_occupations_level_2_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_notes_level_2_clean.pickle', 'rb') as file:
    labels_notes_level_2_loaded_array = pickle.load(file)

with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_title_ext_level_3_clean.pickle', 'rb') as file:
    labels_title_ext_level_3_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_description_ext_level_3_clean.pickle', 'rb') as file:
    labels_description_ext_level_3_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_tasks_include_level_3_clean.pickle', 'rb') as file:
    labels_tasks_include_level_3_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_included_occupations_level_3_clean.pickle', 'rb') as file:
    labels_included_occupations_level_3_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_excluded_occupations_level_3_clean.pickle', 'rb') as file:
    labels_excluded_occupations_level_3_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_notes_level_3_clean.pickle', 'rb') as file:
    labels_notes_level_3_loaded_array = pickle.load(file)

with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_title_ext_level_4_clean.pickle', 'rb') as file:
    labels_title_ext_level_4_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_description_ext_level_4_clean.pickle', 'rb') as file:
    labels_description_ext_level_4_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_tasks_include_level_4_clean.pickle', 'rb') as file:
    labels_tasks_include_level_4_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_included_occupations_level_4_clean.pickle', 'rb') as file:
    labels_included_occupations_level_4_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_excluded_occupations_level_4_clean.pickle', 'rb') as file:
    labels_excluded_occupations_level_4_loaded_array = pickle.load(file)
with open(Paths.EMBEDDINGS_DATA_PATH / 'new_labels_notes_level_4_clean.pickle', 'rb') as file:
    labels_notes_level_4_loaded_array = pickle.load(file)

print("Loaded arrays successfully.")

Loaded arrays successfully.


### 5.5 Combine job advertisments embeddings (weighted sum)

In [19]:
# Define weights
W_JOBS_TITLE = 0.3
W_JOBS_DESCRIPTION = 0.7

In [20]:
# Combine job advertisments embeddings
jobs_combined_embeddings = jobs_title_loaded_array * W_JOBS_TITLE + jobs_description_loaded_array * W_JOBS_DESCRIPTION

### 5.6 Combine taxonomy embeddings (weighted sum)

In [21]:
# Define weights
W_LABELS_TITLE = 0.3
W_LABELS_DESCRIPTION = 0.7
W_TASKS_INCLUDE = 0.5
W_INCLUDED_OCCUPATIONS = 0.6

In [22]:
# Combine taxonomy embeddings for each level
labels_level_1_combined = (labels_title_ext_level_1_loaded_array * W_LABELS_TITLE + 
                           labels_description_ext_level_1_loaded_array * W_LABELS_DESCRIPTION + 
                           labels_tasks_include_level_1_loaded_array * W_TASKS_INCLUDE + 
                           labels_included_occupations_level_1_loaded_array * W_INCLUDED_OCCUPATIONS)

labels_level_2_combined = (labels_title_ext_level_2_loaded_array * W_LABELS_TITLE + 
                           labels_description_ext_level_2_loaded_array * W_LABELS_DESCRIPTION + 
                           labels_tasks_include_level_2_loaded_array * W_TASKS_INCLUDE + 
                           labels_included_occupations_level_2_loaded_array * W_INCLUDED_OCCUPATIONS)

labels_level_3_combined = (labels_title_ext_level_3_loaded_array * W_LABELS_TITLE + 
                           labels_description_ext_level_3_loaded_array * W_LABELS_DESCRIPTION + 
                           labels_tasks_include_level_3_loaded_array * W_TASKS_INCLUDE + 
                           labels_included_occupations_level_3_loaded_array * W_INCLUDED_OCCUPATIONS)

labels_level_4_combined = (labels_title_ext_level_4_loaded_array * W_LABELS_TITLE + 
                           labels_description_ext_level_4_loaded_array * W_LABELS_DESCRIPTION + 
                           labels_tasks_include_level_4_loaded_array * W_TASKS_INCLUDE + 
                           labels_included_occupations_level_4_loaded_array * W_INCLUDED_OCCUPATIONS)

In [23]:
# Define weights
W_LEVEL_1 = 0.2
W_LEVEL_2 = 0.4
W_LEVEL_3 = 0.5
W_LEVEL_4 = 0.7

In [24]:
# Combine taxonomy embeddings from all levels
labels_combined_embeddings = (labels_level_1_combined * W_LEVEL_1 + 
                              labels_level_2_combined * W_LEVEL_2 + 
                              labels_level_3_combined * W_LEVEL_3 + 
                              labels_level_4_combined * W_LEVEL_4)

### 5.7 Compute similarity matrix

In [25]:
# Compute similarity matrix
similarity_matrix = compute_cosine_similarity_matrix(jobs_combined_embeddings, labels_combined_embeddings)

100%|██████████| 25665/25665 [00:17<00:00, 1477.11it/s]


In [26]:
# Save/load similarity matrix
try:
    with open(Paths.EMBEDDINGS_DATA_PATH / "20240908_cosine_similarity_matrix.pickle", 'rb') as file:
        similarity_matrix = pickle.load(file)
except FileNotFoundError:
    with open(Paths.EMBEDDINGS_DATA_PATH / "20240908_cosine_similarity_matrix.pickle", 'wb') as file:
        pickle.dump(similarity_matrix, file)   

similarity_matrix.shape   

(25665, 436)

In [27]:
# Display similarity matrix
print('Cosine Similarity Matrix:')
print(similarity_matrix)

Cosine Similarity Matrix:
[[0.40700588 0.40720373 0.4257651  ... 0.43917805 0.42881268 0.44897842]
 [0.45322189 0.45646986 0.46308506 ... 0.46275198 0.46732673 0.46225119]
 [0.51667559 0.53172606 0.5114944  ... 0.5165031  0.51709956 0.50938016]
 ...
 [0.4698028  0.47480735 0.46264511 ... 0.47781923 0.46807685 0.47094226]
 [0.51592249 0.51686627 0.517003   ... 0.54002804 0.5154258  0.52677864]
 [0.52311349 0.53645653 0.52705765 ... 0.48384291 0.44300619 0.44147113]]


### 5.8 Get k nearest neighbours (knn)

In [28]:
# Define k
K=5

In [29]:
# Retrieve k nearest neighbours for each job
knn_5_indices = get_knn(similarity_matrix=similarity_matrix, k=K)

100%|██████████| 25665/25665 [00:00<00:00, 49617.09it/s]


In [30]:
# Display k elements
print("Indices of the k largest elements in each row:")
print(knn_5_indices)

Indices of the k largest elements in each row:
[[308 307 301 302 130]
 [ 72  71  73  76  75]
 [164  56  55 162  60]
 ...
 [ 95  54  94  96  51]
 [298 300 305 304 301]
 [ 11   9  17  10  14]]


## 6. Perform RAG using text-generation model

### 6.1 Build prompts

In [31]:
# Define system template
SYSTEM_TEMPLATE = ("You will receive a job title and a job description (which may be in any language). "
                   "Your task is to classify the job advertisment into the appropriate four-digit code from the given CONTEXT. "
                   "Use the following CONTEXT to determine the correct classification:\n{}\n\n"
                   "Your output must be only the four-digit classification ID, and no additional text or explanations.\n"
                   "Available classes are: {}")

# Define user template
USER_TEMPLATE = ("Below is a job advertisement. "
                 "Your task is to classify it using one of the provided four-digit classification IDs. "
                 "Provide only the correct four-digit ID in response.\n\n"
                 "JOB TITLE: {}\n"
                 "JOB DESCRIPTION: {}\n")

In [32]:
# Convert the knn indices to codes
knn_5_codes = convert_knn_indices_to_codes(clean_jobs=clean_jobs, 
                                           clean_labels=clean_labels,
                                           knn_indices=knn_5_indices)

In [35]:
# Generate all prompts
prompts = generate_prompts(clean_jobs, clean_labels, knn_5_codes, SYSTEM_TEMPLATE, USER_TEMPLATE)

Generating prompts: 100%|██████████| 25665/25665 [03:08<00:00, 136.19it/s]


In [37]:
# Store/load prompts
try:
    with open(Path(Paths.INTERIM_DATA_PATH) / 'prompts.pkl', 'rb') as file:
        prompts = pickle.load(file)
    print("Prompts loaded from file.")
except FileNotFoundError:
    with open(Path(Paths.INTERIM_DATA_PATH) / 'prompts.pkl', 'wb') as file:
        pickle.dump(prompts, file)
    print("Prompts saved to file.")

Prompts loaded from file.


### 6.2 Run inference with LLM

In [38]:
# Log in HF
login(os.environ["HF_TOKEN"])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /home/azureuser/.cache/huggingface/token
Login successful


In [39]:
# Define LLM for text-generation
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

In [4]:
# Define pipeline
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [41]:
# Define terminators
terminators = [
    pipe.tokenizer.eos_token_id,
    pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

In [59]:
# Define prompt parameters
TEMPERATURE = 0.6
TOP_P = 9

In [60]:
# Create empty dict to store outputs
outputs = {}

In [66]:
# Run inference with LLM for each job advertisment
for id, prompt in prompts.items():
    try:

        messages = [
            {"role": "system","content":prompt[0]},
            {"role": "user", "content": prompt[1]}
        ]

        output = pipe(
            text_inputs=messages,
            max_new_tokens=256,
            eos_token_id=terminators,
            do_sample=True,
            temperature=TEMPERATURE,
            top_p=TOP_P,
            )
        assistant_response = output[0]["generated_text"][-1]["content"]
        outputs[id] = assistant_response
    except Exception as e:
        print(e)
        outputs[id] = "error"

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


## 7. Post-processing

### 7.1 Postprocess LLM outputs

In [75]:
# Extract 4 digit string from the outputs
clean_outputs = {k: extract_first_4_digit_string(v) for k, v in outputs.items()}

In [87]:
# Get a list with all the valid taxonomy codes
valid_codes = taxonomy[taxonomy.level==4].code.to_list()

In [89]:
# Replace errors and non valid codes with knn=1 result
for id, code in clean_outputs.items():
    if code == 'error' or code not in valid_codes:
        clean_outputs[id] = knn_5_codes[id][0]

### 7.2 Prepare submission

In [97]:
# Convert dict to pandas DataFrame
final_outputs = pd.DataFrame(clean_outputs.items(), columns=['id', 'code'])

In [100]:
# Export submission
final_outputs.to_csv(Paths.SUBMISSION_DATA_PATH / "classification.csv", index = False, header=None)