**Embeddings model loading to convert the remaining numeric data to semantic data**

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="Alibaba-NLP/gte-base-en-v1.5",
    model_kwargs={"device": "cpu", "trust_remote_code": True},
    encode_kwargs={"normalize_embeddings": True},
)

**Loading batch of data processed with main.py**   
We read the processed data file by dumping it to a .bson file,    
since we want to make fewer requests to the database.

In [None]:
import bson

with open("embedded.bson", "rb") as f:
    data = bson.decode_all(f.read())

print(len(data))

For the CNN model we will calculate the parameters to standardize them later at runtime.

In [None]:
import numpy as np
expYears = []
expYearsManagement = []
avgTimeInJob = []
for i in range(len(data)):
    expYears.append(data[i]["expYears"])
    expYearsManagement.append(data[i]["expYearsManagement"])
    avgTimeInJob.append(data[i]["avgTimeInJob"])
expYears = np.array(expYears).reshape(-1, 1)
expYearsManagement = np.array(expYearsManagement).reshape(-1, 1)
avgTimeInJob = np.array(avgTimeInJob).reshape(-1, 1)

# Compute the minimum and maximum of the data
min_val = np.min(expYears)
max_val = np.max(expYears)

min_val_management = np.min(expYearsManagement)
max_val_management = np.max(expYearsManagement)

min_val_avgTimeInJob = np.min(avgTimeInJob)
max_val_avgTimeInJob = np.max(avgTimeInJob)

# Define a function to scale data
def min_max_scale(data, min_val, max_val):
    return (data - min_val) / (max_val - min_val)


**Preparar una función para convertir los datos númericos restantes a datos semánticos**     
Originalmente recibimos algunos datos que son escalares númericos, como los años de experiencia, estos serán convertidos a datos semánticos. Encontramos que la mejor forma de conservar la numeralidad en la forma de embeddings era primero pasar los años a meses y luego a una cadena de texto.  
Esto lo hacemos con una librería de python que dado un entero, lo convierte a su forma de texto en inglés.

**Creating a function to convert the remaining numerical data to semantic data**.     
We originally received some data that is numerically scalar, such as years of experience, and this will be converted to semantic data. We found that the best way to preserve the numericality in the form of embeddings was to first convert the years to months and then to a text string.  
We do this with a python library that, given an integer, converts it to its English text form.

In [None]:
from num2words import num2words
def monthsofexperience(expYears):
    months = int(expYears * 12)
    return f"The candidate has {num2words(months)} months of labor experience"

# monthsofexperience(2) -> 'The candidate has twenty-four months of labor experience'

In the previous code block, you can see the first function for converting numerical data to semantic data.   
As an example, if we pass as argument 2, the function will return a string with the text *"The candidate has twenty-four months of labor experience"*.   
We define the rest of the functions that work on the same principle, but for different numeric fields.

In [None]:
def expYearsEmbedding(expYears):
    if expYears >= 5:
        temp =monthsofexperience(expYears) + ", so he has enough experience to be a manager"
    else:
        temp = monthsofexperience(expYears) + ", so he is not qualified to be a manager"
    return np.array(embeddings.embed_documents([temp])[0])
    
def expYearsEmbedding2(expYearsManagement):
    if expYearsManagement >= 5:
        temp = monthsofexperience(expYearsManagement) + ", so he has enough experience to have an executive position"
    else:
        temp = monthsofexperience(expYearsManagement) + ", so he is not qualified to have an executive position"
    return np.array(embeddings.embed_documents([temp])[0])

def avgTimeInJobEmb(avgTimeInJob):
    temp = f"The candidate has an average of {num2words(avgTimeInJob)} months in each job"
    return np.array(embeddings.embed_documents([temp])[0])

def management_position(management_position):
    if management_position:
        temp = "One of the recent jobs of the candidate was in a management position"
    elif management_position == False:
        temp = "One of the recent jobs of the candidate did not involve management activities"
    else:
        temp = "The candidate didn't provide information about his recent jobs"
        
    return np.array(embeddings.embed_documents([temp])[0])

def education_level(education_level):
    if education_level < 0:
        temp = "The candidate didn't provide information about his education level"
    elif education_level == 0:
        temp = "The candidate has a high school education level"
    elif education_level == 1:
        temp = "The candidate has a college education level"
    elif education_level == 2:
        temp = "The candidate has a postgraduate education level"
    else:
        temp = "The candidate has a doctorate education level"
    return np.array(embeddings.embed_documents([temp])[0])

**Constant definitions for missing data**.    
We have decided to always use the last 3 jobs of a candidate, however, if a candidate has less than 3 jobs, we will complete with the ones defined below.    
The same will apply to the candidate's education, we will always try to obtain information about their university degree, and about their highest education above university degree. If no information is available, we will also complete with those defined below.


In [None]:
NA_JOB_CONST = np.array(embeddings.embed_documents(["No more jobs where found for this candidare"])[0])
NA_CONST = np.array(embeddings.embed_documents(["Not available information"])[0])

**Using the folder information to assign a label for supervised learning**.    
When processing the CVs, the path of the folder where the CV was stored was saved, this information will be used for labelling the data.

In [None]:
def labeler(label):
    if "Especialista" in label:
        return 0
    elif "Gerente" in label:
        return 1
    elif "Director" in label:
        return 2

**Create a function to obtain the matrices that feed the models**.    
With the previously defined functions, we can obtain the data matrices that will feed the models.    
We have defined 15 features to be used for candidates that we consider relevant for their classification in one of the 3 categories. These features have been carefully selected to avoid having biases in the model. They will always be in neutral language and deliberately omit personal information, as they have been rewritten in such a way as to omit particular details.    
The selected features are:
1. Years of work experience
2. Years in leadership positions
3. Average time in each job
4. Information indicating highest level of education (High School, College, Postgraduate, NA)
5. Highest educational qualification (Above college degree, not including name of college or university)
6. Undergraduate degree (Not including the name of the college or university)
7. Last job title (The name of the company is deliberately not included)
8. Summary of responsibilities of last job
9. Categorisation of whether the last job involved leadership
10. Title of penultimate job (Company name is deliberately not included).
11. Summary of responsibilities for the penultimate job
12. Categorisation as to whether the penultimate job involves leadership
13. Title of the penultimate job (Company name deliberately not included)
14. Summary of responsibilities of second to last job
15. Categorisation of whether the antepenultimate job involves leadership


In [None]:
def to_matrix(bson):
    totalWorks = len(bson["work"])
    work_title = []
    work_brief = []
    work_management = []
    for i in range(1,4):
        if totalWorks-i < 0:
            work_title.append(NA_JOB_CONST)
            work_brief.append(NA_JOB_CONST)
            work_management.append(NA_JOB_CONST)
        else:
            work_title.append(bson["work"][totalWorks-i]["title"])
            work_brief.append(bson["work"][totalWorks-i]["brief"])
            if bson["work"][totalWorks-i]["management"] == 0:
                work_management.append(management_position(False))
            else:
                work_management.append(management_position(True))
    if bson["bachelor"] != None:
        bachelor_title = bson["bachelor"]["title"]
    else:
        bachelor_title = NA_CONST

    if bson["maxEducation"] != None:
        maxEducation_title = bson["maxEducation"]["title"]
    else:
        maxEducation_title = NA_CONST
        
    expYears = expYearsEmbedding(bson["expYears"])
    #expYearsNumeric = np.ones(expYears.shape) * min_max_scale(bson["expYears"], min_val, max_val)
    expYearsManagement = expYearsEmbedding2(bson["expYearsManagement"])
    #expYearsManagementNumeric = np.ones(expYearsManagement.shape) * min_max_scale(bson["expYearsManagement"], min_val_management, max_val_management)
    avgTimeInJob = avgTimeInJobEmb(bson["avgTimeInJob"])
    #avgTimeInJobNumeric = np.ones(avgTimeInJob.shape) * min_max_scale(bson["avgTimeInJob"], min_val_avgTimeInJob, max_val_avgTimeInJob)
    highestEducation = education_level(bson["highestEducation"])

    # Put all the data together into a giant np.array where each variable is a column
    data = np.vstack([expYears,
                      #expYearsNumeric, 
                    expYearsManagement, 
                    #expYearsManagementNumeric,
                    avgTimeInJob, 
                    #avgTimeInJobNumeric,
                    highestEducation,
                    bachelor_title, 
                    maxEducation_title, 
                    work_title[0], 
                    work_brief[0],
                    work_management[0], 
                    work_title[1],
                    work_brief[1],
                    work_management[1], 
                    work_title[2], 
                    work_brief[2],
                    work_management[2]])
    return (data, labeler(bson["label"]), bson["file"])

Last thing to do is to run the function inside a loop that loops through all the processed data and save it in an .npy file with the following dimensions:    
(Number of data, 15, 768)

In [None]:
from random import randint
# Añadimos un randint para evitar colisiones en los nombres de los archivos
# En este caso guardamos los archivos en la carpeta Data_matrix por si se necesitan tener por separado
def flujo(bson):
    try:
        a = to_matrix(bson)
        filename = a[2].split("/")[0]
        with open(f"Data_matrix/{a[1]}/{filename}{randint(1,100)}.csv", "wb") as f:
            np.savetxt(f, a[0], delimiter=",")
    except:
        print("Error with ", bson["file"])

In [None]:
import os
from joblib import Parallel, delayed
Parallel(n_jobs=-1, prefer="threads")(delayed(flujo)(bson) for bson in data)

def load_data():
    X = []
    y = []
    for label in os.listdir("Data_matrix"):
        for file in os.listdir(f"Data_matrix/{label}"):
            data = np.loadtxt(f"Data_matrix/{label}/{file}", delimiter=",")
            X.append(data)
            y.append(int(label))
    return np.array(X), np.array(y)

X, Y = load_data()

X = np.load("X.npy")
Y = np.load("Y.npy")