## Using Large Language Models to match CV documents to job postings

This notebook will use LLM-models from openai (hosted on Azure) to find the best matching CV documents for a job posting. The process outline is this:
1. Get CV documents from CV database
2. Summarize the CV documents to make them shorter and more information dense
3. Transform the CV documents to numerical embeddings in order to compare them with job postings in an easy way
4. Read a job posting and transform it in the same way as with the CV documents
5. Compare the job posting with all the CV documents and find the best matches
6. Present the names and rankings of the found matches

See `README.md` for more background and information

### Setting up the environment
I assume you have `conda` installed (but any virtual environment with `pip` installed will do). For conda, open a terminal and type the following commands:
```bash
conda create -n job-cv-matching python=3.9
conda activate job-cv-matching
```

We create the necessary environment from the `requirements` file. In the terminal, type:
```bash
pip install -r requirements
```

In [1]:
# Then we import the packages and set some parameters
# When running this cell, make sure to select the job-cv-matching kernel from the virtual environment that we just created. If it does not show up, restart the jupyter server and try again.

from tqdm.notebook import tqdm
import json
import pandas as pd
import os
import requests
import numpy as np
import time
import pickle

base_path = 'data/'

### Get the CV documents
In this example, we will use a public dataset available on [Kaggle](https://www.kaggle.com). You will need an account and an API-key to connect and download the data with the method in this notebook. You will find info on how to set this up on [this link](https://github.com/Kaggle/kaggle-api).
If you implement this on your own data, you just have to replace the call to the Kaggle-API with a call to your own source of CV documents, and then process the dataset accordingly to fit the format used below.

In [2]:
import kaggle

# Download the dataset of CV documents
!mkdir data
!kaggle datasets download leenardeshmukh/curriculum-vitae -p ./data --unzip


Downloading curriculum-vitae.zip to ./data
 72%|███████████████████████████▎          | 3.00M/4.18M [00:00<00:00, 4.68MB/s]
100%|██████████████████████████████████████| 4.18M/4.18M [00:00<00:00, 5.21MB/s]


In [2]:
# Load the data and print a few lines
base_path = 'data/'
raw = pd.read_csv(base_path+'Curriculum Vitae.csv')
raw.rename(columns={'Resume': 'cv'}, inplace=True)
raw

Unnamed: 0,Category,cv
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
11019,DotNet Developer,"Technical Skills â¢ Languages: C#, ASP .NET M..."
11020,DotNet Developer,Education Details \r\nJanuary 2014 Education ...
11021,DotNet Developer,"Technologies ASP.NET, MVC 3.0/4.0/5.0, Unit Te..."
11022,DotNet Developer,"Technical Skills CATEGORY SKILLS Language C, C..."


## Summarize CV documents
First we need to set up connection to the models we will be using.

Any suitable LLM will do here, but i have chosen GPT-based models from [openai](https://openai.com), hosted on Microsoft [Azure](https://portal.azure.com). See `README.md` pre-requisites section for more information on how to set this up.

In [2]:
import openai

# Set parameters for Azure openai
openai_rg_name = 'openai-lab'
openai_svc_name = 'openai-lab-rm'
openai.api_type = "azure"
openai.api_version = "2023-03-15-preview"

# Choose your openai endpoint and key that you acquire when setting up Azure openai. I have set them as environment variables by using a .env-file.
openai.api_base = os.getenv("OPENAI_API_BASE")
openai.api_key = os.getenv("OPENAI_API_KEY")

# Define the model for summarizing CV documents. I have used an old name, but the model is in fact gpt-35-turbo.
text_summarization_model = "text-davinci-003"

CV documents are rather lengthy and often contains repeated information and are formatted as a selling text. Some of that text can act a a disturbing noise for the LLM-models that will interpret and transform the data. Since these models don't care about how nicely and well formatted the document is, we can summarize the documents to make them as information dense and to-the-point as possible. By doing so we also decrease the length, making them easier and cheaper for the GPT-based models to process.

For the model to produce (hopefully) good results, we set the context by informing it of the current situation and task at hand. We also set some tuneable parameters in our call to the model.

In [4]:
# Create a function that takes a CV and summarizes in and returns the summary
def get_summary(document):
    response = openai.ChatCompletion.create(
        engine=text_summarization_model,
        
        # Here we set the context for the model to prepare it for the task
        messages=[
            {"role": "system", "content": "You are a large language model specialized in summarizing CV documents. You do this by extracting all the information in a document that is relevant from a career and job application perspective."},
            {"role": "user", "content": f"Write a detailed summary of the CV document below: \n\n{document}"}
        ],
        
        # Here we set the model parameters which are used to tweak how the response turns out
        temperature=0.2,
        top_p=1,
        n=1,
        )
    
    return response


# Test the model
doc = "Richard Martin is a consultant that specializes in building AI-driven analytics solutions. He works at Sopra Steria Sweden together with a diverse team of consultants covering the whole field of data and analytics. His job title is typically Data Scientist. He also works with statistics, business intelligence, and classical machine learning. He has several years of experience from common tools like Python, SQL and Power BI. He likes to work in the Azure cloud ecosystem."
print(get_summary(doc))

{
  "id": "chatcmpl-81Vt1tGVY2WCVg0aBTL5GjqrB8ICn",
  "object": "chat.completion",
  "created": 1695371591,
  "model": "gpt-35-turbo",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "Richard Martin is a Data Scientist and consultant at Sopra Steria Sweden, where he specializes in building AI-driven analytics solutions. He works with a diverse team of consultants covering the entire field of data and analytics. Richard has several years of experience working with common tools such as Python, SQL, and Power BI. He is also proficient in statistics, business intelligence, and classical machine learning. Richard prefers to work in the Azure cloud ecosystem."
      }
    }
  ],
  "usage": {
    "completion_tokens": 88,
    "prompt_tokens": 147,
    "total_tokens": 235
  }
}


In [7]:
# Work with a small sample when developing
data = raw.sample(5)

In [8]:
# Summarize all documents
data['summarization'] = ''
for idx, category, cv in tqdm(zip(data.index.values, data['Category'].loc[data.index.values], data['cv'].loc[data.index.values]), total=len(data)):
    
    response = get_summary(cv)
    data['summarization'].loc[idx] = response['choices'][0]['message']['content']

# Newlines can cause problems when we create embeddings of the summarized documents, so we replace them with blank spaces.
data['summarization'].replace(r'\n',' ', regex=True, inplace=True)

  0%|          | 0/5 [00:00<?, ?it/s]

In [9]:
data.head(5)

Unnamed: 0,Category,cv,summarization
10289,Arts,â¢ Operating Systems: Windows XP / Vista / 07...,The candidate has experience in operating syst...
9160,ETL Developer,Education Details \r\nJanuary 2015 Bachelor of...,The candidate holds a Bachelor of Engineering ...
10514,Java Developer,"Computer Skills: Languages And Script: JSP, Se...","The candidate has computer skills in JSP, Serv..."
7224,Advocate,Education Details \r\n LLB. Dibrugarh Univer...,The candidate's CV states that they have compl...
5842,DevOps Engineer,CORE COMPETENCIES ~ Ant ~ Maven ~ GIT ~ Bitbuc...,The CV belongs to a DevOps Engineer with exper...


## Transform into embeddings

Embedding is a way to translate written text into structured numerical data. It is a mapping from string to array like so:
`hi there' -> [0.1244, 0.1984, 0.1851]`

The point of embedding the texts is that numerical arrays (vectors) are easier to compare for similarity than written text. Exactly how the transformation is done is hard to figure out since the model is kind of 'black box' by nature. But Microsoft explains it like this:
> The embedding is an information dense representation of the semantic meaning of a piece of text

![Picture of the embedding process](embed.png)

We select a different GPT-model specialized for text-embedding, and create a funtion that embeds a text into a numerical vector. Then we proceed to process our dataset.


In [8]:
# Select our deployed model specialized for embedding
embedding_model = 'text-similarity-davinci-001'

# Function for creating an embedding from a text
def get_embedding(text, deployment_id):

    result = openai.Embedding.create(
      deployment_id=deployment_id,
      input=text
    )
    result = np.array(result["data"][0]["embedding"])
    return result

# Try it out
embedding = get_embedding("What does the embedding of this sentence look like?", embedding_model)

# Check the results
print('Embedding:', embedding)
print('Datatype of embedding:', type(embedding))
print('Length of embedding vector:', len(embedding))

Embedding: [-0.01314186  0.00399883  0.01619395 ...  0.00137543 -0.01616747
 -0.01071873]
Datatype of embedding: <class 'numpy.ndarray'>
Length of embedding vector: 1536


In [12]:
# Create embeddings for each CV in our dataset
data['embedding'] = ''
for i in tqdm(data.index.values):
    try:
        embedding = get_embedding(data['summarization'][i], embedding_model)
        data['embedding'][i] = embedding
    except Exception as err:
        i
        print(f"Unexpected {err=}, {type(err)=}")

    # Wait between calls because of restrictions in model API
    time.sleep(7)

  0%|          | 0/5 [00:00<?, ?it/s]

In [13]:
data.sample(5)

Unnamed: 0,Category,cv,summarization,embedding
5842,DevOps Engineer,CORE COMPETENCIES ~ Ant ~ Maven ~ GIT ~ Bitbuc...,The CV belongs to a DevOps Engineer with exper...,"[-0.002695744391530752, -0.022219469770789146,..."
9160,ETL Developer,Education Details \r\nJanuary 2015 Bachelor of...,The candidate holds a Bachelor of Engineering ...,"[0.000728837912902236, -0.015829650685191154, ..."
10289,Arts,â¢ Operating Systems: Windows XP / Vista / 07...,The candidate has experience in operating syst...,"[0.005620994139462709, -0.004112922586500645, ..."
10514,Java Developer,"Computer Skills: Languages And Script: JSP, Se...","The candidate has computer skills in JSP, Serv...","[0.006072777323424816, -0.006378600373864174, ..."
7224,Advocate,Education Details \r\n LLB. Dibrugarh Univer...,The candidate's CV states that they have compl...,"[-0.01249951682984829, -0.0009597481694072485,..."


Now we have our processed dataset ready for comparisons whenever we want to find candidates for a new job ad.
This is a good time to store our data so that we dont have to re-process it each time we use our matching tool. In this example we simply save the results to a csv-file, but in a "real" environment it would typically be stored in a database.

In [14]:
# Saving numerical arrays inside a csv-file can be done by pickling the arrays.
data['embedding'] = data['embedding'].apply(lambda arr: pickle.dumps(arr))

# Save the results to our data folder
data.to_csv(base_path+'embeddings.csv', index=False)

## Build pipeline for transforming and comparing a job ad with the processed CVs
We now define some functions that automates the process of reading in job ads and processing them in the same way as with our CV documents and then compares each one of them with the job ad to find the best matches.

In [3]:
# Load the embeddings from our storage
data = pd.read_csv(base_path+'embeddings.csv')

# De-pickle the embeddings back into numpy arrays
data['embedding'] = data['embedding'].apply(lambda pickled_arr: pickle.loads(eval(pickled_arr)))

In [9]:
# We measure the similarity between two texts by measuring the angle between their respective embedding vectors. Since our embedding model strictly produces vectors of unit length, this is the same as taking the dot product between the vectors.
def vector_similarity(x, y):
    """
    Returns the similarity between two vectors.    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    similarity = np.dot(x, y)
    return similarity 

In [10]:
# We also want to sort the results in order of similarity score
def order_document_sections_by_query_similarity(query, contexts):
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant candidates. 
    Return the list of articles, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query, embedding_model)

    document_similarities = sorted(
        [(vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()], 
        reverse=True)
    
    return document_similarities

In [11]:
# Finally we need a function that processes our dataset by using the above functions and return our final result
def retrieve_relevant_documents(description, data, top_n=5):
    
    # find text most similar to the query
    answers = order_document_sections_by_query_similarity(query=description, contexts=data['embedding'])[0:top_n]
    results = []

    # print top n
    for answer in answers:
        name = 'name' #data["namn"].loc[answer[1]]
        idx = answer[1]
        score = answer[0]
        summarization = data['summarization'].loc[answer[1]]
        results.append({'id': idx, 'score': score, 'summarization': summarization})

        print(f'id:   {idx},   similarity score:   {score}')
        print(summarization, '\n')

    return results, answers

And finally, we feed our model a sample job ad and see the summarized CVs returned along with a similarity score for each och the matches. Here we limit the response to the top _n_ matches.

In [13]:
query = 'Vi söker en medarbetare till vår bid avdelning. Personen ska kunna arbeta med bid- och anbudsprocesser och sköta intervjuer med kunder och egna konsulter. God kommunikationsförmåga är av yttersta vikt. CSS, JavaScript, jQuery, Ajax.'
results, answers = retrieve_relevant_documents(description=query, data=data, top_n=3)

id:   2,   similarity score:   0.787698408453656
The candidate has computer skills in JSP, Servlet, HTML, CSS, JavaScript, jQuery, Ajax, Spring, Hibernate, MySQL, Eclipse, and NetBeans IDE. They have education in H.S.C from VidyaBharati college in Amravati, Maharashtra in January 2007 and S.S.C from Holy Cross English School in Amravati, Maharashtra in January 2005. They have worked as a Java Developer for 14 months and have experience in Eclipse, Hibernate, Spring, and jQuery. They are currently working as a Java Developer in Winsol Solution Pvt Ltd since July 2017 and have a total of 2 years of experience as a Java Developer in Kunal IT Services Pvt Ltd. 

id:   4,   similarity score:   0.7628761191639152
The CV belongs to a DevOps Engineer with experience in deployment, documentation, change management, and configuration management. The candidate has hands-on experience in DevOps, automation, build engineering, and configuration management. They have worked on multiple projects invo

## Test the model performance