# **Resume base compliance with job description**

This code was created for automatic verification of resumes (in txt format) - identifying whether the generated resume database matches the published job posting.

**To start the script:**

**1. add job description in txt format as "job_description.txt";**

**2. create a folder CV_base, into which add resume files (in txt format).**

Used method:  TFxIDF  (term frequency - inverse document frequency)

### **Creating the environment**

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
import warnings
import os
import chardet

### **Introduce reading files function**
We use sorting of filenames in alphabetical order

In [None]:
def read_documents(directory,code_scheme):
    documents = []
    filenames = os.listdir(directory)
    filenames.sort()
    for filename in filenames:
            with open(os.path.join(directory, filename), 'r', encoding=code_scheme) as f:
               documents.append(f.read())
    return documents

Function to get filenames (used on stage of outputs)

In [None]:
def filenames_list(directory):
    documents = []
    filenames = os.listdir(directory)
    filenames.sort()
    return filenames

In [9]:
vectorizer = TfidfVectorizer()

Choose coding for the txt files (UTF-8 for English, ISO-8859-2 for Hungarian language)

In [None]:
#For Hungarian & English language
#encoding='ISO-8859-2'

#For English language
encoding='UTF-8'

You may change sensitivity - the number of resumes to show after check. 0.3 = top 30% from the base.

In [None]:
sensitivity = 0.3

### **Create base of txt files for futher processing**


In [None]:
vacancy = job_description.txt
folder = 'CV_base/'

txt_documents = read_documents(folder,encoding)

We use the same vectoriser from task 1: TfidfVectorizer

We create a tf-idf table to show the weights of each word in each CV document!

In [None]:
vectorizer_t2 = TfidfVectorizer()
tfidf_matrix_t2 = vectorizer_t2.fit_transform(txt_documents)
tfidf_df_t2 = pd.DataFrame(tfidf_matrix_t2.toarray(), columns=vectorizer_t2.get_feature_names_out())
print(tfidf_df_t2)

Calculate number of elements in the txt_documents list

In [None]:
num_elements = len(txt_documents)
last_document_index = num_elements - 1

Calculate similarity for each element in the txt_documents list.


In [None]:
similarities_t2 = cosine_similarity(vacancy, tfidf_matrix_t2)
print('similarities_t2:',similarities_t2)
similarities_t2.shape

Get the top 30% of CVs based on similarity to vacancy description

In [None]:
num_top_elements = int(sensitivity * len(similarities_t2.flatten()))
top_indices = np.argsort(similarities_t2.flatten())[-num_top_elements:]

filenames = filenames_list(folder)
top_filenames = [filenames[i] for i in top_indices]
top_similarities = similarities_t2.flatten()[top_indices]

for filename, similarity in zip(top_filenames, top_similarities):
    print(f"Filename: {filename}")
    print(f"Similarity: {similarity}")
