# Resume screening
This notebook goes through the process of screening a batch of resumes against a job specification, using NLP tools. The first step looks at named-entity recognition (NER) to extract key information from the resume. Next will be a look at the cosine similarity of each resume against the job spec, with the aim of filtering out the less relevent resumes. Third will be grouping the resumes together using LDA and finally using TF-IDF to attempt to rank the resumes for the given role. 

The resumes will then be uploaded to a vector database to simplify searching for appropriate candidates in the future.

Future developments to be made in a recommendation/feedback system. The recruiter can rate (out of say 10) a candidate for a given role and then use a random forest machine learning model to predict ratings for other resumes. This will then be fed into the distance between the resume vector and job spec vector, with positive resumes moving closer and therefore being recommended more frequently. This would require a lot of human feedback initially to train the model.

In [30]:
import pandas as pd
import spacy
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import torch
import re

import gensim
import gensim.corpora as corpora
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.models import CoherenceModel
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package stopwords to /Users/riz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/riz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 1. Load resumes
The functions below load .pdf or .docx resumes. An error log file is created if any of the resume's aren't in either pdf or docx format

In [17]:
import PyPDF2
from docx import Document
import os
import uuid
import re

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfFileReader(file)
        text = ''
        for page_num in range(reader.numPages):
            text += reader.getPage(page_num).extractText()
        return text

def extract_text_from_docx(docx_path):
    doc = Document(docx_path)
    return "\n".join([para.text for para in doc.paragraphs])

def extract_text(file_path):
    if file_path.endswith('.pdf'):
        return extract_text_from_pdf(file_path)
    elif file_path.endswith('.docx'):
        return extract_text_from_docx(file_path)
    else:
        with open('../output/incorrect_format.txt', 'a') as log_file:
            log_file.write(f'{file_path}\n')
        return None


I've taken a resume dataset from [Kaggle](https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset) rather than loading a set of resumes individually. Loading the resumes also attaches a UUID to better help tracking down the line.

In [78]:
'''
# load resumes
resume_text = pd.read_csv('../data/UpdatedResumeDataSet.csv')
# assign uuid
resume_text['resume_id'] = [str(uuid.uuid4()) for _ in range(len(resume_text))]
# rename dataset columns
resume_text.rename(columns={'Category': 'category', 'Resume': 'text'}, inplace=True)
resume_text = resume_text[['resume_id', 'category', 'text']]
'''

resume_directory = "../data/resume-dataset/data/data/INFORMATION-TECHNOLOGY"  # set resume directory 
resume_text_dict = {}
for filename in os.listdir(resume_directory):
    file_path = os.path.join(resume_directory, filename)
    resume_id = str(uuid.uuid4())  # Generate a unique ID
    resume_text_dict[resume_id] = {
        'filename': filename,
        'text': extract_text(file_path)
    }

data = []
for resume_id, info in resume_text_dict.items():
    row = {'id': resume_id, 'filename': info['filename'], 'text': info['text']}
    data.append(row)

# set as dataframe
resume_text = pd.DataFrame(data)


In [79]:
resume_text.head()

Unnamed: 0,id,filename,text
0,398b70dd-2d80-4863-ada5-92de39fedc39,18176523.pdf,SENIOR INFORMATION TECHNOLOGY MANAGER\nExecuti...
1,2c614501-e010-44f4-8758-78e8e1db7104,25857360.pdf,STAFF ASSISTANT\nProfessional Summary\nHighly ...
2,7d81c49f-759a-42db-9e66-ecd635cd27a1,39718499.pdf,ASSISTANT FOOTBALL COACH\nSummary\nEnthusiasti...
3,29abdfe8-6538-49f5-82dd-f51507d85b53,40018190.pdf,IT SUPPORT TECHNICIAN\nEducation\nBachelor of ...
4,edb32c28-7229-4c45-87d7-727bfdb3c74f,31243710.pdf,IT MANAGER\nSummary\nTen years of management e...


And also define some functions for preprocessing the resumes, using regex to attempt to extract email & phone numbers.

In [None]:
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in text.lower().split() if word not in stop_words]

#resume_text['email'] = resume_text['text'].apply(extract_email)
#resume_text['phone_number'] = resume_text['text'].apply(extract_phone_number)

### 2. Feature extraction/NER

We'll attempt to extract some key features from the resume and save them in the dataframe. Email and phone number should be reasonable straight forward using regex, but for the person's name, we'll attempt to use BERT for NER to detect 'person (PER)'

In [87]:
def extract_email(text):

    pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
    match = re.findall(pattern, text)
    return match[0] if match else None  # returns the first found email or None

def extract_phone_number(text):
    
    # regex to match UK phone number but can be changed
    pattern = r'^(((\+44\s?\d{4}|\(?0\d{4}\)?)\s?\d{3}\s?\d{3})|((\+44\s?\d{3}|\(?0\d{3}\)?)\s?\d{3}\s?\d{4})|((\+44\s?\d{2}|\(?0\d{2}\)?)\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?$'
    match = re.findall(pattern, text)
    return match[0] if match else None  # returns the first found phone number or None


In [88]:
resume_text['email'] = resume_text['text'].apply(extract_email)
resume_text['phone_number'] = resume_text['text'].apply(extract_phone_number)
resume_text.head()

Unnamed: 0,id,filename,text,email,phone_number,clean_text,padding_words
0,398b70dd-2d80-4863-ada5-92de39fedc39,18176523.pdf,SENIOR INFORMATION TECHNOLOGY MANAGER\nExecuti...,,,"[senior, information, technology, manager, exe...",[python]
1,2c614501-e010-44f4-8758-78e8e1db7104,25857360.pdf,STAFF ASSISTANT\nProfessional Summary\nHighly ...,,,"[staff, assistant, professional, summary, high...",[]
2,7d81c49f-759a-42db-9e66-ecd635cd27a1,39718499.pdf,ASSISTANT FOOTBALL COACH\nSummary\nEnthusiasti...,,,"[assistant, football, coach, summary, enthusia...",[]
3,29abdfe8-6538-49f5-82dd-f51507d85b53,40018190.pdf,IT SUPPORT TECHNICIAN\nEducation\nBachelor of ...,,,"[support, technician, education, bachelor, sci...",[]
4,edb32c28-7229-4c45-87d7-727bfdb3c74f,31243710.pdf,IT MANAGER\nSummary\nTen years of management e...,,,"[manager, summary, ten, year, management, expe...",[]


Attempt to extract details using both spaCy and bert to perform named entity recognition (NER). Bert is likely to be less useful given it's only been trained to recognise location (LOC), organisations (ORG), person (PER) and miscellaneous (MISC).

The spaCy pipeline will include a list of skills taken from [jobzilla skills dataset](https://github.com/kingabzpro/jobzilla_ai/blob/main/jz_skill_patterns.jsonl).

In [82]:
resume_text.to_csv('test.csv', index = False)

In [83]:
resume_text = pd.read_csv('test.csv')

In [3]:
# bert model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

# LspaCy model
nlp = spacy.load("en_core_web_sm")

# NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

In [4]:
""" skill_pattern_path = "../data/jz_skill_patterns.jsonl"
ruler = nlp.add_pipe("entity_ruler")
ruler.from_disk(skill_pattern_path)
nlp.pipe_names

def get_skills(text):
    doc = nlp(text)
    myset = []
    subset = []
    for ent in doc.ents:
        if ent.label_ == "SKILL":
            subset.append(ent.text)
    myset.append(subset)
    return subset


def unique_skills(x):
    return list(set(x)) """

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'entity_ruler']

In [4]:
# Function to extract named entities using bert ner
def extract_entities_bert(resume_text):
    entities = ner_pipeline(resume_text)
    # Further processing to extract specific entities like name, education, etc.
    # This might require custom logic based on the structure of your resumes
    return entities

# Function to extract named entities using spaCy
def extract_entities_spacy(resume_text):
    doc = nlp(resume_text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

In [5]:
resume_text['Entities_spacy'] = resume_text['Resume'].apply(extract_entities_spacy)
resume_text['Entities_bert'] = resume_text['Resume'].apply(extract_entities_bert)

# Display the new DataFrame with extracted entities
print(resume_text[['Category', 'Entities_spacy', 'Entities_bert']])


         Category                                     Entities_spacy  \
0    Data Science  [(Sql, PERSON), (Java, PERSON), (JavaScript/JQ...   
1    Data Science  [(May 2013 to, DATE), (May 2017, DATE), (Expri...   
2    Data Science  [(Control System Design, ORG), (Web Developmen...   
3    Data Science  [(Tableau, GPE), (SAP HANA SQL, ORG), (SAP HAN...   
4    Data Science  [(MCA, ORG), (YMCAUST, ORG), (Faridabad, GPE),...   
..            ...                                                ...   
957       Testing  [(MS, GPE), (Basic Excel, PRODUCT), (Loyalty &...   
958       Testing  [(Team Player, ORG), (DECLARATION, PERSON), (J...   
959       Testing  [(Eagerness, NORP), (Competitive, ORG), (Janua...   
960       Testing  [(SKILLS & SOFTWARE, ORG), (MS-Power Point, OR...   
961       Testing  [(Skill Set OS, PERSON), (Windows XP/7/8/8.1/1...   

                                         Entities_bert  
0    [{'entity': 'I-MISC', 'score': 0.57808983, 'in...  
1    [{'entity': 'I-O

In [24]:
selected_rows = resume_text.iloc[:5]

# Initialize an empty list to store the transformed data
transformed_data = []

# Iterate over each row
for index, row in selected_rows.iterrows():
    # Iterate over each entity in the row
    for entity in row['Entities_bert']:
        # Append the entity data to the transformed_data list
        transformed_data.append({
            'Row': index,
            'Entity': entity['entity'],
            'Score': entity['score'],
            'Index': entity['index']
            # Add more fields here as needed
        })

# Convert the transformed data into a DataFrame
transformed_df = pd.DataFrame(transformed_data)

In [25]:
print(transformed_df)

     Row  Entity     Score  Index
0      0  I-MISC  0.578090      6
1      0   I-ORG  0.379557     33
2      0  I-MISC  0.896526     37
3      0  I-MISC  0.978586     39
4      0  I-MISC  0.774978     40
..   ...     ...       ...    ...
290    4   I-ORG  0.427012     16
291    4  I-MISC  0.964054     60
292    4  I-MISC  0.971787     73
293    4   I-ORG  0.769654    106
294    4   I-ORG  0.793049    108

[295 rows x 4 columns]


### 3. Cosine similarity
Look at cosine similarity between resume and job spec, but first try to highlight any resume that has attempted to use word padding to attempt to manipulate any word based filtering system

In [85]:
# preprocess text for both specs and resumes
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in text.lower().split() if word not in stop_words]

resume_text['clean_text'] = resume_text['text'].apply(preprocess_text)


In [86]:
from collections import Counter

def detect_word_padding(text, threshold=0.1):
    #words = text.split()
    word_counts = Counter(text)
    total_words = len(text)
    
    padding_words = []
    for word, count in word_counts.items():
        if count / total_words > threshold:
            padding_words.append(word)

    return padding_words

resume_text['padding_words'] = resume_text['clean_text'].apply(detect_word_padding)
padded_resumes = resume_text[resume_text['padding_words'].apply(lambda x: len(x) > 1)]

print(padded_resumes)

Empty DataFrame
Columns: [id, filename, text, email, phone_number, clean_text, padding_words]
Index: []


### 4. LDA Grouping

In [11]:
# Create Dictionary and Corpus for LDA
id2word = corpora.Dictionary(processed_resumes)
corpus = [id2word.doc2bow(text) for text in processed_resumes]

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=10, # Adjust the number of topics
                                            random_state=100,
                                            update_every=1,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto')


In [12]:
print(processed_resumes)

0      [skill, *, programming, languages:, python, (p...
1      [education, detail, may, 2013, may, 2017, b.e,...
2      [area, interest, deep, learning,, control, sys...
3      [skill, â¢, r, â¢, python, â¢, sap, hana, â...
4      [education, detail, mca, ymcaust,, faridabad,,...
                             ...                        
957    [computer, skills:, â¢, proficient, m, office...
958    [â, willingness, accept, challenges., â, p...
959    [personal, skill, â¢, quick, learner,, â¢, e...
960    [computer, skill, &, software, knowledge, ms-p...
961    [skill, set, o, window, xp/7/8/8.1/10, databas...
Name: Resume, Length: 962, dtype: object


### 5. TF-IDF for keywords

### 6. Import to vector database