<a href="https://colab.research.google.com/github/RobelD420/Machine-Learning/blob/main/ATS_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m52.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.3


In [16]:
#Predefined function to ensure fitz has successfully opened file or not
def file_opened(file):
  if not file.is_closed:
     print (f"{file} has been opened successfully")
  else:
     print (f"{file} has not been opened successfully")
     return

In [11]:
import pymupdf as fitz #PDF reading
import spacy #Structural parsing
import regex #Regular Expression
import json #For jsonifying
import os #Directory Navigation

In [3]:
#The spacy model that extracts attributes
nlp = spacy.load("en_core_web_sm")

In [5]:
#Skills Lists
coding_skills = ["Python", "SQL", "Django", "React", "Machine Learning", "Java", "C++", "Node.js", "Next.js"]
business_skills = ["Commerce", "Business", "Trading", "Marketing", "Online Marketing", "Digital Marketing", "Real Estate", "Forex"]
language_skills = ["English", "Spanish", "French", "Arabic", "Russian", "German", "Portuguese", "Amharic"]

In [6]:
resume_folder = "data/"
output_folder = "contents/info/"

os.makedirs(output_folder, exist_ok=True)  # Create output folder if missing

In [18]:
#Parsing
resumes = [] #The list will hold all the parsed resumes

#Looping through resumes
for file in os.listdir(resume_folder):
    if not file.endswith(".pdf"):
        continue #Skip non-pdf files

print(f"Processing {file}...")

Processing skills-based-cv.pdf...


In [19]:
#Use fitz to open pdf
file = fitz.open(os.path.join(resume_folder, file))

file_opened(file)

Document('data/skills-based-cv.pdf') has been opened successfully


In [20]:
#Reformatting it so spacy can handle attribute extraction
text = ""
for page in file:
  text += page.get_text()

In [21]:
#Specify doc object for spacy model
doc = nlp(text)

In [22]:
# === Extract Candidate Name ===
candidate_name = None
for ent in doc.ents:
  #Using Entity objects in the doc, and extracting "person" labels
    if ent.label_ == "PERSON" and "customer" not in ent.text.lower():
       candidate_name = ent.text.title() #To put it in title case
       break

In [23]:
# === Extract Candidate Email ===
email_pattern = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
email_match = regex.search(email_pattern, text)
email = email_match.group() if email_match else None

In [24]:
# === Extract Candidate Phone ===
phone_pattern = r"(?:Mobile[:\s]*)?(\+?\d{2,4}[\s-]?\d{3,5}[\s-]?\d{3,5})"
phone_match = regex.findall(phone_pattern, text)
phone = phone_match[0] if phone_match else None

In [25]:
 # === Extract Candidate Skills ===
skills_found = set()
for token in doc:
    word = token.text.strip()
    #Check whether the word is within the 3 predefined skillsets
    if word in coding_skills + business_skills + language_skills:
       skills_found.add(word)

In [26]:
# === Total Years of Experience ===
exp_years = 0
years_pattern = r"Total Years of Experience[:\s]*([0-9]{1,2})"
years_match = regex.search(years_pattern, text)
exp_years = int(years_match.group(1)) if years_match else None

In [30]:
#Print Candidate Info
print(f"✅ Found:\n Name = {candidate_name}, \n Email={email}, \n Years = {exp_years}")

✅ Found:
 Name = Ashley Gill, 
 Email=ashleygill2023@gotmail.com, 
 Years = None


In [53]:
#Defining the JSON format
resume_data= {
    "name": candidate_name,
    "email": email,
    "phone": phone,
    "years": exp_years,
    "skills": list(skills_found)
}


In [33]:
#SAVING the JSON file
with open(os.path.join(output_folder, f"{candidate_name}.json"), "w") as f:
        json.dump(resume_data, f, indent=4)

resumes.append(resume_data)
#We close the PDF file since it is only the JSON we'll need from here on.
file.close()

In [54]:
#TENTATIVE CODE FOR TESTING
folder = "contents/info"
resumes = []

for f in os.listdir(folder):
    if f.endswith(".json"):
        with open(os.path.join(folder, f)) as jf:
            resumes.append(json.load(jf))

In [55]:
resumes

[{'name': 'Robson Taye',
  'email': 'RobaTaye28@gotmail.com',
  'phone': '01002 92134',
  'years': 3,
  'skills': ['Python', 'SQL', 'French', 'English']},
 {'name': 'Ashley Gill',
  'email': 'ashleygill2023@gotmail.com',
  'phone': '01882 65234',
  'years': None,
  'skills': ['Spanish', 'French', 'Business', 'Marketing', 'English']}]

In [56]:
#Printing JSON status
print(f"\n✅ ✅ ✅\nDone parsing all resumes!\n")


✅ ✅ ✅
Done parsing all resumes!



In [50]:
#COSINE SIMILARITY
from sentence_transformers import SentenceTransformer, util

#This will be the Applicant-Ranking model
model = SentenceTransformer('all-MiniLM-L6-v2')

#Assume the following is the job the applicants are seeking
Job_requirement = "We want someone strong in Python. English and French are preferred. More total years of experience is better."


In [57]:
#Prepare the text for embedding: Skill + exp_years
texts = []

#Iterate over every json resume
for resume in resumes:
   skills = " ".join(resume["skills"])
   years = f"{resume['years']} years experience" if resume["years"] else "unknown experience"
   combined = f"{skills} {years}"
   texts.append(combined)

In [58]:
print("🔍 Texts for similarity:\n", texts)

🔍 Texts for similarity:
 ['Python SQL French English 3 years experience', 'Spanish French Business Marketing English unknown experience']


The COSINE SIMILARITY method is primarily used to determine the similarity between two pieces of text. The model tries to find the COSINE of the angle between the vectors representing the words.

Here in this case, we are trying to find a similarity between the embedded text detailing the applicant's skillset and the job requirement that dictates whether someone is fit for the specified job or not.

In [59]:
#Encoding for Similarity
job_embeddings = model.encode(Job_requirement, convert_to_tensor=True)
resume_embeddings = model.encode(texts, convert_to_tensor=True)

#Cosine Score Rating
cos_scores = util.cos_sim(job_embeddings, resume_embeddings)[0]

In [60]:
#FINAL RANKINGS (0 - 1)
print("\n=== FINAL RANKINGS ===")
for i, res in enumerate(resumes):
    print(f"{res['name']}: {cos_scores[i].item():.4f}")


=== FINAL RANKINGS ===
Robson Taye: 0.6888
Ashley Gill: 0.3927


In [62]:
#So, who is better for the job?
# Find best score and candidate
best_idx = cos_scores.argmax().item()   # index of highest score
best_candidate = resumes[best_idx]['name']
best_score = cos_scores[best_idx].item()

print("\n===============================")
print(f"🏆 {best_candidate} has the highest match with a score of {best_score:.4f}")
print("===============================")


🏆 Robson Taye has the highest match: with a score of 0.6888
