# Applicant Tracking System: Resume and Job Description Matches

This is an applicant tracking system, often refered to as an ATS. They are commonly used by companies and businesses of all sizes, and they serve to sort through applicant resumes and return a subset as the best possible canidates for the job. Such 

1. What factors are the strongest indicator of a match score?\
**ANS:** The factor that most strongly indicates a match score was the degree a candidate held. The second most defining factor was the skills and candidate had, and the least defining was the responsibilities a candidate had in a previous position.

3. Does direct word matching between a resume and description indicate a better match, or do synonyms perform just as well?\
**ANS:** Direct word matching between a resume and job description does indicate a better match. Synonym don’t seem to have any particular ordering, in terms of the best or worst synonym to use.

5. How do different matching methods impact scoring, and what is the best method to get the most accurate score?\
**ANS:** The best method was to use Spacy’s similarity function to directly compare the text listed under each category, rather than using keyword extraction to uncover a percentage. 



## Challenge Goal
- **New library:** Natural Language Processing: Resumes and job descriptions, by nature of their use, must be human focused and readable. But with the precedent set by companies like linkedin, indeed, and glassdoor, matches between job descriptions and resumes are needed on a much faster and larger scale, and natural language processing is especially helpful for this. Parsing out the meaning and synonyms of words is especially important for ensuring an accurate match. 


## Collaboration and Conduct

Students are expected to follow Washington state law on the [Student Conduct Code for the University of Washington](https://www.washington.edu/admin/rules/policies/WAC/478-121TOC.html). In this course, students must:

- Indicate on your submission any assistance received, including materials distributed in this course.
- Not receive, generate, or otherwise acquire any substantial portion or walkthrough to an assessment.
- Not aid, assist, attempt, or tolerate prohibited academic conduct in others.

Update the following code cell to include your name and list your sources. If you used any kind of computer technology to help prepare your assessment submission, include the queries and/or prompts. Submitted work that is not consistent with sources may be subject to the student conduct process.

In [1]:
your_name = "Ilse Schmitz"
sources = [
    'https://blog.martianlogic.com/5-facts-you-must-know-about-the-applicant-tracking-system-ats' +
    ' - This was used to uncover how ATS systems work',
    'https://kristenfife.medium.com/understanding-how-the-ats-reads-and-interacts-with-your-resume-401bd00b66db' +
    ' - This was used to uncover how ATS systems work',

    'https://course.spacy.io/en - Spacys interactive learning course, I used it to undertand how spacy works,' +
    'and what the spacy library can do for me.',
    'https://spacy.io/usage/processing-pipelines' +
    ' - FOnd by google searching "python spacy pipeline API documentation", I used this resource to better' +
    'understand how to scale my code to work faster for large datasets',
    
    'https://deepnote.com/app/abid/spaCy-Resume-Analysis-81ba1e4b-7fa8-45fe-ac7a-0b7bf3da7826' +
    ' - an incredible example of a working ATS system. No code has been directly copied from the example, but its' + 
    'a fantastic tool to see how ATS sysems work, and some of the ideas about how to identify keywords and' + 
    'compute percentages has been borrowed from this example.', 
    
    'search.ipynb - Sourced the method "clean()" that was provided alongside this assignment', 
    'dataframes.ipynb'
]

assert your_name != "", "your_name cannot be empty"
assert ... not in sources, "sources should not include the placeholder ellipsis"
assert len(sources) >= 6, "must include at least 6 sources, inclusive of lectures and sections"

## Data Setting and Methods

*Replace this text with a description of the data setting, any data transformations you conducted, and the methods you plan to use to answer the research questions. You may remove the code cell below if you don't need it.*

### Setting Up SpaCy

In [2]:
!pip install spacy

Collecting spacy
  Using cached spacy-3.8.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Using cached murmurhash-1.0.12-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Using cached cymem-2.0.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Using cached preshed-3.0.9-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Using cached thinc-8.3.4-cp311-cp311-manylinux_2

In [3]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [4]:
import json
import pandas as pd
from spacy.language import Language
import numpy as np
import re

import spacy
from spacy.language import Language
# from spacy.pipeline import EntityRuler
from spacy.lang.en import English
from spacy.tokens import Doc, Span

In [5]:
nlp = spacy.load('en_core_web_md')

### Getting CSV files

This is a dataset containing resume and job data, as well as a match score indicating the how well suited a resume is for each job. The '﻿job_position_name' column has been renamed, due to a red dot appearing only under certain circumstances. It appears in the code, but not within the displayed markdown text. I'm unsure why this appears or what it is, but the column has been renamed to avoid potential confusion or corruption. 

In [6]:
# # Resume Job Match Dataset
resume_job_match = pd.read_csv('resume_data.csv') 
resume_job_match.rename(columns={'﻿job_position_name': 'job_position_name'}, inplace=True)
resume_job_match.fillna('', inplace=True)
cols = ['skills', 'skills_required', 'related_skils_in_job', 'responsibilities', 'responsibilities.1']
for col in cols:
    resume_job_match[col] =  resume_job_match[col].str.replace('\n', ' ')
resume_job_match

Unnamed: 0,address,career_objective,skills,educational_institution_name,degree_names,passing_years,educational_results,result_types,major_field_of_studies,professional_company_names,...,online_links,issue_dates,expiry_dates,job_position_name,educationaL_requirements,experiencere_requirement,age_requirement,responsibilities.1,skills_required,matched_score
0,,Big data analytics working and database wareho...,"['Big Data', 'Hadoop', 'Hive', 'Python', 'Mapr...",['The Amity School of Engineering & Technology...,['B.Tech'],['2019'],['N/A'],[None],['Electronics'],['Coca-COla'],...,,,,Senior Software Engineer,B.Sc in Computer Science & Engineering from a ...,At least 1 year,,Technical Support Troubleshooting Collaboratio...,,0.850000
1,,Fresher looking to join as a data analyst and ...,"['Data Analysis', 'Data Analytics', 'Business ...","['Delhi University - Hansraj College', 'Delhi ...","['B.Sc (Maths)', 'M.Sc (Science) (Statistics)']","['2015', '2018']","['N/A', 'N/A']","['N/A', 'N/A']","['Mathematics', 'Statistics']",['BIB Consultancy'],...,,,,Machine Learning (ML) Engineer,M.Sc in Computer Science & Engineering or in a...,At least 5 year(s),,Machine Learning Leadership Cross-Functional C...,,0.750000
2,,,"['Software Development', 'Machine Learning', '...","['Birla Institute of Technology (BIT), Ranchi']",['B.Tech'],['2018'],['N/A'],['N/A'],['Electronics/Telecommunication'],['Axis Bank Limited'],...,,,,"Executive/ Senior Executive- Trade Marketing, ...",Master of Business Administration (MBA),At least 3 years,,"Trade Marketing Executive Brand Visibility, Sa...",Brand Promotion Campaign Management Field Supe...,0.416667
3,,To obtain a position in a fast-paced business ...,"['accounts payables', 'accounts receivables', ...","['Martinez Adult Education, Business Training ...",['Computer Applications Specialist Certificate...,['2008'],[None],[None],['Computer Applications'],"['Company Name ï¼ City , State', 'Company Name...",...,,,,Business Development Executive,Bachelor/Honors,1 to 3 years,Age 22 to 30 years,Apparel Sourcing Quality Garment Sourcing Reli...,Fast typing skill IELTSInternet browsing & onl...,0.760000
4,,Professional accountant with an outstanding wo...,"['Analytical reasoning', 'Compliance testing k...",['Kent State University'],['Bachelor of Business Administration'],[None],['3.84'],[None],['Accounting'],"['Company Name', 'Company Name', 'Company Name...",...,[None],[None],"['February 15, 2021']",Senior iOS Engineer,Bachelor of Science (BSc) in Computer Science,At least 4 years,,iOS Lifecycle Requirement Analysis Native Fram...,iOS iOS App Developer iOS Application Developm...,0.650000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9539,,,"['Mathematical modelling', 'Machine Learning',...",['Sanghvi College of Engineering'],['B.Tech'],['2019'],['N/A'],['N/A'],['N/A'],['BPM Foundation'],...,,,,Data Engineer,Bachelor of Science (BSc),5 to 8 years,,Data Platform Design Data Pipeline Development...,Azure Big Data Data Analytics ETL Tools Power ...,0.683333
9540,,Expertise EDA modeler. I like to learn what my...,"['Data Analysis', 'Business Analysis', 'Machin...","['KVoCT, Pune', 'KVoCT, Pune']","['B.CA', 'M.CA']","['2018', '2020']","[None, None]","[None, None]","[None, None]",['Passionate Solution'],...,,,,Executive/ Sr. Executive -IT,Bachelor of Science (BSc) in Computer Science ...,3 to 5 years,Age at most 40 years,Hardware & Software Installation System Monito...,,0.650000
9541,,Looking for roles related to application devel...,"['Business Analyst', 'Data Analytics', 'Data C...",['PGG College Mysore'],['B.BA'],['2019'],['N/A'],['N/A'],['N/A'],['ZigSAW'],...,,,,Executive - VAT,BBA in Accounting and Finance,1 to 3 years,,Mushak Forms Maintenance VAT Software & MS Off...,VAT and Tax,0.650000
9542,,,"['Machine Learning', 'Natural Language Process...","['Rajiv Gandhi Memorial University, Delhi']",['B.TECH'],['2020'],['N/A'],['N/A'],['Electrical'],['Zynta Labs'],...,[None],[None],[None],Asst. Manager/ Manger (Administrative),Bachelor/Honors,At least 5 years,Age at least 28 years,Administrative Support Scheduling Filing & Doc...,•Administration •Health Safety and Environment...,0.650000


### Build Matcher

## Results

In [7]:
def create_docs(dataframe, disabled, col_A, col_B=None):
    """
    Takes in a dataframe, a list of strings containg the pipes to disable when creating nlp objects, and one or two | *****
    strings containing the names of columns in the data frame.     
    If there are two column names given, concatenate each element found in col_A and col_B element wise into a list 
    of the same length as the dataframe.
    Otherwise, transform the column col_A from the dataframe into a list. 
    Transform each list item into a generator of Doc items.
    Return the generator.
    """
    if col_B is None:
        col_data = dataframe[col_A].tolist()
    else:
        col_data = (dataframe[col_A] + dataframe[col_B]).tolist()
    docs = nlp.pipe(col_data, disable=disabled)
    return docs
    

In [8]:
def check_similarity(data, job_col_name, res_col_name, res2_col_name=None):
    """
    Takes in a dataframe containing resume and job description data, and two or three strings containing the names of 
    columns found within the dataframe. 
    If two column names are given, make the third column name a None object. 
    ...
    Find the similarity between the resume data and the job description data.
    Return a list with the score of similarity between the resume data and the job data, and two generators containing
    Doc objects with the natural-language-processed resume data and job data, respectivly
    """
    skill_match = []
    disabled = ['tokenizer', 'tagger', 'parser', 'lemmatizer', 'textcat', 'custom']

    job_docs = create_docs(data, disabled, job_col_name)
    res_docs = create_docs(data, disabled, res_col_name, res2_col_name)

    for res_doc, job_doc in zip(res_docs, job_docs):
        if (job_doc.text == '') or (res_doc.text == ''):
            score = None
        else:
            score = res_doc.similarity(job_doc)
        skill_match.append(score)

    return skill_match

In [9]:
def mean_sq_err(compute, true):
    """
    Takes in two lists containing computed values and true values respectivly, and computes the mean squared error.
    If there is a missing value in either list, skip to the next computed value and true value 
    """
    MSE = 0
    for i in range(min(len(compute), len(true))):
        if (compute[i] is not None) and (true[i] is not None):
            MSE += np.square(np.subtract(compute[i], true[i]))
    
    return (MSE / (i+1))

### Research Question 1

In [10]:
columns = ['skills_required', 'skills', 'related_skils_in_job', 
          'educationaL_requirements', 'degree_names', 'major_field_of_studies',
         'responsibilities.1', 'responsibilities', None,
         'job_position_name', 'positions', None]
title = ['skills', 'degree', 'responsibilities', 'job position']

match_score = resume_job_match['matched_score'].tolist()

similarity = dict()
for i in range(int(len(columns) / 3)):
    sim = check_similarity(resume_job_match, columns[3*i], columns[3*i + 1], columns[3*i + 2])
    MSE = mean_sq_err(sim, match_score)
    similarity[title[i]] = MSE
    

To explain how these numbers are calculated, it's important to explain where this information comes from. USing skills as an example, I began with the data from *'resume_data.csv'*. I calculated the similarity between the skills the applicant had listed, the skills relevant to the job position the applicant worked previously, and the skills required by the job description. I combined the data provided by the applicant into a single string, and checked the similarity between the applicant's skills and the skills listed in the job description. A match between 0 and 1 is provided, with 1 being a perfect match, and 0 being completely unrelated. This is the computed match.\
For some categories, like the previous job positions held by the applicant, only one string is provided. As such, the computed value can take one or two strings of applicant information, depending on whether applicant information is stored in one or two places. 
From there, I computed the mean squared error. 

The mean squared error is computed between the match score provided by the dataset, and the computed values for each category of information. In this case, the mean squared error is used to indicate how heavily an ATS may weigh each section when calculating the match score. The larger the mean squared error, the less weight that section holds, and thus the less important it is to the ATS. Conversely, the smaller the mean squared error, the more important that section is. 

This is not a perfect estimate, by any means. In practice, an ATS would take the score provided for each category and compute the match score with different categories being weighted. For example, if the degree was less important than a candidate's previous job roles. However, we do not have the weights used to compute the provided match score. Further, when using several different categories of applicant information to evaluate, the data may heavily weight one particular category, but be much closer to several lesser weighted categories, and as such, mean squared error for each category would be high for the most heavily weighted, and much closer to the numerous lightly weighted categories. As such, discerning the category with the most importance placed upon from this information is tenuous at best, and outright wrong at worst.

However, this information can tell us the best indicator for a match score. If the mean squared error between the skills scores is small, this can indicate that the similarity between an applicant's skills and the required skills is the best indicator for whether the match score will be high or low. This doesn't tell us the weights, but it does tell us the most likely outcome for a match. 


In [11]:
for i in range(len(similarity)):
    print(f"Mean Squared error of {title[i]}: ", similarity[title[i]])

Mean Squared error of skills:  0.07175357068233537
Mean Squared error of degree:  0.052926185909737425
Mean Squared error of responsibilities:  0.14293504587083708
Mean Squared error of job position:  0.10896527127001596


The category with the smallest mean squared error is degree. With a score of approximately $0.0529$, this is possibly due to the brevity that a list of applicant's degrees would naturally have a better match score with a smaller list of words to compare, but this is surprising to see. Generally, it would be expected that prior positions and work experience would overrule this, but it's certainly not unfeasible that an applicant who had the degree required for the position would be prioritized in comparison to an applicant lacking the necessary credentials. 
This means that an applicant's degree and level of education are the best indicator for whether or not that person is a good match for the job. 

Shortly after the job position category comes the skills category with approximately $0.0718$, and after that, the job position category with $0.109$. 

In contrast, the mean squared error of the responsibilities category, $0.143$ , indicates that the degree is the least important indicator of a good resume-job-description match. Similar to the degree category, this is surprising. It would go to show that a candidate with previous experience carrying out the same or similar tasks would be a better fit for a job, but this could very well be an incorrect assumption. Further, it's important to remember that the computer is simply matching words, and as such, lacks the ability to pick up on transferable skills, and further doesn't have the ability to ask the candidate to expand upon topics within a job interview.  

### Research Question 2

In [12]:
# This retains the mean squared error of the original skills score
MSE_skills = similarity[title[0]]

In [13]:
synonym = ['machine learning', 'AI', 'neural network']
synonym_df = []


for i in range(len(synonym)):
    df = resume_job_match[['skills', 'skills_required', 'related_skils_in_job']].copy()
    df['skills'] = df['skills'].str.replace('Machine Learning', synonym[i])
    df['skills'] = df['skills'].str.replace('ML', synonym[i])
    
    df['related_skils_in_job'] = df['related_skils_in_job'].str.replace('Machine Learning', synonym[i])
    df['related_skils_in_job'] = df['related_skils_in_job'].str.replace('ML', synonym[i])
    synonym_df.append(df)


In [14]:
synonym_skills_compare = dict()
for i in range(len(synonym_df)):
    synonym_scores = check_similarity(synonym_df[i], 'skills', 'skills_required', 'related_skils_in_job')
    MSE = mean_sq_err(synonym_scores, match_score)
    synonym_skills_compare[synonym[i]] = MSE

In [15]:
print("Mean Squared Error of \'Machine Learning\' or  \'ML\': ", MSE_skills)

for i in range(len(synonym_skills_compare)):
    print(f"Mean Squared Error of \'{synonym[i]}\': ", synonym_skills_compare[synonym[i]])

Mean Squared Error of 'Machine Learning' or  'ML':  0.07175357068233537
Mean Squared Error of 'machine learning':  0.09826729516000227
Mean Squared Error of 'AI':  0.09805075568555112
Mean Squared Error of 'neural network':  0.09826118651859835


This is the mean squared error for four different variations of the word "Machine Learning". This was carried out by finding every instance of the word "Machine Learning" or "ML" helps to see whether exact word matching has the same impact on word matching as synonyms. This tells us whether direct word matching works better, or whether synonyms work just as good. 

In this case, starting with the original dataset, we have the mean squared error, approximately $0.718$. For each instance of skills, each instance of "Machine Learning" and "ML" has been replaced with "machine learning" in all lowercase, "AI", or "neural network". These are not the same thing, but they are synonyms, and thus in testing out how closely related the mean squared error is, this can inform whether direct word matching is better than synonym matching for resume and job description checkers. 

In this case, all three examples of the synonyms had an increased mean squared error close to $0.098$, with an increase of just over $0.026$. This provides evidence to support that direct word matching would be the best way to go, rather than relying on synonyms or indirect word matches, as only the resumes were modified during this search. 
There is always the possibility the by changing to synonyms, thus leading to an increased match score, but given the regularity of increase, even when changing "Machine Learning" and "ML" to all lowercase, this indicates that any incidental increase is still heavily weighted out by the affects of indirect matching. 

Thus, it can be concluded that direct word matching when writing a resume is likely to be more effective at inciting a higher match score than using synonyms. 

### Research Question 3

In [16]:
ruler = nlp.add_pipe("entity_ruler", after="ner")

patterns = []
with open('jz_skill_patterns.jsonl', 'r') as f:
    for line in f:
        data = json.loads(line)
        patterns.append(data)

ruler.add_patterns(patterns)

In [17]:
def clean(token: str, pattern: re.Pattern[str] = re.compile(r"\W+")) -> str:
    """
    Returns all the characters in the token lowercased and without matches to the
    given pattern.

    >>> clean("Hello!")
    'hello'
    """
    return pattern.sub("", token.lower())

In [18]:
def keyword_compare(data, job_col_name, res_col_name, res2_col_name=None):
    """..."""
    job_docs = create_docs(data, [], job_col_name)
    res_docs = create_docs(data, [], res_col_name, res2_col_name)

    job_skills = set()
    res_skills = set()
    match_per = []
    
    for res_doc, job_doc in zip(res_docs, job_docs):
        total_matches = 0
    
        # If there are any skills listed in the job description, add them to a set 
        for ent in job_doc.ents:
            job_skills.add(clean(ent.text))
    
        # If there are required skills, check the skills in the resume
        # Otherwise, the number of matching skills is zero
        if len(job_skills) > 0:
            for ent in res_doc.ents:
                res_skills.add(clean(ent.text))
    
        # If a skill in the job description is found in the resume, add one to total_matches 
        for skill in job_skills:
            if skill in res_skills:
                total_matches += 1
                
        if (len(job_skills) == 0):
            # No skills required for job
            match_per.append(None)
        else:
            match_per.append(total_matches / len(job_skills))

    return match_per

In [19]:
columns = ['skills_required', 'skills', 'related_skils_in_job', 
          'educationaL_requirements', 'degree_names', 'major_field_of_studies',
         'responsibilities.1', 'responsibilities', None,
         'job_position_name', 'positions', None]

percentage = []

for i in range(int(len(columns) / 3)):
    percentage.append(keyword_compare(resume_job_match, columns[3*i], columns[3*i + 1], columns[3*i + 2]))

Keyword selection, in this case, uses named entity recognition to pick out skills, organizations, and people. This may seem like a strange choice, and it certainly is an imperfect one, but further diving into the datasets provides answers as to why this choice was made. In particular, the jobzilla dataset used to identify skills is prioritized around identifying skills common in or adjacent to the tech industry, and thus skills like "customer service" are more difficult to identify. 
However, as we are measuring the percentage of the number of keywords in the job description being present in the resume and all keywords are being identified in the same manner, part of this discrepancy can be accounted for. If "customer service" cannot be identified by the keyword finder, it doesn't matter if it's present in the job description or the resume. Its presence or absence will not contribute to the match percentage. 

In [20]:
title = ['skills', 'degree', 'responsibilities', 'job position']
for i in range(len(percentage)):
    MSE_per = mean_sq_err(percentage[i], match_score)
    print(f"Mean Squared error of {title[i]} percentage: ", MSE_per)
    print(f"Mean Squared error of {title[i]} similarity: ", similarity[title[i]])
    print()

Mean Squared error of skills percentage:  0.10272318941134886
Mean Squared error of skills similarity:  0.07175357068233537

Mean Squared error of degree percentage:  0.1171220451174688
Mean Squared error of degree similarity:  0.052926185909737425

Mean Squared error of responsibilities percentage:  0.14293504587083708
Mean Squared error of responsibilities similarity:  0.14293504587083708

Mean Squared error of job position percentage:  0.1745173458103681
Mean Squared error of job position similarity:  0.10896527127001596



From the numbers computed, we can compare and see what way of reading resumes is better. Generally speaking, we can tell the mean squared error of keyword selecting is larger compared to the mean squared error using spacy's similarity function. This is likely due to the fact that spacy's similarity is Asofter on indirect matches, such as the difference between the phrases "Neural Network" and "Machine Learning". For spacy's similarity function, this would still produce a number indicating the closeness of each word, whereas the keyword matching would produce zero no matter what, since those words are not the same. 

In [21]:
nlp1 = nlp("Neural Network")
nlp2 = nlp("Machine Learning")

print(nlp1.similarity(nlp1))
print(nlp1.similarity(nlp2))

1.0
0.39600063577922107


Ultimately, the answer of what gives the most accurate score goes to spacy's similarity function. The smaller mean squared error is a clear indicator that it's providing a better read across three of the four categories. The only exception to this was the responsibilities category, where both keyword matching and similarity produced the same result, approximately $0.1429$. This is bizarre, more so than if the keyword percentage was smaller than the similarity number. This could indicate that spacy's similarity function uses some of the same methods as the keyword method, such as utilizing a percentage. It could also be possible that the numbers are similar only when truncated, but given that both numbers go out to seventeen decimal places, the fact that both numbers have the same digits that far out is genuinely baffling.

This also provides the answer to the final part of research question 3. Of the methods presented here, the most accurate scoring would be to use spacy's built in similarity function, though there is certainly room for improvement. 

Lastly, this helps support the second research question, albeit indirectly. When taking a holistic approach to ATS's in daily life, it's generally unknown what methods an ATS could be using, whether it be using an in-house similarity computation, an open-source option like spacy, or a more complicated AI reader. In any case, it stands to reason that all three options would find the best matches with direct word matching, rather than synonyms, even when synonyms are accounted for. 

In [22]:
del resume_job_match

In [23]:
resume_data = pd.read_csv('test.csv') 
height, width = resume_data.shape

In [24]:
job_data = pd.read_csv('jts.csv') 
job_short = job_data.head(height).copy()

In [25]:
match2 = pd.concat([resume_data, job_short], axis=1)
del job_data
del job_short
del resume_data
match2

Unnamed: 0,Resume_str,Category,id,Job Title,Job Description
0,SOCIAL MEDIA COMMUNICATIONS MANAGER Education ...,PUBLIC-RELATIONS,0,Flutter Developer,We are looking for hire experts flutter develo...
1,TEACHER Willing relocate: Anywhere Professiona...,TEACHER,1,Django Developer,PYTHON/DJANGO (Developer/Lead) - Job Code(PDJ ...
2,PRINCIPAL CONSULTANT Executive Profile A dynam...,BANKING,2,Machine Learning,"Data Scientist (Contractor)\n\nBangalore, IN\n..."
3,SENIOR ASSOCIATE Executive Profile Seasoned Fi...,BANKING,3,iOS Developer,JOB DESCRIPTION:\n\nStrong framework outside o...
4,ACCOUNTANT Professional Summary I enthusiastic...,ACCOUNTANT,4,Full Stack Developer,job responsibility full stack engineer – react...
...,...,...,...,...,...
493,ENGINEERING OFFICER Objective Looking opportun...,ENGINEERING,507,Wordpress Developer,*Sr/Jr Wordpress Developer*\n*Required skills ...
494,BUSINESS DEVELOPMENT MANAGER Summary Understan...,BUSINESS-DEVELOPMENT,508,Machine Learning,Business Title\nMachine Learning Intern (Techn...
495,HEAD GIRLS BASKETBALL COACH Summary Former col...,FITNESS,509,Network Administrator,Company description\nARMSOFTECH.AIR provides s...
496,PRESIDENT Executive Profile Media relations pr...,PUBLIC-RELATIONS,510,Machine Learning,About Us\n\nMorgan Stanley is a leading global...


## Implications and Limitations

A wide variety of people would benefit from utilizing this analysis, including job applicants, recruiters, HR personnel, or anyone wishing to build their own ATS. Ultimately, it's an introductory step into the variety of applicant tracking systems. It provides some information on how they can be best utilized, and given the growing number of companies using them, it's especially unlikely that they're going away. Indeed, for many reading this, it's especially likely that their resumes will be graded by an ATS. Knowing how they work and how to structure a resume so that it's both human and machine readable is an important skill to have. 

However, it would be remiss to talk about how ATS's work, and not talk about how they, like all projects that rely on data, are only as good as the data they're given. When talking about employment especially, there is plenty of discrimination that shows up in hiring records, and even attempts to mitigate bias can still yield pitiful results. One especially notable example of this was Amazon's AI recruiting tool was found to be biased against hiring women, and was found to systematically discriminate against people using the word "women" within their resume. Even when the word "women" was explicitly removed from consideration, the tool still suffered from bias problems, and was ultimately scrapped. In fact, this plays into one of the limitations of this tool. Despite the fact that names, genders, and ages are not included within the datasets, things like extracurricular activities, what college someone attended, where they worked previously, or even how they phrase skills, responsibilities, or accomplishments can be used to extrapolate identifying information, and given that this dataset does not include this information, it's difficult to tell where bias is being reinforced. 

Another limitation of this project was that it was built using imperfect data. In particular, despite having a large number of records, the skills dataset in particular was preferential to include terms common in the tech industry, rather than having a balanced variety of terms. The fact that changing certain terms still leads to significant changes in the data does indicate the relevance of the terms, but the noticeable prevalence does indicate that the data is not necessarily the most reliable. 

Lastly, while this covered two potential ways an ATS may work, ATS's vary widely depending on whose programming them, or how they're programmed to run. One of the most common ways is to utilize machine learning to measure the similarity between a resume and a job description, and is an aspect left uncovered by this project. Ultimately, this was decided due to debugging constraints, and learning to integrate natural language processing into neural networks was a topic deserving of its own deep dive, so for ease, this was scrapped from this project. However, this is leaving out one of the most common elements in an ATS out of a project specifically about exploring them. 

Ultimately, this is one source of information on a varied and complex topic, and should be treated as such. Like any good researcher, to find out more, getting information from multiple sources is important to reaching a well-rounded understanding. 