# Capstone: Profile Based Job Matching Recommendation System

### Overall Contents:

- Background
- Data cleaning
- Exploratory Data Analysis
- Word Vectorization using CountVectorizer
- Word Vectorization using TFIDF-Vectorizer
- [Final Recommender Model](#6.-Final-Recommender-Model) **(In this notebook)**
- [Cost Benefit Analysis](#7.-Cost-Benefit-Analysis) **(In this notebook)**
- [Conclusion](#8.-Conclusion) **(In this notebook)**

### Import Libraries and datasets

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text 
np.random.seed(142)
pd.set_option('display.max_colwidth', None)

In [2]:
job_post = pd.read_csv("../datasets/final_job_df.csv")
resume = pd.read_csv("../datasets/clean_resume_df.csv")

In [3]:
my_stop_words = ['Abstracted',
'Accentuated',
'Accomplished',
'accuracy',
'Achieved',
'Acted',
'Adapted',
'Addressed',
'Adjusted',
'Administered',
'Adopted',
'Advanced',
'Advertised',
'Advised',
'Advocated',
'Aided',
'Allocated',
'Analysed',
'Analyzed',
'Answered',
'Applied',
'Appointed',
'Appraised',
'Approved',
'Arbitrated',
'Arranged',
'Articulated',
'Assembled',
'Assessed',
'Assigned',
'Assisted',
'Attained',
'Attended',
'Audited',
'Authored',
'Automated',
'Balanced',
'Began',
'Benchmarked',
'Bent',
'Bound',
'Branded',
'Briefed',
'Budgeted',
'Built',
'Calculated',
'Cared',
'Catalogued',
'Chaired',
'Charted',
'Clarified',
'Classified',
'Coached',
'Coded',
'Collaborated',
'Collected',
'Combined',
'Communicated',
'Compared',
'Compiled',
'Completed',
'Composed',
'Computed',
'Conceptualized',
'Condensed',
'Conducted',
'Conferred',
'Configured',
'Conserved',
'Considered',
'Constructed',
'Consulted',
'Contacted',
'Contributed',
'Controlled',
'Converted',
'Conveyed',
'Convinced',
'Cooperated',
'Coordinated',
'Corrected',
'Corresponded',
'Counseled',
'Created',
'Critiqued',
'Customized',
'Cut',
'deadlines',
'Debated',
'Debugged',
'Decided',
'Decreased',
'Defined',
'Delegated',
'Demonstrated',
'Designed',
'Detailed',
'Detected',
'Determined',
'Developed',
'Devised',
'Diagnosed',
'Differentiated',
'Directed',
'Discriminated',
'Discussed',
'Dispatched',
'Displayed',
'Distinguished',
'Distributed',
'Documented',
'Doubled',
'Drew',
'Drilled',
'Drove',
'Edited',
'Educated',
'Elicited',
'Eliminated',
'Empowered',
'Enabled',
'Encouraged',
'Engineered',
'Enlightened',
'Enlisted',
'Ensured',
'Entertained',
'Established',
'Evaluated',
'Examined',
'Executed',
'Expanded',
'Expedited',
'Experimented',
'Explained',
'Explored',
'Expressed',
'Extracted',
'Extrapolated',
'Fabricated',
'Facilitated',
'Familiarized',
'Fashioned',
'Fed',
'Filed',
'Fine-Tuned',
'Focused',
'Followed',
'Forecasted',
'Formulated',
'Fortified',
'Founded',
'Furnished',
'Furthered',
'Gathered',
'Generated',
'Guided',
'Handled',
'Headed',
'Helped',
'Hired',
'Hosted',
'Identified',
'Illustrated',
'Imagined',
'Implemented',
'Imported',
'Incorporated',
'Increased',
'Individualized',
'Indoctrinated',
'Influenced',
'Informed',
'Initiated',
'Innovated',
'Inspected',
'Installed',
'Instilled',
'Instituted',
'Instructed',
'Insured',
'Integrated',
'Interacted',
'Interpreted',
'Intervened',
'Interviewed',
'Introduced',
'Invented',
'Investigated',
'Involved',
'Joined',
'Judged',
'Launched',
'Lectured',
'Led',
'Linked',
'Listened',
'Logged',
'Maintained',
'Managed',
'Manipulated',
'Marketed',
'Measured',
'Mediated',
'Memorized',
'Mentored',
'Merged',
'Met',
'Modelled',
'Moderated',
'Modified',
'Monitored',
'Motivated',
'Moved',
'Navigated',
'Netted',
'Observed',
'Obtained',
'Operated',
'Ordered',
'Organized',
'Originated',
'Outlined',
'Overhauled',
'Oversaw',
'Painted',
'Participated',
'Perceived',
'Performed',
'Persuaded',
'Photographed',
'Planned',
'Prepared',
'Presented',
'Presided',
'Prevented',
'Printed',
'Prioritized',
'problems',
'Produced',
'Programmed',
'Promoted',
'Proposed',
'Provided',
'Publicized',
'Published',
'Pulled',
'Punched',
'Purchased',
'Quadrupled',
'Read',
'Reasoned',
'Rebuilt',
'Recognized',
'Recommended',
'Reconciled',
'Recorded',
'Recovered',
'Recruited',
'Rectified',
'Re-designed',
'Reduced',
'Re-engineered',
'Referred',
'Registered',
'Regulated',
'Rehabilitated',
'Reinforced',
'Related',
'Remodelled',
'Rendered',
'Repaired',
'Reported',
'Represented',
'Researched',
'Reserved',
'Resolved',
'Responded',
'Restored',
'Restructured',
'Retained',
'Retooled',
'Retrieved',
'Reviewed',
'Revised',
'Revitalized',
'Routed',
'Safeguarded',
'Salvaged',
'Saved',
'Scanned',
'Scheduled',
'Schooled',
'Screened',
'Secured',
'Selected',
'sensitivity',
'Serviced',
'Set',
'Shaped',
'Shared',
'Simplified',
'Simulated',
'Skilled',
'Sold',
'Solicited',
'Solidified',
'Solved',
'Specialized',
'Specified',
'speed',
'Spoke',
'Standardized',
'Stimulated',
'Streamlined',
'Strengthened',
'Studied',
'Submitted',
'Suggested',
'Summarized',
'Supervised',
'Supplied',
'Supported',
'Surveyed',
'Synthesized',
'Systematized',
'Tabulated',
'Taught',
'teamwork',
'Tended',
'Tested',
'through',
'Trained',
'Translated',
'Transmitted',
'Tutored',
'Upgraded',
'Used',
'Visualized',
'Worked',
'Wrote'
]

In [4]:
my_stop_words = [word.lower() for word in my_stop_words]

In [5]:
stop_words = text.ENGLISH_STOP_WORDS.union(my_stop_words)

## Data Dictionary

 ### <center>Job Post Dataset<center>

|Column|Dtypes|Data Description|
|---|---|---|
|JobPost_Job_id| object| Identifier for individual job post|
|JobPost_Job_Title| object| Job title (Consisting of Data Scientist, Data Analyst, Data Engineer)|
|Skill| object| Skills mentioned/requirements for the job|

### <center> Resume Dataset<center>
|Column|Dtypes|Data Description|
|---|---|---|
|Resume_Job_id| object| Identifier for individual resume|
|Resume_Job_Title| object| Job title that the applicant applied|
|clean_resume| object| Clean version of the applicant's resume|


## 6. Final Recommender Model

In [6]:
    def final_recommender(job_post_id, no_of_skills=0,no_of_candidate=10,req_skills=[]):


    '''
    Recommendation system with input of:
    job_post_id
    no_of_skills, default sets to 0, if input is 0, it will use the skills mentioned in the job post instead of overall skills identified for individual industry.
    no_of_candidate, default sets to 10.
    req_skills, default sets to an empty list, these are required skill, a must have, for the job and to be inputed as a list of strings.
    Outputs a dataframe of top recommendated resumes with summation scores and the individual scores of skills 
    
    '''
    import time
    start_time = time.time()
    
    # Creates the resume job id using the index+1
    resume_df = resume.copy() # Creates a copy of the df so that each run will be the fresh copy of the df
    resume_df['Unnamed: 0'] += 1
    resume_df["Unnamed: 0"] = resume_df["Unnamed: 0"].astype(str)
    resume_df["Resume_Job_id"] = resume_df.Resume_Job_Title.str.cat(resume_df["Unnamed: 0"],sep="_")
    resume_df.drop(columns=["Unnamed: 0"],inplace=True)
    
    # Instantiate tfidf-vectorizer with params.
    tfidf = TfidfVectorizer(ngram_range=(1,3),min_df=10,sublinear_tf=True,token_pattern='\w+',stop_words='english')
    # fitting and transforming to the resume description and converting it to float32 to reduce memory usage
    resume_tfidf = tfidf.fit_transform(resume_df.clean_resume).astype(np.float32)
    # Turning the vecrorize skills into a dataframe
    tfidf_skills = pd.DataFrame(resume_tfidf.todense(), columns = tfidf.get_feature_names())
    #Concatenating the skills with the job id and title, 
    resume_df_tfidf = pd.concat([resume_df[['Resume_Job_id','Resume_Job_Title']],tfidf_skills],axis=1)
    
    # Replacing space with underscore for the columns
    resume_df_tfidf.columns = [col.replace(" ","_") for col in resume_df_tfidf.columns]
    # Above 0 representing to use EDA's skillset 
    if no_of_skills > 0:
        
        pd.set_option('display.max_columns', None)
        searcher = re.search(r'(^data_[a-z]{7,9})',job_post_id)
        role = searcher.group()
        try:
            if role.lower()  == 'data_scientist':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[0]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()

            elif role.lower() == 'data_analyst':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[1]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()

            elif role.lower() == 'data_engineer':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[2]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
            
            job_w_skills = job_post[[job_post.columns[0]]+skillset]
             # Filtered resume dataset with the skillset selected
            resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset]
            
        except Exception as e:
            error = re.findall(r"'(\w+[\s\w+]+)'",str(e))
            print(f'The following skill(s) have been removed as they are unavailable in our database: {error} ')
                        
            if role.lower()  == 'data_scientist':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[0]].drop(columns=error,axis=1).iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()

            elif role.lower() == 'data_analyst':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[1]].drop(columns=error,axis=1).iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()

            elif role.lower() == 'data_engineer':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[2]].drop(columns=error,axis=1).iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
            job_w_skills = job_post[[job_post.columns[0]]+skillset]
             # Filtered resume dataset with the skillset selected
            resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset]
    elif no_of_skills == 0:
        try:
            skillset = job_post[job_post["JobPost_Job_id"]==job_post_id].iloc[:,2:][job_post[job_post["JobPost_Job_id"]==job_post_id].iloc[:,2:]!=0].dropna(axis=1).columns.tolist()
 
            # Filtered job_post dataset with the skillset selected
            job_w_skills = job_post[[job_post.columns[0]]+skillset]
            # Filtered resume dataset with the skillset selected
            resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset]
        except Exception as e:
           
            error = re.findall(r"'(\w+[\s\w+]+)'",str(e))
            print(f'The following skill(s) have been removed as they are unavailable in our database: {error} ')
            skillset = job_post[job_post["JobPost_Job_id"]==job_post_id].iloc[:,2:][job_post[job_post["JobPost_Job_id"]==job_post_id].drop(columns=error,axis=1).iloc[:,2:]!=0].dropna(axis=1).columns.tolist()
            job_w_skills = job_post[[job_post.columns[0]]+skillset]
            resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset]
        
#     Filtering resumes with 0 in either of the required skills
    if len(req_skills) != 0:
        counter = 0
        for i in range(len(req_skills)):
            if counter < len(req_skills):
                resume_w_skills = resume_w_skills[resume_w_skills[req_skills[i]]!=0]
        print(f"Mandatory skills: {req_skills} ")

    recommendation = pd.DataFrame(np.round(np.matmul(np.asarray(job_w_skills.iloc[:,1:]),np.asarray(resume_w_skills.iloc[:,1:]).T),decimals=4),index = job_w_skills.JobPost_Job_id,columns = resume_w_skills.Resume_Job_id)


    recommendation_list = recommendation.T[[job_post_id]].sort_values(by=job_post_id,ascending=False).head(no_of_candidate)
    final_list= pd.merge(recommendation_list,round(resume_w_skills.loc[resume_w_skills['Resume_Job_id'].isin(recommendation_list.index.tolist())],4).set_index('Resume_Job_id'),left_index=True, right_index=True)
    final_list.rename(columns ={final_list.columns[0] : "Total Score"},inplace=True)
    #     print(skillset)
    print("Time taken to run: %s seconds " % round((time.time() - start_time),2))
    print(f"Time saved: {round((((6*no_of_candidate)-(time.time() - start_time))/(6*no_of_candidate) * 100),2)}%")
    print(f"Top {no_of_candidate} applicants recommended for \033[1m{job_post_id}")
    return final_list

IndentationError: expected an indented block (<ipython-input-6-70ae2d32a29e>, line 12)

In [None]:
random_job = job_post.JobPost_Job_id.sample(1).values[0]
random_job

### Using Top Skills Identified during EDA

In [None]:
final_recommender(job_post_id=random_job,no_of_skills=10,no_of_candidate=10)

### Using Top Skills Identified during EDA and established required skills from the list

In [None]:
final_recommender(job_post_id=random_job,no_of_skills=10,no_of_candidate=10,req_skills=['machine_learning','natural_language_processing'])

### Using Skills mentioned in the job requirements

In [None]:
final_recommender(job_post_id=random_job,no_of_candidate=10)

### Using Skills mentioned in the job requirements and established required skills from the list

In [None]:
final_recommender(job_post_id=random_job,no_of_candidate=10,req_skills=['natural_language_processing'])

In [None]:
# Testing the time taken for 100 candidates
final_recommender(job_post_id=random_job,no_of_candidate=100,req_skills=['machine_learning'])

## 7. Cost Benefit Analysis

Although we are unable to measure the benefit in monetary value, it can be replace with time. With the recommendation system, we are able to reduce the time to hire, this in turn can allow the company to better allocate more resources elsewhere. By hiring the right person for the job, this improves the quality of hire. This can be define as the total value the new employees offer to the company with their performance. Variables considered when calculating the quality of hire include ramp-up time, turnover rates, and performance reviews. Research suggests that high-quality, talented candidates stay in the market no more than 8-12 days. Thus, it is vital to speed up the hiring process to ensure you get the top candidates before other companies snatch them.

## 8. Summary

This is the finalized version of the recommender. Here is a runthrough on how the function works. <br>Users are given an option to select skill set based on either:

1. Job requirements (inputting no_of_skills as 0)
2. Top skills identified during EDA (inputting no_of_skills anything above 0, only use if job skills from post are insufficient)

On top of that, there is a mandatory input field "job_post_id" which represents the id of the job post for the recommender to return recommended applicant. Users are also able to specify the number of candidates under the variable, "no_of_candidate" which defaults at 10 and required skills as a list under the variable, "req_skills".

In summary, there are 4 inputs to the function:
1. job_post_id
2. no_of_skills
3. no_of_candidate
4. req_skills

Based on the method of identifying the skills, the resume will be filtered and a matrix multiplication will be done between the filtered job_post and resume dataset returning a matrix which consist of all the scores between jobs and resumes.
From there, the function will check if there is any required skills mentioned and if there is, a rule will be set to filter out any applicant with 0 in their profile for any of the mentioned skills. The remaining applicant will then be sorted in descending order and filter to display the number of candidates requested by the user. Finally, the function will concatenate the total score and scores of applicant skills in a dataframe.  However, the scores of the applicant skills are only a true representation to the total score if the skills listed are from the job post itself as by using the skillsets identified from EDA, users are unable to know what is the binary value of the skill listed in the job post. 

There are also a feature where a timer was set at the start of the function to calculate the time taken to run. This is to compare with the time needed for the user to vet the resume, on the assumption that the user takes an average of 6 seconds to screen a resume.

## $time save = \frac{(6 seconds * NumOfCandidates) - (RuntimeOfFunction)}{6 seconds * NumOfCandidates} * 100\%$



## 9. Conclusion

In conclusion, a note to take would be that this is not a foolproof method because there is no clear distinction on whether the candidate is right or wrong for the job. The recommender will recommend applicants who fulfils the job requirements based on the skill set and grant them a chance for an interview but does not guarantee them landing a job because the recommender does not capture the aspect of softskills. To fully optimize the recommender, users have to understand what does the job requires to achieve better results.

The recommender indeed achieve what we originally set out to be which is to help the user save time like instead of screening hundreds of resumes and choose suitable candidates for interviews, it returns an overview of the applicants it recommended and it takes an average of 10-15 seconds to return an output saving approximately 80% for 10 candidates. It also provides the flexiblity to allow users to customize and experiment on different requirements. 

The recommender can be further improve with users feedback, A/B testing on the skillset option or on different industry and also would be best if we are able to capture more information on both job post & resumes so we can expand the recommendation based on the different level of the job (Entry or senior role), location etc. Something similar to like using an app to retrieve information as there is consistency of the data retrieved. 
