# Capstone: Profile Based Job Matching Recommendation System

### Overall Contents:

- Background
- Data cleaning
- Exploratory Data Analysis
- Word Vectorization using CountVectorizer
- [Word Vectorization using TFIDF-Vectorizer](#5.-Word-Vectorization-using-TFIDF-Vectorizer) **(In this notebook)**
- Final Recommender Model
- Cost Benefit Analysis
- Conclusion

### Import Libraries and datasets

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text 
np.random.seed(142)

In [2]:
job_post = pd.read_csv("../datasets/final_job_df.csv")
job_post.head(1)

Unnamed: 0,JobPost_Job_id,JobPost_Job_Title,10,4g,9,_program_management,_project_,accounting,ach,active_directory,...,web_development,websphere,weka,windows,wordpress,workday,xml,xquery,xslt,zookeeper
0,data_scientist_1,data_scientist,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
resume_df = pd.read_csv("../datasets/clean_resume_df.csv")
resume_df.head(1)

Unnamed: 0.1,Unnamed: 0,Resume_Job_Title,Resume,clean_resume
0,0,Data Scientist,Skills * Programming Languages: Python (pandas...,skills programming languages python pandas nu...


In [4]:
resume_df['Unnamed: 0'] += 1
resume_df["Unnamed: 0"] = resume_df["Unnamed: 0"].astype(str)
resume_df["Resume_Job_id"] = resume_df.Resume_Job_Title.str.cat(resume_df["Unnamed: 0"],sep="_")
resume_df.drop(columns=["Unnamed: 0"],inplace=True)
resume_df.head()

Unnamed: 0,Resume_Job_Title,Resume,clean_resume,Resume_Job_id
0,Data Scientist,Skills * Programming Languages: Python (pandas...,skills programming languages python pandas nu...,Data Scientist_1
1,Data Scientist,Education Details \r\nMay 2013 to May 2017 B.E...,education details mayto maybe uitrgpv data ...,Data Scientist_2
2,Data Scientist,"Areas of Interest Deep Learning, Control Syste...",areas of interest deep learning control system...,Data Scientist_3
3,Data Scientist,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,skills r python sap hana tableau sap hana...,Data Scientist_4
4,Data Scientist,"Education Details \r\n MCA YMCAUST, Faridab...",education details mca ymcaust faridabad ...,Data Scientist_5


In [5]:
my_stop_words = ['Abstracted',
'Accentuated',
'Accomplished',
'accuracy',
'Achieved',
'Acted',
'Adapted',
'Addressed',
'Adjusted',
'Administered',
'Adopted',
'Advanced',
'Advertised',
'Advised',
'Advocated',
'Aided',
'Allocated',
'Analysed',
'Analyzed',
'Answered',
'Applied',
'Appointed',
'Appraised',
'Approved',
'Arbitrated',
'Arranged',
'Articulated',
'Assembled',
'Assessed',
'Assigned',
'Assisted',
'Attained',
'Attended',
'Audited',
'Authored',
'Automated',
'Balanced',
'Began',
'Benchmarked',
'Bent',
'Bound',
'Branded',
'Briefed',
'Budgeted',
'Built',
'Calculated',
'Cared',
'Catalogued',
'Chaired',
'Charted',
'Clarified',
'Classified',
'Coached',
'Coded',
'Collaborated',
'Collected',
'Combined',
'Communicated',
'Compared',
'Compiled',
'Completed',
'Composed',
'Computed',
'Conceptualized',
'Condensed',
'Conducted',
'Conferred',
'Configured',
'Conserved',
'Considered',
'Constructed',
'Consulted',
'Contacted',
'Contributed',
'Controlled',
'Converted',
'Conveyed',
'Convinced',
'Cooperated',
'Coordinated',
'Corrected',
'Corresponded',
'Counseled',
'Created',
'Critiqued',
'Customized',
'Cut',
'deadlines',
'Debated',
'Debugged',
'Decided',
'Decreased',
'Defined',
'Delegated',
'Demonstrated',
'Designed',
'Detailed',
'Detected',
'Determined',
'Developed',
'Devised',
'Diagnosed',
'Differentiated',
'Directed',
'Discriminated',
'Discussed',
'Dispatched',
'Displayed',
'Distinguished',
'Distributed',
'Documented',
'Doubled',
'Drew',
'Drilled',
'Drove',
'Edited',
'Educated',
'Elicited',
'Eliminated',
'Empowered',
'Enabled',
'Encouraged',
'Engineered',
'Enlightened',
'Enlisted',
'Ensured',
'Entertained',
'Established',
'Evaluated',
'Examined',
'Executed',
'Expanded',
'Expedited',
'Experimented',
'Explained',
'Explored',
'Expressed',
'Extracted',
'Extrapolated',
'Fabricated',
'Facilitated',
'Familiarized',
'Fashioned',
'Fed',
'Filed',
'Fine-Tuned',
'Focused',
'Followed',
'Forecasted',
'Formulated',
'Fortified',
'Founded',
'Furnished',
'Furthered',
'Gathered',
'Generated',
'Guided',
'Handled',
'Headed',
'Helped',
'Hired',
'Hosted',
'Identified',
'Illustrated',
'Imagined',
'Implemented',
'Imported',
'Incorporated',
'Increased',
'Individualized',
'Indoctrinated',
'Influenced',
'Informed',
'Initiated',
'Innovated',
'Inspected',
'Installed',
'Instilled',
'Instituted',
'Instructed',
'Insured',
'Integrated',
'Interacted',
'Interpreted',
'Intervened',
'Interviewed',
'Introduced',
'Invented',
'Investigated',
'Involved',
'Joined',
'Judged',
'Launched',
'Lectured',
'Led',
'Linked',
'Listened',
'Logged',
'Maintained',
'Managed',
'Manipulated',
'Marketed',
'Measured',
'Mediated',
'Memorized',
'Mentored',
'Merged',
'Met',
'Modelled',
'Moderated',
'Modified',
'Monitored',
'Motivated',
'Moved',
'Navigated',
'Netted',
'Observed',
'Obtained',
'Operated',
'Ordered',
'Organized',
'Originated',
'Outlined',
'Overhauled',
'Oversaw',
'Painted',
'Participated',
'Perceived',
'Performed',
'Persuaded',
'Photographed',
'Planned',
'Prepared',
'Presented',
'Presided',
'Prevented',
'Printed',
'Prioritized',
'problems',
'Produced',
'Programmed',
'Promoted',
'Proposed',
'Provided',
'Publicized',
'Published',
'Pulled',
'Punched',
'Purchased',
'Quadrupled',
'Read',
'Reasoned',
'Rebuilt',
'Recognized',
'Recommended',
'Reconciled',
'Recorded',
'Recovered',
'Recruited',
'Rectified',
'Re-designed',
'Reduced',
'Re-engineered',
'Referred',
'Registered',
'Regulated',
'Rehabilitated',
'Reinforced',
'Related',
'Remodelled',
'Rendered',
'Repaired',
'Reported',
'Represented',
'Researched',
'Reserved',
'Resolved',
'Responded',
'Restored',
'Restructured',
'Retained',
'Retooled',
'Retrieved',
'Reviewed',
'Revised',
'Revitalized',
'Routed',
'Safeguarded',
'Salvaged',
'Saved',
'Scanned',
'Scheduled',
'Schooled',
'Screened',
'Secured',
'Selected',
'sensitivity',
'Serviced',
'Set',
'Shaped',
'Shared',
'Simplified',
'Simulated',
'Skilled',
'Sold',
'Solicited',
'Solidified',
'Solved',
'Specialized',
'Specified',
'speed',
'Spoke',
'Standardized',
'Stimulated',
'Streamlined',
'Strengthened',
'Studied',
'Submitted',
'Suggested',
'Summarized',
'Supervised',
'Supplied',
'Supported',
'Surveyed',
'Synthesized',
'Systematized',
'Tabulated',
'Taught',
'teamwork',
'Tended',
'Tested',
'through',
'Trained',
'Translated',
'Transmitted',
'Tutored',
'Upgraded',
'Used',
'Visualized',
'Worked',
'Wrote'
]

In [6]:
my_stop_words = [word.lower() for word in my_stop_words]

In [7]:
stop_words = text.ENGLISH_STOP_WORDS.union(my_stop_words)

## 5. Word Vectorization using TFIDF-Vectorizer

### Tfidf-Vectorizer: n_gram = (1,3) combination of grams with highest frequency

In [8]:
# Instantiate countvectorize with params.
tfidf = TfidfVectorizer(ngram_range=(1,3),min_df=2,max_df=0.95,sublinear_tf = True,token_pattern='\w+',stop_words=stop_words)
# fitting and transforming to the resume description and converting it to float32 to reduce memory usage
resume_tfidf = tfidf.fit_transform(resume_df.clean_resume).astype(np.float32)



###### Comments:

The parameter 'sublinear_tf' was set to True to combat keyword stuffing.<br>
A brief explaination of what sublinear_tf is, sublinear_tf scaling takes the assumption where it is unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence. It modifies the formula for TFIDF by assigning a weight and replace $\mbox{tf}$ by some other function $\mbox{wf}$.
<br>

In [9]:
tfidf_skills = pd.DataFrame(resume_tfidf.todense(), columns = tfidf.get_feature_names())

In [10]:
# Concatenating the skills with the resume df 
resume_df_tfidf = pd.concat([resume_df[['Resume_Job_id','Resume_Job_Title']],tfidf_skills],axis=1)

In [11]:
resume_df_tfidf.head(1)

Unnamed: 0,Resume_Job_id,Resume_Job_Title,04th,04th martojannature,04th martojannature work,05,05 months,05 months sr,06th,06th juneto,...,zuul hysrtrix,zuul hysrtrix ribbon,zuul hystrix,zuul hystrix pivotal,zuul proxy,zuul proxy api,zz,zz server,zz server data,zz server zz
0,Data Scientist_1,Data Scientist,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# Replacing space with underscore for the columns
resume_df_tfidf.columns = [col.replace(" ","_") for col in resume_df_tfidf.columns]

In [13]:
# Checking that the change is in place
resume_df_tfidf.head(1)

Unnamed: 0,Resume_Job_id,Resume_Job_Title,04th,04th_martojannature,04th_martojannature_work,05,05_months,05_months_sr,06th,06th_juneto,...,zuul_hysrtrix,zuul_hysrtrix_ribbon,zuul_hystrix,zuul_hystrix_pivotal,zuul_proxy,zuul_proxy_api,zz,zz_server,zz_server_data,zz_server_zz
0,Data Scientist_1,Data Scientist,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 5.1 Second version of the recommendation system

In [14]:
def recommender_ver_2(job_post_id, no_of_skills=15,no_of_candidate=10):
    '''
    Recommendation system with input of:
    job_post_id
    no_of_skills, default sets to 15
    no_of_candidate, default sets to 10.
    
    Outputs a dataframe of top recommendated resumes with summation scores and the individual scores of skills 
    
    '''
    pd.set_option('display.max_columns', None)
    searcher = re.search(r'(^data_[a-z]{7,9})',job_post_id) # Regex to capture the job role
    role = searcher.group()
    # Identifying skillset based on the job role
    if role.lower()  == 'data_scientist':
        skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[0]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
        
    elif role.lower() == 'data_analyst':
        skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[1]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
        
    elif role.lower() == 'data_engineer':
        skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[2]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
    
    # Filtered job_post dataset with the skillset selected
    job_w_skills = job_post[[job_post.columns[0]]+skillset]
    # Filtered resume dataset with the skillset selected
    resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset] 
#     resume_w_skills = resume_w_skills[resume_w_skills!=0].dropna()
    recommendation = pd.DataFrame(np.matmul(np.asarray(job_w_skills.iloc[:,1:]),np.asarray(resume_w_skills.iloc[:,1:]).T),index = job_w_skills.JobPost_Job_id,columns = resume_w_skills.Resume_Job_id)
    recommandation_list = recommendation.T[[job_post_id]].sort_values(by=job_post_id,ascending=False).head(no_of_candidate)
    final_list= pd.merge(recommandation_list,resume_w_skills.loc[resume_w_skills['Resume_Job_id'].isin(recommandation_list.index.tolist())].set_index('Resume_Job_id'),left_index=True, right_index=True)
#     print(skillset)
    return final_list

In [15]:
random_job = job_post.JobPost_Job_id.sample(1).values[0]
random_job

'data_scientist_2409'

In [16]:
recommender_ver_2(job_post_id= random_job, no_of_skills=10, no_of_candidate=30)

Unnamed: 0_level_0,data_scientist_2409,python,machine_learning,r,sql,hadoop,spark,data_mining,java,sas,natural_language_processing
Resume_Job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Python Developer_1886,0.214463,0.055919,0.064655,0.041675,0.027902,0.065987,0.092954,0.0,0.043301,0.0,0.0
Python Developer_1929,0.164799,0.034684,0.067389,0.0,0.0,0.0,0.0,0.039126,0.0,0.0,0.062727
Data Scientist_2933,0.160016,0.031479,0.05755,0.030276,0.023049,0.047938,0.0,0.03551,0.028767,0.055528,0.0
Data Scientist_2844,0.159906,0.026081,0.030418,0.033043,0.017849,0.039716,0.037784,0.025147,0.0,0.0,0.045843
Python Developer_1800,0.158338,0.158338,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Data Scientist_2793,0.156527,0.036867,0.050304,0.048045,0.023275,0.025635,0.026092,0.0,0.0,0.036805,0.020446
Data Scientist_2962,0.156481,0.026242,0.032992,0.029031,0.023379,0.029826,0.012722,0.0,0.026738,0.0,0.044043
Data Scientist_2834,0.153981,0.026555,0.034898,0.034669,0.014512,0.037259,0.021797,0.03151,0.018112,0.046841,0.040757
Python Developer_1920,0.151298,0.042273,0.057681,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.051343
Python Developer_1847,0.147639,0.078396,0.028848,0.0,0.040394,0.0,0.055485,0.0,0.0,0.0,0.0


###### Comments: 

With the parameter sublinear_tf set to true, we are unable to locate the previous top candidate "Python Developer_1952" in top 30 candidates returned and thus conclude that the keyword stuffing has been handled. However, we are still receiving candidates with scores of 0 for majority of the skills.
<br>
In the next model, we will add a line to filter out users that have 0 scores in any of the skills.

### 5.2 Third version of the recommendation system

In [17]:
def recommender_ver_3(job_post_id, no_of_skills=15,no_of_candidate=10):
    '''
    Recommendation system with input of:
    job_post_id
    no_of_skills, default sets to 15
    no_of_candidate, default sets to 10.
    
    Outputs a dataframe of top recommendated resumes with summation scores and the individual scores of skills 
    
    '''
    pd.set_option('display.max_columns', None)
    searcher = re.search(r'(^data_[a-z]{7,9})',job_post_id) # Regex to capture the job role
    role = searcher.group()
    # Identifying skillset based on the job role
    if role.lower()  == 'data_scientist':
        skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[0]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
        
    elif role.lower() == 'data_analyst':
        skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[1]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
        
    elif role.lower() == 'data_engineer':
        skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[2]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
    
    # Filtered job_post dataset with the skillset selected
    job_w_skills = job_post[[job_post.columns[0]]+skillset]
    # Filtered resume dataset with the skillset selected
    resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset]
    # Added this line to filter out applicants that have 0 in either field
    resume_w_skills = resume_w_skills[resume_w_skills!=0].dropna()
    recommendation = pd.DataFrame(np.matmul(np.asarray(job_w_skills.iloc[:,1:]),np.asarray(resume_w_skills.iloc[:,1:]).T),index = job_w_skills.JobPost_Job_id,columns = resume_w_skills.Resume_Job_id)
    recommandation_list = recommendation.T[[job_post_id]].sort_values(by=job_post_id,ascending=False).head(no_of_candidate)
    final_list= pd.merge(recommandation_list,resume_w_skills.loc[resume_w_skills['Resume_Job_id'].isin(recommandation_list.index.tolist())].set_index('Resume_Job_id'),left_index=True, right_index=True)
#     print(skillset)
    return final_list

In [18]:
recommender_ver_3(job_post_id= random_job, no_of_skills=10, no_of_candidate=10)

Unnamed: 0_level_0,data_scientist_2409,python,machine_learning,r,sql,hadoop,spark,data_mining,java,sas,natural_language_processing
Resume_Job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Data Scientist_2834,0.153981,0.026555,0.034898,0.034669,0.014512,0.037259,0.021797,0.03151,0.018112,0.046841,0.040757
Python Developer_1934,0.121555,0.023832,0.02886,0.026272,0.016474,0.027772,0.033354,0.023664,0.012439,0.02401,0.024617
Python Developer_1951,0.11727,0.025345,0.027067,0.023184,0.016024,0.030208,0.026511,0.022194,0.010259,0.032885,0.018626
Python Developer_1898,0.104602,0.030538,0.021822,0.007765,0.017525,0.014771,0.018634,0.015421,0.005235,0.021207,0.019946
Python Developer_1943,0.096515,0.021448,0.025261,0.019937,0.017141,0.022398,0.022797,0.018942,0.010887,0.024587,0.010267
Python Developer_1941,0.09305,0.020471,0.02455,0.019376,0.016284,0.021767,0.022156,0.018408,0.010581,0.023895,0.009978
Python Developer_1882,0.092591,0.020584,0.022621,0.017258,0.016472,0.02288,0.018046,0.018511,0.009357,0.022458,0.010033
Python Developer_1880,0.092517,0.02059,0.022627,0.017262,0.016379,0.022886,0.018051,0.018515,0.009359,0.022464,0.010036
Python Developer_1881,0.092452,0.020575,0.022611,0.01725,0.016367,0.02287,0.018038,0.018502,0.009353,0.022448,0.010029


###### Comments: 

This version of the model returns all applicant that have a score in all of the skillset. However, out of 10 candidates only 1 of them is applying for data scientist and 9 of them are python developers. 
This approach is not the best for candidates that are fresh out of education as they might be missing several industry skills.
In the next version, I will allow the user to have a choice of using the skillsets mentioned in the job posting.

### 5.3 Fourth version of the recommendation system

In [19]:
def recommender_ver_4(job_post_id, no_of_skills=0,no_of_candidate=10,req_skills=[]):
    '''
    Recommendation system with input of:
    job_post_id
    no_of_skills, default sets to 0, if input is 0, it will use the skills mentioned in the job post instead of overall skills identified for individual industry.
    no_of_candidate, default sets to 10.
    req_skills, default sets to an empty list, these are required skill, a must have, for the job and to be inputed as a list of strings.
    Outputs a dataframe of top recommendated resumes with summation scores and the individual scores of skills 
    
    '''

    if no_of_skills > 0:
        
        pd.set_option('display.max_columns', None)
        searcher = re.search(r'(^data_[a-z]{7,9})',job_post_id)
        role = searcher.group()
        try:
            if role.lower()  == 'data_scientist':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[0]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()

            elif role.lower() == 'data_analyst':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[1]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()

            elif role.lower() == 'data_engineer':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[2]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
            
            job_w_skills = job_post[[job_post.columns[0]]+skillset]
             # Filtered resume dataset with the skillset selected
            resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset]
            
        except Exception as e:
            error = re.findall(r"'(\w+[\s\w+]+)'",str(e))
            print(f'The following skill(s) have been removed as they are unavailable in our database: {error} ')
                        
            if role.lower()  == 'data_scientist':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[0]].drop(columns=error,axis=1).iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()

            elif role.lower() == 'data_analyst':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[1]].drop(columns=error,axis=1).iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()

            elif role.lower() == 'data_engineer':
                skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[2]].drop(columns=error,axis=1).iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
            job_w_skills = job_post[[job_post.columns[0]]+skillset]
             # Filtered resume dataset with the skillset selected
            resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset]
    elif no_of_skills == 0:
        try:
            skillset = job_post[job_post["JobPost_Job_id"]==job_post_id].iloc[:,2:][job_post[job_post["JobPost_Job_id"]==job_post_id].iloc[:,2:]!=0].dropna(axis=1).columns.tolist()
 
            # Filtered job_post dataset with the skillset selected
            job_w_skills = job_post[[job_post.columns[0]]+skillset]
            # Filtered resume dataset with the skillset selected
            resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset]
        except Exception as e:
           
            error = re.findall(r"'(\w+[\s\w+]+)'",str(e))
            print(f'The following skill(s) have been removed as they are unavailable in our database: {error} ')
            skillset = job_post[job_post["JobPost_Job_id"]==job_post_id].iloc[:,2:][job_post[job_post["JobPost_Job_id"]==job_post_id].drop(columns=error,axis=1).iloc[:,2:]!=0].dropna(axis=1).columns.tolist()
            job_w_skills = job_post[[job_post.columns[0]]+skillset]
            resume_w_skills = resume_df_tfidf[[resume_df_tfidf.columns[0]]+skillset]
        
#     Filtering resumes with 0 in either of the required skills
    if len(req_skills) != 0:
        counter = 0
        for i in range(len(req_skills)):
            if counter < len(req_skills):
                resume_w_skills = resume_w_skills[resume_w_skills[req_skills[i]]!=0]
        print(f"Required skills: {req_skills} ")
#     resume_w_skills = resume_w_skills[resume_w_skills!=0].dropna()
    recommendation = pd.DataFrame(np.matmul(np.asarray(job_w_skills.iloc[:,1:]),np.asarray(resume_w_skills.iloc[:,1:]).T),index = job_w_skills.JobPost_Job_id,columns = resume_w_skills.Resume_Job_id)


    recommandation_list = recommendation.T[[job_post_id]].sort_values(by=job_post_id,ascending=False).head(no_of_candidate)
    final_list= pd.merge(recommandation_list,resume_w_skills.loc[resume_w_skills['Resume_Job_id'].isin(recommandation_list.index.tolist())].set_index('Resume_Job_id'),left_index=True, right_index=True)
    #     print(skillset)
    
    return final_list

In [20]:
recommender_ver_4(job_post_id= 'data_scientist_100', no_of_candidate=10)

The following skill(s) have been removed as they are unavailable in our database: ['six_sigma'] 


Unnamed: 0_level_0,data_scientist_100,ach,ai,data_mining,db2,machine_learning,microsoft_sql_server,python,r,sas,sql,tableau
Resume_Job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Data Scientist_2856,0.30818,0.0,0.019049,0.044463,0.0,0.050962,0.045812,0.029921,0.034667,0.058217,0.025089,0.0
Data Scientist_2909,0.30179,0.0,0.0,0.05313,0.0,0.045598,0.0,0.037464,0.040124,0.055863,0.021674,0.047937
Data Scientist_2813,0.291629,0.0,0.033144,0.029647,0.0,0.04427,0.0,0.030747,0.041677,0.046359,0.019243,0.046541
Data Scientist_2939,0.288414,0.0,0.027756,0.034992,0.0,0.030031,0.039425,0.022009,0.029834,0.047928,0.020796,0.035643
Data Scientist_2837,0.281161,0.0,0.038211,0.0,0.0,0.055392,0.0,0.040596,0.055027,0.06598,0.025954,0.0
Data Scientist_2821,0.280007,0.0,0.037029,0.029339,0.0,0.030852,0.030229,0.02403,0.034174,0.039884,0.019201,0.035268
Data Scientist_2893,0.265417,0.0,0.055977,0.042326,0.0,0.041602,0.0,0.028093,0.036087,0.0,0.01822,0.043114
Data Scientist_2870,0.25943,0.0,0.0,0.038852,0.0,0.033345,0.0,0.032742,0.044382,0.053216,0.022089,0.034804
Python Developer_1937,0.25741,0.0,0.023972,0.036305,0.0,0.031159,0.0,0.051699,0.043625,0.049927,0.020724,0.0
Python Developer_1936,0.25706,0.0,0.0,0.0,0.0,0.021813,0.0,0.059278,0.021669,0.067291,0.032678,0.054331


###### Comments:
On top of adding the additional choice for user to use skills mentioned in the job post, if there are skills not found in any of the resume, it will print out a line to inform user which of the skills are unavailable.

In [21]:
recommender_ver_4(job_post_id= 'data_scientist_222', no_of_candidate=10)

Unnamed: 0_level_0,data_scientist_222,c,image_processing,java,linux,machine_learning,matlab,medical_imaging,microsoft_office
Resume_Job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Python Developer_1916,0.212926,0.037001,0.0,0.036251,0.0,0.0,0.08625,0.0,0.053423
Data Scientist_2908,0.211929,0.03356,0.049181,0.0,0.016605,0.042564,0.0,0.070019,0.0
Data Scientist_36,0.205581,0.04733,0.0,0.0,0.0,0.069239,0.089012,0.0,0.0
Data Scientist_16,0.205581,0.04733,0.0,0.0,0.0,0.069239,0.089012,0.0,0.0
Data Scientist_26,0.205581,0.04733,0.0,0.0,0.0,0.069239,0.089012,0.0,0.0
Data Scientist_6,0.205581,0.04733,0.0,0.0,0.0,0.069239,0.089012,0.0,0.0
Java Developer_2417,0.199776,0.087952,0.0,0.080542,0.031283,0.0,0.0,0.0,0.0
Data Scientist_2845,0.195835,0.039301,0.0,0.029838,0.030241,0.039181,0.057275,0.0,0.0
Project Manager_2034,0.185987,0.037301,0.0,0.036545,0.029883,0.0,0.056596,0.0,0.025663
Data Scientist_2798,0.182223,0.022385,0.053114,0.0,0.0,0.054378,0.052346,0.0,0.0


In [22]:
# Example of listing medical_imaging as a required skills as it may be a specialize industry.
recommender_ver_4(job_post_id= 'data_scientist_222', no_of_candidate=10, req_skills=['medical_imaging'])

Required skills: ['medical_imaging'] 


Unnamed: 0_level_0,data_scientist_222,c,image_processing,java,linux,machine_learning,matlab,medical_imaging,microsoft_office
Resume_Job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Data Scientist_2908,0.211929,0.03356,0.049181,0.0,0.016605,0.042564,0.0,0.070019,0.0
Data Scientist_2891,0.07635,0.0,0.0,0.0,0.0,0.045458,0.0,0.030893,0.0
Network Administrator_2346,0.076218,0.0,0.0,0.012365,0.021219,0.0,0.0,0.042634,0.0
Web Developer_1032,0.050016,0.0,0.0,0.0,0.0,0.0,0.0,0.050016,0.0


###### Comments:

In addition, instead of removing any candidates that have 0 skills in either of the required list. A new input has been added for users if there are any mandatory skills for the job, it could be like skills used in a specialize industry such as 'medical imaging' from the example above. As mentioned earlier, it is quite impossible for a entry level applicant to have all the skills especially industry related, so it would be better to give user a choice and if they are hiring for perhaps a senior role, they can include more skills listed in the list. 