# Capstone: Profile Based Job Matching Recommendation System

### Overall Contents:

- Background
- Data cleaning
- Exploratory Data Analysis
- [Word Vectorization using CountVectorizer](#4.-Word-Vectorization-using-CountVectorizer) **(In this notebook)**
- Word Vectorization using TFIDF-Vectorizer
- Final Recommender Model
- Cost Benefit Analysis
- Conclusion

### Import Libraries and datasets

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text 
np.random.seed(142)

In [2]:
job_post = pd.read_csv("../datasets/final_job_df.csv")
job_post.head(1)

Unnamed: 0,JobPost_Job_id,JobPost_Job_Title,10,4g,9,_program_management,_project_,accounting,ach,active_directory,...,web_development,websphere,weka,windows,wordpress,workday,xml,xquery,xslt,zookeeper
0,data_scientist_1,data_scientist,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
resume_df = pd.read_csv("../datasets/clean_resume_df.csv")
resume_df.head(1)

Unnamed: 0.1,Unnamed: 0,Resume_Job_Title,Resume,clean_resume
0,0,Data Scientist,Skills * Programming Languages: Python (pandas...,skills programming languages python pandas nu...


In [4]:
# Creating a resume ID using the column, 'Unnamed_0'
resume_df['Unnamed: 0'] += 1
resume_df["Unnamed: 0"] = resume_df["Unnamed: 0"].astype(str)
resume_df["Resume_Job_id"] = resume_df.Resume_Job_Title.str.cat(resume_df["Unnamed: 0"],sep="_")
resume_df.drop(columns=["Unnamed: 0"],inplace=True)
resume_df.head()

Unnamed: 0,Resume_Job_Title,Resume,clean_resume,Resume_Job_id
0,Data Scientist,Skills * Programming Languages: Python (pandas...,skills programming languages python pandas nu...,Data Scientist_1
1,Data Scientist,Education Details \r\nMay 2013 to May 2017 B.E...,education details mayto maybe uitrgpv data ...,Data Scientist_2
2,Data Scientist,"Areas of Interest Deep Learning, Control Syste...",areas of interest deep learning control system...,Data Scientist_3
3,Data Scientist,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...,skills r python sap hana tableau sap hana...,Data Scientist_4
4,Data Scientist,"Education Details \r\n MCA YMCAUST, Faridab...",education details mca ymcaust faridabad ...,Data Scientist_5


In [5]:
# Adding action verbs into the stop list
# https://www.hendrix.edu/uploadedFiles/Student_Life/Career_Services/Strong_Action_Verbs.pdf
    
my_stop_words = ['Abstracted',
'Accentuated',
'Accomplished',
'accuracy',
'Achieved',
'Acted',
'Adapted',
'Addressed',
'Adjusted',
'Administered',
'Adopted',
'Advanced',
'Advertised',
'Advised',
'Advocated',
'Aided',
'Allocated',
'Analysed',
'Analyzed',
'Answered',
'Applied',
'Appointed',
'Appraised',
'Approved',
'Arbitrated',
'Arranged',
'Articulated',
'Assembled',
'Assessed',
'Assigned',
'Assisted',
'Attained',
'Attended',
'Audited',
'Authored',
'Automated',
'Balanced',
'Began',
'Benchmarked',
'Bent',
'Bound',
'Branded',
'Briefed',
'Budgeted',
'Built',
'Calculated',
'Cared',
'Catalogued',
'Chaired',
'Charted',
'Clarified',
'Classified',
'Coached',
'Coded',
'Collaborated',
'Collected',
'Combined',
'Communicated',
'Compared',
'Compiled',
'Completed',
'Composed',
'Computed',
'Conceptualized',
'Condensed',
'Conducted',
'Conferred',
'Configured',
'Conserved',
'Considered',
'Constructed',
'Consulted',
'Contacted',
'Contributed',
'Controlled',
'Converted',
'Conveyed',
'Convinced',
'Cooperated',
'Coordinated',
'Corrected',
'Corresponded',
'Counseled',
'Created',
'Critiqued',
'Customized',
'Cut',
'deadlines',
'Debated',
'Debugged',
'Decided',
'Decreased',
'Defined',
'Delegated',
'Demonstrated',
'Designed',
'Detailed',
'Detected',
'Determined',
'Developed',
'Devised',
'Diagnosed',
'Differentiated',
'Directed',
'Discriminated',
'Discussed',
'Dispatched',
'Displayed',
'Distinguished',
'Distributed',
'Documented',
'Doubled',
'Drew',
'Drilled',
'Drove',
'Edited',
'Educated',
'Elicited',
'Eliminated',
'Empowered',
'Enabled',
'Encouraged',
'Engineered',
'Enlightened',
'Enlisted',
'Ensured',
'Entertained',
'Established',
'Evaluated',
'Examined',
'Executed',
'Expanded',
'Expedited',
'Experimented',
'Explained',
'Explored',
'Expressed',
'Extracted',
'Extrapolated',
'Fabricated',
'Facilitated',
'Familiarized',
'Fashioned',
'Fed',
'Filed',
'Fine-Tuned',
'Focused',
'Followed',
'Forecasted',
'Formulated',
'Fortified',
'Founded',
'Furnished',
'Furthered',
'Gathered',
'Generated',
'Guided',
'Handled',
'Headed',
'Helped',
'Hired',
'Hosted',
'Identified',
'Illustrated',
'Imagined',
'Implemented',
'Imported',
'Incorporated',
'Increased',
'Individualized',
'Indoctrinated',
'Influenced',
'Informed',
'Initiated',
'Innovated',
'Inspected',
'Installed',
'Instilled',
'Instituted',
'Instructed',
'Insured',
'Integrated',
'Interacted',
'Interpreted',
'Intervened',
'Interviewed',
'Introduced',
'Invented',
'Investigated',
'Involved',
'Joined',
'Judged',
'Launched',
'Lectured',
'Led',
'Linked',
'Listened',
'Logged',
'Maintained',
'Managed',
'Manipulated',
'Marketed',
'Measured',
'Mediated',
'Memorized',
'Mentored',
'Merged',
'Met',
'Modelled',
'Moderated',
'Modified',
'Monitored',
'Motivated',
'Moved',
'Navigated',
'Netted',
'Observed',
'Obtained',
'Operated',
'Ordered',
'Organized',
'Originated',
'Outlined',
'Overhauled',
'Oversaw',
'Painted',
'Participated',
'Perceived',
'Performed',
'Persuaded',
'Photographed',
'Planned',
'Prepared',
'Presented',
'Presided',
'Prevented',
'Printed',
'Prioritized',
'problems',
'Produced',
'Programmed',
'Promoted',
'Proposed',
'Provided',
'Publicized',
'Published',
'Pulled',
'Punched',
'Purchased',
'Quadrupled',
'Read',
'Reasoned',
'Rebuilt',
'Recognized',
'Recommended',
'Reconciled',
'Recorded',
'Recovered',
'Recruited',
'Rectified',
'Re-designed',
'Reduced',
'Re-engineered',
'Referred',
'Registered',
'Regulated',
'Rehabilitated',
'Reinforced',
'Related',
'Remodelled',
'Rendered',
'Repaired',
'Reported',
'Represented',
'Researched',
'Reserved',
'Resolved',
'Responded',
'Restored',
'Restructured',
'Retained',
'Retooled',
'Retrieved',
'Reviewed',
'Revised',
'Revitalized',
'Routed',
'Safeguarded',
'Salvaged',
'Saved',
'Scanned',
'Scheduled',
'Schooled',
'Screened',
'Secured',
'Selected',
'sensitivity',
'Serviced',
'Set',
'Shaped',
'Shared',
'Simplified',
'Simulated',
'Skilled',
'Sold',
'Solicited',
'Solidified',
'Solved',
'Specialized',
'Specified',
'speed',
'Spoke',
'Standardized',
'Stimulated',
'Streamlined',
'Strengthened',
'Studied',
'Submitted',
'Suggested',
'Summarized',
'Supervised',
'Supplied',
'Supported',
'Surveyed',
'Synthesized',
'Systematized',
'Tabulated',
'Taught',
'teamwork',
'Tended',
'Tested',
'through',
'Trained',
'Translated',
'Transmitted',
'Tutored',
'Upgraded',
'Used',
'Visualized',
'Worked',
'Wrote'
]

In [6]:
my_stop_words = [word.lower() for word in my_stop_words]

In [7]:
stop_words = text.ENGLISH_STOP_WORDS.union(my_stop_words)

## 4. Word Vectorization using CountVectorizer

### CountVectorizer: n_gram = (1,3) combination of grams with highest frequency

In [8]:
# Instantiate countvectorize with params.
cvec = CountVectorizer(ngram_range=(1,3),min_df=2,max_df=0.95,token_pattern='\w+',stop_words=stop_words)
# fitting and transforming to the resume description and converting it to int16 to reduce memory usage
resume_cvec = cvec.fit_transform(resume_df.clean_resume).astype(np.int16)



In [9]:
cvec_skills = pd.DataFrame(resume_cvec.todense(), columns = cvec.get_feature_names())

In [10]:
# Concatenating the skills with the resume df 
resume_df_cvec = pd.concat([resume_df[['Resume_Job_id','Resume_Job_Title']],cvec_skills],axis=1)

In [11]:
resume_df_cvec.shape

(2962, 372048)

In [12]:
resume_df_cvec.head(1)

Unnamed: 0,Resume_Job_id,Resume_Job_Title,04th,04th martojannature,04th martojannature work,05,05 months,05 months sr,06th,06th juneto,...,zuul hysrtrix,zuul hysrtrix ribbon,zuul hystrix,zuul hystrix pivotal,zuul proxy,zuul proxy api,zz,zz server,zz server data,zz server zz
0,Data Scientist_1,Data Scientist,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# Replacing space with underscore for the columns
resume_df_cvec.columns = [col.replace(" ","_") for col in resume_df_cvec.columns]

In [14]:
# Checking that the change is in place
resume_df_cvec.head(1)

Unnamed: 0,Resume_Job_id,Resume_Job_Title,04th,04th_martojannature,04th_martojannature_work,05,05_months,05_months_sr,06th,06th_juneto,...,zuul_hysrtrix,zuul_hysrtrix_ribbon,zuul_hystrix,zuul_hystrix_pivotal,zuul_proxy,zuul_proxy_api,zz,zz_server,zz_server_data,zz_server_zz
0,Data Scientist_1,Data Scientist,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 4.1 First version of the recommendation system

In [15]:
def recommender_ver_1(job_post_id, no_of_skills=15,no_of_candidate=10):
    '''
    Recommendation system with input of:
    job_post_id
    no_of_skills, default sets to 15
    no_of_candidate, default sets to 10.
    
    Outputs a dataframe of top recommendated resumes with summation scores and the individual scores of skills 
    
    '''
    pd.set_option('display.max_columns', None)
    searcher = re.search(r'(^data_[a-z]{7,9})',job_post_id) # Regex to capture the job role
    role = searcher.group()
    # Identifying skillset based on the job role
    if role.lower()  == 'data_scientist':
        skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[0]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
        
    elif role.lower() == 'data_analyst':
        skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[1]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
        
    elif role.lower() == 'data_engineer':
        skillset = job_post[job_post["JobPost_Job_Title"]==job_post['JobPost_Job_Title'].unique()[2]].iloc[:,2:].sum().sort_values(ascending=False).head(no_of_skills).index.tolist()
    
    # Filtered job_post dataset with the skillset selected
    job_w_skills = job_post[[job_post.columns[0]]+skillset]
    # Filtered resume dataset with the skillset selected
    resume_w_skills = resume_df_cvec[[resume_df_cvec.columns[0]]+skillset] 
#     resume_w_skills = resume_w_skills[resume_w_skills!=0].dropna()
    recommendation = pd.DataFrame(np.matmul(np.asarray(job_w_skills.iloc[:,1:]),np.asarray(resume_w_skills.iloc[:,1:]).T),index = job_w_skills.JobPost_Job_id,columns = resume_w_skills.Resume_Job_id)
    recommandation_list = recommendation.T[[job_post_id]].sort_values(by=job_post_id,ascending=False).head(no_of_candidate)
    final_list= pd.merge(recommandation_list,resume_w_skills.loc[resume_w_skills['Resume_Job_id'].isin(recommandation_list.index.tolist())].set_index('Resume_Job_id'),left_index=True, right_index=True)
#     print(skillset)
    return final_list

In [16]:
random_job = job_post.JobPost_Job_id.sample(1).values[0]
random_job

'data_scientist_2409'

In [17]:
recommender_ver_1(job_post_id= random_job, no_of_skills=10, no_of_candidate=10)

Unnamed: 0_level_0,data_scientist_2409,python,machine_learning,r,sql,hadoop,spark,data_mining,java,sas,natural_language_processing
Resume_Job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Python Developer_1952,151,125,0,1,23,3,11,1,4,2,0
Database Administrator_2602,123,0,0,1,123,0,0,0,0,0,0
Python Developer_1931,120,106,3,0,9,2,7,0,10,0,0
Python Developer_1852,115,57,18,10,29,8,15,1,5,0,3
Python Developer_1898,110,76,6,1,24,2,3,2,1,3,2
Python Developer_1864,107,81,0,0,26,0,0,0,8,1,0
Python Developer_1818,107,82,3,0,18,3,7,1,17,0,1
Python Developer_1783,107,90,0,0,15,2,8,0,18,0,0
Python Developer_1836,106,79,0,0,22,4,1,0,11,0,1
Python Developer_1823,102,84,0,0,18,0,0,0,1,0,0


###### Comments: 

From the initial model, our model returned a high count of data scientist. However, there seem to be an issue where the top candidate, does not have any experience in either 'Python' or 'r' programming language and it seems like he topped the chart with the score form 'Machine Learning'. <br>
This poses an issue where in the scenario if a person just keyword stuffing. To combat this issue, TF-IDF will be used to vectorize the resume.

**Note**: Recommender_ver_2 (TFIDF Vectorizer) will be done on the next notebook due to memory usage issue.