<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone - Resumes and Job Ads Recommender

# Problem Statement

HR practitioners and/or hiring managers could have been spending too much time trying to sieve through many resumes for shortlisting suitable candidates whom they can contact for interview.
As a job seeker, we may also find ourselves spending so much times looking through plentiful job advertisements which may not be relevant to us.
Wouldn't it be nice if pre-selection can be done which will effectively save time for all of us?

We will be using Natural Language Processing and Recommender System to group similar job seekers / job advertisements.
Success will be evaluated by the (TBD on model) to match the job the grouped job seekers to the most suitable job advertisements and vice versa.

# Executive Summary

We scrape the website spiderjob.com for resumes. In view that api key is not available, we used BeautifulSoup and regex to get the desired information. As we encounter roadblock on the time connection timeout despite introducing bot agent, we limit the job categories to Accounting and Information Technology for this capstone. 
For the job ads, we based it on existing dataset that is available on Kaggle which was used for predicting fake job posting since the features in this dataset has 80% simiarity to the resumes dataset.

As we were cleaning the resumes dataset, we at the same made decision which features will be important for us to have and which are the one to drop. In view that job title, objective, experience and skills are free texts that hold meaningful words for our analysis, we create a new feature and combine all where we then split the text into words, return them to their root form and also remove the stop words. These are performed for the job ads dataset.

TBC.....


### Contents:
- [Scraping of resumes - Accounting](#Scraping-of-resumes---Accounting)

## Scraping of resumes - Accounting

In [1]:
import requests
import pandas as pd
import regex as re
import numpy as np
import random
import time
from bs4 import BeautifulSoup

In [2]:
# Check if the link if working well for scraping
url = 'https://www.jobspider.com/job/resume-search-results.asp/category_1'
res = requests.get(url)
res

<Response [200]>

In [3]:
%%time

# Scraping thru the table of contents
# Create an empty table for storing the scrape items
table = []

# To iterate thru pagination 
for num in range(1,31):
    link = url + '/page_' + str(num)
    res = requests.get(link)
    soup = BeautifulSoup(res.content, 'lxml')
    tbody = soup.find_all('td', align='center')
    
    # As there are 6 items listed in the ('td', align='center'), splitting the len of the post for iteration by group of 6
    for i in range(int(len(tbody)/6)):
        items = {}
        items['sn'] = tbody[6*int(i) + 0].text
        items['date_posted'] = tbody[6*int(i)+1].text
        items['job_func_sought'] = tbody[6*int(i)+2].text.lower()
        items['category'] = tbody[6*int(i)+3].text
        items['location'] = tbody[6*int(i)+4].text
        items['resume_href'] = tbody[6*int(i)+5].find('a')['href']
        
        table.append(items)
table

Wall time: 42.5 s


[{'sn': '1',
  'date_posted': '3/30/2020',
  'job_func_sought': 'bookkeeping/accounting',
  'category': 'Accounting/Bookkeeping',
  'location': 'Dartmouth, NS',
  'resume_href': '/job/view-resume-82533.html'},
 {'sn': '2',
  'date_posted': '3/16/2020',
  'job_func_sought': 'stainless steel decorative sheet',
  'category': 'Accounting/Bookkeeping',
  'location': 'Xian, AL',
  'resume_href': '/job/view-resume-82510.html'},
 {'sn': '3',
  'date_posted': '3/4/2020',
  'job_func_sought': 'accountant',
  'category': 'Accounting/Bookkeeping',
  'location': 'Orlando, FL',
  'resume_href': '/job/view-resume-82487.html'},
 {'sn': '4',
  'date_posted': '2/27/2020',
  'job_func_sought': 'admin assistant',
  'category': 'Accounting/Bookkeeping',
  'location': 'Brampton, ON',
  'resume_href': '/job/view-resume-82476.html'},
 {'sn': '5',
  'date_posted': '2/27/2020',
  'job_func_sought': 'admin assistant',
  'category': 'Accounting/Bookkeeping',
  'location': 'Brampton, ON',
  'resume_href': '/job/vi

In [8]:
# Convert scraped data into DataFrame
table1 = pd.DataFrame(table)

In [9]:
# Check out the shape and first 5 items
print(table1.shape)
table1.head()

(1500, 6)


Unnamed: 0,sn,date_posted,job_func_sought,category,location,resume_href
0,1,3/30/2020,bookkeeping/accounting,Accounting/Bookkeeping,"Dartmouth, NS",/job/view-resume-82533.html
1,2,3/16/2020,stainless steel decorative sheet,Accounting/Bookkeeping,"Xian, AL",/job/view-resume-82510.html
2,3,3/4/2020,accountant,Accounting/Bookkeeping,"Orlando, FL",/job/view-resume-82487.html
3,4,2/27/2020,admin assistant,Accounting/Bookkeeping,"Brampton, ON",/job/view-resume-82476.html
4,5,2/27/2020,admin assistant,Accounting/Bookkeeping,"Brampton, ON",/job/view-resume-82475.html


In [7]:
# Scraping through the resume

%%time

headers = {'User-agent': 'SL Bot 1.0'}
base_url = 'https://www.jobspider.com'
href_list = [table[n]['resume_href'] for n in range(800)] # After multiple tries, manual input a range of 800 and for successful scraping (without timeout error)
cv_dict = []

# Create for loop to combine resume_href from table1 to base url for iteration through each link
for j in range(800):
    cv_res = requests.get(base_url + href_list[j], headers=headers)
    cv_soup = BeautifulSoup(cv_res.content, 'lxml')
    cv_body = cv_soup.find_all('table', align='center')
    cv_list = cv_body[1].text.splitlines()
    
    # Create an empty dictionary to store the scrape items
    for items in cv_body:
        cv_items = {}
        
        # In each link, scrap through the resume and create columns to house the items
        for text in cv_list:
            cv_items['id'] = [text for text in cv_list if 'SpiderID:' in text]
            cv_items['emp_type'] = [text for text in cv_list if 'Type of Position:' in text]
            cv_items['availability'] = [text for text in cv_list if 'Availability Date:' in text]
            cv_items['desired_wage'] = [text for text in cv_list if 'Desired Wage:' in text]
            cv_items['work_auth'] = [text for text in cv_list if 'U.S. Work Authorization:' in text]
            cv_items['job_level'] = [text for text in cv_list if 'Job Level:' in text]
            cv_items['will_travel'] = [text for text in cv_list if 'Willing to Travel:' in text]
            cv_items['edu_level'] = [text for text in cv_list if 'Highest Degree Attained:' in text]
            cv_items['will_reloc'] = [text for text in cv_list if 'Willing to Relocate:' in text]
            cv_items['objective'] = [text for text in cv_list if 'Objective:' in text]
            cv_items['exp'] = [text for text in cv_list if 'Experience:' in text]
            cv_items['edu'] = [text for text in cv_list if 'Education:' in text]
            cv_items['skills'] = [text for text in cv_list if 'Skills:' in text]
            cv_items['add_info'] = [text for text in cv_list if 'Additional Information:' in text]
            cv_items['contact_info'] = [text for text in cv_list if 'Contact Information:' in text]
        
    cv_dict.append(cv_items)

cv_dict

Wall time: 15min 3s


[{'id': ['SpiderID: 82533'],
  'emp_type': ['Type of Position: Full-Time Permanent'],
  'availability': ['Availability Date: '],
  'desired_wage': ['Desired Wage: '],
  'work_auth': ['U.S. Work Authorization: '],
  'job_level': ['Job Level: Experienced with over 2 years experience'],
  'will_travel': ['Willing to Travel: '],
  'edu_level': ['Highest Degree Attained: Bachelors'],
  'will_reloc': ['Willing to Relocate: '],
  'objective': ["Objective:Bookkeeping/Finance professional with experience in preparing, maintaining, analyzing, verifying, and reconciling financial transactions, statements, records, and reports. Able to explain complicated financial principles and processes to a variety of professional and non-professional audience. Member of the Association of Chartered Certified Accountants (ACCA), holder of BSc in Applied Accounting from Oxford Brookes University in London with over 8 years' experience gained in Cameroon. Advanced communication (fluent in English and good in Fre

In [10]:
# Convert scraped data into DataFrame
table2 = pd.DataFrame(cv_dict)

In [11]:
# Checking the tail of table2 to identify the id of the last resume
table2.tail()

Unnamed: 0,id,emp_type,availability,desired_wage,work_auth,job_level,will_travel,edu_level,will_reloc,objective,exp,edu,skills,add_info,contact_info
795,[SpiderID: 55525],[Type of Position: Other],[Availability Date: ASAP],[Desired Wage: ],[U.S. Work Authorization: Yes],[Job Level: New Grad/Entry Level],[Willing to Travel: ],[Highest Degree Attained: Other],[Willing to Relocate: Yes],[Objective:Bookkeeper/Accountant/Office ],[Experience:Please read next section],[Education:College],[Skills:Soft and hard skills],[Additional Information:Bookkeeper/accounting ...,[Candidate Contact Information:]
796,[SpiderID: 55501],[Type of Position: Full-Time Permanent],[Availability Date: 11-02-11],"[Desired Wage: 24,000]",[U.S. Work Authorization: ],[Job Level: Experienced with over 2 years expe...,"[Willing to Travel: Yes, Less Than 25%]",[Highest Degree Attained: High School/Equivalent],[Willing to Relocate: Yes],[Objective:To find a job that matches my exper...,[Experience:I have over 25 years experience wo...,[],"[Skills:I am a quick learner, efficient, will ...",[],[Candidate Contact Information:]
797,[SpiderID: 55479],[Type of Position: Full-Time Permanent],[Availability Date: 11/1/2011],"[Desired Wage: 42,000.00]",[U.S. Work Authorization: Yes],[Job Level: Experienced with over 2 years expe...,"[Willing to Travel: Yes, Less Than 25%]",[Highest Degree Attained: Other],[Willing to Relocate: Undecided],[Objective:Seeking a challenging position in a...,[Experience:The past 12 years with the Hartfor...,[],[Skills:Customer Service\t\t\t\t\t Unclaime...,[],[Candidate Contact Information:]
798,[SpiderID: 55478],[Type of Position: Full-Time Permanent],[Availability Date: Immediately],[Desired Wage: 45000],[U.S. Work Authorization: Yes],[Job Level: Experienced with over 2 years expe...,"[Willing to Travel: Yes, 25-50%]",[Highest Degree Attained: Bachelors],[Willing to Relocate: No],"[Objective:Self-motivated, quality-focused, an...","[Experience:ROOHIN NABIZADA INC. - TASHKENT, U...",[Education:Bachelor of Science in Business Adm...,[Skills:Strategic and Tactical Planning\tTime ...,[Additional Information:AWARDS AND HONORSAward...,[Candidate Contact Information:]
799,[SpiderID: 55472],[Type of Position: Full-Time Permanent],[Availability Date: 11/2/11],[Desired Wage: 75000],[U.S. Work Authorization: Yes],[Job Level: Experienced with over 2 years expe...,"[Willing to Travel: Yes, 25-50%]",[Highest Degree Attained: MBA],[Willing to Relocate: No],[],[],[],[],[],[Candidate Contact Information:]


In [16]:
# As table1 has a shape of 1500 rows, deleting the last 700 rows so that we can concatenate with table2 which has a shape of 800
# Double check where to delete from table1 : based on the last id of table2 i.e. 55472
table1 = table1.iloc[0:800]

In [17]:
# Concatenate table1 and table2
df_accounting = pd.concat([table1, table2], axis=1)

In [19]:
# Checking out the shape and first 5 items
print(df_accounting.shape)
df_accounting.head()

(800, 21)


Unnamed: 0,sn,date_posted,job_func_sought,category,location,resume_href,id,emp_type,availability,desired_wage,...,job_level,will_travel,edu_level,will_reloc,objective,exp,edu,skills,add_info,contact_info
0,1,3/30/2020,bookkeeping/accounting,Accounting/Bookkeeping,"Dartmouth, NS",/job/view-resume-82533.html,[SpiderID: 82533],[Type of Position: Full-Time Permanent],[Availability Date: ],[Desired Wage: ],...,[Job Level: Experienced with over 2 years expe...,[Willing to Travel: ],[Highest Degree Attained: Bachelors],[Willing to Relocate: ],[Objective:Bookkeeping/Finance professional wi...,[Experience:Finance and Administration Directo...,[Education:EducationBSc. in Applied Accounting...,[],[Additional Information:Certifications• QuickB...,[Candidate Contact Information:]
1,2,3/16/2020,stainless steel decorative sheet,Accounting/Bookkeeping,"Xian, AL",/job/view-resume-82510.html,[SpiderID: 82510],[Type of Position: Full-Time Permanent],[Availability Date: ],[Desired Wage: ],...,[Job Level: New Grad/Entry Level],[Willing to Travel: ],[Highest Degree Attained: ],[Willing to Relocate: ],"[Objective:Shaanxi Tonghui Steel Co., Ltd. is ...",[],[],[],[],[Candidate Contact Information:]
2,3,3/4/2020,accountant,Accounting/Bookkeeping,"Orlando, FL",/job/view-resume-82487.html,[SpiderID: 82487],[Type of Position: Contractor],[Availability Date: ],[Desired Wage: ],...,"[Job Level: Management (Manager, Director)]",[Willing to Travel: No],[Highest Degree Attained: ],[Willing to Relocate: No],[],[],[],[],[],[Candidate Contact Information:]
3,4,2/27/2020,admin assistant,Accounting/Bookkeeping,"Brampton, ON",/job/view-resume-82476.html,[SpiderID: 82476],[Type of Position: Full-Time Permanent],[Availability Date: ASAP],[Desired Wage: $20-$25],...,[Job Level: New Grad/Entry Level],[Willing to Travel: No],[Highest Degree Attained: Masters],[Willing to Relocate: Undecided],[Objective:A positive and motivated IT Post-gr...,"[Experience:BS Computers, Punjab, India•\tDiag...",[Education:Master of Information technology\t\...,[Skills:Computer software: Microsoft world (Wo...,[Additional Information:•\tKeen Learner•\tSoun...,[Candidate Contact Information:]
4,5,2/27/2020,admin assistant,Accounting/Bookkeeping,"Brampton, ON",/job/view-resume-82475.html,[SpiderID: 82475],[Type of Position: Full-Time Permanent],[Availability Date: ],[Desired Wage: ],...,[Job Level: New Grad/Entry Level],[Willing to Travel: ],[Highest Degree Attained: ],[Willing to Relocate: ],[],[],[],[],[],[Candidate Contact Information:]


In [21]:
# removing the square brackets in each entries

df_accounting['id'] = df_accounting['id'].str.get(0)
df_accounting['emp_type'] = df_accounting['emp_type'].str.get(0)
df_accounting['availability'] = df_accounting['availability'].str.get(0)
df_accounting['desired_wage'] = df_accounting['desired_wage'].str.get(0)
df_accounting['work_auth'] = df_accounting['work_auth'].str.get(0)
df_accounting['job_level'] = df_accounting['job_level'].str.get(0)
df_accounting['will_travel'] = df_accounting['will_travel'].str.get(0)
df_accounting['edu_level'] = df_accounting['edu_level'].str.get(0)
df_accounting['will_reloc'] = df_accounting['will_reloc'].str.get(0)
df_accounting['objective'] = df_accounting['objective'].str.get(0)
df_accounting['exp'] = df_accounting['exp'].str.get(0)
df_accounting['edu'] = df_accounting['edu'].str.get(0)
df_accounting['skills'] = df_accounting['skills'].str.get(0)
df_accounting['add_info'] = df_accounting['add_info'].str.get(0)
df_accounting['contact_info'] = df_accounting['contact_info'].str.get(0)

In [22]:
# Final check of the dataframe
df_accounting.head()

Unnamed: 0,sn,date_posted,job_func_sought,category,location,resume_href,id,emp_type,availability,desired_wage,...,job_level,will_travel,edu_level,will_reloc,objective,exp,edu,skills,add_info,contact_info
0,1,3/30/2020,bookkeeping/accounting,Accounting/Bookkeeping,"Dartmouth, NS",/job/view-resume-82533.html,SpiderID: 82533,Type of Position: Full-Time Permanent,Availability Date:,Desired Wage:,...,Job Level: Experienced with over 2 years exper...,Willing to Travel:,Highest Degree Attained: Bachelors,Willing to Relocate:,Objective:Bookkeeping/Finance professional wit...,Experience:Finance and Administration Director...,"Education:EducationBSc. in Applied Accounting,...",,Additional Information:Certifications• QuickBo...,Candidate Contact Information:
1,2,3/16/2020,stainless steel decorative sheet,Accounting/Bookkeeping,"Xian, AL",/job/view-resume-82510.html,SpiderID: 82510,Type of Position: Full-Time Permanent,Availability Date:,Desired Wage:,...,Job Level: New Grad/Entry Level,Willing to Travel:,Highest Degree Attained:,Willing to Relocate:,"Objective:Shaanxi Tonghui Steel Co., Ltd. is a...",,,,,Candidate Contact Information:
2,3,3/4/2020,accountant,Accounting/Bookkeeping,"Orlando, FL",/job/view-resume-82487.html,SpiderID: 82487,Type of Position: Contractor,Availability Date:,Desired Wage:,...,"Job Level: Management (Manager, Director)",Willing to Travel: No,Highest Degree Attained:,Willing to Relocate: No,,,,,,Candidate Contact Information:
3,4,2/27/2020,admin assistant,Accounting/Bookkeeping,"Brampton, ON",/job/view-resume-82476.html,SpiderID: 82476,Type of Position: Full-Time Permanent,Availability Date: ASAP,Desired Wage: $20-$25,...,Job Level: New Grad/Entry Level,Willing to Travel: No,Highest Degree Attained: Masters,Willing to Relocate: Undecided,Objective:A positive and motivated IT Post-gra...,"Experience:BS Computers, Punjab, India•\tDiagn...",Education:Master of Information technology\t\t...,Skills:Computer software: Microsoft world (Wor...,Additional Information:•\tKeen Learner•\tSound...,Candidate Contact Information:
4,5,2/27/2020,admin assistant,Accounting/Bookkeeping,"Brampton, ON",/job/view-resume-82475.html,SpiderID: 82475,Type of Position: Full-Time Permanent,Availability Date:,Desired Wage:,...,Job Level: New Grad/Entry Level,Willing to Travel:,Highest Degree Attained:,Willing to Relocate:,,,,,,Candidate Contact Information:


In [24]:
# Save a copy to csv
#df_accounting.to_csv('./datasets/Accounting.csv', index=False)