<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone - Resumes and Job Ads Recommender

# Problem Statement

HR practitioners and/or hiring managers could have been spending too much time trying to sieve through many resumes for shortlisting suitable candidates whom they can contact for interview.
As a job seeker, we may also find ourselves spending so much times looking through plentiful job advertisements which may not be relevant to us.
Wouldn't it be nice if pre-selection can be done which will effectively save time for all of us?

We will be using Natural Language Processing and Recommender System to group similar job seekers / job advertisements.
Success will be evaluated by the (TBD on model) to match the job the grouped job seekers to the most suitable job advertisements and vice versa.

# Executive Summary

We scrape the website spiderjob.com for resumes. In view that api key is not available, we used BeautifulSoup and regex to get the desired information. As we encounter roadblock on the time connection timeout despite introducing bot agent, we limit the job categories to Accounting and Information Technology for this capstone. 
For the job ads, we based it on existing dataset that is available on Kaggle which was used for predicting fake job posting since the features in this dataset has 80% simiarity to the resumes dataset.

As we were cleaning the resumes dataset, we at the same made decision which features will be important for us to have and which are the one to drop. In view that job title, objective, experience and skills are free texts that hold meaningful words for our analysis, we create a new feature and combine all where we then split the text into words, return them to their root form and also remove the stop words. These are performed for the job ads dataset.

TBC.....


### Contents:
- [Scraping of resumes - Information Technology](#Scraping-of-resumes---Information-Technology)

## Scraping of resumes - Information Technology

In [1]:
import requests
import pandas as pd
import regex as re
import numpy as np
import random
import time
from bs4 import BeautifulSoup

In [2]:
# Check if the link if working well for scraping
url = 'https://www.jobspider.com/job/resume-search-results.asp/category_121'
res = requests.get(url)
res

<Response [200]>

In [3]:
%%time

# Scraping thru the table of contents
# Create an empty table for storing the scrape items
table = []

# To iterate thru pagination 
for num in range(1,31):
    link = url + '/page_' + str(num)
    res = requests.get(link)
    soup = BeautifulSoup(res.content, 'lxml')
    tbody = soup.find_all('td', align='center')
    
    # As there are 6 items listed in the ('td', align='center'), splitting the len of the post for iteration by group of 6
    for i in range(int(len(tbody)/6)):
        items = {}
        items['sn'] = tbody[6*int(i) + 0].text
        items['date_posted'] = tbody[6*int(i)+1].text
        items['job_func_sought'] = tbody[6*int(i)+2].text.lower()
        items['category'] = tbody[6*int(i)+3].text
        items['location'] = tbody[6*int(i)+4].text
        items['resume_href'] = tbody[6*int(i)+5].find('a')['href']
        
        table.append(items)
table

Wall time: 43.2 s


[{'sn': '1',
  'date_posted': '3/26/2020',
  'job_func_sought': 'it helpdesk support technician',
  'category': 'Information Technology',
  'location': 'Las Vegas, NV',
  'resume_href': '/job/view-resume-82527.html'},
 {'sn': '2',
  'date_posted': '3/25/2020',
  'job_func_sought': 'web developer',
  'category': 'Information Technology',
  'location': 'surrey, BC',
  'resume_href': '/job/view-resume-82525.html'},
 {'sn': '3',
  'date_posted': '3/23/2020',
  'job_func_sought': 'it helpdesk support technician',
  'category': 'Information Technology',
  'location': 'Las Vegas, NV',
  'resume_href': '/job/view-resume-82521.html'},
 {'sn': '4',
  'date_posted': '3/2/2020',
  'job_func_sought': 'it education professional',
  'category': 'Information Technology',
  'location': 'Lutz, FL',
  'resume_href': '/job/view-resume-82483.html'},
 {'sn': '5',
  'date_posted': '2/25/2020',
  'job_func_sought': 'senior systems administrator supervisor',
  'category': 'Information Technology',
  'location'

In [9]:
# Convert scraped data into DataFrame
table1 = pd.DataFrame(table)

In [10]:
# Check out the shape and first 5 items
print(table1.shape)
table1.head()

(1500, 6)


Unnamed: 0,sn,date_posted,job_func_sought,category,location,resume_href
0,1,3/26/2020,it helpdesk support technician,Information Technology,"Las Vegas, NV",/job/view-resume-82527.html
1,2,3/25/2020,web developer,Information Technology,"surrey, BC",/job/view-resume-82525.html
2,3,3/23/2020,it helpdesk support technician,Information Technology,"Las Vegas, NV",/job/view-resume-82521.html
3,4,3/2/2020,it education professional,Information Technology,"Lutz, FL",/job/view-resume-82483.html
4,5,2/25/2020,senior systems administrator supervisor,Information Technology,"Ventura, CA",/job/view-resume-82473.html


In [6]:
# Scraping through the resume

%%time

headers = {'User-agent': 'SL Bot 2.0'}
base_url = 'https://www.jobspider.com'
href_list = [table[n]['resume_href'] for n in range(1000)] # After multiple tries, manual input a range of 1000 and for successful scraping (without timeout error)
cv_dict = []

# Create for loop to combine resume_href from table1 to base url for iteration through each link
for j in range(1000):
    cv_res = requests.get(base_url + href_list[j], headers=headers)
    cv_soup = BeautifulSoup(cv_res.content, 'lxml')
    cv_body = cv_soup.find_all('table', align='center')
    cv_list = cv_body[1].text.splitlines()
    
    # Create an empty dictionary to store the scrape items
    for items in cv_body:
        cv_items = {}
        
        # In each link, scrap through the resume and create columns to house the items
        for text in cv_list:
            cv_items['id'] = [text for text in cv_list if 'SpiderID:' in text]
            cv_items['emp_type'] = [text for text in cv_list if 'Type of Position:' in text]
            cv_items['availability'] = [text for text in cv_list if 'Availability Date:' in text]
            cv_items['desired_wage'] = [text for text in cv_list if 'Desired Wage:' in text]
            cv_items['work_auth'] = [text for text in cv_list if 'U.S. Work Authorization:' in text]
            cv_items['job_level'] = [text for text in cv_list if 'Job Level:' in text]
            cv_items['will_travel'] = [text for text in cv_list if 'Willing to Travel:' in text]
            cv_items['edu_level'] = [text for text in cv_list if 'Highest Degree Attained:' in text]
            cv_items['will_reloc'] = [text for text in cv_list if 'Willing to Relocate:' in text]
            cv_items['objective'] = [text for text in cv_list if 'Objective:' in text]
            cv_items['exp'] = [text for text in cv_list if 'Experience:' in text]
            cv_items['edu'] = [text for text in cv_list if 'Education:' in text]
            cv_items['skills'] = [text for text in cv_list if 'Skills:' in text]
            cv_items['add_info'] = [text for text in cv_list if 'Additional Information:' in text]
            cv_items['contact_info'] = [text for text in cv_list if 'Contact Information:' in text]
        
    cv_dict.append(cv_items)
        
cv_dict

Wall time: 19min 37s


[{'id': ['SpiderID: 82527'],
  'emp_type': ['Type of Position: Full-Time Permanent'],
  'availability': ['Availability Date: 3/26/2020'],
  'desired_wage': ['Desired Wage: '],
  'work_auth': ['U.S. Work Authorization: Yes'],
  'job_level': ['Job Level: Experienced with over 2 years experience'],
  'will_travel': ['Willing to Travel: Yes, Less Than 25%'],
  'edu_level': ['Highest Degree Attained: Bachelors'],
  'will_reloc': ['Willing to Relocate: Yes'],
  'objective': ['Objective:Dedicated IT expert with 10+ years’ experience in providing quality technical support to users across various companies. A certified technician with a bachelor’s degree in Information Systems from the University of Alabama. Solution oriented worker who adopts a customer centric approach in all support tasks, and communicates effectively with audiences in and outside the IT profession. Delivers exceptional services in mobile device and computer systems maintenance, troubleshooting and repair. Looking to obtain 

In [11]:
# Convert scraped data into DataFrame
table2 = pd.DataFrame(cv_dict)

In [12]:
# Checking the tail of table2 to identify the id of the last resume
table2.tail()

Unnamed: 0,id,emp_type,availability,desired_wage,work_auth,job_level,will_travel,edu_level,will_reloc,objective,exp,edu,skills,add_info,contact_info
995,[SpiderID: 72735],[Type of Position: Contractor],[Availability Date: ],[Desired Wage: ],[U.S. Work Authorization: ],[Job Level: Experienced with over 2 years expe...,[Willing to Travel: ],[Highest Degree Attained: ],[Willing to Relocate: No],[Objective: I can work on Corp to...,"[Experience: Cleveland Clinic, Cleveland, OH O...",[],"[Skills:RDBMS\tPROGRESS 4GL, Oracle 10g/11g, P...",[Additional Information:Certification: Oracle ...,[Candidate Contact Information:]
996,[SpiderID: 72731],[Type of Position: Full-Time Permanent],[Availability Date: 2 weeks],[Desired Wage: 150000],[U.S. Work Authorization: Yes],"[Job Level: Management (Manager, Director)]","[Willing to Travel: Yes, More Than 75%]",[Highest Degree Attained: Bachelors],[Willing to Relocate: No],[Objective:To obtain a technical management po...,[Experience:WORK EXPERIENCEIT Director Applied...,"[Education:University of Wisconsin - Madison, ...",[Skills:CORE COMPETENCIESTechnical knowledge &...,[Additional Information:Project Management Pro...,[Candidate Contact Information:]
997,[SpiderID: 72730],[Type of Position: Contractor],[Availability Date: ],[Desired Wage: ],[U.S. Work Authorization: Yes],[Job Level: Experienced with over 2 years expe...,"[Willing to Travel: Yes, Less Than 25%]",[Highest Degree Attained: Bachelors],[Willing to Relocate: Yes],[Objective:SUMMARY:•\tOverall 6+ years of IT e...,[Experience:Client: EFH (Energy Future Holding...,[Education:EDUCATION:•\tBachelor in Computer S...,[Skills:TECHNICAL SKILLS:Storage/SAN\tHigh End...,[],[Candidate Contact Information:]
998,[SpiderID: 72727],[Type of Position: Contractor],[Availability Date: ],[Desired Wage: ],[U.S. Work Authorization: Yes],"[Job Level: Management (Manager, Director)]","[Willing to Travel: Yes, More Than 75%]",[Highest Degree Attained: Bachelors],[Willing to Relocate: Yes],[],[],[],[],[],[Candidate Contact Information:]
999,[SpiderID: 72726],[Type of Position: Contractor],[Availability Date: ],[Desired Wage: ],[U.S. Work Authorization: Yes],"[Job Level: Management (Manager, Director)]","[Willing to Travel: Yes, More Than 75%]",[Highest Degree Attained: Bachelors],[Willing to Relocate: Yes],[],[],[],[],[],[Candidate Contact Information:]


In [14]:
# As table1 has a shape of 1500 rows, deleting the last 500 rows so that we can concatenate with table2 which has a shape of 1000
# Double check where to delete from table1 : based on the last id of table2 i.e. 72726
table1 = table1.iloc[0:1000]

In [15]:
# Concatenate table1 and table2
df_it = pd.concat([table1, table2], axis=1)

In [16]:
# Checking out the shape and first 5 items
print(df_it.shape)
df_it.head()

(1000, 21)


Unnamed: 0,sn,date_posted,job_func_sought,category,location,resume_href,id,emp_type,availability,desired_wage,...,job_level,will_travel,edu_level,will_reloc,objective,exp,edu,skills,add_info,contact_info
0,1,3/26/2020,it helpdesk support technician,Information Technology,"Las Vegas, NV",/job/view-resume-82527.html,[SpiderID: 82527],[Type of Position: Full-Time Permanent],[Availability Date: 3/26/2020],[Desired Wage: ],...,[Job Level: Experienced with over 2 years expe...,"[Willing to Travel: Yes, Less Than 25%]",[Highest Degree Attained: Bachelors],[Willing to Relocate: Yes],[Objective:Dedicated IT expert with 10+ years’...,"[Experience:THE COSMOPOLITAN, LAS VEGAS, NV\t\...","[Education:UNIVERSITY OF ALABAMA, BIRMINGHAM\t...","[Skills:Skype, Creston, Ivanti App Sense, Alti...",[],[Candidate Contact Information:]
1,2,3/25/2020,web developer,Information Technology,"surrey, BC",/job/view-resume-82525.html,[SpiderID: 82525],[Type of Position: Full-Time Permanent],[Availability Date: ],[Desired Wage: ],...,[Job Level: New Grad/Entry Level],"[Willing to Travel: Yes, Less Than 25%]",[Highest Degree Attained: Bachelors],[Willing to Relocate: No],[],[],[],[],[],[Candidate Contact Information:]
2,3,3/23/2020,it helpdesk support technician,Information Technology,"Las Vegas, NV",/job/view-resume-82521.html,[SpiderID: 82521],[Type of Position: Full-Time Permanent],[Availability Date: ],[Desired Wage: ],...,"[Job Level: Management (Manager, Director)]","[Willing to Travel: Yes, Less Than 25%]",[Highest Degree Attained: Bachelors],[Willing to Relocate: Yes],[Objective:Dedicated IT expert with 10+ years’...,"[Experience:THE COSMOPOLITAN, LAS VEGAS, NV\t\...",[Education:CODE ACADEMY | Python\t\t\t\t\t\t\t...,"[Skills:Skype, Creston, Ivanti App Sense, Alti...",[],[Candidate Contact Information:]
3,4,3/2/2020,it education professional,Information Technology,"Lutz, FL",/job/view-resume-82483.html,[SpiderID: 82483],[Type of Position: Full-Time Permanent],[Availability Date: ],[Desired Wage: 72000],...,"[Job Level: Management (Manager, Director)]",[Willing to Travel: No],[Highest Degree Attained: Bachelors],[Willing to Relocate: No],"[Objective:Accomplished, performance-focused, ...",[Experience:Professional ExperienceBK & TJ Ent...,[Education:EducationBachelor of Arts in Busine...,[Skills:Technical SkillsOperating Systems:IBM ...,[Additional Information:CertificationCompTIA N...,[Candidate Contact Information:]
4,5,2/25/2020,senior systems administrator supervisor,Information Technology,"Ventura, CA",/job/view-resume-82473.html,[SpiderID: 82473],[Type of Position: Full-Time Permanent],[Availability Date: Immediately],[Desired Wage: 99000],...,"[Job Level: Management (Manager, Director)]",[Willing to Travel: No],[Highest Degree Attained: High School/Equivalent],[Willing to Relocate: No],"[Objective:Innovative, results-driven, and ana...",[Experience:RELEVANT EXPERIENCEMARKET SCAN INF...,"[Education:EDUCATIONCivil Engineering, Communi...",[],[Additional Information:PROFESSIONAL DEVELOPME...,[Candidate Contact Information:]


In [17]:
# Removing the square brackets in each entries

df_it['id'] = df_it['id'].str.get(0)
df_it['emp_type'] = df_it['emp_type'].str.get(0)
df_it['availability'] = df_it['availability'].str.get(0)
df_it['desired_wage'] = df_it['desired_wage'].str.get(0)
df_it['work_auth'] = df_it['work_auth'].str.get(0)
df_it['job_level'] = df_it['job_level'].str.get(0)
df_it['will_travel'] = df_it['will_travel'].str.get(0)
df_it['edu_level'] = df_it['edu_level'].str.get(0)
df_it['will_reloc'] = df_it['will_reloc'].str.get(0)
df_it['objective'] = df_it['objective'].str.get(0)
df_it['exp'] = df_it['exp'].str.get(0)
df_it['edu'] = df_it['edu'].str.get(0)
df_it['skills'] = df_it['skills'].str.get(0)
df_it['add_info'] = df_it['add_info'].str.get(0)
df_it['contact_info'] = df_it['contact_info'].str.get(0)

In [18]:
# Final check of the dataframe
df_it.head()

Unnamed: 0,sn,date_posted,job_func_sought,category,location,resume_href,id,emp_type,availability,desired_wage,...,job_level,will_travel,edu_level,will_reloc,objective,exp,edu,skills,add_info,contact_info
0,1,3/26/2020,it helpdesk support technician,Information Technology,"Las Vegas, NV",/job/view-resume-82527.html,SpiderID: 82527,Type of Position: Full-Time Permanent,Availability Date: 3/26/2020,Desired Wage:,...,Job Level: Experienced with over 2 years exper...,"Willing to Travel: Yes, Less Than 25%",Highest Degree Attained: Bachelors,Willing to Relocate: Yes,Objective:Dedicated IT expert with 10+ years’ ...,"Experience:THE COSMOPOLITAN, LAS VEGAS, NV\t\t...","Education:UNIVERSITY OF ALABAMA, BIRMINGHAM\t\...","Skills:Skype, Creston, Ivanti App Sense, Altir...",,Candidate Contact Information:
1,2,3/25/2020,web developer,Information Technology,"surrey, BC",/job/view-resume-82525.html,SpiderID: 82525,Type of Position: Full-Time Permanent,Availability Date:,Desired Wage:,...,Job Level: New Grad/Entry Level,"Willing to Travel: Yes, Less Than 25%",Highest Degree Attained: Bachelors,Willing to Relocate: No,,,,,,Candidate Contact Information:
2,3,3/23/2020,it helpdesk support technician,Information Technology,"Las Vegas, NV",/job/view-resume-82521.html,SpiderID: 82521,Type of Position: Full-Time Permanent,Availability Date:,Desired Wage:,...,"Job Level: Management (Manager, Director)","Willing to Travel: Yes, Less Than 25%",Highest Degree Attained: Bachelors,Willing to Relocate: Yes,Objective:Dedicated IT expert with 10+ years’ ...,"Experience:THE COSMOPOLITAN, LAS VEGAS, NV\t\t...",Education:CODE ACADEMY | Python\t\t\t\t\t\t\t\...,"Skills:Skype, Creston, Ivanti App Sense, Altir...",,Candidate Contact Information:
3,4,3/2/2020,it education professional,Information Technology,"Lutz, FL",/job/view-resume-82483.html,SpiderID: 82483,Type of Position: Full-Time Permanent,Availability Date:,Desired Wage: 72000,...,"Job Level: Management (Manager, Director)",Willing to Travel: No,Highest Degree Attained: Bachelors,Willing to Relocate: No,"Objective:Accomplished, performance-focused, a...",Experience:Professional ExperienceBK & TJ Ente...,Education:EducationBachelor of Arts in Busines...,Skills:Technical SkillsOperating Systems:IBM M...,Additional Information:CertificationCompTIA Ne...,Candidate Contact Information:
4,5,2/25/2020,senior systems administrator supervisor,Information Technology,"Ventura, CA",/job/view-resume-82473.html,SpiderID: 82473,Type of Position: Full-Time Permanent,Availability Date: Immediately,Desired Wage: 99000,...,"Job Level: Management (Manager, Director)",Willing to Travel: No,Highest Degree Attained: High School/Equivalent,Willing to Relocate: No,"Objective:Innovative, results-driven, and anal...",Experience:RELEVANT EXPERIENCEMARKET SCAN INFO...,"Education:EDUCATIONCivil Engineering, Communit...",,Additional Information:PROFESSIONAL DEVELOPMEN...,Candidate Contact Information:


In [19]:
# Save a copy to csv
#df_it.to_csv('./datasets/IT.csv', index=False)