# Web Scraping 

In this notebook, I will be scraping using Selenium and Beautiful Soup. 

The website that I will be scraping is https://www.mycareersfuture.sg/, which is the newest Singapore government initiative to help Singaporeans with a smarter way to find jobs. I am looking to retrieve job listings for data related positions in order to study factors (e.g. salaries) regarding data related positions in Singapore. 

The steps for web scraping are recorded below in detail:

<font color='blue'>

### Import the required packages

In [None]:
# Import all the packages required

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('fivethirtyeight')
%config InlineBackend.figure_format = 'retina'

In [3]:
# Import scrappy

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

<font color='blue'>

### Study the structure of the web pages at MyCareersFuture

The structure of the pages are as such:

__Main home page__, where we can enter the search term (i.e. job titles we are interested in)

                THEN

This returns the __search results pages__, where it shows the job listings for the search term

                THEN 
                
If we click on each job listing on the search results page, it returns the __detailed information for the job listing__ e.g. salary, employment type, job description and requirements etc. 

Hence, in order to scrape the detailed job listings, we have to follow the steps:
1. Set up the urls for the search results page for each search term
2. Use Selenihm webdriver to go to each search results page to retrieve html for the page
3. From the html retrieved, use Beautiful Soup to extract the url for the individual job listings
4. Use Selenihm webdriver to go to each job listing page to retrieve html for the page
5. From the html retrieved, use Beautiful Soup to extract the details of the job listing

Once we have extracted the details of each job listing:

1. Load all the details into a Pandas DataFrame
2. Export the data as a csv file to be stored for future usage in another notebook

<font color='blue'>

### Determine the job titles that I am interested in for my study

These are the job titles related to data, that I plan to scrape job listings for:
    # Data Analyst
    # Data Scientist
    # Data Engineer
    # Data Architect
    # Data Manager
    # Data Developer
    # Business Intelligence Analyst
    # Business Analyst

<font color='blue'>

### Consolidate URLs for the search results of the different job titles

</font>

I will consolidate the URLs for the search results for different job titles on MyCareersFuture. This way, I can run these URLs using the Selenium webdriver later, to retrieve the HTML for all the pages.

In [27]:
# Extract URLs for search results of each job title
# Data Analyst - Search results

data_analyst = []

for i in range(0,4):
    data_analyst.append('https://www.mycareersfuture.sg/search?search=data%20analyst&sortBy=new_posting_date&page='+str(i))

In [28]:
# Extract URLs for search results of each job title
# Data Scientist - Search results

data_scientist = []

for i in range(0,5):
    data_scientist.append('https://www.mycareersfuture.sg/search?search=data%20scientist&sortBy=new_posting_date&page='+str(i))

In [29]:
# Extract URLs for search results of each job title
# Data Engineer - Search results

data_engineer = []

for i in range(0,7):
    data_engineer.append('https://www.mycareersfuture.sg/search?search=Data%20engineer&sortBy=new_posting_date&page='+str(i))

In [30]:
# Extract URLs for search results of each job title
# Data Architect - Search results

data_architect = []

for i in range(0,2):
    data_architect.append('https://www.mycareersfuture.sg/search?search=Data%20Architect&sortBy=new_posting_date&page='+str(i))

In [31]:
# Extract URLs for search results of each job title
# Data Manager - Search results

data_manager = []

for i in range(0,3):
    data_manager.append('https://www.mycareersfuture.sg/search?search=Data%20manager&sortBy=new_posting_date&page='+str(i))

In [41]:
# Extract URLs for search results of each job title
# Data Developer - Search results

data_developer = []

for i in range(0,1):
    data_developer.append('https://www.mycareersfuture.sg/search?search=Data%20developer&sortBy=new_posting_date&page='+str(i))

In [32]:
# Extract URLs for search results of each job title
# Business Intelligence Analyst - Search results

business_intelligence_analyst = []

for i in range(0,1):
    business_intelligence_analyst.append('https://www.mycareersfuture.sg/search?search=business%20intelligence%20analyst&sortBy=new_posting_date&page='+str(i))

In [35]:
# Extract URLs for search results of each job title
# Business Analyst - Search results

business_analyst = []

for i in range(0,18):
    business_analyst.append('https://www.mycareersfuture.sg/search?search=business%20analyst&sortBy=new_posting_date&page='+str(i))

In [42]:
# Consolidate all the urls into one single list for iteration using selenium below

consol_list = data_analyst + data_scientist + data_engineer + data_architect + data_manager + data_developer \
                + business_intelligence_analyst + business_analyst

<font color='blue'>

### Open Selenium Webdriver

In [None]:
import os
from selenium import webdriver

chromedriver = "/Users/JooYeng/Downloads/chromedriver 4"
os.environ["webdriver.chrome.driver"] = chromedriver

# Create a driver called "driver."
driver = webdriver.Chrome(executable_path="/Users/JooYeng/Downloads/chromedriver 4")

<font color = 'blue'>

### Use Selenium to extract html for the search result pages

In [44]:
# Use selenium to extract html for the consol_list

html = []

for page in consol_list:
    driver.get(page)
    sleep(4)
    html.append(driver.page_source)

<font color = 'blue'>

### Use Beautiful Soup to extract url of each job listing page

In [103]:
# Check how to use Beautiful Soup to extract the url

pg = BeautifulSoup(html[0]).findAll("a",{"class":"bg-white mb3 w-100 dib v-top pa3 no-underline flex-ns flex-wrap JobCard__card___22xP3"})
for job in pg:
    print(job.attrs['href'])

/job/mid-level-data-analyst-traveloka-services-0b8be8fb990807c9e3e2f035cd12504a
/job/platform-engineer-gumi-asia-54f313f4b63eb6c0fcb838a51cd8b318
/job/data-analytics-lead-traveloka-services-8712611c3fb3ab76d7cadf931f3983fa
/job/senior-data-analyst-propertyguru-707f35425322b89a7451c4f0316250d4
/job/data-engineer-2fab625fb13fe16207621f894b9ebdb9
/job/crm-data-analyst-aspire-global-network-72550b8ea34bcc4bb37f4efba636afd6
/job/crm-data-analyst-aspire-global-network-dccc8a3b7b4f2f1946c7e0d24ea01269
/job/technical-data-analyst-eclerx-7cc809458abebca3475715b2f588438f
/job/data-analyst-emerio-globesoft-e0e7e3098911377e04e4dfc531402529
/job/data-management-analyst-operations-unilite-recruitment-services-565d5ed6978c091752d10f8391774e28
/job/data-scientist-merck-282468cd9f95fb3a2711b22f64ac37a4
/job/supply-chain-planning-anaylst-konica-minolta-business-solutions-asia-893c163f3c5cec8bf6224390b8c06ac4
/job/data-analyst-ntuc-enterprise-co-operative-81a4fdffd3d1823c4701e53227d7de71
/job/business-an

In [105]:
# Extract the href for each individual job 
# Consolidate them into a list

url_part = []

for page in html:
    pg = BeautifulSoup(page).findAll("a",{"class":"bg-white mb3 w-100 dib v-top pa3 no-underline flex-ns flex-wrap JobCard__card___22xP3"})
    for job in pg:
        url_part.append(job.attrs['href'])

In [110]:
# As the href extracted is only part of the url, I will engineer the results to 
# get the full url for each individual job

url_full = []

for url in url_part:
    url_full.append('https://www.mycareersfuture.sg'+url)

<font color = 'blue'>

### Use Selenium to extract html for each job listing page

In [146]:
# Open Selenium webdriver

chromedriver = "/Users/JooYeng/Downloads/chromedriver 4"
os.environ["webdriver.chrome.driver"] = chromedriver

# Create a driver called "driver."
driver = webdriver.Chrome(executable_path="/Users/JooYeng/Downloads/chromedriver 4")

In [147]:
# Iterate through the url list to retrieve detailed job requirements for each of the job listing

job_html = []

for job in url_full:
    driver.get(job)
    sleep(5)
    job_html.append(driver.page_source)

<font color = 'blue'>

### Use Beautiful Soup to extract job details from each job listing page html

In this case, the information that I am interested in are:
    - Company Name
    - Job ID
    - Full Job Title
    - Employment Type (Permanent, Contract etc.)
    - Seniority
    - Job Category/Industry
    - Salary Amount
    - Job Description
    - Job Requirements
    - Number of applicants
    - Posted Date of listing
    - Expiry Date of listing

In [161]:
# Append job details for each job

comp = []
job_id = []
title = []
emp_type = []
seniority = []
job_cat = []

for job in job_html:
    try:
        comp.append(BeautifulSoup(job).find("p",{"name":"company"}).text)
    except AttributeError:
        comp.append(np.nan)
    try:    
        job_id.append(BeautifulSoup(job).find("span",{"class":"black-60 db f6 fw4 mv1"}).text)
    except AttributeError:
        job_id.append(np.nan)
    try:
        title.append(BeautifulSoup(job).find("h1",{"id":"job_title"}).text)
    except AttributeError:
        title.append(np.nan)
    try:
        emp_type.append(BeautifulSoup(job).find("p",{"id":"employment_type"}).text)
    except AttributeError:
        emp_type.append(np.nan)
    try:        
        seniority.append(BeautifulSoup(job).find("p",{"id":"seniority"}).text)
    except AttributeError:
        seniority.append(np.nan)
    try:
        job_cat.append(BeautifulSoup(job).find("p",{"id":"job-categories"}).text)
    except AttributeError:
        job_cat.append(np.nan)

In [165]:
# Append job details for each job

job_desc = []
job_req = []

for job in job_html:
    try:
        job_desc.append(BeautifulSoup(job).find("div",{"id":"description-content"}).text)
    except AttributeError:
        job_desc.append(np.nan)
    try:
        job_req.append(BeautifulSoup(job).find("div",{"id":"requirements-content"}).text)
    except AttributeError:
        job_req.append(np.nan)

In [166]:
# Append job details for each job

num_app = []
posted_date = []
exp_date = []

for job in job_html:
    try:
        num_app.append(BeautifulSoup(job).find("span",{"id":"num_of_applications"}).text)
    except AttributeError:
        num_app.append(np.nan)
    try:
        posted_date.append(BeautifulSoup(job).find("span",{"id":"last_posted_date"}).text)
    except AttributeError:
        posted_date.append(np.nan)
    try:
        exp_date.append(BeautifulSoup(job).find("span",{"id":"expiry_date"}).text)
    except AttributeError:
        exp_date.append(np.nan)

In [167]:
# Append job details for each job

sal_amt = []
sal_term = []

for job in job_html:
    try:
        sal_amt.append(BeautifulSoup(job).find("div",{"class":"lh-solid"}).text)
    except AttributeError:
        sal_amt.append(np.nan)
    try:
        sal_term.append(BeautifulSoup(job).find("span",{"class":"salary_type dib f5 fw4 black-60 pr1 i pb"}).text)
    except AttributeError:
        sal_term.append(np.nan)

<font color='blue'>

### Create a DataFrame with all the information extracted

In [None]:
comp = []
emp_type = []
num_app = []
posted_date = []
exp_date = []
sal_amt = []
sal_term = []

In [168]:
# Create a DataFrame with all the information

    # Create lists with the keys and values to zip the 2 lists together
dict_key_list = ['job_id','job_title','url','job_desc','job_req','yrs_exp','indus',
                 'comp','sal_amt','sal_term','emp_type','num_app','posted_date','exp_date']
dict_val_list = [job_id,title,url_full,job_desc,job_req,seniority,job_cat,
                comp,sal_amt,sal_term,emp_type,num_app,posted_date,exp_date]

    # Zip the 2 lists together and create a dictionary
jobs_dict = dict(zip(dict_key_list,dict_val_list))

    # Create a dataframe from the dictionary
jobs_df = pd.DataFrame(jobs_dict) 

In [169]:
# Check the dataframe
jobs_df.head(2)

Unnamed: 0,job_id,job_title,url,job_desc,job_req,yrs_exp,indus,comp,sal_amt,sal_term,emp_type,num_app,posted_date,exp_date
0,JOB-2019-0089696,Mid Level Data Analyst,https://www.mycareersfuture.sg/job/mid-level-d...,Data Analyst at Traveloka is at the forefront ...,We are looking for someone with: Passion in ...,Junior Executive,Information Technology,TRAVELOKA SERVICES PTE. LTD.,"$4,300to$7,600",Monthly,Permanent,2 applications,Posted 26 Apr 2019,Closing on 26 May 2019
1,JOB-2019-0089321,Platform Engineer,https://www.mycareersfuture.sg/job/platform-en...,Responsible for programming and maintaining mo...,Coding/Scripting experience in languages such...,Executive,Information Technology,GUMI ASIA PTE. LTD.,"$5,000to$6,500",Monthly,Full Time,0 application,Posted 26 Apr 2019,Closing on 26 May 2019


In [170]:
# Check if all the jobs extracted are unique listings

len(jobs_df['job_id'].unique())

665

<font color='blue'>

### Export DataFrame to a CSV File

</font>

This can then be stored and loaded in a separate notebook for my study.

In [171]:
# Export to csv

jobs_df.to_csv('./MyCareerSG.csv')