# Job Post Scraping

In this notebook we are going to scrape the job listings for 'Data Analyst' from an Egyptian website called Wuzzuf then extract the information from the listings and store the results in a csv file.

## Website

<img src="./img/website.png" alt="WUZZUF website" width="800"/>

## Data Objectives

Each listing result must contain - 
 - Date
 - Job Title
 - Company Name
 - Company Address
 - Full-time/Part-time
 - Career Level
 - URL

Aditionally we will attempt to extract the following information - 
 - Job Description
 - Job Requirements
 - Salary
 - Required Experience 
 - Required Education
 - Skills/Tools


 ## Job Description

This job description was listed on a freelancing website:

> The project goal is to write Python Script to Scrape All Data Analyst Jobs data Posted on Wuzzuf Website Using BeautifulSoup
>
> - I was able to Extract the Job Title, Company Name, Company Address, Job Time, Job Level, Job Link and The period in which the job was posted which is converted to the Post Date and Put all of them in a Spreadsheet
>
> Website - https://wuzzuf.net/search/jobs/?a=navbl%7Cspbl&q=data%20analyst&start=0

## Program Parameters

Here are some parameters that can change the output of the notebook:

 - `NUM_PAGES` -> This sets the number of pages to scrape from the search results (15 job posts per page, default = 3)
 - `SEARCH_TERM` -> This sets the search term (default = 'data analyst')

The following parameter shouldn't be changed as this notebook is setup specifically to parse WUZZUF search results:
 - `BASE_URL` -> This is the base URL, used to follow relative URL links during the scraping process

In [1]:
NUM_PAGES = 3 # 15 jobs per page
SEARCH_TERM = 'data analyst' 
BASE_URL = 'https://wuzzuf.net'

## Imports

We are using `requests` and `BeautifulSoup` to request and parse the html, respectively. We are using `datetime` to parse dates into timestamps.  `urllib` is used to safely process URLs. `re` and `json` are used to parse and navigate the JSON data from the webpage as this page is dynamically loaded, meaning the static HTML doesn't yield the information we are looking to extract. Finally, `pandas` is used to export the results.

In [2]:
from bs4 import BeautifulSoup
import requests
from datetime import datetime
import urllib.parse
import re
import json
import pandas as pd

## Helper Functions

In [3]:
def build_page_urls(pages, search_term):
    '''Builds the URLs to retrieve our job posts based on the number of pages requested'''
    url = (f'https://wuzzuf.net/search/jobs/?a=navbl%7Cspbl&q={urllib.parse.quote(search_term)}&start=')
    urls = list()
    for i in range(pages):
        urls.append(url + str(i))
    return urls

In [4]:
def nth_occurrence(n, c, s):
    '''Find the nth occurnace (0-indexed) of character c in string s'''
    nth = -1
    for i in range(0, n):
        nth = s.find(c, nth + 1)
    return nth

### Function Testing ###
assert nth_occurrence(1, 'F', 'Fail') == 0, f'Should equal 0, got {nth_occurrence(1, "F", "Fail")}'
assert nth_occurrence(2, 't', 'test') == 3, f'Should equal 3, got {nth_occurrence(2, "t", "test")}'
assert nth_occurrence(5, '.', '.....') == 4, f'Should equal 4, got {nth_occurrence(1, "F", "Fail")}'

## Scraping Functions

In [5]:
def get_post_urls(page_url):
    '''Extracts the individual job listings from a search result page'''
    post_urls = list()
    html = requests.get(page_url)
    soup = BeautifulSoup(html.text, 'lxml')

    # Debug HTML
    # print(soup.prettify())

    potential_links = soup.find_all('a')
    for link in potential_links:
        url = link.get('href')
        if url is not None:
            if url.__contains__('/jobs/') and not url.__contains__('http'):
                post_urls.append(BASE_URL + urllib.parse.quote(url))
    return post_urls

In [6]:
def scrape_post(post_url):
    '''Extracts the Job Details from a Job Post URL'''
    post = dict()
    html = requests.get(post_url)
    soup = BeautifulSoup(html.text, 'lxml')

    # Debug HTML
    # print(soup.prettify())

    # Find script tag
    script_tag = soup.find("script")

    # Find json Data - begin at third curly brace
    third_bracket = nth_occurrence(3, '{', script_tag.text)

    # Find the 4th last semi colon
    reversed = script_tag.text[::-1]
    sc = nth_occurrence(4, ';', reversed)
    sc = len(reversed) - sc - 1

    # Trim and load json
    trimmed_string = script_tag.text[third_bracket:sc]

    # Debug JSON
    # print(trimmed_string)

    data = json.loads(trimmed_string)

    # Find job ID, company ID and Job details
    job_json = str(data["entities"]["applicationStatistics"]["collection"])
    job_id = job_json[2:job_json.find(':')-1]
    job_details = data["entities"]["job"]["collection"][job_id]["attributes"]
    if data["entities"]["job"]["collection"][job_id]["relationships"]["company"]["data"] is not None:
        company_id = data["entities"]["job"]["collection"][job_id]["relationships"]["company"]["data"]["id"]
    else:
        company_id = None

    # Date
    post['Date'] = datetime.strptime(
        job_details['postedAt'], '%m\u002F%d\u002F%Y %H:%M:%S')

    # Job Title
    post['Job Title'] = job_details['title']

    # Company Name
    if company_id is not None:
        post['Company Name'] = data["entities"]["company"]["collection"][company_id]["attributes"]["name"]
    else:
        post['Company Name'] = 'Confidential'

    # Company Address
    address = ""
    if job_details["location"]["area"] is not None:
        address = job_details["location"]["area"]["name"]
    if job_details["location"]["city"] is not None:
        if address != "":
            address += ', ' + job_details["location"]["city"]["name"]
        else:
            address = job_details["location"]["city"]["name"]
    if job_details["location"]["country"] is not None:
        if address != "":
            address += ', ' + job_details["location"]["country"]["name"]
        else:
            address = job_details["location"]["country"]["name"]
    post['Company Address'] = address

    # Work Type
    first_pass = True
    for work_type in job_details["workTypes"]:
        if first_pass:
            work_types = work_type["displayedName"]
            first_pass = False
        else:
            work_types += ', ' + work_type["displayedName"]
    post['Work Type'] = work_types

    # Career Level
    career_level = ""
    if job_details["careerLevel"]["name"] is not None:
        career_level = job_details["careerLevel"]["name"]
    if job_details["careerLevel"]["hint"] is not None:
        career_level += ' (' + job_details["careerLevel"]["hint"] + ')'

    post['Career Level'] = career_level

    # URL
    post['URL'] = post_url

    # Job Description
    post['Description'] = job_details['description'].replace('\n', "")

    # Job Requirements
    post['Requirements'] = job_details['requirements'].replace('\n', "")

    # Salary
    if job_details['salary']['isPaid'] is True:
        if job_details['salary']['min'] is None and job_details['salary']['max'] is None:
            salary = "Confidential"
        elif job_details['salary']['min'] is not None and job_details['salary']['max'] is not None:
            salary = str(job_details['salary']['min']) + \
                ' - ' + str(job_details['salary']['max'])
        elif job_details['salary']['min'] is not None or job_details['salary']['max'] is not None:
            salary = job_details['salary']['min'] or job_details['salary']['max']
        if job_details['salary']['currency'] is not None:
            salary += " " + job_details['salary']['currency']['code']
        if job_details['salary']['period'] is not None:
            salary += " (" + job_details['salary']['period']['name'] + ')'
    else:
        salary = "Unpaid"

    if job_details['salary']['additionalDetails'] is not None:
        salary += ', ' + job_details['salary']['additionalDetails']
    post['Salary'] = salary

    # Required Experience
    experience = ""
    if job_details['workExperienceYears']['min'] is None and job_details['workExperienceYears']['max'] is None:
        experience = "Unspecified"
    elif job_details['workExperienceYears']['min'] is not None and job_details['workExperienceYears']['max'] is not None:
        experience = str(job_details['workExperienceYears']['min']) + \
            ' - ' + str(job_details['workExperienceYears']['max']) + ' years'
    elif job_details['workExperienceYears']['min'] is not None and job_details['workExperienceYears']['max'] is None:
        experience = str(job_details['workExperienceYears']['min']) + '+ years'
    else:
        experience = '0 - ' + \
            str(job_details['workExperienceYears']['max']) + ' years'

    post['Experience'] = experience

    # Required Education
    post['Education'] = job_details['candidatePreferences']['educationLevel']['name']

    # Skills/Tools
    skills = ""
    first_pass = True
    for entry in job_details['keywords']:
        if first_pass:
            skills = entry['name']
            first_pass = False
        else:
            skills += ", " + entry['name']
    post['Skills & Tools'] = skills

    return post


In [7]:
def scrape_website(pages, search_term):
    '''Scrapes all the search results based on the search term and number of '''
    page_urls = build_page_urls(pages, search_term)
    post_urls = list()
    posts = list()
    for page in page_urls:
        post_urls.append(get_post_urls(page))
    for page in post_urls:
        for url in page:
            print(f'Scraping URL => {url}')
            result = scrape_post(url)
            posts.append(result)
    return posts


## Scrape Website

To alter the output you can change the parameters at the top of the notebook (in the first code cell)

 - `NUM_PAGES` -> This sets the number of pages to scrape from the search results (15 job posts per page, default = 3)
 - `SEARCH_TERM` -> This sets the search term for the scraping process (default = 'data analyst')

In [8]:
results = scrape_website(NUM_PAGES, SEARCH_TERM)

Scraping URL => https://wuzzuf.net/jobs/p/9rPTWJnYPFuq-Professional-Data-Analyst-Cairo-Egypt%3Fo%3D1%26l%3Dsp%26t%3Dsj%26a%3Ddata%20analyst%7Csearch-v3%7Cnavbl%7Cspbl
Scraping URL => https://wuzzuf.net/jobs/p/lGTaAJcRIwbm-Data-Analyst-Gomla-Market-Alexandria-Egypt%3Fo%3D2%26l%3Dsp%26t%3Dsj%26a%3Ddata%20analyst%7Csearch-v3%7Cnavbl%7Cspbl
Scraping URL => https://wuzzuf.net/jobs/p/R7KdIv7dCm0z-Senior-Data-Analyst---Cairo-Cairo-Egypt%3Fo%3D3%26l%3Dsp%26t%3Dsj%26a%3Ddata%20analyst%7Csearch-v3%7Cnavbl%7Cspbl
Scraping URL => https://wuzzuf.net/jobs/p/TWguALrIS6vT-Senior-Data-Analyst-Alarabia-Group-Cairo-Egypt%3Fo%3D4%26l%3Dsp%26t%3Dsj%26a%3Ddata%20analyst%7Csearch-v3%7Cnavbl%7Cspbl
Scraping URL => https://wuzzuf.net/jobs/p/9AAaTU2msaG5-Data-Analyst-Hands-of-Hope-Physical-Therapy-Wellness-Cairo-Egypt%3Fo%3D5%26l%3Dsp%26t%3Dsj%26a%3Ddata%20analyst%7Csearch-v3%7Cnavbl%7Cspbl
Scraping URL => https://wuzzuf.net/jobs/p/adeG14H7rqQA-Data-Analyst-Gila-Electric-Cairo-Egypt%3Fo%3D6%26l%3Dsp%26t%3Dsj%26

In [9]:
df = pd.DataFrame.from_dict(results)
df.head(5)

Unnamed: 0,Date,Job Title,Company Name,Company Address,Work Type,Career Level,URL,Description,Requirements,Salary,Experience,Education,Skills & Tools
0,2022-10-04 15:19:10,Professional Data Analyst,Confidential,"Cairo, Egypt",Full Time,Experienced (Non-Manager),https://wuzzuf.net/jobs/p/9rPTWJnYPFuq-Profess...,"<p><strong>Senior Data Analytics, Business Int...",<ul><li>Visualization.</li><li>Master Data Man...,Confidential,7+ years,Bachelor's Degree,"Analytics, BI, Data Analysis, Data Governance,..."
1,2022-09-25 09:36:20,Data Analyst,Gomla Market,"Ameria, Alexandria, Egypt",Full Time,Entry Level (Junior Level / Fresh Grad),https://wuzzuf.net/jobs/p/lGTaAJcRIwbm-Data-An...,"<p>At <strong>Gomla Market</strong>, we deal w...",<ul><li>Bachelor's or Master's degree in Stati...,Confidential,1 - 2 years,Bachelor's Degree,"Analysis, SAS, SPSS, SQL, Statistics, Data Ana..."
2,2022-09-18 16:24:40,Senior Data Analyst - Cairo,Confidential,"Cairo, Egypt",Full Time,Experienced (Non-Manager),https://wuzzuf.net/jobs/p/R7KdIv7dCm0z-Senior-...,<p>Egybell is hiring RSC Analytic Manager for ...,<p>Additional Requirements:</p><ul><li>1. To h...,Confidential,5 - 9 years,Bachelor's Degree,"Data, Google Analytics, Marketing, Management,..."
3,2022-09-18 11:32:23,Senior Data Analyst,Alarabia Group,"10th of Ramadan City, Cairo, Egypt",Full Time,Experienced (Non-Manager),https://wuzzuf.net/jobs/p/TWguALrIS6vT-Senior-...,<ul><li>To understand business requirements in...,<ul><li>BSc/BA in Computer Science or relevant...,Confidential,2 - 5 years,Bachelor's Degree,"BI, BI Developer, Computer Science, developer,..."
4,2022-09-07 21:44:59,Data Analyst,Hands of Hope Physical Therapy & Wellness,"Maadi, Cairo, Egypt",Full Time,Experienced (Non-Manager),https://wuzzuf.net/jobs/p/9AAaTU2msaG5-Data-An...,"<ul><li>Track, collect, and interpret data, th...",<ul><li>Essential experience in one or more of...,Confidential,3 - 18 years,Bachelor's Degree,"Analysis, Analyst, Data, Data Analysis, Data A..."


## Output results

In [10]:
timestamp_string = str(int(datetime.timestamp(datetime.now())))
df.to_csv(path_or_buf=f'outputs/WUZZUF_{SEARCH_TERM}_{timestamp_string}.csv', index=False)