# How to Web Scrape [DataJobs.com](https://datajobs.com/)

If you're like me, endlessly scrolling through job postings to gain insights on what you should learn ends up becoming an excersize in futility. I need a way to digest all of the information from thousands of job postings all at once. That is what this project aims to do: scrape information from online job boards to gain valauable job-hunting insights!

<img src="IMG/DataJobs_Header.png">

This notebook will demonstrate how to scrape job entries from [DataJobs.com](https://datajobs.com/) using [Selenium](https://selenium-python.readthedocs.io/)! This serves as a guide informing the larger scraper that incorporates jobs from [Indeed.com](https://www.indeed.com/). Please see the [Webscrape_Datajobs](https://github.com/ColinB19/datajobswebscraper/blob/master/Webscrape_DataJobs.ipynb) notebook for the full end-product. 

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import Chrome
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

import pandas as pd
import numpy as np
import regex as re

In [7]:
# enables automatic installation of chrome drivers
service = Service(ChromeDriverManager().install())
# set up chrome driver
driver = Chrome(service=service)

# navigate to DataJobs.com
site_url = "https://datajob.com/"
driver.get(site_url)

Now that we have navigated to the site page, let's click on the *Data Science Jobs/Analytics' link to get a list of all data science rolls available on the platform.

<img src="IMG/DataJobs_link1-edit.png">


In [10]:
wait_time = 3
dsa_jobs_list = WebDriverWait(driver, wait_time).until(
    EC.element_to_be_clickable(
        (By.XPATH, "//a[contains(text(), 'Data Science Jobs / Analytics')]")
    )
)
dsa_jobs_list.click()

Now we can scroll through the jobs and pull information. The scraper should grab all of the entries on a page, then go to the next page and repeat the process. What will we scrape?

1. The link to the job posting - so we can scrape job descriptions later!
2. The Job Title
3. The Company
4. Pay information (if available)
5. The Location

<img src="IMG/scroll-edit.gif" height=400 width=600>


In [None]:
# this is the regex we will use to parse the HTML
dj_pattern = r"<a href=\"(.*)\"><strong>(.*)</strong> – <span [^\>]*>(.*)</span></a>[\n\s]*</div>[\n\s]*<div[^\>]*>[\n\s]*<em>[\n\s]*<span[^\>]*>(.*)</span>[\n\s]*[\&nbsp;\•]*[\n\s]*\$*([\d,]*)[–\s]*\$*([\d,]*)[\n\s]*</em>"

board_paths = ["/Data-Science-Jobs", "/Data-Engineering-Jobs"]
# loop through the boards available
for bp in board_paths:
    if bp == "/Data-Science-Jobs":
        cat = "Data Science & Analytics"
    else:
        cat = "Data Engineering"
    # load into the webpage
    driver.get(site_url + bp)
    more_pages = True  # will kill the loop when there are no more pages
    i = 0  # just a counter to kill the loop just in case
    while more_pages:
        # grab page source html
        page_html = driver.page_source

        # grab job info
        fall = re.findall(dj_pattern, page_html)
        
        # zip the info into a dict for easy DataFrame-ability
        fall_cols = [
            dict(
                zip(
                    job_meta.columns,
                    (
                        (
                            y.replace("&amp;", "&") # this removes some HTML stuff to not confuse the CSV format
                            .replace("&amp,", "&")
                            .replace("&nbsp;", " ")
                            .replace("&nbsp,", " ")
                            if type(y) == str
                            else y
                        )
                        for y in x
                    )
                    + (cat,),
                )
            )
            for x in fall
        ]
        # add to dataframe
        job_meta = pd.concat(
            [job_meta, pd.DataFrame(fall_cols)], ignore_index=True
        )

        if i == 300:
            # stop after 300 pages
            more_pages = False

        # try to go to next page
        try:
            next_page = WebDriverWait(driver, wait_time).until(
                EC.element_to_be_clickable(
                    (By.XPATH, "//a[contains(text(), 'NEXT PAGE')]")
                )
            )
            next_page.click()
            i += 1
        except:
            print(f"END OF SEARCH RESULTS: {site_url} || {bp}")
            more_pages = False

Lastly, let's scrape job descriptions. This is so we can find some popular skills and terms for data jobs.

<img src="IMG/JobPost_HTML_example.png">

First, we can utilize the links we scraped earlier to navigate to the job posting webpage:

```python
# string subsetting to have proper nmber of '/' characters in the link
job_url = site_url + job["url"][1:]
```

In [None]:
def cleanhtml(html_string: str) -> str:
    """Regex pattern to match and remove all html tags and comments, leaving plain text."""
    html_string2 = re.sub("(<!--.*?-->)", "", html_string, flags=re.DOTALL)
    cleaned_html = re.sub(
        "<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});", " ", html_string2
    )
    return cleaned_html

# list to hold description text
job_desc_list = []
for _, job in job_meta.iterrows():
    # set up the URL so the driver can navigate there
    job_url = site_url + job["url"][1:]
    # navigate to the job posting
    driver.get(job_url)
    # grab job desc element
    try:
        job_descr = WebDriverWait(driver, wait_time).until(
            EC.element_to_be_clickable(
                (
                    By.XPATH,
                    "//div[@id='job_description']//*[@class='jobpost-table-cell-2']",
                )
            )
        )
    except:
        print(f"I can't find this job: {job['title']} || {self._site_url}")
        continue

    # get html
    job_desc_clean = (
        cleanhtml(job_descr.get_attribute("innerHTML"))
        .replace("&amp;", "&") # this removes some HTML stuff to not confuse the CSV format
        .replace("&amp,", "&")
        .replace("&nbsp;", " ")
        .replace("&nbsp,", " ")
    )
    job_desc_list.append(
        {
            "job_id": job["job_id"],
            "title": job["title"],
            "company": job["company"],
            "desc": job_desc_clean,
        }
    )
