# 🪚 How to Web Scrape [DataJobs.com](https://datajobs.com/)

*Source: 🤖 [Data Jobs Webscraper](https://github.com/ColinB19/datajobswebscraper) Repo*

If you're like me, endlessly scrolling through job postings to gain insights on what you should learn becomes an exercise in futility. After scrolling through hundreds of jobs for hours on end, it is just too difficult to know which skills are the most important. I need a way to digest all of the information from thousands of job postings simultaneously. That is what this project aims to do: pull information from online job boards to gain valuable job-hunting insights!

**Web Scraping** is the programmatic retrieval of data from websites. In general, the web scraping bot navigates to a desired webpage and extracts or parses the source HTML for the desired data. 

[**Selenium**](https://selenium-python.readthedocs.io/) is a web scraping framework in Python that uses a real browser instance to navigate through webpages, interact with page elements, and extract information. While you can use most browsers with selenium, the suggested browser is Google Chrome. 

<img src="IMG/DataJobs_Header.png">

This notebook will demonstrate how to scrape job entries from [**DataJobs.com**](https://datajobs.com/) using **Selenium** This serves as a guide informing the larger scraper that incorporates jobs from [**Indeed.com**](https://www.indeed.com/). Please see the [**Full Webscraping**](https://github.com/ColinB19/datajobswebscraper/blob/master/Webscrape_DataJobs.ipynb) notebook for the full end-product. 

> **Note**: You should always check a website's *robots.txt* file to ensure that scraping is allowed. DataJobs has no restrictions on web scraping, so we are good to go. If you want to check, just type in the websites base URL + `"/robots.txt"`.

# Import Dependencies
- [**Selenium**](https://selenium-python.readthedocs.io/): web scraping library allowing us to extract information from the internet
- [**Pandas**](https://pandas.pydata.org/docs/): data management library that we will use to conveniently store and manipulate the data
- [**regex**](https://github.com/mrabarnett/mrab-regex): a text parsing library that will allow us to search through the HTML from the web pages we scrape to find key information.
- [**webdriver_manager**](https://github.com/SergeyPirogov/webdriver_manager): a package that enables automatic download of the proper chromium drivers.

> **Regex** is an extension built on top of the built-in [**re**](https://docs.python.org/3/library/re.html) module with some extended functionality. I tend to use it by default. 

In [1]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import Chrome
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

import pandas as pd
import regex as re

# Setup Driver
Now we can set up the Chrome driver. I use the ChromeDriverManager class from the **webdriver_manager** package. This enables automatic download of the Chromium drivers so you do not need to manually install and point **Selenium** to them. 

In [2]:
# enables automatic installation of chrome drivers
service = Service(ChromeDriverManager().install())
# set up chrome driver
driver = Chrome(service=service)

# navigate to DataJobs.com
site_url = "https://datajobs.com/"
driver.get(site_url)

# Page Navigation

Now that we have navigated to the site page, let's click on the `Data Science Jobs/Analytics` link to get a list of all data science roles available on the platform.

> **Note**: in the [**Full Webscraping**](https://github.com/ColinB19/datajobswebscraper/blob/master/Webscrape_DataJobs.ipynb) notebook, we scrape both the Data Science and Data Engineering positions. 

<img src="IMG/DataJobs_link1-edit.png">


In [3]:
wait_time = 3
dsa_jobs_list = WebDriverWait(driver, wait_time).until(
    EC.element_to_be_clickable(
        (By.XPATH, "//a[contains(text(), 'Data Science Jobs / Analytics')]")
    )
)
dsa_jobs_list.click()

# Scrape Job Information

Now we can scroll through the jobs and pull information. The scraper should grab all of the entries on a page, then go to the next page and repeat the process. What will we scrape?

1. The link to the job posting - so we can scrape job descriptions later!
2. The Job Title
3. The Company
4. Pay information (if available)
5. The Location

<img src="IMG/scroll-edit.gif" height=400 width=600>


In [4]:
# this is the regex we will use to parse the HTML
dj_pattern = r"<a href=\"(.*)\"><strong>(.*)</strong> – <span [^\>]*>(.*)</span></a>[\n\s]*</div>[\n\s]*<div[^\>]*>[\n\s]*<em>[\n\s]*<span[^\>]*>(.*)</span>[\n\s]*[\&nbsp;\•]*[\n\s]*\$*([\d,]*)[–\s]*\$*([\d,]*)[\n\s]*</em>"

# these are URL paths to the specific job lists for each job type on DataJobs
board_paths = ["/Data-Science-Jobs", "/Data-Engineering-Jobs"]

job_meta = pd.DataFrame(
            columns=[
                "url",
                "title",
                "company",
                "location",
                "salary_lower",
                "salary_upper"
            ]
        )
# loop through the boards available
for bp in board_paths:
    if bp == "/Data-Science-Jobs":
        cat = "Data Science & Analytics"
    else:
        cat = "Data Engineering"
    # load into the webpage
    driver.get(site_url + bp)
    more_pages = True  # will kill the loop when there are no more pages
    i = 0  # just a counter to kill the loop just in case
    while more_pages:
        # grab page source html
        page_html = driver.page_source

        # grab job info
        fall = re.findall(dj_pattern, page_html)
        
        # zip the info into a dict for easy DataFrame-ability
        fall_cols = [
            dict(
                zip(
                    job_meta.columns,
                    (
                        (
                            y.replace("&amp;", "&") # this removes some HTML stuff to not confuse the CSV format
                            .replace("&amp,", "&")
                            .replace("&nbsp;", " ")
                            .replace("&nbsp,", " ")
                            if type(y) == str
                            else y
                        )
                        for y in x
                    )
                )
            )
            for x in fall
        ]
        # add to dataframe
        job_meta = pd.concat(
            [job_meta, pd.DataFrame(fall_cols)], ignore_index=True
        )

        if i == 10:
            # stop after 10 pages
            more_pages = False

        # try to go to next page
        try:
            next_page = WebDriverWait(driver, wait_time).until(
                EC.element_to_be_clickable(
                    (By.XPATH, "//a[contains(text(), 'NEXT PAGE')]")
                )
            )
            next_page.click()
            i += 1
        except:
            print(f"END OF SEARCH RESULTS: {site_url} || {bp}")
            more_pages = False

In [5]:
# let's check out our data!
job_meta.head()

Unnamed: 0,url,title,company,location,salary_lower,salary_upper
0,/Columbia-University-Data-Science-Institute/Po...,Postdoctoral Research Scientist,"Columbia University, Data Science Institute",New York City,,
1,/the-D-E-Shaw-Group/People-Analytics-Data-Scie...,People Analytics - Data Science & Reporting An...,The D. E. Shaw Group,New York City,,
2,/the-Ohio-State-University/Lead-Data-Scientist...,Lead Data Scientist,The Ohio State University,"Columbus, OH",,
3,/DraftKings/Lead-Data-Science-Engineer-Job~100873,Lead Data Science Engineer,DraftKings,Remote,,
4,/Shopify/Staff-Data-Scientist-Americas-Remote-...,Staff Data Scientist (Americas - Remote),Shopify,Remote,,


# Scrape Job Descriptions

Lastly, let's scrape job descriptions. This is so we can find some popular skills and terms for data jobs.

<img src="IMG/JobPost_HTML_example.png">



In [8]:
def cleanhtml(html_string: str) -> str:
    """Regex pattern to match and remove all html tags and comments, leaving plain text."""
    html_string2 = re.sub("(<!--.*?-->)", "", html_string, flags=re.DOTALL)
    cleaned_html = re.sub(
        "<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});", " ", html_string2
    )
    return cleaned_html

# list to hold description text
job_desc_list = []
for _, job in job_meta.iterrows():
    # set up the URL so the driver can navigate there
    job_url = site_url + job["url"][1:]
    # navigate to the job posting
    driver.get(job_url)
    # grab job desc element
    try:
        job_descr = WebDriverWait(driver, wait_time).until(
            EC.element_to_be_clickable(
                (
                    By.XPATH,
                    "//div[@id='job_description']//*[@class='jobpost-table-cell-2']",
                )
            )
        )
    except:
        print(f"I can't find this job: {job['title']}")
        continue

    # get html
    job_desc_clean = (
        cleanhtml(job_descr.get_attribute("innerHTML"))
        .replace("&amp;", "&") # this removes some HTML stuff to not confuse the CSV format
        .replace("&amp,", "&")
        .replace("&nbsp;", " ")
        .replace("&nbsp,", " ")
    )
    job_desc_list.append(
        {
            "title": job["title"],
            "company": job["company"],
            "desc": job_desc_clean,
        }
    )
# make a pandas dataframe
job_descs = pd.DataFrame(job_desc_list)


I can't find this job: Data Science Senior Engineer


In [9]:
# let's check out our data!
job_descs.head()

Unnamed: 0,title,company,desc
0,Postdoctoral Research Scientist,"Columbia University, Data Science Institute",\n The Data Science In...
1,People Analytics - Data Science & Reporting An...,The D. E. Shaw Group,\n OVERVIEW: The D....
2,Lead Data Scientist,The Ohio State University,\n Department: Publ...
3,Lead Data Science Engineer,DraftKings,\n BE THE STRATEGY BEH...
4,Staff Data Scientist (Americas - Remote),Shopify,\n Staff Data Scienti...


In [10]:
driver.close()