# Web Scraping Glassdoor

**Introduction**

In this project, we'll build a web scraper to extract job listings from Glassdoor, a popular job search platform. We'll use Python and BeautifulSoup, a Python library for web scraping, to extract job titles, companies, locations, job descriptions, and other relevant information.

Here are the main steps we'll follow in this project:

- Setup our development environment

- Understand the basics of web scraping

- Analyze the website structure of Glassdoor

- Write the Python code to extract job data from Glassdoor

- Save the data to a CSV file

- Test our web scraper and refine our code as needed

**Prerequisites**

Before starting this project, you should have some basic knowledge of Python programming and HTML structure. In addition, you'll need to install the following libraries in your Python environment:

- requests
- BeautifulSoup
- csv
- datetime


You can install them using the following command in your command prompt/terminal:

**pip install requests**

**pip install beautifulsoup4**

-
-
-

**Project Scope**

We will be building a script that can scrape job postings from Glassdoor.com based on a specific job position and location. We will extract the following information from each job posting:

Job title
Company name
Job location
Posting date
Summary of the job
Salary (if available)
Job URL
We will then store this information in a CSV file for further analysis.


-
-
-


**Step 1: Importing Required Libraries**

Here, we are importing the required libraries: csv for writing data to a CSV file, datetime for getting the current date, requests for sending HTTP requests to the website, BeautifulSoup for parsing the HTML source code of the webpage, and time for introducing a delay in our program.

In [129]:
import csv
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import time

**Step 2: Generating RL**

Here, we are defining a function get_url that takes two parameters: position and location. We use these parameters to generate the URL of the webpage we want to scrape. We are using a template URL and replacing the placeholders with the actual values of position and location. The URL also includes some additional parameters such as locT=C and locId=1139970 that specify the location of the job posting. You can customize these parameters based on your needs.

In [130]:

def get_url(position, location):
    """Generate url from position and location"""
    template = "https://www.glassdoor.com/Job/jobs.htm?sc.keyword={}&locT=C&locId=1139970&JobType=all&fromAge=1"
    position = position.replace(" ", "+")
    location = location.replace(" ", "+")
    url = template.format(position, location)
    return url

**Step 3: Extracting Job Data**

The next step is to define a function that will take a single job posting record as input, and extract the relevant data from it. This function will be called from within the main() function, which we will define in the next step.

To do this, we'll use the BeautifulSoup library to parse the HTML of the job posting card, and extract the desired data using a series of try/except blocks.


In [131]:
def get_record(card):
    """Extract job data from a single record"""
    atags = card.find_all("a")
    try:
        job_title = atags[0].text.strip()
    except IndexError:
        job_title = ""
    try:
        company = atags[1].text.strip()
    except IndexError:
        company = ""
    try:
        job_location = card.find("span", {"class": "jobLocation"}).text.strip()
    except AttributeError:
        job_location = ""
    try:
        post_date = card.find("span", {"class": "jobAge"}).text.strip()
    except AttributeError:
        post_date = ""
    try:
        summary = card.find("div", {"class": "jobDescriptionContent"}).text.strip()
    except AttributeError:
        summary = ""
    try:
        salary = card.find("span", {"class": "salaryText"}).text.strip()
    except AttributeError:
        salary = ""
    try:
        job_url = "https://www.glassdoor.com" + atags[0]["href"]
    except (IndexError, TypeError):
        job_url = ""

    today = datetime.today().strftime("%Y-%m-%d")
    record = (job_title, company, job_location, post_date, today, summary, salary, job_url)
    return record

**Step 4: Define the manin function**

Define the main function that takes two parameters: job position and location. This function performs the following steps:

- Set the headers for the HTTP request. Glassdoor may block requests from bots, so it's a good idea to set a user agent string.
- Construct the URL for the job search based on the job position and location.
- Send an HTTP request to the URL and retrieve the HTML code of the search results page.
- Parse the HTML code using BeautifulSoup and select the HTML elements that contain the job postings.
- Extract the job posting information using the helper functions and store it in a list.
- Write the job posting information to a CSV file.
- Print a success message.

In [132]:

def main(position, location):
    """Run the main program routine"""
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }
    url = get_url(position, location)
    records = []
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")
        cards = soup.find_all("li", {"class": "react-job-listing"})
        for card in cards:
            record = get_record(card)
            records.append(record)
        with open("jobs.csv", "a", newline="", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(
                [
                    "JobTitle",
                    "Company",
                    "JobLocation",
                    "PostDate",
                    "ExtractDate",
                    "Summary",
                    "Salary",
                    "JobUrl",
                ]
            )
            writer.writerows(records)
    except Exception as e:
        print(e)
        print("Error scraping job postings")
        return None
    print(f"Successfully scraped {len(records)} job postings")
    return


**Step 5: Run the Main Function**

Call the main function with the job position and location parameters. Check the CSV file to verify that the job posting information has been extracted correctly.

In [133]:
main('developer', 'texas')

Successfully scraped 30 job postings


**Conclusion**

Web scraping is a powerful tool for extracting data from websites. In this learning module, we have used web scraping to extract job postings from Glassdoor. You can use similar techniques to extract data from other websites as well. However, be aware of the legal and ethical implications of web scraping and make sure to comply with the website's terms of service.



