This project automates the process of scraping job listings from Glassdoor, storing the scraped data in a PostgreSQL database, and sending an email report containing job listings with specific filters. The entire workflow is orchestrated using Apache Airflow.
- Web Scraping: Scrapes job listings from Glassdoor using Selenium.
- Data Storage: Stores the scraped data in a PostgreSQL database for future queries.
- Email Reporting: Filters jobs based on specific criteria and sends an automated email report with a CSV attachment.
- Task Orchestration: Managed through Apache Airflow with tasks scheduled to run daily.
- Scrape Glassdoor Jobs: Uses Selenium to scrape job titles, locations, and company names from Glassdoor's job listings for Data Analyst positions.
- Store Data in PostgreSQL: Saves the scraped data in a PostgreSQL table called
internships. - Send Email Report: Queries the PostgreSQL database for jobs related to "Data" and sends a CSV report via email.
- Apache Airflow: Orchestration and scheduling of tasks.
- Selenium: Web scraping tool for automating the browser to extract job listings.
- Pandas: Data manipulation and transformation.
- PostgreSQL: Relational database for storing job listings.
- psycopg2: PostgreSQL adapter for Python.
- smtplib: Python library for sending emails via SMTP.
- ChromeDriver: WebDriver to interact with Chrome for Selenium.
The Selenium-based scraper navigates to the Glassdoor website, collects job listings for Data Analyst positions, and stores the data as a CSV file.
Example Data Columns:
- Job Title
- Location
- Company Name
driver.get("https://www.glassdoor.com/Job/united-states-data-analyst-jobs")
# Extract and save the data to CSVThe CSV data from the scraper is read into a pandas DataFrame and then inserted into the internships table of a PostgreSQL database.
cur.execute("INSERT INTO internships (title, company, location) VALUES (%s, %s, %s)", ...)A filtered report is generated by querying PostgreSQL for jobs containing "Data" in the title, and the results are attached as a CSV file in an email.
query = "SELECT title, company, location FROM internships WHERE title ILIKE '%%Data%%'"The workflow is defined as an Airflow DAG with three Python tasks:
Scrape Glassdoor Jobs Store data in PostgreSQL Send email report
scrape_task >> store_task >> email_task- Environment Setup
- Python 3.x installed
- PostgreSQL database
- Airflow installed and running
- Selenium and ChromeDriver for web scraping
- Gmail account for sending emails (with app password enabled)
Install the required libraries:
pip install airflow selenium pandas psycopg2 smtplib webdriver-managerCreate a table in PostgreSQL:
CREATE TABLE internships (
id SERIAL PRIMARY KEY,
title VARCHAR(255),
company VARCHAR(255),
location VARCHAR(255)
);- Update the file paths in the scraping and email tasks for saving and reading CSV files.
- Set your database credentials in the PostgreSQL task.
- onfigure your email settings (sender email, app password, recipient) in the email task.
The email report will contain a CSV file (filtered_internships_report.csv) with job listings matching the filter "Data" in the title. The email will look like:
Subject: Daily Internship Report
Body:
Please find attached the report.
Add more advanced filters to the SQL query for email reporting. Scrape additional job platforms. Enhance error handling and task retry logic in Airflow.
This project automates the process of scraping job listings, storing the data, and sending reports using Apache Airflow, making it a powerful tool for job seekers or recruiters. The modular structure allows for easy expansion and customization.