Skip to content

SoujxD/Python-Automation

Repository files navigation

📊 Airflow Pipeline for Internship Scraping and Email Reporting 🚀

Project Overview

This project automates the process of scraping job listings from Glassdoor, storing the scraped data in a PostgreSQL database, and sending an email report containing job listings with specific filters. The entire workflow is orchestrated using Apache Airflow.

Key Features

  • Web Scraping: Scrapes job listings from Glassdoor using Selenium.
  • Data Storage: Stores the scraped data in a PostgreSQL database for future queries.
  • Email Reporting: Filters jobs based on specific criteria and sends an automated email report with a CSV attachment.
  • Task Orchestration: Managed through Apache Airflow with tasks scheduled to run daily.

🛠️ Project Structure

Tasks in the Workflow

  1. Scrape Glassdoor Jobs: Uses Selenium to scrape job titles, locations, and company names from Glassdoor's job listings for Data Analyst positions.
  2. Store Data in PostgreSQL: Saves the scraped data in a PostgreSQL table called internships.
  3. Send Email Report: Queries the PostgreSQL database for jobs related to "Data" and sends a CSV report via email.

💻 Technologies Used

  • Apache Airflow: Orchestration and scheduling of tasks.
  • Selenium: Web scraping tool for automating the browser to extract job listings.
  • Pandas: Data manipulation and transformation.
  • PostgreSQL: Relational database for storing job listings.
  • psycopg2: PostgreSQL adapter for Python.
  • smtplib: Python library for sending emails via SMTP.
  • ChromeDriver: WebDriver to interact with Chrome for Selenium.

🚀 Workflow Steps

1. Scrape Job Listings from Glassdoor

The Selenium-based scraper navigates to the Glassdoor website, collects job listings for Data Analyst positions, and stores the data as a CSV file.

Example Data Columns:

  • Job Title
  • Location
  • Company Name
driver.get("https://www.glassdoor.com/Job/united-states-data-analyst-jobs")
# Extract and save the data to CSV

2. Store Data in PostgreSQL

The CSV data from the scraper is read into a pandas DataFrame and then inserted into the internships table of a PostgreSQL database.

cur.execute("INSERT INTO internships (title, company, location) VALUES (%s, %s, %s)", ...)

3. Query Data and Send Email

A filtered report is generated by querying PostgreSQL for jobs containing "Data" in the title, and the results are attached as a CSV file in an email.

query = "SELECT title, company, location FROM internships WHERE title ILIKE '%%Data%%'"

📅 Airflow DAG

The workflow is defined as an Airflow DAG with three Python tasks:

Scrape Glassdoor Jobs Store data in PostgreSQL Send email report

scrape_task >> store_task >> email_task

⚙️ Prerequisites

  • Environment Setup
  • Python 3.x installed
  • PostgreSQL database
  • Airflow installed and running
  • Selenium and ChromeDriver for web scraping
  • Gmail account for sending emails (with app password enabled)

Python Libraries

Install the required libraries:

pip install airflow selenium pandas psycopg2 smtplib webdriver-manager

Database Setup

Create a table in PostgreSQL:

CREATE TABLE internships (
    id SERIAL PRIMARY KEY,
    title VARCHAR(255),
    company VARCHAR(255),
    location VARCHAR(255)
);

🔧 Configuration

  1. Update the file paths in the scraping and email tasks for saving and reading CSV files.
  2. Set your database credentials in the PostgreSQL task.
  3. onfigure your email settings (sender email, app password, recipient) in the email task.

📬 Email Example

The email report will contain a CSV file (filtered_internships_report.csv) with job listings matching the filter "Data" in the title. The email will look like:

Subject: Daily Internship Report

Body:

Please find attached the report.

🎯 Future Improvements

Add more advanced filters to the SQL query for email reporting. Scrape additional job platforms. Enhance error handling and task retry logic in Airflow.

🚀 Conclusion

This project automates the process of scraping job listings, storing the data, and sending reports using Apache Airflow, making it a powerful tool for job seekers or recruiters. The modular structure allows for easy expansion and customization.

About

This project automates the process of scraping job listings from Glassdoor, storing the scraped data in a PostgreSQL database, and sending an email report containing job listings with specific filters. The entire workflow is orchestrated using Apache Airflow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors