Skip to content

πŸ‘¨β€πŸ’» Data Engineer Jobs Exploration and Salary Prediction Project based on Glassdoor 2023 USA Job Listings

License

Notifications You must be signed in to change notification settings

Hamagistral/DataEngineers-Glassdoor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

97 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Banner

    Glassdoor Data Engineer Jobs

Gain insights into the job market for data engineers in the USA

Live Preview πŸ›Έ Data on Kaggle πŸŒͺ️ Request Feature

πŸ“ Table of Contents

  1. Project Overview
  2. Project Architecture
  3. Web Scrapinng
  4. Data Cleaning, EDA and Model Building
  5. Installation
  6. References
  7. Contact

πŸ•΅οΈ Data Exploration Page

image

πŸ’Έ Salary Prediction Page

image

πŸ”¬ Project Overview :

🎯 Goal :

The goal of this data science project is to gain insights into the job market for data engineers in the USA. By analyzing job postings and related data from Glassdoor, the project aims to identify the most in-demand tools, education degrees, and other qualifications required by companies hiring for this role. Additionally, the project seeks to create a model to predict salaries for data engineers based on a variety of factors including location, company industry and rating, education level, and seniority.

🧭 Steps :

The project begins with web scraping weekly job postings posted last week of data engineering roles from Glassdoor in the US. The collected data includes job titles, company names, job locations, job descriptions, salaries, education requirements, and required skills. The data is named like "glassdoor-data-engineer-15-2023.csv" where 15 is the week number the data was scraped in and 2023 is the year, then it's stored locally on data/raw/ folder then it's uploaded to an AWS S3 Bucket containing only the raw uncleaned data. The data is then cleaned and preprocessed to remove irrelevant information and ensure consistency, the duplicates are dropped then it's joined with the initial cleaned data in another S3 Bucket containing only one csv file that contains all the job postings. All of this is automated in a data pipeline using MageAI.

Exploratory data analysis (EDA) is performed on the cleaned data to gain insights into trends and patterns. This includes identifying the most common job titles, the industries and locations with the highest demand, and the most commonly required skills and education degrees. EDA also involves creating visualizations to aid in understanding the data.

After EDA, feature engineering is performed to create new features that may improve the accuracy of the salary prediction model. This includes creating dummy variables for categorical features such as location, education level, and seniority.

The salary prediction model is built using a random forest regressor. Finally, the model is deployed in a web application using Streamlit, allowing users to input their own data and receive a salary prediction based on the model.

πŸ“ Project Architecture

Project Arch

βš™οΈ Mage ETL :

ETL

πŸ› οΈ Technologies Used

Jupyter Python Selenium AWS mageai Pandas scikit-learn Matplotlib Banner

πŸ•ΈοΈ Web Scraping

I adjusted the web scraper using Selenium to scrape data engineering jobs posted last week from Glassdoor US. The output file is then stored in the "/data/raw" folder under the name of "glassdoor-data-engineer-15-2023.csv" where "15" is the week number where the job was posted and "2023" the year. See code here.

With each job, I obtained the following: Company Name, Job title, Salary Estimate, Job Description, Rating, Job Location, Company Size, Company Founded Date, Type of Ownership, Industry and Sector. The main challenge for this scraping task, was the duplicated job postings, after the 6th page or so the glassdoor website keeps rerendring the first jobs listings, so all the jobs scraped become a duplicates. That's why I came up with the idea to implement a scheduler to run the script once every week to get the latest job listings, and then usin a data pipeline clean and transform the data then joining it with the cleaned dataset stored in aws s3 bucket that contains all non duplicated and cleaned job listings from previous weeks.

🧹 Data Cleaning, EDA and Model Building

Please refer to the respective notebooks (data cleaning, data eda, model buidling).

πŸ–₯️ Installation :

  1. Clone the repository:
git clone https://github.com/Hamagistral/DataEngineers-Glassdoor.git
  1. Install the required packages:
pip install -r requirements.txt

- Run Mage

  1. Change directory to mage-etl:
cd mage-etl
  1. Launch project :
mage start glassdoor_dataengjobs
  1. Run pipeline :
mage run glassdoor_dataengjobs glassdoor_dataeng_pipeline

- Usage :

  1. Change directory to streamlit:
cd streamlit
  1. Run the app:
streamlit run 01_πŸ•΅οΈ_Explore_Data.py

πŸ“‹ References

Project inspired by: https://github.com/PlayingNumbers/ds_salary_proj
Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Mage ETL inspired by: https://youtu.be/WpQECq5Hx9g
Streamlit App inspired by: https://youtu.be/xl0N7tHiwlw

πŸ“¨ Contact Me

LinkedIn β€’ Website β€’ Gmail

About

πŸ‘¨β€πŸ’» Data Engineer Jobs Exploration and Salary Prediction Project based on Glassdoor 2023 USA Job Listings

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published