Glassdoor Data Engineer Jobs

Gain insights into the job market for data engineers in the USA

Live Preview 🛸 Data on Kaggle 🌪️ Request Feature

📝 Table of Contents

Project Overview
Project Architecture
Web Scrapinng
Data Cleaning, EDA and Model Building
Installation
References
Contact

🕵️ Data Exploration Page

💸 Salary Prediction Page

🔬 Project Overview :

🎯 Goal :

The goal of this data science project is to gain insights into the job market for data engineers in the USA. By analyzing job postings and related data from Glassdoor, the project aims to identify the most in-demand tools, education degrees, and other qualifications required by companies hiring for this role. Additionally, the project seeks to create a model to predict salaries for data engineers based on a variety of factors including location, company industry and rating, education level, and seniority.

🧭 Steps :

The project begins with web scraping weekly job postings posted last week of data engineering roles from Glassdoor in the US. The collected data includes job titles, company names, job locations, job descriptions, salaries, education requirements, and required skills. The data is named like "glassdoor-data-engineer-15-2023.csv" where 15 is the week number the data was scraped in and 2023 is the year, then it's stored locally on data/raw/ folder then it's uploaded to an AWS S3 Bucket containing only the raw uncleaned data. The data is then cleaned and preprocessed to remove irrelevant information and ensure consistency, the duplicates are dropped then it's joined with the initial cleaned data in another S3 Bucket containing only one csv file that contains all the job postings. All of this is automated in a data pipeline using MageAI.

Exploratory data analysis (EDA) is performed on the cleaned data to gain insights into trends and patterns. This includes identifying the most common job titles, the industries and locations with the highest demand, and the most commonly required skills and education degrees. EDA also involves creating visualizations to aid in understanding the data.

After EDA, feature engineering is performed to create new features that may improve the accuracy of the salary prediction model. This includes creating dummy variables for categorical features such as location, education level, and seniority.

The salary prediction model is built using a random forest regressor. Finally, the model is deployed in a web application using Streamlit, allowing users to input their own data and receive a salary prediction based on the model.

📝 Project Architecture

⚙️ Mage ETL :

🛠️ Technologies Used

🕸️ Web Scraping

I adjusted the web scraper using Selenium to scrape data engineering jobs posted last week from Glassdoor US. The output file is then stored in the "/data/raw" folder under the name of "glassdoor-data-engineer-15-2023.csv" where "15" is the week number where the job was posted and "2023" the year. See code here.

With each job, I obtained the following: Company Name, Job title, Salary Estimate, Job Description, Rating, Job Location, Company Size, Company Founded Date, Type of Ownership, Industry and Sector. The main challenge for this scraping task, was the duplicated job postings, after the 6th page or so the glassdoor website keeps rerendring the first jobs listings, so all the jobs scraped become a duplicates. That's why I came up with the idea to implement a scheduler to run the script once every week to get the latest job listings, and then usin a data pipeline clean and transform the data then joining it with the cleaned dataset stored in aws s3 bucket that contains all non duplicated and cleaned job listings from previous weeks.

🧹 Data Cleaning, EDA and Model Building

Please refer to the respective notebooks (data cleaning, data eda, model buidling).

🖥️ Installation :

Clone the repository:

git clone https://github.com/Hamagistral/DataEngineers-Glassdoor.git

Install the required packages:

pip install -r requirements.txt

- Run Mage

Change directory to mage-etl:

cd mage-etl

Launch project :

mage start glassdoor_dataengjobs

Run pipeline :

mage run glassdoor_dataengjobs glassdoor_dataeng_pipeline

- Usage :

Change directory to streamlit:

cd streamlit

Run the app:

streamlit run 01_🕵️_Explore_Data.py

📋 References

Project inspired by: https://github.com/PlayingNumbers/ds_salary_proj
Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Mage ETL inspired by: https://youtu.be/WpQECq5Hx9g
Streamlit App inspired by: https://youtu.be/xl0N7tHiwlw

📨 Contact Me

LinkedIn • Website • Gmail

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.streamlit		.streamlit
data		data
mage-etl		mage-etl
models		models
notebooks		notebooks
scripts		scripts
streamlit		streamlit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
architecture.png		architecture.png
mage-etl-glassdoor.PNG		mage-etl-glassdoor.PNG
requirements.txt		requirements.txt

License

Hamagistral/DataEngineers-Glassdoor

Folders and files

Latest commit

History

Repository files navigation