Data Engineer Salary Estimator: Project Overview

Created a tool that estimates data science salaries (MAE ~ $ 18.6K) to help data engineers negotiate their income when they get a job.
Scraped circa 3000 job descriptions from Glassdoor using Python and Selenium from 32 countries around the world.
Engineered features from the text of each job description to quantify the value companies put on Python, SQL, Snowflake, AWS, GPC, Apache Spark, Apache Kafka, BI Tools (Looker, Tableau, etc...).
Optimized Linear, Lasso, and Random Forest Regressors using GridsearchCV to reach the best model.
Built a client-facing API using flask.

Code and Resources Used 📦

Python Version: 3.11

Packages:
Pandas, Numpy, Sklearn, NLTK, Wordcloud, Matplotlib, Plotly, Seaborn, Selenium, Flask, JSON, Pickle...
For Whole Project:
pip install -r requirements.txt
For Web Framework:
cd FlaskAPI, pip install -r requirements.txt

YouTube Project Walk-Through 📺

The Video Walk-Through: Ken Jee - Data Science Project from Scratch

The project is based on Ken Jee's ds_salary_proj

The video and the project are several years old, so keep in mind that some things could be outdated.

Web Scraping 🌐

Tweaked the web scraper GitHub repo to scrape job postings from glassdoor.com.
With each job, we got the following:

Company_name
Rating
Location
Job_title
Description
Job_age
Easy_apply
Salary
Employees
Type_of_ownership
Sector
Founded
Industry
Revenue_USD
Friend_recommend
CEO_approval
Career_opportunities
Comp_&_benefits
Culture_&_values
Senior_management
Work/Life_balance
Pros
Cons
Benefits_rating
Benefits_reviews

Data Cleaning 🧹

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

The data extracted directly from the postings:

Company Name
Ratings
Job Location
Job Title
Job Description
Job Posting Age
Easy Apply Option
Salary Ranges (Min, Max)
Number of Employees in the Company
Type of Ownership
Company Sector
Company Industry
Yearly Revenue in USD
Employee Reviews

The data was enriched with additional information (based on the Job Description and the Job Title):

Contract Type (is a time-framed contract Y\N)
Seniority (jr, mid, senior, management)
Education (BA, MS, Phd, Certificate)
Version control (Git, SVN, Gitlab and other platforms)
Cloud Platform (AWS, GPC, Azure...)
RDBMS (MySQL, PostgresSQL...)
Search & Analytics (Snowflake, Google BigQuery...)
Data Integration and Processing (Databricks, Informatica PowerCenter...)
Stream Processing Tools (Apache KAfka, Apache Flink...)
Workflow Orchestration Tools (Apache Airflow, SSIS...)
Big Data Processing (Apache Spark, Apache Hadoop...)
Operating System (Windows, Linux...)
Programming Languages (Python, SQL, Java, Scala...)
Business Intelligence Tools (Power BI, Tableau...)
Machine Learning Frameworks (PyTorch, TensorFlow...)

EDA 📊

👉 +100 insights - Data Engineer 🧭🗺️

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights:

Model Building 🧠

The Model

First, I removed all data without salary information. Secondly, I transformed the categorical variables into dummy variables. I also split the data into train and test sets with a test size of 20%.

I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly bad for this type of model.

I tried three different models:

Multiple Linear Regression – Baseline for the model
Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.

Model performance 📈

The Random Forest model far outperformed the other approaches on the test and validation sets.

Random Forest : MAE = 18.67
Linear Regression: MAE = 58539069871.22 (Yikes!)
Ridge Regression: MAE = 19.99

Productionization 💻

In this step, I built a flask API endpoint that was hosted on a local webserver by following Ken's Jee steps (I had to change a few steps because not everything was up to date). The API endpoint takes in a request from the "GET" method sending in the body values from a job listing and returns an estimated salary.

Acknowledgments 👍

This project was inspired by Ken Jee's work, and the author would like to extend special thanks to him.

Name		Name	Last commit message	Last commit date
Latest commit History 361 Commits
FlaskAPI		FlaskAPI
data		data
doc/images		doc/images
scraper		scraper
scripts		scripts
test		test
.coveragerc		.coveragerc
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
_000_data_remove.py		_000_data_remove.py
_001_data_collection.py		_001_data_collection.py
_002_data_cleaning.ipynb		_002_data_cleaning.ipynb
_002_data_cleaning_socioeconomic.ipynb		_002_data_cleaning_socioeconomic.ipynb
_003_eda.ipynb		_003_eda.ipynb
_004_model_building.ipynb		_004_model_building.ipynb
_004_model_building_europe_data.ipynb		_004_model_building_europe_data.ipynb
mypy.ini		mypy.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer Salary Estimator: Project Overview

Code and Resources Used 📦

YouTube Project Walk-Through 📺

Web Scraping 🌐

Data Cleaning 🧹

EDA 📊

Model Building 🧠

Model performance 📈

Productionization 💻

Acknowledgments 👍

About

Releases

Packages

Contributors 2

Languages

License

Luk-kar/DS_data_engineer_salary_project

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Salary Estimator: Project Overview

Code and Resources Used 📦

YouTube Project Walk-Through 📺

Web Scraping 🌐

Data Cleaning 🧹

EDA 📊

Model Building 🧠

Model performance 📈

Productionization 💻

Acknowledgments 👍

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages