Data Science Salary Estimator:

Project Overview

Created a tool that estimates data science salaries (MAE ~ $ 9K) to help data scientists negotiate their income when they get a job.
Scraped over 1000 job descriptions from glassdoor using python and selenium.
Engineered features from the text of each job description to quantify the value companies put on python, excel, aws, and spark.
Optimized Linear, Lasso, Random Forest Regressor and Xgboost Regressor using GridsearchCV to reach the best model.

Code and Resources Used

Python Version: 3.10.6
Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, pickle
Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905

Web Scraping

Tweaked the web scraper github repo (above) to scrape 1000 job postings from glassdoor.com. With each job, we got the following:

Job title
Salary Estimate
Job Description
Rating
Company
Location
Company Headquarters
Company Size
Company Founded Date
Type of Ownership
Industry
Sector
Revenue
Competitors

Data Cleaning

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

Parsed numeric data out of salary
Made columns for employer provided salary and hourly wages
Removed rows without salary
Parsed rating out of company text
Made a new column for company state
Added a column for if the job was at the company’s headquarters
Transformed founded date into age of company
Made columns for if different skills were listed in the job description:
- Python
- Excel
- AWS
- Spark
Column for simplified job title and Seniority
Column for description length

EDA

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights from the pivot tables.

Model Building

First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.

I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly bad in for this type of model.

I tried four different models:

Multiple Linear Regression – Baseline for the model
Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
Random Forest Regressor – Again, with the sparsity associated with the data, I thought that this would be a good fit
Xgboost Regressor – Again, with the sparsity associated with the data, I thought that this would be a good fit

Model performance

The Xgboost Regressor model far outperformed the other approaches on the test and validation sets.

Random Forest : MAE = 11.36
Linear Regression: MAE = 18.86
Lasso Regression: MAE = 19.98
Xgboost: MAE = 9.01

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
graphs		graphs
README.md		README.md
cleaned_salary_data.csv		cleaned_salary_data.csv
data_cleaning.ipynb		data_cleaning.ipynb
exploratory_data_analysis.ipynb		exploratory_data_analysis.ipynb
glassdoor_jobs.csv		glassdoor_jobs.csv
glassdoor_scraper.py		glassdoor_scraper.py
model.ipynb		model.ipynb
model.pkl		model.pkl
salary_data_cleaned2.csv		salary_data_cleaned2.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graphs

graphs

README.md

README.md

cleaned_salary_data.csv

cleaned_salary_data.csv

data_cleaning.ipynb

data_cleaning.ipynb

exploratory_data_analysis.ipynb

exploratory_data_analysis.ipynb

glassdoor_jobs.csv

glassdoor_jobs.csv

glassdoor_scraper.py

glassdoor_scraper.py

model.ipynb

model.ipynb

model.pkl

model.pkl

salary_data_cleaned2.csv

salary_data_cleaned2.csv

Repository files navigation

Data Science Salary Estimator:

Project Overview

Code and Resources Used

Web Scraping

Data Cleaning

EDA

Model Building

Model performance

About

Releases

Packages

Languages

Tayyab885/data_sciecne_salary_project

Folders and files

Latest commit

History

Repository files navigation

Data Science Salary Estimator:

Project Overview

Code and Resources Used

Web Scraping

Data Cleaning

EDA

Model Building

Model performance

About

Resources

Stars

Watchers

Forks

Languages