This project aims to create a salary prediction model, which predict salaries based on given job descriptions.
Considering the uncertainties and competitiveness in the job market, as job seekers, knowing how much employees are willing to pay for certain jobs would prove to be very beneficial to the job seekers. On the other hand, employers could benefit for this by having reference for the salaries in their company. It is beneficial for them both to know how factors such as educational qualifications, years of experience, job industry, etc affect their salary.
Thus, can we develop a model where salaries can be predicted based on the available features? If so, how accurate can the model get?Are the features available good enough for a predictive model?
The dataset is split into training and test data. It contains 'features' and 'target' for the training dataset.
Description of Features:
- jobId: The ID of the job.
- companyId: The ID of the company posting the job.
- jobType: The description of the position of the job within the company (Junior, Senior, CEO, etc).
- degree: The obtained degree of the employee.
- major: The formal education field of specialization of the employee.
- industry: The industry to which the company belongs.
- yearsExperience: The employee's years of experience.
- milesFromMetropolis: The distance in miles, from the employee's house to the metropolis.
Target Variable:
- Salary: Salary for the job.
There are categorical and numerical features in the data. A check for duplicates and missing values revealed that there are no missing or duplicated data.
Categorical:
- jobId
- companyId
- jobType: CFO, CEO, VICE_PRESIDENT, MANAGER, JUNIOR, JANITOR, CTO, SENIOR
- degree: MASTERS, HIGH_SCHOOL, DOCTORAL, BACHELORS, NONE
- major: MATH, NONE, PHYSICS, CHEMISTRY, COMPSCI, BIOLOGY, LITERATURE, BUSINESS, ENGINEERING
- industry: HEALTH, WEB, AUTO, FINANCE, EDUCATION, OIL, SERVICE
Numerical:
- yearsExperience
- milesFromMetropolis
Here is the summary of the numerical features in the data.
The target variable is a numerical variable representing annual salary (in USD).
From the distribution and boxplot above, we can see that salary is normally distributed. The boxplot also reveals some outliers.
As revealed in the boxplot above, there are outliers in both sides.
Further checks reveal that the outliers have a salary of 0, but the positions are not voluntary. Therefore it can be determined that they are outliers.
These 16 juniors are from high value industries but with 20+ years of experience. Perhaps they are transitioning into new careers. We can determine that these values are also legitimate and not remove them from the training set.
Thus, the outliers to be removed are the 7 employees with 0 salaries.
Here is a summary of the numeric variables in the cleaned dataset.
It seems that as the job position increases, salary increases as well. (CEOs have the highest salaries, while janitors have the lowest).
More advanced degrees are associated with higher salaries.
Employees with a major in Math, Business, and Engineering generally have higher salaries compared to the rest. Employees with no major have the lowest salaries.
The Web, Finance, and Oil Industries are the top 3 industries with the highest salaries.
There is a clear association between years of experience and salary. The more years of experience, the higher the salary.
Salaries get smaller the further the employee is to the metropolis.
Here is the correlation between features and the target variable:
We can observe that there's a small positive correlation between years of experience and salary. There is no correlation between numerical features.
The baseline results for this problem is a basic linear regression model with years of experience as a predictor of salary. This model has an MSE of 1288.
The following models may be a good fit for the salary prediction problem:
- Model pipeline using StandardScaler()-scaling to unit variance, PCA()-Projecting higher dimensional to lower dimensional space, LinearRegression()
- RandomForestRegressor()- A random forest is a meta estimator that is good for predictive accuracy and could control over-fitting very well.
- GradientBoostingRegressor()- This can be used for both regression and classification problems, which is an ensemble of weak prediction models, typically decision trees
The categorical features were encoded using one-hot-encoding. The Data was also normalized.
Three Models were trained on the data. One linear and two ensemble models:
- Linear Regression
- Random Forest Regression
- Gradient Boosting Regression
Average MSE: 384.4389794016031 Standard deviation during CV: 0.7765280635963165
Average MSE: 367.2183936856392 Standard deviation during CV: 0.6235201312820777
Average MSE: 357.12352000277804 Standard deviation during CV: 0.8043960464617612
The best performing model is the Gradient Boosting Regressor with an average MSE of 357.
We can see that for predicting salaries:
- Whether someone is a Janitor ranks as the most important feature
- Years of experience and Miles from the Metropolis comes in second and third
- Having a Major is ranked fourth
- Developed a pretty accurate model with an average MSE of 357.
- Future solutions to the problem could employ more feature engineering of the features
- Feature importance ranks job position janitor, years of experience, and miles from metropolis as the three most important features