Since the advent of ChatGPT, AI applications have flourished, driving a surge in demand for data-related positions. This phenomenon sparked our interest in the salary structures within the data industry. On Kaggle, we discovered a dataset on Data Engineer salaries, which led us to undertake a project aimed at predicting salary ranges in the data field based on features such as region, job title, years of experience, and company size.
Our research findings indicate that job title is the most critical factor in determining salaries. This insight holds significant value for both job seekers and employers in the data field, aiding in career planning and recruitment strategies.
This project not only reveals the factors influencing salaries in the data industry but also provides valuable information for those looking to enter the field, helping them make more informed career decisions.
This dataset provides insights into data engineer salaries and employment attributes for the year 2024. It includes information such as salary, job title, experience level, employment type, employee residence, remote work ratio, company location, and company size.
The dataset allows for analysis of salary trends, employment patterns, and geographic variations in data engineering roles. It can be used by researchers, analysts, and organizations to understand the evolving landscape of data engineering employment and compensation.
Feature Name | Type | Description | |
---|---|---|---|
1 | work_year | [❌deprecated] | The year in which the data was collected (2024). |
2 | experience_level | ordinal | The experience level of the employee, categorized as SE (Senior Engineer), MI (Mid-Level Engineer), or EL (Entry-Level Engineer). |
3 | employment_type | ordinal | The type of employment, such as full-time (FT), part-time (PT), contract (C), or freelance (F). |
4 | job_title | nominal | The title or role of the employee within the company, for example, AI Engineer. |
5 | salary | [❌deprecated] | The salary of the employee in the local currency (e.g., 202,730 USD). |
6 | salary_currency | [❌deprecated] | The currency in which the salary is denominated (e.g., USD). |
7 | salary_in_usd | numerical (🔍predicted) |
The salary converted to US dollars for standardization purposes. |
8 | employee_residence | nominal | The country of residence of the employee. |
9 | remote_ratio | ordinal | The ratio indicating the extent of remote work allowed in the position (0 for no remote work, 1 for fully remote). |
10 | company_location | nominal | The location of the company where the employee is employed. |
11 | company_size | ordinal | The size of the company, often categorized by the number of employees (S for small, M for medium, L for large). |
We try to use these features of the dataset to predict the salaries of data science-related jobs.
Graph Type | Horizontal Axis (x) | Vertical Axis (y) | Note |
---|---|---|---|
Boxplot | categorical feature | salary_in_usd | order by median descending |
Barplot | categorical feature | avg. of salary_in_usd or counts | order by value descending |
Histogram | bins of salary_in_usd | frequency | |
Histogram | bins of signedlog10(salary_in_usd) | frequency |
- Checking the type of data
summary(data)
str(data)
- Checking data distribution
class_distribution <- table(data$company_size)
- Checking the missing or null value
missing_counts <- colSums(is.na(data))
- Checking Outliers
boxplot(data$salary_in_usd, main = "Boxplot")
process_outliers <- function(data, columns) {
for (col in columns) {
Q1 <- quantile(data[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(data[[col]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
median_val <- median(data[[col]], na.rm = TRUE)
print(paste("---", col))
print(paste("lower bound:",lower_bound))
print(paste("upper bound:", upper_bound))
print(paste("median", median_val))
outliers <- data[[col]][data[[col]] < lower_bound | data[[col]] > upper_bound]
print(paste("outliers count:", length(outliers)))
data[[col]][data[[col]] < lower_bound] <- median_val
data[[col]][data[[col]] > upper_bound] <- median_val
outliers_after <- data[[col]][data[[col]] < lower_bound | data[[col]] > upper_bound]
print(paste("outliers_after count:", length(outliers_after)))
}
return(data)
}
data <- process_outliers(data, c("salary_in_usd"))
- Ordinal Encoding
# ordinal: experience_level, company_size
ordinal_encoding <- function(data, old_col, new_col, order) {
data[[new_col]] <- factor(data[[old_col]], levels = order, ordered = TRUE)
data[[new_col]] <- as.numeric(data[[new_col]])
return(data)
}
experience_levels <- c("EN", "MI", "SE", "EX")
data <- ordinal_encoding(data, "experience_level", "experience_level_encoded", experience_levels)
data$experience_level_encoded
company_size_levels <- c("S", "M", "L")
data <- ordinal_encoding(data, "company_size", "company_size_encoded", company_size_levels)
data$company_size_encoded
- Target Encoding
target_encoding <- function(data, col, target) {
target_encode_tmp <- build_target_encoding(data, cols_to_encode = col,
target_col = target, functions = c("mean"))
encode_result <- target_encode(data, target_encoding = target_encode_tmp)
return(encode_result[[length(encode_result)]])
}
# employment_type: Target encoding
data$employment_type_encoded <-target_encoding(data, "employment_type", target_encode_col_name)
# job_title: Target encoding
data$job_title_encoded <- target_encoding(data, "job_title", target_encode_col_name)
# employee_residence: Target encoding
data$employee_residence_encoded <- target_encoding(data, "employee_residence", target_encode_col_name)
# company_location: Target encoding
data$company_location_encoded <- target_encoding(data, "company_location", target_encode_col_name)
- One-hot Encoding
- During testing, it was found that One-hot Encoding led to an excessive number of features, resulting in decreased accuracy and increased training time. Therefore, it was ultimately not adopted.
cols_encoding <- c("experience_level","employment_type","job_title","employee_residence",
"company_location","company_size")
do_one_hot_encoding <- function(data, col_name) {
category_name <- col_name
formula_str <- paste("~", category_name, "- 1")
formula_obj <- as.formula(formula_str)
encoded_df <- model.matrix(formula_obj, data = data)
colnames(encoded_df) <- make.names(colnames(encoded_df))
combined_data <- cbind(data, encoded_df)
return(combined_data)
}
for (col in cols_encoding) {
data <- do_one_hot_encoding(data, col)
}
- Splitting the data into training and testing sets
- The original plan was to use K-means to create new features. However, since all the features within the dataset are categorical in nature, the K-means method is not feasible.
do_kmeans <- function(data, cluster_count) {
k <- cluster_count
kmeans_result <- kmeans(data$salary_in_usd, centers = k)
return(as.factor(kmeans_result$cluster))
}
data$salary_in_usd_cluster <- do_kmeans(data, 10)
colnames(data)
- Therefore, in the end, 80% of the data was selected for training using an index-based approach, while 20% was reserved for testing.
data$i <- runif(nrow(data))
train_data <- subset(data, i >= 0.2)
test_data <- subset(data, i < 0.2)
train_data$i <- NULL
test_data$i <- NULL
- Transforming Transforming the target variable
# sin log function
signedlog10 <- function(x) {
ifelse(abs(x) <= 1, 0, sign(x) * log10(abs(x)))
}
We use target encoding to calculate the mean of signedlog10(salary_in_usd) for each feature.
After target encoding, we use ggcorrplot to generate the correlation matrix for the features.
library(ggcorrplot)
correlation_matrix <- cor(data %>% select(-salary_in_usd))
ggcorrplot(correlation_matrix, lab = T)+
labs(title = "correlation matrix",
x = "",
y = "")+
theme_bw()+
theme(axis.text.x = element_text("BL", "bold", "black", size = 10, angle = 45, hjust = 1),
axis.text.y = element_text("BL", "bold", "black", size = 10, hjust = 1),
title = element_text("BL", "bold", size = 10))
-
We use randomForest and gradient boosting (gbm) to build an ensemble learning model to predict signedlog10(salary_in_usd).
-
Ensemble Learning: Use caret and caretEnsemble
- Functions for creating ensembles of caret models: caretList() and caretStack(). caretList() is a convenience function for fitting multiple caret::train() models to the same dataset. caretStack() will make linear or non-linear combinations of these models, using a caret::train() model as a meta-model, and caretEnsemble() will make a robust linear combination of models using a GLM.
- We also use cross-validation (5-fold) and grid search to find the best hyperparameters.
- We use RMSE of salary_in_usd to evaluate the performance of model.
- null model: use linear regression(only intercept)
null_model <- lm(salary_in_usd~1, data = train_data)
summary(null_model)
null_rmse <- RMSE(train_data$salary_in_usd %>% arcsignedlog10,
null_model$fitted.values %>% arcsignedlog10)
- ensemble model = random forest + gradient boosting
library(randomForest)
library(gbm)
library(caret)
library(caretEnsemble)
# cross validation
ctrl <- trainControl(
method = "cv",
number = 5,
savePredictions = "final",
returnData = TRUE,
returnResamp = "final",
verboseIter = TRUE
)
# grid search
rf_grid <- expand.grid(
mtry = c(2)
)
gbm_grid <- expand.grid(
n.trees = 150,
interaction.depth = 3,
shrinkage = 0.1,
n.minobsinnode = 10
)
# model_list
model_list <- caretList(
salary_in_usd~.,
data = train_data,
metric = "RMSE",
verbose = T,
trControl = ctrl,
tuneList = list(
rf = caretModelSpec("rf", tuneGrid = rf_grid),
gbm = caretModelSpec("gbm", tuneGrid = gbm_grid)
)
)
# ensemble model
ens_model <- caretEnsemble(model_list)
feature_name | overall | rf | gbm |
---|---|---|---|
job_title👑 | 41.94337184 | 41.65463988 | 44.24573465 |
employee_residence | 29.53998588 | 29.47609937 | 30.04942006 |
experience_level | 25.94155302 | 26.09243964 | 24.73837574 |
company_size | 1.397332208 | 1.513988644 | 0.467108035 |
remote_ratio | 1.177757053 | 1.262832468 | 0.499361506 |
employment_type | 0 | 0 | 0 |
Train / Test | RMSE |
---|---|
Train | 46728.5916975186 |
Train / Test | RMSE |
---|---|
Test | 47087.3351748348 |
- Our Performance👑
Train / Test | RMSE |
---|---|
Train | 46728.5916975186 |
Test | 47087.3351748348 |
- RMSE score for train 51.4 kUSD/year, and for test 52.0 kUSD/year
- test RMSE: 57857.07162184822
Our accuracy is already significantly better than other results on Kaggle.
However, this is a passable result. We need more data or features to optimize our model training to achieve results that most people can accept.
Nonetheless, during the training process, we found that job_title is the most important feature, indicating that job titles play a crucial role in predicting salaries.