Skip to content

The analysis revealed that the job title is the most significant factor influencing salaries. This insight is incredibly valuable for both job seekers and employers in the data field, aiding in career planning and recruitment strategies.

Notifications You must be signed in to change notification settings

Rita94105/Salary_predict

Repository files navigation

Since the advent of ChatGPT, AI applications have flourished, driving a surge in demand for data-related positions. This phenomenon sparked our interest in the salary structures within the data industry. On Kaggle, we discovered a dataset on Data Engineer salaries, which led us to undertake a project aimed at predicting salary ranges in the data field based on features such as region, job title, years of experience, and company size.

Our research findings indicate that job title is the most critical factor in determining salaries. This insight holds significant value for both job seekers and employers in the data field, aiding in career planning and recruitment strategies.

This project not only reveals the factors influencing salaries in the data industry but also provides valuable information for those looking to enter the field, helping them make more informed career decisions.

About Dataset

This dataset provides insights into data engineer salaries and employment attributes for the year 2024. It includes information such as salary, job title, experience level, employment type, employee residence, remote work ratio, company location, and company size.

The dataset allows for analysis of salary trends, employment patterns, and geographic variations in data engineering roles. It can be used by researchers, analysts, and organizations to understand the evolving landscape of data engineering employment and compensation.

Feature Description

Feature Name Type Description
1 work_year [❌deprecated] The year in which the data was collected (2024).
2 experience_level ordinal The experience level of the employee, categorized as SE (Senior Engineer), MI (Mid-Level Engineer), or EL (Entry-Level Engineer).
3 employment_type ordinal The type of employment, such as full-time (FT), part-time (PT), contract (C), or freelance (F).
4 job_title nominal The title or role of the employee within the company, for example, AI Engineer.
5 salary [❌deprecated] The salary of the employee in the local currency (e.g., 202,730 USD).
6 salary_currency [❌deprecated] The currency in which the salary is denominated (e.g., USD).
7 salary_in_usd numerical
(🔍predicted)
The salary converted to US dollars for standardization purposes.
8 employee_residence nominal The country of residence of the employee.
9 remote_ratio ordinal The ratio indicating the extent of remote work allowed in the position (0 for no remote work, 1 for fully remote).
10 company_location nominal The location of the company where the employee is employed.
11 company_size ordinal The size of the company, often categorized by the number of employees (S for small, M for medium, L for large).

Our Target

We try to use these features of the dataset to predict the salaries of data science-related jobs.

Exploratory Data Analysis(EDA)

Graph Type Horizontal Axis (x) Vertical Axis (y) Note
Boxplot categorical feature salary_in_usd order by median descending
Barplot categorical feature avg. of salary_in_usd or counts order by value descending
Histogram bins of salary_in_usd frequency
Histogram bins of signedlog10(salary_in_usd) frequency

Data Preprocessing

Library

  1. smotfamily
  2. ggplot2
  3. dataPreparation
  4. data.table

Process

  1. Checking the type of data
summary(data)
str(data)
  1. Checking data distribution
class_distribution <- table(data$company_size)
  1. Checking the missing or null value
missing_counts <- colSums(is.na(data))
  1. Checking Outliers
boxplot(data$salary_in_usd, main = "Boxplot")

process_outliers <- function(data, columns) {
  for (col in columns) {
    Q1 <- quantile(data[[col]], 0.25, na.rm = TRUE)
    Q3 <- quantile(data[[col]], 0.75, na.rm = TRUE)
    IQR <- Q3 - Q1
    lower_bound <- Q1 - 1.5 * IQR
    upper_bound <- Q3 + 1.5 * IQR
    
    median_val <- median(data[[col]], na.rm = TRUE)
    
    print(paste("---", col))
    print(paste("lower bound:",lower_bound))
    print(paste("upper bound:", upper_bound))
    print(paste("median", median_val))
    outliers <- data[[col]][data[[col]] < lower_bound | data[[col]] > upper_bound]
    print(paste("outliers count:", length(outliers)))
    
    data[[col]][data[[col]] < lower_bound] <- median_val
    data[[col]][data[[col]] > upper_bound] <- median_val
    
    outliers_after <- data[[col]][data[[col]] < lower_bound | data[[col]] > upper_bound]
    print(paste("outliers_after count:", length(outliers_after)))
    
  }
  return(data)
}

data <- process_outliers(data, c("salary_in_usd"))
  1. Ordinal Encoding
# ordinal: experience_level, company_size

ordinal_encoding <- function(data, old_col, new_col, order) {
  data[[new_col]] <- factor(data[[old_col]], levels = order, ordered = TRUE)
  data[[new_col]] <- as.numeric(data[[new_col]])
  return(data)
}

experience_levels <- c("EN", "MI", "SE", "EX")
data <- ordinal_encoding(data, "experience_level", "experience_level_encoded", experience_levels)
data$experience_level_encoded

company_size_levels <- c("S", "M", "L")
data <- ordinal_encoding(data, "company_size", "company_size_encoded", company_size_levels)
data$company_size_encoded
  1. Target Encoding
target_encoding <- function(data, col, target) {
  target_encode_tmp <- build_target_encoding(data, cols_to_encode = col,
                                           target_col = target, functions = c("mean"))
  encode_result <- target_encode(data, target_encoding = target_encode_tmp)
  return(encode_result[[length(encode_result)]])
}

# employment_type: Target encoding
data$employment_type_encoded <-target_encoding(data, "employment_type", target_encode_col_name)

# job_title: Target encoding
data$job_title_encoded <- target_encoding(data, "job_title", target_encode_col_name)

# employee_residence: Target encoding
data$employee_residence_encoded <- target_encoding(data, "employee_residence", target_encode_col_name)

# company_location: Target encoding
data$company_location_encoded <- target_encoding(data, "company_location", target_encode_col_name)
  1. One-hot Encoding
  • During testing, it was found that One-hot Encoding led to an excessive number of features, resulting in decreased accuracy and increased training time. Therefore, it was ultimately not adopted.
cols_encoding <- c("experience_level","employment_type","job_title","employee_residence",
                   "company_location","company_size")
                   
do_one_hot_encoding <- function(data, col_name) {
  category_name <- col_name
  formula_str <- paste("~", category_name, "- 1")
  formula_obj <- as.formula(formula_str)
  encoded_df <- model.matrix(formula_obj, data = data)
  colnames(encoded_df) <- make.names(colnames(encoded_df))
  
  combined_data <- cbind(data, encoded_df)
  return(combined_data)
}

for (col in cols_encoding) {
  data <- do_one_hot_encoding(data, col)
}
  1. Splitting the data into training and testing sets
  • The original plan was to use K-means to create new features. However, since all the features within the dataset are categorical in nature, the K-means method is not feasible.
do_kmeans <- function(data, cluster_count) {
  k <- cluster_count
  kmeans_result <- kmeans(data$salary_in_usd, centers = k)
  return(as.factor(kmeans_result$cluster))
}

data$salary_in_usd_cluster <- do_kmeans(data, 10)
colnames(data)
  • Therefore, in the end, 80% of the data was selected for training using an index-based approach, while 20% was reserved for testing.
data$i <- runif(nrow(data))
train_data <- subset(data, i >= 0.2)
test_data <- subset(data, i < 0.2)
train_data$i <- NULL
test_data$i <- NULL
  1. Transforming Transforming the target variable
# sin log function
signedlog10 <- function(x) {
  ifelse(abs(x) <= 1, 0, sign(x) * log10(abs(x)))
}

Model Training and Results

Target Encoding

We use target encoding to calculate the mean of signedlog10(salary_in_usd) for each feature.

After target encoding, we use ggcorrplot to generate the correlation matrix for the features.

library(ggcorrplot)
correlation_matrix <- cor(data %>% select(-salary_in_usd))
ggcorrplot(correlation_matrix, lab = T)+
  labs(title = "correlation matrix",
       x = "",
       y = "")+
  theme_bw()+
  theme(axis.text.x = element_text("BL", "bold", "black", size = 10, angle = 45, hjust = 1),
        axis.text.y = element_text("BL", "bold", "black", size = 10, hjust = 1),
        title = element_text("BL", "bold", size = 10))

Training

  1. We use randomForest and gradient boosting (gbm) to build an ensemble learning model to predict signedlog10(salary_in_usd).

  2. Ensemble Learning: Use caret and caretEnsemble

  • Functions for creating ensembles of caret models: caretList() and caretStack(). caretList() is a convenience function for fitting multiple caret::train() models to the same dataset. caretStack() will make linear or non-linear combinations of these models, using a caret::train() model as a meta-model, and caretEnsemble() will make a robust linear combination of models using a GLM.
  1. We also use cross-validation (5-fold) and grid search to find the best hyperparameters.
  2. We use RMSE of salary_in_usd to evaluate the performance of model.
  • null model: use linear regression(only intercept)
null_model <- lm(salary_in_usd~1, data = train_data)
summary(null_model)

null_rmse <- RMSE(train_data$salary_in_usd %>% arcsignedlog10,
                  null_model$fitted.values %>% arcsignedlog10)
  • ensemble model = random forest + gradient boosting
library(randomForest)
library(gbm)

library(caret)
library(caretEnsemble)

# cross validation
ctrl <- trainControl(
  method = "cv",                
  number = 5,                   
  savePredictions = "final",    
  returnData = TRUE,            
  returnResamp = "final",       
  verboseIter = TRUE            
)

# grid search
rf_grid <- expand.grid(
  mtry = c(2)
)

gbm_grid <- expand.grid(
  n.trees = 150,
  interaction.depth = 3,
  shrinkage = 0.1,
  n.minobsinnode = 10
)

# model_list
model_list <- caretList(
  salary_in_usd~.,
  data = train_data, 
  metric = "RMSE",
  verbose = T,
  trControl = ctrl,
  tuneList = list(
    rf = caretModelSpec("rf", tuneGrid = rf_grid),
    gbm = caretModelSpec("gbm", tuneGrid = gbm_grid)
  )
)

# ensemble model
ens_model <- caretEnsemble(model_list)

Feature Importance

feature_name overall rf gbm
job_title👑 41.94337184 41.65463988 44.24573465
employee_residence 29.53998588 29.47609937 30.04942006
experience_level 25.94155302 26.09243964 24.73837574
company_size 1.397332208 1.513988644 0.467108035
remote_ratio 1.177757053 1.262832468 0.499361506
employment_type 0 0 0

Measurement

Train / Test RMSE
Train 46728.5916975186

scatter plot: true vs pred

Prediction

Train / Test RMSE
Test 47087.3351748348

scatter plot: true vs pred

Compare with other results on Kaggle

  1. Our Performance👑
Train / Test RMSE
Train 46728.5916975186
Test 47087.3351748348
  1. AIML salaries 2022-2024 AutoViz+CatBoost+SHAP
  • RMSE score for train 51.4 kUSD/year, and for test 52.0 kUSD/year
  1. Neural Network Regression Models
  • test RMSE: 57857.07162184822

Conclusion

Our accuracy is already significantly better than other results on Kaggle.

However, this is a passable result. We need more data or features to optimize our model training to achieve results that most people can accept.

Nonetheless, during the training process, we found that job_title is the most important feature, indicating that job titles play a crucial role in predicting salaries.

About

The analysis revealed that the job title is the most significant factor influencing salaries. This insight is incredibly valuable for both job seekers and employers in the data field, aiding in career planning and recruitment strategies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages