I've chosen to use Support Vector Regression for a few main reasons:

1. Memory efficiency

2. Ability to work well in high dimensional spaces (Due to high numbers of unique values in columns)

3. Ability to handle non-linear regression using the *kernal trick*

4. Robustness in preventing overfitting and resistance against outliers

I'm chosing to use only 2023 data given how volatile the job market has been and the recent influence of hype in Data Science and its effects on salary variability. To reduce this noise, I'm only examining the most recent data the dataset has provided.

In [None]:
#Import libraries for analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
#from sklearn.metrics import r2_score, root_mean_squared_error #Kaggle does not support r2 or NRSME
#Download code and run on IDE for live metrics

In [None]:
#Read CSV and print first 5 rows
df = pd.read_csv('/kaggle/input/jobs-in-data/jobs_in_data.csv')
print(df.head())

In [None]:
#Drop unneccessary columns
df.drop(['salary_currency','salary'], axis=1, inplace=True)

In [None]:
#Examine dataframe
print(df.describe)

In [None]:
#Filter for jobs in 2023 based on row value
df = df[df['work_year'] == 2023]

#Drop work year to remove redundancy
df.drop(['work_year'], axis=1, inplace=True)

In [None]:
#One-Hot encode categorical values and replace boolean values with 0,1
df = pd.get_dummies(df, columns=['job_category','work_setting','company_location','employee_residence',\
                                 'company_size','employment_type','experience_level','job_title'])
df.replace(to_replace = True, value = 1, inplace = True)
df.replace(to_replace = False, value = 0, inplace = True)

In [None]:
#Check if OHE worked and data types are all numbers
print(df.info())
print(df.head())

In [None]:
#Visualize target variable to see distribution
plt.hist(df.salary_in_usd)
plt.show()
plt.close()

In [None]:
#Transform target salary data to normalize right skewness and visualize
df.salary_in_usd = np.log(df.salary_in_usd)
plt.hist(df.salary_in_usd)
plt.show()
plt.close()

In [None]:
#Get x and y variables
x = df.drop(['salary_in_usd'], axis=1)
y = df.salary_in_usd

#Split dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.9, random_state=1)

In [None]:
'''
#NOTE: LONG RUNTIME
#Find optimal parameters using GridSearch
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']}

grid = GridSearchCV(SVR(), param_grid, refit=True, verbose=3)

# fitting the model for grid search
grid.fit(x_train, y_train)

# print best parameter after tuning
print(grid.best_params_)
'''

In [None]:
#Create model using parameters from GridSearch
model = SVR(C=1, gamma=.1, kernel='rbf')

#Fit model
model.fit(x_train,y_train)

#Create y_pred to score model
y_pred = model.predict(x_test)

In [None]:
'''
#Find and print r2 score & normalized mean squared error (THIS)
r2 = metrics.r2_score(y_test,y_pred)
NRMSE = metrics.root_mean_squared_error(y_test,y_pred)
print(f"R2 score: {r2}")
print(f"NRMSE score: {NRMSE / (y.max() - y.min())}")
'''
#Code for Normalized Root Mean Squared Error

R2 score: 0.5102

NRSME score: 0.0983

Average variance in prediction: 9.83% 

The model captures 51.02% of the data

In [None]:
#Plot predictions against actual values
plt.scatter(np.exp(y_test),np.exp(y_pred), alpha=.2)
plt.xlabel('Actual Value')
plt.ylabel('Predicted Value')

In [None]:
#Max and min over/under estimations, show a sample of the data vs. predictions
print("Max Overestimation")
print((np.exp(y_pred) / np.exp(y_test)).max())
print()
print("Max Underestimation")
print((np.exp(y_pred) / np.exp(y_test)).min()) 
print()
print("Sample of Predictions:")
print((np.exp(y_pred) / np.exp(y_test)))

In [None]:
#Find Squared Variance of test and prediction classes
print(f"Squared Variance of actual data: ", np.sqrt(np.exp((y_test).var())))
print(f"Squared Variance of predicted data: ", np.sqrt(np.exp((y_pred).var())))

The predicted values have a lower standard deviation compared to the actual salary values, indicating that our algothrim suffers from highly variable data and predicted values will be closer to the mean.

Further data collections should include more continuous data such as years of experience or more categorical values (Degree Level, Skills Desired) to increase the complexity and accuracy of the model.