## Predict AI & ML Salaries

Dataset Source: https://www.kaggle.com/datasets/bryanb/aiml-salaries

##### Import Necessary Libraries

In [0]:
import pyspark
from pyspark.sql.functions import *
import pyspark.sql.functions as f
from pyspark.sql.types import *
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator

from distutils.version import LooseVersion

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

##### Ingest Data

In [0]:
# File location and type
file_location = "/FileStore/tables/salaries.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

schema = StructType([StructField('Work_Year', StringType(), True),
                     StructField("experience_level", StringType(), True),
                     StructField("employment_type", StringType(), True),
                     StructField("job_title", StringType(), True),
                     StructField("salary", StringType(), True),
                     StructField("salary_currency", StringType(), True),
                     StructField("salary_in_usd", IntegerType(), True),
                     StructField("employee_residence", StringType(), True),
                     StructField("remote_ratio", StringType(), True),
                     StructField("company_location", StringType(), True),
                     StructField("company_size", StringType(), True)])

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .schema(schema)\
  .load(file_location)

df = df.drop("salary", "salary_currency")

df = df.na.drop()

df = df.withColumnRenamed("salary_in_usd", 'labels')

display(df)

Work_Year,experience_level,employment_type,job_title,labels,employee_residence,remote_ratio,company_location,company_size
2022,SE,FT,Marketing Data Analyst,200000,GB,100,GB,S
2022,EN,FT,Data Scientist,74378,CA,100,CA,L
2022,SE,FT,Data Science Lead,165000,US,50,US,S
2022,EN,FT,Data Scientist,33599,GB,50,GB,L
2022,SE,FT,Data Engineer,185900,US,0,US,M
2022,SE,FT,Data Engineer,129300,US,0,US,M
2022,SE,FT,Data Analyst,169000,US,0,US,M
2022,SE,FT,Data Analyst,110600,US,0,US,M
2021,EN,FT,Power BI Developer,5409,IN,50,IN,L
2021,MI,FT,Data Engineer,75050,AU,50,AU,L


#####

In [0]:
print("The output of the summary function:\n", )
df.summary().show()
print(f"\nThere are {df.count()} samples in this dataset.")

The output of the summary function:

+-------+------------------+----------------+---------------+--------------------+------------------+------------------+-----------------+----------------+------------+
|summary|         Work_Year|experience_level|employment_type|           job_title|            labels|employee_residence|     remote_ratio|company_location|company_size|
+-------+------------------+----------------+---------------+--------------------+------------------+------------------+-----------------+----------------+------------+
|  count|              1195|            1195|           1195|                1195|              1195|              1195|             1195|            1195|        1195|
|   mean| 2021.684518828452|            null|           null|                null|122041.14225941422|              null|66.73640167364017|            null|        null|
| stddev|0.5846039750490938|            null|           null|                null| 66487.80083298149|              nul

#####

In [0]:
print("Unique Company Locations:")
print(df.select("company_location").distinct().count())
print(df.select("company_location").distinct().show())

print("Unique Employee Locations:")
print(df.select("employee_residence").distinct().count())
print(df.select("employee_residence").distinct().show())

Unique Company Locations:
59
+----------------+
|company_location|
+----------------+
|              FI|
|              NL|
|              CZ|
|              PT|
|              AU|
|              CA|
|              GB|
|              BR|
|              DE|
|              ES|
|              TR|
|              US|
|              IN|
|              FR|
|              GR|
|              SG|
|              TH|
|              AS|
|              NG|
|              PR|
+----------------+
only showing top 20 rows

None
Unique Employee Locations:
64
+------------------+
|employee_residence|
+------------------+
|                FI|
|                NL|
|                CZ|
|                PT|
|                CL|
|                AU|
|                CA|
|                GB|
|                BR|
|                DE|
|                ES|
|                TR|
|                CR|
|                US|
|                IN|
|                FR|
|                GR|
|                TH|
|            

#####

In [0]:
categorical_columns = [item[0] for item in df.dtypes if item[1].startswith("string")]

stages = []

for cat in categorical_columns:
    # stringIndexer
    stringIndexer = StringIndexer(inputCol=cat, outputCol=cat + "_index")
    # one_hot_encoder
    if LooseVersion(pyspark.__version__) < LooseVersion("3.0"):
        from pyspark.ml.feature import OneHotEncoderEstimator
        encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCol=[cat + "classVector"])
    else:
        from pyspark.ml.feature import OneHotEncoder
        encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[cat + 'classVector'])
    stages += [stringIndexer, encoder]

#####

In [0]:
# vector_assembler
vector_inputs = [cat + 'classVector' for cat in categorical_columns]

assembler = VectorAssembler(inputCols=vector_inputs, outputCol="features")

stages += [assembler]

#####

In [0]:
data_prep_pipe = Pipeline().setStages(stages)
data_prep_model = data_prep_pipe.fit(df)
prepped_ds = data_prep_model.transform(df)

#####

In [0]:
#train/test split

train_ds, test_ds = prepped_ds.randomSplit(weights=[0.80, 0.20], seed=42)

print(f"Sample Count for Training Dataset {train_ds.count()}")
print(f"Sample Count for Testing Dataset {test_ds.count()}")

Sample Count for Training Dataset 987
Sample Count for Testing Dataset 208


#####

In [0]:
gbt = GBTRegressor(featuresCol='features', labelCol='labels', maxIter=25)

gbt_model = gbt.fit(train_ds)

#####

In [0]:
display(train_ds.describe())

summary,Work_Year,experience_level,employment_type,job_title,labels,employee_residence,remote_ratio,company_location,company_size,Work_Year_index,experience_level_index,employment_type_index,job_title_index,employee_residence_index,remote_ratio_index,company_location_index,company_size_index
count,987.0,987,987,987,987.0,987,987.0,987,987,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0
mean,2021.681864235056,,,,120912.304964539,,65.95744680851064,,,0.3181357649442756,0.6149949341438703,0.0405268490374873,6.162107396149949,3.4954407294832825,0.5197568389057751,2.917933130699088,0.4518743667679838
stddev,0.5803626985544023,,,,65884.10835476987,,44.14123346315827,,,0.5803626985547309,0.8327098867867521,0.2890477582163879,10.93499429445172,8.973294245589734,0.6994381558909721,7.962536047510055,0.6592023211614352
min,2020.0,EN,CT,3D Computer Vision Researcher,2324.0,AE,0.0,AE,L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2022.0,SE,PT,Staff Data Scientist,600000.0,VN,50.0,VN,S,2.0,3.0,3.0,62.0,62.0,2.0,58.0,2.0


#####

In [0]:
gbt_preds = gbt_model.transform(test_ds)

#####

In [0]:
gbt_evaluator = RegressionEvaluator(labelCol="labels", predictionCol="prediction", metricName="rmse")

rmse = gbt_evaluator.evaluate(gbt_preds)

print(f"Root Mean Squared Error (RMSE) on test data is ${rmse}")

Root Mean Squared Error (RMSE) on test data is $58932.70610251257


#####

In [0]:
gbt_model.write().overwrite().save("Data Salary Predictor - Regression Model")

## Notes & Other Takeaways From This Project
****
- The results were not what I was hoping for, but here are the reasons why this project did not work out well:

  - In Compensation Analysis, the factors include factors such as: experience, education, responsibility, job complexity, supervision received, supervision exercised, consequences of error, working conditions, etc. The goal is to provide objective factors that evaluate the value of the job. The factors provided here are helpful, but are most likely do not have the largest impact on salary determination.
  - The closest that this analysis gets to those is with a categorical feature representing experience. It breaks a full career into four (4) different groups. That removes information that could be used to more accurately determine salary. 
  - As alluded to in the previous point, the inputs are (with the potential exception of the 'Remote Work Ratio' feature) all categorical/discrete. For a regression analysis like this, multiple continuous/numerical features would have improved the results immensely. Some of the values can be categorical/discrete in nature.
  - The dataset size is too small, especially for the type of data. Combine that this dataset all (or nearly all) discrete features to a dataset that is about twelve hundred samples and it is likely to cause issues.
  - With a dataset of nearly twelve hundred samples, when there are fifty-nine location different values for the company location and sixty-four different values for the employer location, there are not enough of each values to provide a truly meaningful weight for those two features/parameters. If I were to condense those to features from country to continent, I would remove some potentially valuable information in this analysis.
****
- In order to cure the poor results is to collect more data that includes:
  
  - More samples
  - More continuous values
  - Features that better evaluate the value of different jobs
****
- Next improvements I am looking to incorporate into Apache Spark-related projects is wrapping my work in functions.
****