# Predicting Employee Satisfaction with Apache PySpark and Machine Learning

## 1. Topic Proposal:

### 1.1 Introduction 

Our project is a predictive machine learning project on a dataset containing information about employee satisfaction, considering multiple factors such as salary, years of experience, education level, and many other factors. The dataset focuses mainly on specific types of employees such as data scientists, data engineers, database administrators, and data/business analysts. However, this analysis can be applied to any type of employee, considering many factors to measure their satisfaction.
In addition, we found this dataset interesting as data-related jobs have recently become the most important, and many companies seek data specialists and would like to measure their satisfaction considering the current international market.

During our analysis, we used Apache PySpark and the Spark ML library (data frame/dataset-based) to build predictive machine learning models. Apache PySpark and Spark ML library are used in conjunction with Hadoop distributed systems. This means that if we had a larger dataset of the same data, our project would still be applicable as we could store this large dataset on Hadoop and then use the Apache Spark data processing engine and run the same Spark Machine Learning models saved with these large datasets without any modification in the code. In other words, Spark ML models can be trained, saved, and reloaded in different Spark instances for use in predictions. As is standard in the industry, the model will eventually need to be retrained with new data to maintain its predictive power depending on the application. Some models must be retrained daily, others every month, and others only every few months. In our report, we have used Random Forest, Naive Bayes, and Logistic Regression models. This report could help companies avoid the financial and operational costs associated with high employee turnover by using our predictive models so that they can take steps to enhance employee satisfaction and retention.


On the other side, if the models are used for risk management, the predictive model's accuracy may decrease over time due to the changes in the workplace caused by proactive measures taken to increase employee satisfaction. For instance, when the company addresses the dissatisfaction factors and starts to improve these factors, it will modify the data patterns on which the model relies, leading to less accurate predictions. Therefore, to mitigate this risk, this company should retrain its models regularly based on the evolving data which leads to new actions to be considered to increase employee satisfaction. As a result, the Spark ML models remain effective in enhancing employee satisfaction and retention.

### 1.2 Dataset


The original dataset can be found at: https://www.kaggle.com/datasets/phuchuynguyen/datarelated-developers-survey-by-stack-overflow 


Our dataset consists of 13 features, a mix of numerical and categorical features, and a binary target variable indicating whether the employee is satisfied, leading to a binary classification task. This dataset is processed from the Stack Overflow Annual Survey results from 2017 to 2020, however, after the EDA and data preprocessing phases we ended up analyzing 2019 and 2020 when the COVID-19 virus appeared. We will find in the provided URL above that there are two files: survey_final.csv and processed_data_toDummies.csv. The first file "survey_final.csv" contains the original dataset that comes from the Stack overflow Annual Developers Survey, only considering the respondents who considered themselves already in a data-related job (Data Scientist, Machine Learning Specialist, Database Administrator, Data Analyst, Business Analyst, and Data Engineer). The second file "processed_data_toDummies.csv,” which we have proceeded with in our project, contains the data pre-processed from the original dataset with all the developer types converted into dummy variables:


* Data Scientist or Machine Learning Specialist
* Database Administrator
* Data Analyst
* Business Analyst and Data Engineer


Before describing our dataset features we added the "Region" feature to our dataset because we had 180 countries so we have decided to categorize these countries into 5 regions only: Africa, Europe, America, Asia, and Oceania. Encoding 180 countries is redundant and may lead to a reduction in the accuracy of our models. Each row in the dataset consists of 15 features.


* *Year* (integer): the year in which data or survey responses were collected
* *Hobbyist* (string): contains binary data (e.g., "Yes" or "No") indicating whether the individual considers programming or coding as a hobby
* *ConvertedComp* (double): the annual compensation (salary) of the respondents, in the local currency of their respective countries
* *Country* (string): the location information of the survey respondents, indicating where they are located or working
* *EdLevel* (string): the highest level of education attained by the respondents, such as "Bachelor's degree" or "Master's degree"
* *Employment* (string): describes the employment status or type of job arrangement the respondents have, which can include options like "Full-time," "Part-time," "Freelance," or "Self-employed."
* *JobSat* (integer): describes the job satisfaction level of the respondents
* *OrgSize* (string): the size of the organization or company where the respondents work, which could be categorized by the number of employees
* *UndergradMajor* (string): the respondents' undergraduate majors or fields of study
* *YearsCodePro* (integer): the number of years the respondents have been coding or working in a professional coding/programming capacity
* *Data scientist or machine learning specialist*, *Database administrator*, *Data or business analyst*, *Engineer, data* (integer): these columns used to describe the *Developer type* column in our original dataset "survey_final.csv" and then converted to dummy variables. For instance, when *Database administrator* contains 1 as a value, the rest will be 0 which indicates this respondent works as a Database administrator
* *Region* (string): the geographic region or area where the respondents are located, which is broader than a specific country






This survey is conducted every year, leading to larger datasets that are suitable for big data analysis. However, we may need to change the EDA and data pre-processing phases as new features may be added. In other words, our project is applicable on the other collected surveys starting from 2021 till 2023 and these surveys can be found at: https://insights.stackoverflow.com/survey.


We started by transferring the downloaded dataset to HDFS as shown below:


In [1211]:
# # Copy the data from my local disk to my cluster
# scp processed_data_toDummies.csv mzaka001@lena.doc.gold.ac.uk:/home/mzaka001/CW2/CW2_dataset
# scp survey_final.csv  mzaka001@lena.doc.gold.ac.uk:/home/mzaka001/CW2/CW2_dataset

# # Copy the dataset from local onto HDFS
# hadoop fs -copyFromLocal CW2_dataset/

In [1]:
# Import the required libraries 
import pyspark
from pyspark import SparkContext, SparkConf 
from pyspark.ml.feature import VectorAssembler, OneHotEncoder, MinMaxScaler
from pyspark.ml.feature import VectorAssembler, MinMaxScaler
from pyspark.sql import SQLContext
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.mllib.stat import Statistics
from pyspark.mllib.tree import RandomForest, GradientBoostedTrees
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vector
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, CrossValidatorModel
import random
import time
from pyspark.sql.functions import col
from pyspark.ml.classification import NaiveBayes
from pyspark.sql.functions import when, lit
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.feature import StringIndexer
from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [2]:
# create the Spark SQL Context
sc = SparkContext(appName="bigdata_cw2")
sq = SQLContext(sc)

# set spark.driver.maxResultsize to 0 to prevent memory issues
conf = SparkConf().set("spark.driver.maxResultSize",  "0")

In [4]:
# to stop the spark context at the end of the session
sc.stop()

In [3]:
dataframe = sq.read.csv("hdfs:///user/mzaka001/CW2_dataset/processed_data_toDummies.csv", sep=',', header='true', 
                       inferSchema='true')

As shown below, the columns and their types that have been described above:

In [4]:
dataframe.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Hobbyist: string (nullable = true)
 |-- ConvertedComp: double (nullable = true)
 |-- Country: string (nullable = true)
 |-- EdLevel: string (nullable = true)
 |-- Employment: string (nullable = true)
 |-- JobSat: integer (nullable = true)
 |-- OrgSize: string (nullable = true)
 |-- UndergradMajor: string (nullable = true)
 |-- YearsCodePro: integer (nullable = true)
 |-- Data scientist or machine learning specialist: integer (nullable = true)
 |-- Database administrator: integer (nullable = true)
 |-- Data or business analyst: integer (nullable = true)
 |-- Engineer, data: integer (nullable = true)
 |-- Region: string (nullable = true)



We can notice from the above schema that the categorical variables are: "Hobbyist,” "Country,” "EdLevel,” "Employment,” "OrgSize,” "UndergradMajor,” and "Region.” 


Below are some basic descriptive statistics for our dataset: count, mean, standard deviation, minimum, and maximum. 


In [5]:
dataframe.describe().select("summary",
 'Year',
 'Hobbyist',
 'ConvertedComp',
 'Country',
 'EdLevel',
 'Employment',
 'JobSat',
 'OrgSize',
 'UndergradMajor',
 'YearsCodePro',
 'Data scientist or machine learning specialist',
 'Database administrator',
 'Data or business analyst',
 'Engineer, data',
 'Region').show(vertical=True)

-RECORD 0-------------------------------------------------------------
 summary                                       | count                
 Year                                          | 33601                
 Hobbyist                                      | 33601                
 ConvertedComp                                 | 33601                
 Country                                       | 33601                
 EdLevel                                       | 33138                
 Employment                                    | 33561                
 JobSat                                        | 33526                
 OrgSize                                       | 31904                
 UndergradMajor                                | 30515                
 YearsCodePro                                  | 33518                
 Data scientist or machine learning specialist | 33601                
 Database administrator                        | 33601                
 Data 

The following is a sample of records:

In [6]:
dataframe.show(vertical=True)

-RECORD 0-------------------------------------------------------------
 Year                                          | 2017                 
 Hobbyist                                      | Yes, both            
 ConvertedComp                                 | 43750.0              
 Country                                       | United Kingdom       
 EdLevel                                       | Bachelor's degree    
 Employment                                    | Employed full-time   
 JobSat                                        | 4                    
 OrgSize                                       | 2 to 9 employees     
 UndergradMajor                                | Computer science     
 YearsCodePro                                  | 2                    
 Data scientist or machine learning specialist | 1                    
 Database administrator                        | 1                    
 Data or business analyst                      | null                 
 Engin

### 1.3 EDA


In this section, we explore our dataset. First, we check if the dataset contains null values; then, for the numerical features, we replace the null values with the mean corresponding to each column, and for the categorical features we will replace the null values with the mode corresponding to each column. As we can see below some columns do not have null values such as "Year,” "Hobbyist,” "ConvertedComp,” "Country,” and "Region" so we will focus on the other features.

In [7]:
null_counts = {c:dataframe.filter(dataframe[c].isNull()).count() for c in dataframe.columns}
null_counts 

{'Year': 0,
 'Hobbyist': 0,
 'ConvertedComp': 0,
 'Country': 0,
 'EdLevel': 463,
 'Employment': 40,
 'JobSat': 75,
 'OrgSize': 1697,
 'UndergradMajor': 3086,
 'YearsCodePro': 83,
 'Data scientist or machine learning specialist': 0,
 'Database administrator': 0,
 'Data or business analyst': 2564,
 'Engineer, data': 13355,
 'Region': 0}

In [8]:
print("Total number of records in our dataset is:", dataframe.count())

Total number of records in our dataset is: 33601


Below we will start with the categorical features, we got the mode of "EdLevel", "Employment", "OrgSize", and "UndergradMajor" and then replaced the null values of each column with the corresponding mode values we have extracted.

In [9]:
mode_EdLevel = dataframe.groupBy('EdLevel').count().orderBy(col('count').desc()).first()
mode_Employment = dataframe.groupBy('Employment').count().orderBy(col('count').desc()).first()
mode_OrgSize = dataframe.groupBy('OrgSize').count().orderBy(col('count').desc()).first()
mode_UndergradMajor = dataframe.groupBy('UndergradMajor').count().orderBy(col('count').desc()).first()


# The mode of each variable now contains the mode value and its count, however, we are interested only in the actual value
mode_value_EdLevel = mode_EdLevel['EdLevel']
mode_value_Employment = mode_Employment['Employment']
mode_value_OrgSize = mode_OrgSize['OrgSize']
mode_value_UndergradMajor = mode_UndergradMajor['UndergradMajor']
print("mode_value_EdLevel: ",mode_value_EdLevel)
print("mode_value_Employment: ",mode_value_Employment)
print("mode_value_OrgSize: ",mode_value_OrgSize)
print("mode_value_UndergradMajor: ",mode_value_UndergradMajor)

mode_value_EdLevel:  Bachelor's degree
mode_value_Employment:  Employed full-time
mode_value_OrgSize:  20 to 99 employees
mode_value_UndergradMajor:  Computer science


In [10]:
#Replace the null values of each categorical column with the corresponding mode values we have extracted above
dataframe = dataframe.na.fill(value="Bachelor's degree",subset=["EdLevel"])
dataframe = dataframe.na.fill(value="Employed full-time",subset=["Employment"])
dataframe = dataframe.na.fill(value="20 to 99 employees",subset=["OrgSize"])
dataframe = dataframe.na.fill(value="Computer science",subset=["UndergradMajor"])

In [11]:
#Replace the null values of "YearsCodePro" feature with the mean value we got from the summary statistics above
dataframe = dataframe.na.fill(value=8.755713348051794,subset=["YearsCodePro"])

We didn't change the "JobSat" feature as this is our target label. Regarding the dummy variables, we will drop the rows with null values.

In [12]:
#checking the null values per column
null_counts = {c:dataframe.filter(dataframe[c].isNull()).count() for c in dataframe.columns}
null_counts 

{'Year': 0,
 'Hobbyist': 0,
 'ConvertedComp': 0,
 'Country': 0,
 'EdLevel': 0,
 'Employment': 0,
 'JobSat': 75,
 'OrgSize': 0,
 'UndergradMajor': 0,
 'YearsCodePro': 0,
 'Data scientist or machine learning specialist': 0,
 'Database administrator': 0,
 'Data or business analyst': 2564,
 'Engineer, data': 13355,
 'Region': 0}

In [13]:
dataframe_cleaned=dataframe.dropna()
print("Total number of records in our dataset after drop:")
print(dataframe_cleaned.count())

Total number of records in our dataset after drop:
20230


As noted above, the total number of records we had before dropping the rows with null values was 33601 and after dropping the rows that contained null values and replacing the null values of numerical and categorical features with mean and mode respectively, the total number of records became 20230.

We have stated above that we have added the "Region" column to ease our task, instead of proceeding with 180 countries to gain more significant results so no need to keep the "Country" column as we will not use it during our process.

In [14]:
# drop 'Country' column from the predictors since we have categorized the countries by their region
dataframe_cleaned = dataframe_cleaned.drop(dataframe_cleaned.Country)
dataframe_cleaned.columns

['Year',
 'Hobbyist',
 'ConvertedComp',
 'EdLevel',
 'Employment',
 'JobSat',
 'OrgSize',
 'UndergradMajor',
 'YearsCodePro',
 'Data scientist or machine learning specialist',
 'Database administrator',
 'Data or business analyst',
 'Engineer, data',
 'Region']

Below we have created a new column called "Target" whose value is calculated based on the values in the "JobSat" column. If the "JobSat" column value is greater than 5,  the "Target" column is assigned to 1, which means "Satisfied." Otherwise, the "Target" column is assigned to 0, which means "Not Satisfied." Then we dropped the "JobSat" feature.

In [15]:
dataframe_cleaned = dataframe_cleaned.withColumn(
    "Target",
    when(dataframe_cleaned["JobSat"] > 5, 1)  # Satisfied
    .otherwise(0)  # Not Satisfied
)
# drop 'JosSat' column from the predictors
dataframe_cleaned = dataframe_cleaned.drop(dataframe_cleaned.JobSat)
dataframe_cleaned.columns

['Year',
 'Hobbyist',
 'ConvertedComp',
 'EdLevel',
 'Employment',
 'OrgSize',
 'UndergradMajor',
 'YearsCodePro',
 'Data scientist or machine learning specialist',
 'Database administrator',
 'Data or business analyst',
 'Engineer, data',
 'Region',
 'Target']

In [16]:
dataframe_cleaned.select("EdLevel").distinct().show()

+--------------------+
|             EdLevel|
+--------------------+
|   Bachelor's degree|
|Primary/elementar...|
|Some college/univ...|
|I never completed...|
| Professional degree|
|     Master's degree|
|    Associate degree|
|    Secondary school|
|     Doctoral degree|
+--------------------+



The "EdLevel" column has "I prefer not to answer" as a value which is an insignificant answer so we will drop the rows containing this value. It has other values such as: "I never completed any formal education" and "Some college/university study without earning a bachelor's degree" which we can consider these answers as "Undergraduate". Therefore we created a new column called "EducationLevel" whose value is calculated based on the "EdLevel" column. If the "EdLevel" column value is "Some college/university study without earning a bachelor's degree" or "I never completed any formal education", the "EducationLevel" column is assigned to "Undergraduate"." Otherwise, the "EducationLevel" column keeps the same value of "EdLevel" column.

In [17]:
dataframe_cleaned = dataframe_cleaned.filter(dataframe_cleaned.EdLevel != "I prefer not to answer")
dataframe_cleaned.select("EdLevel").distinct().show()

+--------------------+
|             EdLevel|
+--------------------+
|   Bachelor's degree|
|Primary/elementar...|
|Some college/univ...|
|I never completed...|
| Professional degree|
|     Master's degree|
|    Associate degree|
|    Secondary school|
|     Doctoral degree|
+--------------------+



In [18]:
dataframe_cleaned = dataframe_cleaned.withColumn(
    "EducationLevel",
    when(dataframe_cleaned["EdLevel"].isin("Some college/university study without earning a bachelor's degree", "I never completed any formal education"), "Undergraduate")
    .otherwise(dataframe_cleaned["EdLevel"]) 
)
# drop 'EdLevel' column from the predictors
dataframe_cleaned = dataframe_cleaned.drop(dataframe_cleaned.EdLevel)
dataframe_cleaned.columns

['Year',
 'Hobbyist',
 'ConvertedComp',
 'Employment',
 'OrgSize',
 'UndergradMajor',
 'YearsCodePro',
 'Data scientist or machine learning specialist',
 'Database administrator',
 'Data or business analyst',
 'Engineer, data',
 'Region',
 'Target',
 'EducationLevel']

In [19]:
# let's see the actual values in the EducationLevel column
dataframe_cleaned.select("EducationLevel").distinct().show()

+--------------------+
|      EducationLevel|
+--------------------+
|   Bachelor's degree|
|Primary/elementar...|
|       Undergraduate|
| Professional degree|
|     Master's degree|
|    Associate degree|
|    Secondary school|
|     Doctoral degree|
+--------------------+



Regarding the "UndergradMajor" feature, we have categorized the fields that are similar to each other into one category. These categories are added in a new column called "Major" and dropped "UndergradMajor" column.

In [20]:
dataframe_cleaned.select("UndergradMajor").distinct().collect()
dataframe_cleaned = dataframe_cleaned.withColumn(
    "Major",
    when(dataframe_cleaned["UndergradMajor"].isin("Computer science", "Another engineering discipline",
        "Mathematics or statistics", "Information systems", "Web development or web design"), "STEM")
    .when(dataframe_cleaned.UndergradMajor == "Fine arts or performing arts", "Arts")
    .when(dataframe_cleaned.UndergradMajor.isin("Social science", "Natural science", "Health science"), "Science")
    .when(dataframe_cleaned.UndergradMajor == "Humanities", "Humanities")
    .when(dataframe_cleaned.UndergradMajor == "Business", "Business")
    .otherwise("Other")
)
# drop 'UndergradMajor' column from the predictors
dataframe_cleaned = dataframe_cleaned.drop(dataframe_cleaned.UndergradMajor)
dataframe_cleaned.columns

['Year',
 'Hobbyist',
 'ConvertedComp',
 'Employment',
 'OrgSize',
 'YearsCodePro',
 'Data scientist or machine learning specialist',
 'Database administrator',
 'Data or business analyst',
 'Engineer, data',
 'Region',
 'Target',
 'EducationLevel',
 'Major']

In [21]:
# let's see the actual values in the Major column
dataframe_cleaned.select("Major").distinct().show()

+----------+
|     Major|
+----------+
|   Science|
|Humanities|
|     Other|
|      STEM|
|      Arts|
|  Business|
+----------+



In the below part, we created a new column named "OrganizationSize" in the dataframe based on the values in the original "OrgSize" column using specific mapping rules, it's like a sequential order that we have created in our "OrganizationSize" column. Then we removed the "OrgSize" column from the set of predictors. 

In [22]:
dataframe_cleaned.select("OrgSize").distinct().collect()
dataframe_cleaned = dataframe_cleaned.withColumn(
    "OrganizationSize",
    when(dataframe_cleaned["OrgSize"].isin("2 to 9 employees", "10 to 19 employees"), "11 - 50")
    .when(dataframe_cleaned.OrgSize == "20 to 99 employees", "51 - 200")
    .when(dataframe_cleaned.OrgSize == "100 to 499 employees", "201 - 500")
    .when(dataframe_cleaned.OrgSize == "500 to 999 employees", "500 - 1000")
    .when(dataframe_cleaned.OrgSize.isin("1,000 to 4,999 employees", "5,000 to 9,999 employees"), "1001 - 10.000")
    .when(dataframe_cleaned.OrgSize == "10,000 or more employees", "10.000+")
    .otherwise("Other")
)
# drop 'OrgSize' column from the predictors
dataframe_cleaned = dataframe_cleaned.drop(dataframe_cleaned.OrgSize)
dataframe_cleaned.columns

['Year',
 'Hobbyist',
 'ConvertedComp',
 'Employment',
 'YearsCodePro',
 'Data scientist or machine learning specialist',
 'Database administrator',
 'Data or business analyst',
 'Engineer, data',
 'Region',
 'Target',
 'EducationLevel',
 'Major',
 'OrganizationSize']

In [23]:
print(dataframe_cleaned.count())
print("Region ", dataframe_cleaned.select("Region").distinct().collect())
print("Major ", dataframe_cleaned.select("Major").distinct().collect())
print("OrganizationSize ", dataframe_cleaned.select("OrganizationSize").distinct().collect())

20230
Region  [Row(Region='Europe'), Row(Region='Africa'), Row(Region='Oceania'), Row(Region='not found'), Row(Region='Asia'), Row(Region='America')]
Major  [Row(Major='Science'), Row(Major='Humanities'), Row(Major='Other'), Row(Major='STEM'), Row(Major='Arts'), Row(Major='Business')]
OrganizationSize  [Row(OrganizationSize='1001 - 10.000'), Row(OrganizationSize='201 - 500'), Row(OrganizationSize='Other'), Row(OrganizationSize='11 - 50'), Row(OrganizationSize='51 - 200'), Row(OrganizationSize='10.000+'), Row(OrganizationSize='500 - 1000')]


As demonstrated above, the "Region" column has "not found" as a value, "Major" and "OrganizationSize" columns have "Other" as values. These values mean in an indirect way "null" and will not add value to our analysis journey so we will drop them.

In [24]:
dataframe_cleaned = dataframe_cleaned.filter(dataframe_cleaned.Region != "not found")
dataframe_cleaned = dataframe_cleaned.filter(dataframe_cleaned.Major != "Other")
dataframe_cleaned = dataframe_cleaned.filter(dataframe_cleaned.OrganizationSize != "Other")
dataframe_cleaned.count()

18978

In [25]:
# Check the "Employment" column for distinct values
dataframe_cleaned.select("Employment").distinct().collect()

[Row(Employment='Employed part-time'),
 Row(Employment='Employed full-time'),
 Row(Employment='Independent contractor, freelancer, or self-employed')]

In [26]:
# Check the "Hobbyist" column for distinct values
dataframe_cleaned.select("Hobbyist").distinct().collect()

[Row(Hobbyist='No'), Row(Hobbyist='Yes')]

In [27]:
# Check how many distinct values each variable has - particularly important for non-numeric variables:
unique_counts = {c:dataframe_cleaned.select(c).distinct().count() for c in dataframe_cleaned.columns}
unique_counts

{'Year': 2,
 'Hobbyist': 2,
 'ConvertedComp': 5551,
 'Employment': 3,
 'YearsCodePro': 30,
 'Data scientist or machine learning specialist': 2,
 'Database administrator': 2,
 'Data or business analyst': 2,
 'Engineer, data': 2,
 'Region': 5,
 'Target': 2,
 'EducationLevel': 8,
 'Major': 5,
 'OrganizationSize': 6}

From looking at the different variables of our dataset and the first several records, we can identify that "Hobbyist", "Employment", "Region", "EducationLevel", "Major" and "OrganizationSize" are categorical variables. The rest are numerical or binary.

Now let's look at how many samples fall into each target class:

In [28]:
dataframe_cleaned.groupBy('Target').count().show()
dataframe_cleaned.count()

+------+-----+
|Target|count|
+------+-----+
|     1|12515|
|     0| 6463|
+------+-----+



18978

This means that approximately 65% of employees are satisfied with their jobs and 35% of employees are not satisfied, meaning that our dataset is class imbalanced.

### 1.4 Feature Selection

#### Hypothesis Testing using Chi-square:

As we have illustrated in the EDA phase our dataset consists of numerical and categorical variables, specifically six categorical variables. Usually, machine learning models perform better with numerical variables rather than categorical variables and some cannot deal with categorical variables directly unless preprocessing steps have been applied such as converting them into dummy variables or assigning a probability to each category for each feature. However, before applying any preprocessing steps, we would like to know which categorical features contribute to determining the target label and the Chi-Square test would help us to do this. 

The chi-square test is a statistical test used to determine which categorical features are the most important for predicting the target label by assuming two hypotheses: Null hypothesis and Alternative hypothesis. In our case, the null hypothesis in the chi-square test assumes that there is no association or independence between the categorical predictors and the target label, whereas the alternative hypothesis assumes that there is an association between the categorical variables and the target label. Chi-Square test provides many selection methods to decide which features to choose, We have chosen in our project the *P-value* selection method. The "fpr" parameter passed below to our chi-square test represents the p-value. The *"fpr"* chooses all features whose p-values are below a threshold (0.05), thus controlling the false positive rate of selection. In other words, if the p-value is below the predefined significance level (0.05), then we reject the null hypothesis and accept the alternative hypothesis. This indicates that the results are statistically significant.

Below we will start by encoding our categorical features using StringIndexer() and then combine all these features into a single dense vector column called features using VectorAssembler(). Then we pass this vector column to ChiSqSelector() with fpr value equal to 0.05.

In [29]:
# Define your categorical features
categorical_features = ['OrganizationSize','Region', 'Hobbyist', 'EducationLevel', 'Employment', 'Major']

# Initialize a StringIndexer
stringIndexer = StringIndexer(inputCols=categorical_features, outputCols=["encoded_" + col for col in categorical_features])

# Fit the StringIndexer
model = stringIndexer.fit(dataframe_cleaned)

# Transform the data
dataframe_with_encoded_features = model.transform(dataframe_cleaned)

# Create a VectorAssembler to assemble the features into a DenseVector
encoded_features = ["encoded_OrganizationSize", "encoded_Region", "encoded_Hobbyist", "encoded_EducationLevel", "encoded_Employment", "encoded_Major"]

# the result is in "features" columns
assembler = VectorAssembler(inputCols=encoded_features, outputCol="features")

selector = ChiSqSelector(featuresCol="features", outputCol="selectedFeatures", labelCol="Target", fpr= 0.05)

# Create a pipeline consists of assembler followed by the ChiSqSelector
pipeline = Pipeline(stages=[assembler, selector])

# Fit the pipeline
model = pipeline.fit(dataframe_with_encoded_features)

# Transform the data
result = model.transform(dataframe_with_encoded_features)

dataframe_with_encoded_features[encoded_features].show(n=3, truncate=False)

# The selectedFeatures column now contains the most important features
selected_features = result.select("selectedFeatures")
selected_features.show(n=3, truncate=False)

+------------------------+--------------+----------------+----------------------+------------------+-------------+
|encoded_OrganizationSize|encoded_Region|encoded_Hobbyist|encoded_EducationLevel|encoded_Employment|encoded_Major|
+------------------------+--------------+----------------+----------------------+------------------+-------------+
|0.0                     |3.0           |0.0             |2.0                   |0.0               |0.0          |
|4.0                     |2.0           |0.0             |1.0                   |0.0               |0.0          |
|0.0                     |1.0           |0.0             |1.0                   |0.0               |0.0          |
+------------------------+--------------+----------------+----------------------+------------------+-------------+
only showing top 3 rows

+-------------------------+
|selectedFeatures         |
+-------------------------+
|(6,[1,3],[3.0,2.0])      |
|[4.0,2.0,0.0,1.0,0.0,0.0]|
|(6,[1,3],[1.0,1.0])      |
+-

The "selectedFeatures" column now contains the most important features. Below we will manually map the selected features to the original encoded variables to know which top 3 features we will proceed with.

In [30]:
# Define a dictionary to map selected feature indices to variable names
feature_mapping = {
    0: "encoded_OrganizationSize",
    1: "encoded_Region",
    2: "encoded_Hobbyist",
    3: "encoded_EducationLevel",
    4: "encoded_Employment",
    5: "encoded_Major"
}

# Extract the selected feature indices from the DataFrame
selected_feature_indices = result.select("selectedFeatures").collect()[0][0].toArray()

# Sort the indices based on their values (Chi-Square statistics)
sorted_indices = sorted(range(len(selected_feature_indices)), key=lambda i: selected_feature_indices[i], reverse=True)

# Select the top 3 feature indices
top_3_indices = sorted_indices[:3]

# Map selected feature indices to variable names
selected_features = [feature_mapping[i] for i in top_3_indices]

# Print the selected feature names
print(selected_features)


['encoded_Region', 'encoded_EducationLevel', 'encoded_OrganizationSize']


Based on the Chi-Square test results, the categorical features that have been determined to be the most predictive of the target label are: 'encoded_Region', 'encoded_EducationLevel'and 'encoded_OrganizationSize'. They have the highest Chi-Square statistics among the other encoded categorical features. We will proceed with them in our analysis and drop the rest.

In [31]:
dataframe_with_encoded_features = dataframe_with_encoded_features.drop(dataframe_with_encoded_features.encoded_Hobbyist)
dataframe_with_encoded_features = dataframe_with_encoded_features.drop(dataframe_with_encoded_features.encoded_Employment)
dataframe_with_encoded_features = dataframe_with_encoded_features.drop(dataframe_with_encoded_features.encoded_Major)

In [32]:
# Check the columns and their corresponding values
dataframe_with_encoded_features.show(n=1, vertical=True)

-RECORD 0-----------------------------------------------------------
 Year                                          | 2019               
 Hobbyist                                      | Yes                
 ConvertedComp                                 | 95179.0            
 Employment                                    | Employed full-time 
 YearsCodePro                                  | 4                  
 Data scientist or machine learning specialist | 0                  
 Database administrator                        | 1                  
 Data or business analyst                      | 0                  
 Engineer, data                                | 0                  
 Region                                        | Oceania            
 Target                                        | 1                  
 EducationLevel                                | Undergraduate      
 Major                                         | STEM               
 OrganizationSize                 

### 1.5 Planned Analysis


#### 1.5.1 Pre-processing

We have demonstrated in the EDA section when we used the "group by" query to group our dataset by target, we found that 12515 samples belong to class "1" and 6463 samples belong to class "0" thus we have an imbalanced target variable dataset. Since our class label distribution is skewed, leading to biased results, we balanced our data to have more accurate and precise results from our models. We have many techniques to balance our dataset but the most popular one is *Data Sampling*. Data sampling can be done using oversampling techniques or undersampling techniques or using both together, we have chosen to do oversampling to make use of HDFS and Apache spark characteristics in our big data analysis journey. Oversampling is done by duplicating the number of data points that belong to the minority class, which is in our case class 0, to match the the number of the majority class samples (class 1). 

The second pre-processing step is One-hot encoding, which is a technique used to convert categorical variables to a vector form that our machine-learning model can understand. In other words, each category will be a binary feature indicating whether or not the sample belongs to that category.


The third and the last step in the pre-processing phase is Scaling numerical variables. Now our data is partially ready to be passed to the Spark ML models, as they are presented in numerical values but in different ranges. Some have large values and some have small values so we can say that our data is *heterogeneous* which means one feature can be between 0-1 range and another one can be between 50-100 range. Feeding our Spark ML models with a wide range of values per feature will affect some models' performance so we will scale each feature by letting each one vary roughly within the same range, thus our data will be *homogenous*. Scaling is the process of transforming the data within a specific range, this range is usually between 0 and 1. We will apply the Min-Max Scaling technique, which is most of the time the go-to technique to scale the data. It scales the data within a specified range where the minimum value is 0 and the maximum value is 1, [0,1] is the default range of the MinMaxScaler, these values could be changed but we will leave them as they are.


#### 1.5.2 Machine learning models

We chose to use the Spark ML library (dataframe/dataset-based) rather than the MLlib (RDD-based) library due to its ease of use by providing high levels of APIs, seamlessly integrating with dataframe and SQL API, adding the models to a pipeline - MLlib models cannot be added to a pipeline and many more rich features. 

Spark ML library contains many models for classification, we have used three of them as it is unknown ahead of time which model type may be most suitable ahead of time. We started with Random Forest followed by Naive Bayes and Logistic Regression. We have used L1 regularization in the Logistic Regression model to reduce the risk of overfitting and this is due to its ability to perform in-model feature selection. In other words, L1 regularization excludes the features that contribute little to the model's predictive power by setting the weights of less important features to zero. In addition, it becomes more beneficial after applying one hot encoded, as many binary columns are the result, so it excludes unnecessary binary columns that might not significantly contribute to the model's performance.

We utilized a random search approach for tuning the three models and a hold-out validation technique to validate our models which is the process of splitting the dataset into three subsets: the training set, the validation set, and the testing set. The training set is used to train our machine learning model, the validation set is used for tuning our model's hyperparameters and the test set is to evaluate our model's performance. We set aside 20% of training and validation data for evaluating models' hyperparameters tuning. The holdout validation technique is the best way to consider when you have a large dataset so you can split it into three sets. 

Finally, we trained the best-performing model on the training data using the best set of parameters we reached and then calculated the accuracy on the test dataset.

### 2. Implementation

#### 2.1 Pre-processing:

##### 2.1.1 Balancing Dataset using Oversampling

Below we are oversampling the minority class by calculating an oversampling fraction for the "*0*" target label. This fraction calculates the count of instances for target label '1' divided by an arbitrary factor (2.12), we reached this factor after many trials, and then divided by the count of instances for target label '0'. In the end, an oversampled dataframe is created with additional instances with a target label of '0'. Thus the class label distribution is balanced by increasing the minority class.

In [33]:
counts = dataframe_with_encoded_features.groupBy('Target').count()


# Get the count for the 0 and 1 target labels
count_0 = counts.filter(counts['Target'] == 0).select('count').collect()[0][0]
count_1 = counts.filter(counts['Target'] == 1).select('count').collect()[0][0]
print("count_0 ", count_0)
print("count_1 ", count_1)

# Calculate the oversampling fraction for the 0 target label
oversampling_fraction_0 = (count_1/2.12) / count_0

# Create a DataFrame with additional rows for the 0 target label
oversampled_df = dataframe_with_encoded_features.unionAll(
    dataframe_with_encoded_features.filter(dataframe_with_encoded_features['Target'] == 0)
    .sample(True, oversampling_fraction_0 , seed=200)
    .withColumn('Target', lit(0))
)

# We can now see our classes are approximately balanced:
oversampled_df.groupBy('Target').count().show()
# new dataset size:
print("new dataset size: ",oversampled_df.count())
# Show the oversampled DataFrame
oversampled_df.show(n=1,vertical=True)

count_0  6463
count_1  12515
+------+-----+
|Target|count|
+------+-----+
|     1|12515|
|     0|12500|
+------+-----+

new dataset size:  25015
-RECORD 0-----------------------------------------------------------
 Year                                          | 2019               
 Hobbyist                                      | Yes                
 ConvertedComp                                 | 95179.0            
 Employment                                    | Employed full-time 
 YearsCodePro                                  | 4                  
 Data scientist or machine learning specialist | 0                  
 Database administrator                        | 1                  
 Data or business analyst                      | 0                  
 Engineer, data                                | 0                  
 Region                                        | Oceania            
 Target                                        | 1                  
 EducationLevel            

#### 2.1.2 Feature Engineering:

In this section, we will apply one-hot encoding on categorical features and scale the numerical features by iterating through each column that requires processing within a for loop and running a pipeline that consists of the required stages.


#### 2.1.2.1 One-hot Encoding for Categorical Variables

The categorical features resulting from Chi-Square statistical test were: 'encoded_Region', 'encoded_EducationLevel', and 'encoded_OrganizationSize'. We can consider 'encoded_EducationLevel' as an ordinal variable since no one, for instance, can have a Bachelor's degree without completing high school. Also, we can consider the 'encoded_OrganizationSize' an ordinal variable since an organization with size '11 - 50' is smaller than an organization with size '51 - 200', so applying one hot encoding on ordinal variables would not make sense. Therefore, we will apply one-hot encoding on the 'encoded_Region' variable only.

In [34]:
# Define the list of columns to be one-hot encoded
columns_to_encode = ["Region"]   

# Create empty lists to hold the StringIndexers and Encoders
string_indexers = []
encoders = []

# Loop through each column and create a StringIndexer and an Encoder for it
for col in columns_to_encode:
    indexer = StringIndexer(inputCol=col, outputCol=col + "_index")
    encoder = OneHotEncoder(inputCol=col + "_index", outputCol=col + "_vector")
    string_indexers.append(indexer)
    encoders.append(encoder)

# Create a pipeline with StringIndexers and Encoders
pipeline = Pipeline(stages=string_indexers + encoders)

# Fit and transform the data using the pipeline
model = pipeline.fit(oversampled_df)
encoded_df = model.transform(oversampled_df)

# Show the resulting DataFrame
encoded_df.show(n=2,vertical=True)


-RECORD 0-----------------------------------------------------------
 Year                                          | 2019               
 Hobbyist                                      | Yes                
 ConvertedComp                                 | 95179.0            
 Employment                                    | Employed full-time 
 YearsCodePro                                  | 4                  
 Data scientist or machine learning specialist | 0                  
 Database administrator                        | 1                  
 Data or business analyst                      | 0                  
 Engineer, data                                | 0                  
 Region                                        | Oceania            
 Target                                        | 1                  
 EducationLevel                                | Undergraduate      
 Major                                         | STEM               
 OrganizationSize                 

Now we have in our dataset 'encoded_Region' and 'Region_index' where both of them represent the index of the 'Region' category so we will drop one of them.

In [35]:
columns_to_drop = ["encoded_Region"]
encoded_df=encoded_df.drop(*columns_to_drop)
encoded_df.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Hobbyist: string (nullable = true)
 |-- ConvertedComp: double (nullable = true)
 |-- Employment: string (nullable = false)
 |-- YearsCodePro: integer (nullable = true)
 |-- Data scientist or machine learning specialist: integer (nullable = true)
 |-- Database administrator: integer (nullable = true)
 |-- Data or business analyst: integer (nullable = true)
 |-- Engineer, data: integer (nullable = true)
 |-- Region: string (nullable = true)
 |-- Target: integer (nullable = false)
 |-- EducationLevel: string (nullable = false)
 |-- Major: string (nullable = false)
 |-- OrganizationSize: string (nullable = false)
 |-- encoded_EducationLevel: double (nullable = false)
 |-- encoded_OrganizationSize: double (nullable = false)
 |-- Region_index: double (nullable = false)
 |-- Region_vector: vector (nullable = true)



#### 2.1.2.2 Scaling Numerical Variables

There are only four numerical variables that have a different range of values: "ConvertedComp", "YearsCodePro", "encoded_EducationLevel", "encoded_OrganizationSize".

In [36]:
# List of columns to scale
columns_to_scale = ["ConvertedComp", "YearsCodePro", "encoded_EducationLevel", "encoded_OrganizationSize"]

assemblers = []
scalers = []

for col in columns_to_scale:
    # Assemble the features into a vector
    assembler = VectorAssembler(inputCols=[col], outputCol=col + "_vector")
    scaler = MinMaxScaler(inputCol=col + "_vector", outputCol=col + "_scaled")
    assemblers.append(assembler)
    scalers.append(scaler)

# Create pipeline
scaling_pipeline = Pipeline(stages=assemblers + scalers)

# Fit and transform the data using the pipeline
model = scaling_pipeline.fit(encoded_df)
scaled_df = model.transform(encoded_df)

# Show the resulting DataFrame
scaled_df.show(n=1, vertical=True, truncate=False)

-RECORD 0--------------------------------------------------------------
 Year                                          | 2019                  
 Hobbyist                                      | Yes                   
 ConvertedComp                                 | 95179.0               
 Employment                                    | Employed full-time    
 YearsCodePro                                  | 4                     
 Data scientist or machine learning specialist | 0                     
 Database administrator                        | 1                     
 Data or business analyst                      | 0                     
 Engineer, data                                | 0                     
 Region                                        | Oceania               
 Target                                        | 1                     
 EducationLevel                                | Undergraduate         
 Major                                         | STEM           

Now we are going to proceed only with the following variables: one-hot encoded, scaled, our target label, and the binary variables.

In [37]:
columns_to_drop = ["Year", "Hobbyist", "ConvertedComp", "Employment", "YearsCodePro", "Region", "EducationLevel", 
                   "Major", "OrganizationSize", "encoded_EducationLevel", "encoded_EducationLevel_vector", "encoded_Region",
                   "encoded_OrganizationSize_vector", "encoded_OrganizationSize", "Region_index","ConvertedComp_vector", 
                   "YearsCodePro_vector"]

spark_df = scaled_df.drop(*columns_to_drop)
spark_df.show(n=1, vertical=True, truncate=False)

-RECORD 0--------------------------------------------------------------
 Data scientist or machine learning specialist | 0                     
 Database administrator                        | 1                     
 Data or business analyst                      | 0                     
 Engineer, data                                | 0                     
 Target                                        | 1                     
 Region_vector                                 | (4,[3],[1.0])         
 ConvertedComp_scaled                          | [0.31900923396624825] 
 YearsCodePro_scaled                           | [0.10344827586206896] 
 encoded_EducationLevel_scaled                 | [0.2857142857142857]  
 encoded_OrganizationSize_scaled               | [0.0]                 
only showing top 1 row



At this stage our dataset is ready to split into train, validation, and test sets, however, we still need to modify our features to make more accurate predictions and this is done by assembling individual variables into a single feature vector, which is done by using Spark’s VectorAssembler. The VectorAssembler takes a list of columns as an input and combines them into a single vector column.

In [38]:
# Create a VectorAssembler to be passed to our models
feature_cols = [c for c in spark_df.columns if c != 'Target']
assembler = VectorAssembler(inputCols=feature_cols , outputCol="features")

# Use transform to assemble the columns
assembled_df = assembler.transform(spark_df)  

selected_columns = ["features", "Target"]
assembled_df = assembled_df.select(selected_columns)

# Show the resulting DataFrame
assembled_df.show(n=3, vertical=True, truncate=False)

-RECORD 0-----------------------------------------------------------------------------------------------------
 features | (12,[1,7,8,9,10],[1.0,1.0,0.31900923396624825,0.10344827586206896,0.2857142857142857])            
 Target   | 1                                                                                                 
-RECORD 1-----------------------------------------------------------------------------------------------------
 features | [1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.044550954399959784,0.3103448275862069,0.14285714285714285,0.8] 
 Target   | 0                                                                                                 
-RECORD 2-----------------------------------------------------------------------------------------------------
 features | [0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.30165071810427174,0.24137931034482757,0.14285714285714285,0.0] 
 Target   | 1                                                                                                 
o

### 2.2 Predictive Models:
 
#### 2.2.1 Holdout Validation:

In this section, we will start by splitting the above results into three sets: training set, validation set, and testing set. The holdout validation technique is the best way to consider when you have a large dataset so you can split it into three sets: the training set represents 80% of the whole dataset, the validation set represents 20% of the training set, and the testing set represents 20% of the dataset. 

In [39]:
# holdout validation
(train, test) = assembled_df.randomSplit([0.8, 0.2], seed = 5)
(train_df, val_df) = train.randomSplit([0.8, 0.2], seed = 5)

# Show a sample of the training set
train.show(n=1, vertical=True, truncate=False)

# Show the total number of samples for each class label per set
train_df.groupBy('Target').count().show()
val_df.groupBy('Target').count().show()
test.groupBy('Target').count().show()

# Show total number of samples per set
print(train_df.count())
print(val_df.count())
print(test.count())

-RECORD 0-----------------------------------------------------------------
 features | (12,[0,1,2,3,5,8],[1.0,1.0,1.0,1.0,1.0,0.006700072061805567]) 
 Target   | 1                                                             
only showing top 1 row

+------+-----+
|Target|count|
+------+-----+
|     1| 7988|
|     0| 8140|
+------+-----+

+------+-----+
|Target|count|
+------+-----+
|     1| 2006|
|     0| 1929|
+------+-----+

+------+-----+
|Target|count|
+------+-----+
|     1| 2521|
|     0| 2431|
+------+-----+

16128
3935
4952


#### 2.2.2 Random Forest:


We chose Random Forests because it does internal feature selection so that would emphasize to us that the features we have selected are the most contributing features to the class label. Random Forests is considered a supervised learning technique. One of its biggest advantages is that it can be used for regression and classification as well.

Random Forests works by comprising the maximum number of decision trees so it ends with a strong and accurate forest. Random Forests are considered an ensemble method that is based on bagging or bootstrapping + aggregating. How it internally works is that it builds multiple decision trees on bootstrapped subsamples of the data. Afterward, a subset of the columns also is chosen in each tree. The tree is built and the response variable with the highest decision with the most votes is chosen. By aggregating or choosing the number of columns arbitrary feature selection comes in included implicitly. What makes the random forest classifier our first chosen algorithm, regardless of the pre-processing steps made on the dataset, to detect whether this employee is satisfied or not is the following:

* Random forest produces the output with high accuracy because it does not depend on the output of one decision tree only but uses the voting technique, meaning most class labels dominant.

* Random forest avoids overfitting, if one decision tree is biased because of missing values or outliers, the other decision tree will balance this by ensuring the diversity within the samples so that each tree is different from the other in terms of features and data due to the bagging method thus the trees are not dependent on each other hence less prone to overfitting

* Random forest can handle missing values by substituting these null values with the mean or the median value per feature so this corrects the bias problem

* Random Forest provides feature selection which shows the features that mostly contribute to the classification/prediction by calculating the importance score of the feature and the features with low importance could be removed to improve the overall model performance. And this step can be made in the future to emphasize our chi-square test results. 

* Moreover, random forests can handle huge amounts of attributes with high accuracy and with minimum bias.

In [57]:
start1 = time.time()

rf_tuning1 = pd.DataFrame(columns=['numTrees', 'maxDepth', 'maxBins', 'TP', 'TN', 'FP', 'FN'])

for i in range(0, 4):

    # Randomly select values for hyperparameters from a range
    trees1 = random.randrange(5, 201, 5)
    depth1 = random.randrange(2, 21, 2)
    bins1 = random.randrange(10, 61, 5)

    # Create a RandomForestClassifier instance
    randomForestClassifier1 = RandomForestClassifier(labelCol="Target", featuresCol="features",
                                                     numTrees=trees1, maxDepth=depth1, maxBins=bins1)

    randomForestModel1 = randomForestClassifier1.fit(train_df)

    # // run model with validation data set to get predictions
    predictions1 = randomForestModel1.transform(val_df)
    predictions1.select("Target", "prediction").show()
    
end1 = time.time()

+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     0|       1.0|
|     0|       0.0|
|     0|       0.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       1.0|
+------+----------+
only showing top 20 rows

+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     0|       0.0|
|     0|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
+------+----------+
only showing t

In [58]:
evaluator1 = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")
accuracy1 = evaluator1.evaluate(predictions1)
print("Accuracy = %s" % (accuracy1))
print("Validation Error = %s" % (1.0 - accuracy1))

Accuracy = 0.6348490117984062
Validation Error = 0.36515098820159375


In [59]:
print('Time taken (in minutes) for random forest classifier tuning:', (end1-start1)/60)

Time taken (in minutes) for random forest classifier tuning: 7.8935590545336405


Trying different values for our hyperparameters while doing random search to tune our model.

In [63]:
start2 = time.time()

rf_tuning2 = pd.DataFrame(columns=['numTrees', 'maxDepth', 'maxBins', 'TP', 'TN', 'FP', 'FN'])

for i in range(0, 5):

    # Randomly select values for hyperparameters from a range
    trees2 = random.randrange(5, 101, 5)
    depth2 = random.randrange(2, 21, 2)
    bins2 = random.randrange(10, 61, 5)

    # Create a RandomForestClassifier instance
    randomForestClassifier2 = RandomForestClassifier(labelCol="Target",
    featuresCol="features", numTrees=trees2, maxDepth=depth2, maxBins=bins2)

    randomForestModel2 = randomForestClassifier2.fit(train_df)

    # // run model with test data set to get predictions
    predictions2 = randomForestModel2.transform(val_df)
    predictions2.select("Target", "prediction").show()

end2 = time.time()

+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     0|       1.0|
|     0|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
+------+----------+
only showing top 20 rows

+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
+------+----------+
only showing t

In [64]:
evaluator2 = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")
accuracy2 = evaluator2.evaluate(predictions2)
print("Accuracy = %s" % (accuracy2))
print("Validation Error = %s" % (1.0 - accuracy2))

Accuracy = 0.6823247426479375
Validation Error = 0.3176752573520625


In [65]:
print('Time taken (in minutes) for random forest classifier tuning:', (end2-start2)/60)

Time taken (in minutes) for random forest classifier tuning: 1.3447409868240356


In [66]:
# Another trial
start3 = time.time()

rf_tuning3 = pd.DataFrame(columns=['numTrees', 'maxDepth', 'maxBins', 'TP', 'TN', 'FP', 'FN'])

for i in range(0, 1):

    # Randomly select values for hyperparameters from a range
    trees3 = random.randrange(5, 61, 5)
    depth3 = random.randrange(2, 21, 2)
    bins3 = random.randrange(10, 61, 5)

    # Create a RandomForestClassifier instance
    randomForestClassifier3 = RandomForestClassifier(labelCol="Target", featuresCol="features", numTrees=trees3,
                                                     maxDepth=depth3, maxBins=bins3)

    randomForestModel3 = randomForestClassifier3.fit(train_df)

    # // run model with test data set to get predictions
    predictions3 = randomForestModel3.transform(val_df)
    predictions3.select("Target", "prediction").show()

end3 = time.time()

+------+----------+
|Target|prediction|
+------+----------+
|     1|       1.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       0.0|
|     0|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     0|       0.0|
|     0|       1.0|
|     0|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
+------+----------+
only showing top 20 rows



In [67]:
print('Time taken (in minutes) for random forest classifier tuning:', (end3-start3)/60)

Time taken (in minutes) for random forest classifier tuning: 0.707856031258901


In [68]:
evaluator3 = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")
accuracy3 = evaluator3.evaluate(predictions3)
print("Accuracy = %s" % (accuracy3))
print("Validation Error = %s" % (1.0 - accuracy3))

Accuracy = 0.6999355515792745
Validation Error = 0.30006444842072555


We can consider the last trial as the best resulting model since it has the highest accuracy and the lowest time taken to run.

#### 2.2.3 Naive Bayes:

The Naïve Bayes classifier is a conditional probabilistic classifier that assumes that all our dataset features are independent of each other and it's called *naïve* because this is a naïve assumption and not applied to real-life scenarios, nevertheless surprisingly it performs well and leads to convincing results. It's a probabilistic classifier because it's based on the Bayes theorem that calculates the posterior probability and prior probability while building the frequency and likelihood tables.

In Naïves Bayes algorithm, we compute the frequency and the probability tables for each attribute per class label then substitute these probability values in the Naïve Bayesian equation to calculate the posterior probability per class label ( **P(y|X)** ) and the class label with the highest posterior value is considered as the predicted value.

In [132]:
nb_start = time.time()

# Create a NaiveBayes classifier
nb = NaiveBayes(labelCol="Target", featuresCol="features")

# Fit the model to the training data
nb_model = nb.fit(train_df)

# Make predictions on the test data
nb_predictions = nb_model.transform(val_df)
nb_predictions.select("Target", "prediction").show()

nb_end = time.time()

+------+----------+
|Target|prediction|
+------+----------+
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
+------+----------+
only showing top 20 rows



In [133]:
print('Time taken (in minutes) to fit Naive Bayes model and make predictions:', (nb_end-nb_start)/60)

Time taken (in minutes) to fit Naive Bayes model and make predictions: 0.04904579718907674


In [71]:
nb_evaluator = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")
nb_accuracy = nb_evaluator.evaluate(nb_predictions)
print("Accuracy = %s" % (nb_accuracy))
print("Validation Error = %s" % (1.0 - nb_accuracy))


Accuracy = 0.5532100114427616
Validation Error = 0.4467899885572384


#### 2.2.4 Logistic Regression:

Since our main target is to predict whether the employee is satisfied or not thus we used the Logistic regression model to help us in this prediction.

Logistic Regression is the modeling approach that can be used to describe the relationship between several input variables/predictors and a class label which is either 0 or 1. In more detail, logistic regression predicts the probability of an outcome that can only have two values and then assigns to it a class label. If the probability is less than 0.5 then class 0 is assigned else class 1 is assigned.

In [75]:
lr_start = time.time()

lr_tuning = pd.DataFrame(columns=['regParam', 'iterations', 'TP', 'TN', 'FP', 'FN'])

for i in range(0, 5):
    
    # Create a LogisticRegression classifier
    reg = random.uniform(0, 0.05)
    iters = random.randrange(10, 101, 5)

    # Create a LogisticRegression instance with the selected hyperparameters
    lr = LogisticRegression(labelCol="Target", featuresCol="features", regParam=reg, maxIter=iters, elasticNetParam=1.0,
                            family="auto")
    
    # Fit the model to the training data
    lr_model = lr.fit(train_df)
    
    # Make predictions on the validation data
    lr_predictions = lr_model.transform(val_df)
    lr_predictions.select("Target", "prediction").show()
    
lr_end = time.time()

+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       0.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       0.0|
|     1|       0.0|
|     1|       0.0|
+------+----------+
only showing top 20 rows



  lr_tuning = lr_tuning.append({'regParam': reg, 'iterations': iters,


+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       1.0|
+------+----------+
only showing top 20 rows



  lr_tuning = lr_tuning.append({'regParam': reg, 'iterations': iters,


+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       1.0|
+------+----------+
only showing top 20 rows



  lr_tuning = lr_tuning.append({'regParam': reg, 'iterations': iters,


+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       1.0|
+------+----------+
only showing top 20 rows



  lr_tuning = lr_tuning.append({'regParam': reg, 'iterations': iters,


+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     0|       0.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     0|       0.0|
|     0|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       1.0|
+------+----------+
only showing top 20 rows



  lr_tuning = lr_tuning.append({'regParam': reg, 'iterations': iters,


In [76]:
print('Time taken (in minutes) to fit Logistic Regression model and make predictions:', (lr_end-lr_start)/60)

Time taken (in minutes) to fit Logistic Regression model and make predictions: 0.7262279152870178


In [77]:
# Define the MulticlassClassificationEvaluator
lr_evaluator = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")

# Calculate the accuracy
lr_accuracy = lr_evaluator.evaluate(lr_predictions)

print("Accuracy = %s" % (lr_accuracy))
print("Validation Error = %s" % (1.0 - lr_accuracy))


Accuracy = 0.5632500926110942
Validation Error = 0.43674990738890584



#### 2.2.5 Considering All The Categorical Features:

As the accuracy of models so far has been limited, we decided to see if the accuracy of models could be improved by considering all the categorical variables during our analysis, not only the features resulting from the chi-square test.

In [78]:
# Define your categorical features
categorical_features = ['OrganizationSize','Region', 'Hobbyist', 'EducationLevel', 'Employment', 'Major']

# Initialize a StringIndexer
stringIndexer = StringIndexer(inputCols=categorical_features, outputCols=["encoded_" + col for col in categorical_features])

# Fit the StringIndexer
model = stringIndexer.fit(dataframe_cleaned)

# Transform the data
dataframe_with_encoded_features = model.transform(dataframe_cleaned)

# Create a VectorAssembler to assemble the features into a DenseVector
encoded_features = ["encoded_OrganizationSize", "encoded_Region", "encoded_Hobbyist", "encoded_EducationLevel", "encoded_Employment", "encoded_Major"]

# the result is in "features" columns
assembler = VectorAssembler(inputCols=encoded_features, outputCol="features")

selector = ChiSqSelector(featuresCol="features", outputCol="selectedFeatures", labelCol="Target", fpr= 0.05)

# Create a pipeline consists of assembler followed by the ChiSqSelector
pipeline = Pipeline(stages=[assembler, selector])

# Fit the pipeline
model = pipeline.fit(dataframe_with_encoded_features)

# Transform the data
result = model.transform(dataframe_with_encoded_features)

dataframe_with_encoded_features[encoded_features].show(n=3, truncate=False)

# The selectedFeatures column now contains the most important features
selected_features = result.select("selectedFeatures")
selected_features.show(n=3, truncate=False)

+------------------------+--------------+----------------+----------------------+------------------+-------------+
|encoded_OrganizationSize|encoded_Region|encoded_Hobbyist|encoded_EducationLevel|encoded_Employment|encoded_Major|
+------------------------+--------------+----------------+----------------------+------------------+-------------+
|0.0                     |3.0           |0.0             |2.0                   |0.0               |0.0          |
|4.0                     |2.0           |0.0             |1.0                   |0.0               |0.0          |
|0.0                     |1.0           |0.0             |1.0                   |0.0               |0.0          |
+------------------------+--------------+----------------+----------------------+------------------+-------------+
only showing top 3 rows

+-------------------------+
|selectedFeatures         |
+-------------------------+
|(6,[1,3],[3.0,2.0])      |
|[4.0,2.0,0.0,1.0,0.0,0.0]|
|(6,[1,3],[1.0,1.0])      |
+-

In [79]:
# Define a dictionary to map selected feature indices to variable names
feature_mapping = {
    0: "encoded_OrganizationSize",
    1: "encoded_Region",
    2: "encoded_Hobbyist",
    3: "encoded_EducationLevel",
    4: "encoded_Employment",
    5: "encoded_Major"
}

# Extract the selected feature indices from the DataFrame
selected_feature_indices = result.select("selectedFeatures").collect()[0][0].toArray()

# Sort the indices based on their values (Chi-Square statistics)
sorted_indices = sorted(range(len(selected_feature_indices)), key=lambda i: selected_feature_indices[i], reverse=True)

# Select the top 3 feature indices
top_3_indices = sorted_indices[:]

# Map selected feature indices to variable names
selected_features = [feature_mapping[i] for i in top_3_indices]

# Print the selected feature names
print(selected_features)

['encoded_Region', 'encoded_EducationLevel', 'encoded_OrganizationSize', 'encoded_Hobbyist', 'encoded_Employment', 'encoded_Major']


In [80]:
# Check the columns and their corresponding values
dataframe_with_encoded_features.show(n=1, vertical=True)

-RECORD 0-----------------------------------------------------------
 Year                                          | 2019               
 Hobbyist                                      | Yes                
 ConvertedComp                                 | 95179.0            
 Employment                                    | Employed full-time 
 YearsCodePro                                  | 4                  
 Data scientist or machine learning specialist | 0                  
 Database administrator                        | 1                  
 Data or business analyst                      | 0                  
 Engineer, data                                | 0                  
 Region                                        | Oceania            
 Target                                        | 1                  
 EducationLevel                                | Undergraduate      
 Major                                         | STEM               
 OrganizationSize                 

In [81]:
counts = dataframe_with_encoded_features.groupBy('Target').count()


# Get the count for the 0 and 1 target labels
count_0 = counts.filter(counts['Target'] == 0).select('count').collect()[0][0]
count_1 = counts.filter(counts['Target'] == 1).select('count').collect()[0][0]
print("count_0 ", count_0)
print("count_1 ", count_1)

# Calculate the oversampling fraction for the 0 target label
oversampling_fraction_0 = (count_1/2.12) / count_0

# Create a DataFrame with additional rows for the 0 target label
oversampled_df = dataframe_with_encoded_features.unionAll(
    dataframe_with_encoded_features.filter(dataframe_with_encoded_features['Target'] == 0)
    .sample(True, oversampling_fraction_0 , seed=200)
    .withColumn('Target', lit(0))
)

# We can now see our classes are approximately balanced:
oversampled_df.groupBy('Target').count().show()
# new dataset size:
print("new dataset size: ",oversampled_df.count())
# Show the oversampled DataFrame
oversampled_df.show(n=1,vertical=True)

count_0  6463
count_1  12515
+------+-----+
|Target|count|
+------+-----+
|     1|12515|
|     0|12500|
+------+-----+

new dataset size:  25015
-RECORD 0-----------------------------------------------------------
 Year                                          | 2019               
 Hobbyist                                      | Yes                
 ConvertedComp                                 | 95179.0            
 Employment                                    | Employed full-time 
 YearsCodePro                                  | 4                  
 Data scientist or machine learning specialist | 0                  
 Database administrator                        | 1                  
 Data or business analyst                      | 0                  
 Engineer, data                                | 0                  
 Region                                        | Oceania            
 Target                                        | 1                  
 EducationLevel            

No need to encode the "Hobbyist" feature as it's a binary variable. Moreover, as we mentioned above, we can consider the "EducationLevel" and "OrganizationSize" ordinal variables so encoding them does not make sense. Therefore, we will encode only the features that have multiple categories which are: "Region", "Major", and "Employment". 

In [82]:
# Define the list of columns to be one-hot encoded
columns_to_encode = ["Region", "Major", "Employment"]   

# Create empty lists to hold the StringIndexers and Encoders
string_indexers = []
encoders = []

# Loop through each column and create a StringIndexer and an Encoder for it
for col in columns_to_encode:
    indexer = StringIndexer(inputCol=col, outputCol=col + "_index")
    encoder = OneHotEncoder(inputCol=col + "_index", outputCol=col + "_vector")
    string_indexers.append(indexer)
    encoders.append(encoder)

# Create a pipeline with StringIndexers and Encoders
pipeline = Pipeline(stages=string_indexers + encoders)

# Fit and transform the data using the pipeline
model = pipeline.fit(oversampled_df)
encoded_df = model.transform(oversampled_df)

# Show the resulting DataFrame
encoded_df.show(n=2,vertical=True)

-RECORD 0-----------------------------------------------------------
 Year                                          | 2019               
 Hobbyist                                      | Yes                
 ConvertedComp                                 | 95179.0            
 Employment                                    | Employed full-time 
 YearsCodePro                                  | 4                  
 Data scientist or machine learning specialist | 0                  
 Database administrator                        | 1                  
 Data or business analyst                      | 0                  
 Engineer, data                                | 0                  
 Region                                        | Oceania            
 Target                                        | 1                  
 EducationLevel                                | Undergraduate      
 Major                                         | STEM               
 OrganizationSize                 

In [83]:
# List of columns to scale
columns_to_scale = ["ConvertedComp", "YearsCodePro", "encoded_EducationLevel", "encoded_OrganizationSize"]

assemblers = []
scalers = []

for col in columns_to_scale:
    # Assemble the features into a vector
    assembler = VectorAssembler(inputCols=[col], outputCol=col + "_vector")
    scaler = MinMaxScaler(inputCol=col + "_vector", outputCol=col + "_scaled")
    assemblers.append(assembler)
    scalers.append(scaler)

# Create pipeline
scaling_pipeline = Pipeline(stages=assemblers + scalers)

# Fit and transform the data using the pipeline
model = scaling_pipeline.fit(encoded_df)
scaled_df = model.transform(encoded_df)

# Show the resulting DataFrame
scaled_df.show(n=1, vertical=True, truncate=False)

-RECORD 0--------------------------------------------------------------
 Year                                          | 2019                  
 Hobbyist                                      | Yes                   
 ConvertedComp                                 | 95179.0               
 Employment                                    | Employed full-time    
 YearsCodePro                                  | 4                     
 Data scientist or machine learning specialist | 0                     
 Database administrator                        | 1                     
 Data or business analyst                      | 0                     
 Engineer, data                                | 0                     
 Region                                        | Oceania               
 Target                                        | 1                     
 EducationLevel                                | Undergraduate         
 Major                                         | STEM           

In [84]:
columns_to_drop = ["Year", "Hobbyist", "ConvertedComp", "Employment", "YearsCodePro", "Region", "EducationLevel", 
                   "Major", "OrganizationSize", "encoded_EducationLevel", "encoded_EducationLevel_vector", "encoded_Region",
                   "encoded_OrganizationSize_vector", "encoded_OrganizationSize", "Region_index","ConvertedComp_vector", 
                   "YearsCodePro_vector", "Employment_index", "Major_index", "encoded_Employment", "encoded_Major"]

spark_df = scaled_df.drop(*columns_to_drop)
spark_df.show(n=1, vertical=True, truncate=False)

-RECORD 0--------------------------------------------------------------
 Data scientist or machine learning specialist | 0                     
 Database administrator                        | 1                     
 Data or business analyst                      | 0                     
 Engineer, data                                | 0                     
 Target                                        | 1                     
 encoded_Hobbyist                              | 0.0                   
 Region_vector                                 | (4,[3],[1.0])         
 Major_vector                                  | (4,[0],[1.0])         
 Employment_vector                             | (2,[0],[1.0])         
 ConvertedComp_scaled                          | [0.31900923396624825] 
 YearsCodePro_scaled                           | [0.10344827586206896] 
 encoded_EducationLevel_scaled                 | [0.2857142857142857]  
 encoded_OrganizationSize_scaled               | [0.0]          

In [85]:
# Create a VectorAssembler to be passed to our models
feature_cols = [c for c in spark_df.columns if c != 'Target']
assembler = VectorAssembler(inputCols=feature_cols , outputCol="features")

# Use transform to assemble the columns
assembled_df = assembler.transform(spark_df)  

selected_columns = ["features", "Target"]
assembled_df = assembled_df.select(selected_columns)

# Show the resulting DataFrame
assembled_df.show(n=3, vertical=True, truncate=False)

-RECORD 0-----------------------------------------------------------------------------------------------------------------------------
 features | (19,[1,8,9,13,15,16,17],[1.0,1.0,1.0,1.0,0.31900923396624825,0.10344827586206896,0.2857142857142857])                     
 Target   | 1                                                                                                                         
-RECORD 1-----------------------------------------------------------------------------------------------------------------------------
 features | (19,[0,1,2,7,9,13,15,16,17,18],[1.0,1.0,1.0,1.0,1.0,1.0,0.044550954399959784,0.3103448275862069,0.14285714285714285,0.8]) 
 Target   | 0                                                                                                                         
-RECORD 2-----------------------------------------------------------------------------------------------------------------------------
 features | (19,[1,2,3,6,9,13,15,16,17],[1.0,1.0,1.0,1.

In [86]:
# holdout validation
(train, test) = assembled_df.randomSplit([0.8, 0.2], seed = 5)
(train_df, val_df) = train.randomSplit([0.8, 0.2], seed = 5)

# Show a sample of the training set
train.show(n=1, vertical=True, truncate=False)

# Show the total number of samples for each class label per set
train_df.groupBy('Target').count().show()
val_df.groupBy('Target').count().show()
test.groupBy('Target').count().show()

# Show total number of samples per set
print(train_df.count())
print(val_df.count())
print(test.count())

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------
 features | (19,[0,1,2,3,4,5,9,13,15,16,17],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.2880126024366946,0.4482758620689655,0.2857142857142857]) 
 Target   | 1                                                                                                                           
only showing top 1 row

+------+-----+
|Target|count|
+------+-----+
|     1| 8072|
|     0| 8056|
+------+-----+

+------+-----+
|Target|count|
+------+-----+
|     1| 1992|
|     0| 1943|
+------+-----+

+------+-----+
|Target|count|
+------+-----+
|     1| 2451|
|     0| 2501|
+------+-----+

16128
3935
4952


#### 2.2.5.1 Random Forest Using All Categorical Features:

We will use the same hyperparameters values that we employed when we trained Random Forest using only three categorical features in the last trial.

In [219]:
start4 = time.time()

rf_tuning4 = pd.DataFrame(columns=['numTrees', 'maxDepth', 'maxBins', 'TP', 'TN', 'FP', 'FN'])

for i in range(0, 1):

    # Randomly select values for hyperparameters from a range
    trees4 = random.randrange(5, 61, 5)
    depth4 = random.randrange(2, 21, 2)
    bins4 = random.randrange(10, 61, 5)

    # Create a RandomForestClassifier instance
    randomForestClassifier4 = RandomForestClassifier(labelCol="Target", featuresCol="features", numTrees=trees4,
                                                     maxDepth=depth4, maxBins=bins4)

    randomForestModel4 = randomForestClassifier4.fit(train_df)

    # // run model with test data set to get predictions
    predictions4 = randomForestModel4.transform(val_df)
    predictions4.select("Target", "prediction").show()
    
end4 = time.time()

+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     0|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
+------+----------+
only showing top 20 rows



In [220]:
print('Time taken (in minutes) for random forest classifier tuning:', (end4-start4)/60)

Time taken (in minutes) for random forest classifier tuning: 1.0066583116849264


In [221]:
evaluator4 = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")
accuracy4 = evaluator4.evaluate(predictions4)
print("Accuracy = %s" % (accuracy4))
print("Validation Error = %s" % (1.0 - accuracy4))

Accuracy = 0.6977848255582049
Validation Error = 0.30221517444179513


#### 2.2.5.2 Naive Bayes Using All Categorical Features:

In [281]:
nb_start4 = time.time()

# Create a NaiveBayes classifier
nb4 = NaiveBayes(labelCol="Target", featuresCol="features")

# Fit the model to the training data
nb_model4 = nb4.fit(train_df)

# Make predictions on the test data
nb_predictions4 = nb_model4.transform(val_df)
nb_predictions4.select("Target", "prediction").show()

nb_end4 = time.time()

+------+----------+
|Target|prediction|
+------+----------+
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
+------+----------+
only showing top 20 rows



In [282]:
print('Time taken (in minutes) to fit Naive Bayes model and make predictions:', (nb_end4-nb_start4)/60)

Time taken (in minutes) to fit Naive Bayes model and make predictions: 0.0601337989171346


In [283]:
nb_evaluator4 = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")
nb_accuracy4 = nb_evaluator4.evaluate(nb_predictions4)
print("Accuracy = %s" % (nb_accuracy4))
print("Validation Error = %s" % (1.0 - nb_accuracy4))


Accuracy = 0.5395214456411287
Validation Error = 0.4604785543588713


#### 2.2.5.3 Logistic Regression Using All Categorical Features:

In [134]:
lr_start4 = time.time()

lr_tuning4 = pd.DataFrame(columns=['regParam', 'iterations', 'TP', 'TN', 'FP', 'FN'])

for i in range(0, 5):
    
    # Create a LogisticRegression classifier
    reg = random.uniform(0, 0.05)
    iters = random.randrange(10, 101, 5)

    # Create a LogisticRegression instance with the selected hyperparameters
    lr4 = LogisticRegression(labelCol="Target", featuresCol="features", regParam=reg, maxIter=iters, elasticNetParam=1.0,
                            family="auto")
    
    # Fit the model to the training data
    lr_model4 = lr4.fit(train_df)
    # Make predictions on the validation data
    lr_predictions4 = lr_model4.transform(val_df)
    lr_predictions4.select("Target", "prediction").show()
    
lr_end4 = time.time()

+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       0.0|
|     1|       0.0|
+------+----------+
only showing top 20 rows



  lr_tuning4 = lr_tuning4.append({'regParam': reg, 'iterations': iters,


+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       0.0|
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       0.0|
|     1|       0.0|
+------+----------+
only showing top 20 rows



  lr_tuning4 = lr_tuning4.append({'regParam': reg, 'iterations': iters,


+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       0.0|
|     0|       1.0|
|     1|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
+------+----------+
only showing top 20 rows



  lr_tuning4 = lr_tuning4.append({'regParam': reg, 'iterations': iters,


+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       1.0|
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     0|       0.0|
|     1|       0.0|
|     1|       0.0|
|     1|       0.0|
+------+----------+
only showing top 20 rows



  lr_tuning4 = lr_tuning4.append({'regParam': reg, 'iterations': iters,


+------+----------+
|Target|prediction|
+------+----------+
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
|     0|       1.0|
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     0|       0.0|
|     1|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     0|       0.0|
|     1|       1.0|
|     1|       0.0|
|     1|       0.0|
+------+----------+
only showing top 20 rows



  lr_tuning4 = lr_tuning4.append({'regParam': reg, 'iterations': iters,


In [135]:
print('Time taken (in minutes) to fit Logistic Regression model and make predictions:', (lr_end4-lr_start4)/60)

Time taken (in minutes) to fit Logistic Regression model and make predictions: 0.9122393290201823


In [136]:
# Define the MulticlassClassificationEvaluator
lr_evaluator4 = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")

# Calculate the accuracy
lr_accuracy4 = lr_evaluator4.evaluate(lr_predictions4)

print("Accuracy = %s" % (lr_accuracy4))
print("Validation Error = %s" % (1.0 - lr_accuracy4))


Accuracy = 0.5513726584504396
Validation Error = 0.4486273415495604


We can notice from the above section, after we trained our three models using all our categorical features, that their performance has been decreased in terms of accuracy and time efficiency. The accuracy for all the models has decreased and the time taken to complete the training process increased as well. Although the performance has slightly worsened, it's important to note that the future surveys spanning from 2021 and beyond, are likely to have more variables leading to a remarkable risk if we didn't employ feature selection. 


Therefore, as it's always recommended to opt for simpler models rather than complex ones to guarantee optimal performance and to increase the interpretability of the models, we will proceed in the next section using only the highly significant categorical variables resulting from the chi-square test. 

In [284]:
# Define your categorical features
categorical_features = ['OrganizationSize','Region', 'Hobbyist', 'EducationLevel', 'Employment', 'Major']

# Initialize a StringIndexer
stringIndexer = StringIndexer(inputCols=categorical_features, outputCols=["encoded_" + col for col in categorical_features])

# Fit the StringIndexer
model = stringIndexer.fit(dataframe_cleaned)

# Transform the data
dataframe_with_encoded_features = model.transform(dataframe_cleaned)

# Create a VectorAssembler to assemble the features into a DenseVector
encoded_features = ["encoded_OrganizationSize", "encoded_Region", "encoded_Hobbyist", "encoded_EducationLevel", "encoded_Employment", "encoded_Major"]

# the result is in "features" columns
assembler = VectorAssembler(inputCols=encoded_features, outputCol="features")

selector = ChiSqSelector(featuresCol="features", outputCol="selectedFeatures", labelCol="Target", fpr= 0.05)

# Create a pipeline consists of assembler followed by the ChiSqSelector
pipeline = Pipeline(stages=[assembler, selector])

# Fit the pipeline
model = pipeline.fit(dataframe_with_encoded_features)

# Transform the data
result = model.transform(dataframe_with_encoded_features)

# The selectedFeatures column now contains the most important features
selected_features = result.select("selectedFeatures")

# Define a dictionary to map selected feature indices to variable names
feature_mapping = {
    0: "encoded_OrganizationSize",
    1: "encoded_Region",
    2: "encoded_Hobbyist",
    3: "encoded_EducationLevel",
    4: "encoded_Employment",
    5: "encoded_Major"
}

# Extract the selected feature indices from the DataFrame
selected_feature_indices = result.select("selectedFeatures").collect()[0][0].toArray()

# Sort the indices based on their values (Chi-Square statistics)
sorted_indices = sorted(range(len(selected_feature_indices)), key=lambda i: selected_feature_indices[i], reverse=True)

# Select the top 3 feature indices
top_3_indices = sorted_indices[:3]

# Map selected feature indices to variable names
selected_features = [feature_mapping[i] for i in top_3_indices]


dataframe_with_encoded_features = dataframe_with_encoded_features.drop(dataframe_with_encoded_features.encoded_Hobbyist)
dataframe_with_encoded_features = dataframe_with_encoded_features.drop(dataframe_with_encoded_features.encoded_Employment)
dataframe_with_encoded_features = dataframe_with_encoded_features.drop(dataframe_with_encoded_features.encoded_Major)

counts = dataframe_with_encoded_features.groupBy('Target').count()


# Get the count for the 0 and 1 target labels
count_0 = counts.filter(counts['Target'] == 0).select('count').collect()[0][0]
count_1 = counts.filter(counts['Target'] == 1).select('count').collect()[0][0]

# Calculate the oversampling fraction for the 0 target label
oversampling_fraction_0 = (count_1/2.12) / count_0

# Create a DataFrame with additional rows for the 0 target label
oversampled_df = dataframe_with_encoded_features.unionAll(
    dataframe_with_encoded_features.filter(dataframe_with_encoded_features['Target'] == 0)
    .sample(True, oversampling_fraction_0 , seed=200)
    .withColumn('Target', lit(0))
)

# Define the list of columns to be one-hot encoded
columns_to_encode = ["Region"]   

# Create empty lists to hold the StringIndexers and Encoders
string_indexers = []
encoders = []

# Loop through each column and create a StringIndexer and an Encoder for it
for col in columns_to_encode:
    indexer = StringIndexer(inputCol=col, outputCol=col + "_index")
    encoder = OneHotEncoder(inputCol=col + "_index", outputCol=col + "_vector")
    string_indexers.append(indexer)
    encoders.append(encoder)

# Create a pipeline with StringIndexers and Encoders
pipeline = Pipeline(stages=string_indexers + encoders)

# Fit and transform the data using the pipeline
model = pipeline.fit(oversampled_df)
encoded_df = model.transform(oversampled_df)


# List of columns to scale
columns_to_scale = ["ConvertedComp", "YearsCodePro", "encoded_EducationLevel", "encoded_OrganizationSize"]

assemblers = []
scalers = []

for col in columns_to_scale:
    # Assemble the features into a vector
    assembler = VectorAssembler(inputCols=[col], outputCol=col + "_vector")
    scaler = MinMaxScaler(inputCol=col + "_vector", outputCol=col + "_scaled")
    assemblers.append(assembler)
    scalers.append(scaler)

# Create pipeline
scaling_pipeline = Pipeline(stages=assemblers + scalers)

# Fit and transform the data using the pipeline
model = scaling_pipeline.fit(encoded_df)
scaled_df = model.transform(encoded_df)

columns_to_drop = ["Year", "Hobbyist", "ConvertedComp", "Employment", "YearsCodePro", "Region", "EducationLevel", 
                   "Major", "OrganizationSize", "encoded_EducationLevel", "encoded_EducationLevel_vector", "encoded_Region",
                   "encoded_OrganizationSize_vector", "encoded_OrganizationSize", "Region_index","ConvertedComp_vector", 
                   "YearsCodePro_vector"]

spark_df = scaled_df.drop(*columns_to_drop)
print("The features we are going to proceed with are: ")
spark_df.show(n=1, vertical=True, truncate=False)

# Create a VectorAssembler to be passed to our models
feature_cols = [c for c in spark_df.columns if c != 'Target']
assembler = VectorAssembler(inputCols=feature_cols , outputCol="features")

# Use transform to assemble the columns
assembled_df = assembler.transform(spark_df)  

selected_columns = ["features", "Target"]
assembled_df = assembled_df.select(selected_columns)


# holdout validation
(train, test) = assembled_df.randomSplit([0.8, 0.2], seed = 5)
(train_df, val_df) = train.randomSplit([0.8, 0.2], seed = 5)


# Show the total number of samples for each class label per set
train_df.groupBy('Target').count().show()
val_df.groupBy('Target').count().show()
test.groupBy('Target').count().show()

# Show total number of samples per set
print(train_df.count())
print(val_df.count())
print(test.count())



The features we are going to proceed with are: 
-RECORD 0--------------------------------------------------------------
 Data scientist or machine learning specialist | 0                     
 Database administrator                        | 1                     
 Data or business analyst                      | 0                     
 Engineer, data                                | 0                     
 Target                                        | 1                     
 Region_vector                                 | (4,[3],[1.0])         
 ConvertedComp_scaled                          | [0.31900923396624825] 
 YearsCodePro_scaled                           | [0.10344827586206896] 
 encoded_EducationLevel_scaled                 | [0.2857142857142857]  
 encoded_OrganizationSize_scaled               | [0.0]                 
only showing top 1 row

+------+-----+
|Target|count|
+------+-----+
|     1| 7988|
|     0| 8140|
+------+-----+

+------+-----+
|Target|count|
+------+-----+


#### 2.2.9 Predictions on Test Set Using Random Forest

The random forest model is the best-performing model as it achieved the highest accuracy, which is 69%, on our balanced dataset during the training and validation phases with trees of 61, a max depth of 20, and a max bin of 60 using the resulting categorical features from chi-square test. Therefore, we will train this model on our fully balanced training dataset and then compute the accuracy using the test dataset.

In [320]:
randomForestClassifier_test = RandomForestClassifier(labelCol="Target", featuresCol="features",
                                                     numTrees=60, maxDepth=20, maxBins=60)

randomForestModel_test = randomForestClassifier_test.fit(train)

predictions_test = randomForestModel_test.transform(test)
    
predictions_test.select("Target", "prediction").show()

prediction_targets_test = pd.DataFrame({'targets': predictions_test.select("Target").collect(), 'predictions': predictions_test.select("prediction").collect()})

evaluator_test = MulticlassClassificationEvaluator(labelCol="Target", predictionCol="prediction")

accuracy_test = evaluator_test.evaluate(predictions_test)

print("Accuracy = %s" % (accuracy_test))
print("Test Error = %s" % (1.0 - accuracy_test))

+------+----------+
|Target|prediction|
+------+----------+
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     0|       0.0|
|     1|       1.0|
|     0|       1.0|
|     1|       0.0|
|     1|       1.0|
|     1|       1.0|
|     0|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       1.0|
|     1|       0.0|
|     1|       1.0|
+------+----------+
only showing top 20 rows

Accuracy = 0.7211649005537517
Test Error = 0.2788350994462483


### 3. Summary and Conclusion

#### 3.1 Summary:

All in all, we have processed the results from the "Stack Overflow Annual Developers Survey" from 2017 to 2020 to predict whether the employee is satisfied with his job. We started by exploring the dataset in the EDA section by getting familiar with our numerical and categorical variables, followed by the hypothesis testing section. The target of the hypothesis testing is to identify which categorical factors affect employee satisfaction, and it was illustrated that the *Region*, *Education Level*, and *Organization Size* contribute to the prediction of employee satisfaction. Furthermore, we pre-processed our dataset by balancing it using an oversampling technique due to having an imbalanced target variable, applying one-hot encoding to the categorical features, and then scaling the numerical features. The last step in the pre-processing phase was dropping the irrelevant features from our analysis and creating a vector assembler to be passed to our models. 

At this point, we are ready to build our predictive models, but first, we folded the dataset into three folds: training, validation, and testing sets. Next, we started running our random forest model, followed by naive bayes and logistic regression models using the training and validation sets. However, the results were not satisfying, so we considered all the categorical variables instead of the only three resulting from the chi-square test to compare the models' performance in terms of accuracy and time efficiency. Nevertheless, the performance degradation was relatively modest. Moreover, the models' complexity has increased, affecting their performance and interpretability. Thus, we continued our analysis but only used the resulting features from the chi-square test. 

#### 3.2 Goals:

Our target is to build a suitable Spark ML model to predict whether the employee is satisfied or not with his job by making the best use of the provided capabilities: HDFS and Apache Spark. 

#### 3.3 Challenges And Results:

The process of identifying which model to proceed with was not easy, it can be considered one of the challenges that we have encountered during our big data analysis journey. We can see that when using the accuracy metric on the Naive Bayes algorithm, considering the three categorical features resulting from the chi-square test yields a result of 55%, while using all the categorical features yields 53%. We believe that it is the worst-performing model due to its naive assumptions. Regarding the Logistic Regression model, it yielded a very low accuracy, which is 56% using the categorical features resulting from the chi-square test and 55% using all the categorical features. 

We also see that the Random Forest yielded the highest accuracy, with a result of 69% using both datasets, resulting from the chi-square test and the whole categorical features. Interestingly, both datasets performed equally well in terms of accuracy with the random forest, however, the time taken to train our model increased when we considered all our categorical features, and this is due to the increased model's complexity. This means that the algorithm was able to classify regardless of the extra features added. This is what makes the random forest classifier the best-suited algorithm since it performs with high accuracy, avoids bias and overfitting, and can handle large amounts of features and data so it will predict/classify the output accurately. Lastly, we applied random forest to our test set, and it achieved an accuracy of 72%, which is high compared to the previously mentioned models. We expected to get a higher value, but this could be due to our second challenge, hyperparameter tuning.


It's worth mentioning that the performance of our models could be better, however, this could be due to our hyperparameter tuning process needing to be more robust, such as the grid search method. Unfortunately, this was limited by how many random searches we could perform. The number of random searches was limited, specifically while running the random forest algorithm, as an error occurred when we increased the number of trees and the maximum tree depth during a large number of random searches, leading to a "Java heap space" error, indicating an out-of-memory issue that required session restarts.

During the first trial to tune our random forest model, using a tree number of 200 and a maximum depth of 20, the time consumed was approximately 8 minutes, which is relatively high compared to the last attempt. In the last attempt, while tuning the random forest algorithm, the time consumed was 0.70 minutes with a tree number of 60 and a maximum depth of 20. Moving to the time consumed by the Logistic Regression model, while applying hyperparameter tuning, it was 0.72 minutes, which is relatively close to the time taken during the last attempt of the random forest model but with a lower accuracy result.

The mentioned parameters of the random forest model play a huge role in incrementing the training and testing time of the model, so for example, the more trees the slower our learner is, which will increase the accuracy. The same applies to the maximum depth, if the number of levels increases per tree, this will increase the training time for each tree, hence the whole forest.

#### 3.4 Conclusion:

To conclude, our project is transferable to different domain-specific areas other than merely predicting employee satisfaction in the field of data analytics, as it could be utilized in a wide array of different aspects of different fields. Our model could be replicated using different programming languages, such as Scala which could be a task for the future. Also, more complex algorithms, such as deep learning techniques, could capture the high dimensionality of the data and provide better metrics. The downside of these algorithms is their black box techniques, which could be difficult to grasp and explain to the business side. Also, deep learning requires high computational power, making it an optimal choice to take advantage of the available capabilities within HDFS and Apache Spark. 


Thus, we propose for future reports and analyses to try to incorporate other hyperparameter tuning techniques and maybe explore which ensemble methods such as Xg-Boost and AdaBoost may perform better than random forests for such datasets.

### 4. References:


1. Allibhai, E. (2018). Holdout vs. cross-validation in machine learning. [online] Medium. URL: https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f.


2. Avinash Navlani (2018). Random Forests Classifiers in Python. [online] DataCamp Community. URL: https://www.datacamp.com/community/tutorials/random-forests-classifier-python.


3. Biswal, A. (2023). What is a Chi-Square Test? Formula, Examples & Uses | Simplilearn. [online] Simplilearn.com. URL: https://www.simplilearn.com/tutorials/statistics-tutorial/chi-square-test.


4. Brownlee, J. (2019). How to Choose a Feature Selection Method For Machine Learning. [online] Machine Learning Mastery. URL: https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/.


5. Brownlee, J. (2019). How to Perform Feature Selection with Categorical Data. [online] MachineLearningMastery.com. URL: https://machinelearningmastery.com/feature-selection-with-categorical-data/.


6. Chauhan, N., 2022. Naïve Bayes Algorithm: Everything You Need to Know - KDnuggets. [Blog] KDnuggets. URL: <https://www.kdnuggets.com/2020/06/naive-bayes-algorithm-everything.html#:~:text=The%20Zero%2DFrequency%20Problem,all%20the%20probabilities%20are%20multiplied.>


7. Eijaz Allibhai (2018). Holdout vs. Cross-validation in Machine Learning. [online] Medium. URL: https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f.


8. Goswami, D.S. (2020). Using the Chi-Squared test for feature selection with implementation. [online] Medium. URL: https://towardsdatascience.com/using-the-chi-squared-test-for-feature-selection-with-implementation-b15a4dad93f1.


9. Irfan, S., 2021. Hyperparameter Tuning in Random Forest. [online] Medium. URL: <https://medium.com/geekculture/random-forest-algorithm-has-proven-to-be-one-of-the-most-sought-after-algorithm-in-the-field-of-295da606bf9>


10. Koehrsen, W., 2018. Hyperparameter Tuning the Random Forest in Python. [online] Medium. URL: <https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74>


11. Lindgren, I. (2019). Transformations, Scaling and Normalization. [online] Medium. URL: https://medium.com/@isalindgren313/transformations-scaling-and-normalization-420b2be12300.


12. Logunova, I., 2022. Guide to Random Forest Classification and Regression Algorithms. [online] Serokell Software Development Company. URL: <https://serokell.io/blog/random-forest-classification>


13. MALATO, G., 2021. Feature selection with Random Forest | Your Data Teacher. [online] Your Data Teacher. URL: <https://www.yourdatateacher.com/2021/10/11/feature-selection-with-random-forest/>


14. Mazumder, S., 2021. What is Imbalanced Data | Techniques to Handle Imbalanced Data. [online] Analytics Vidhya. URL: <https://www.analyticsvidhya.com/blog/2021/06/5-techniques-to-handle-imbalanced-data-for-a-classification-problem/>


15. Navlani, A., 2018. Naive Bayes Classification Tutorial using Scikit-learn. [Blog] datacamp, URL: <https://www.datacamp.com/tutorial/naive-bayes-scikit-learn>


16. Priyanjalee, M. (2021). A Guide to exploit Random Forest Classifier in PySpark. [online] Medium. URL: https://towardsdatascience.com/a-guide-to-exploit-random-forest-classifier-in-pyspark-46d6999cb5db.


17. R, S., 2021. Random Forest | Introduction to Random Forest Algorithm. [online] Analytics Vidhya. URL: <https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/>


18. Shrestha, S. (2021). Preparing Data for Apache Spark ML Model. [online] Medium. URL: https://medium.com/@statistics.sudip/preparing-data-for-apache-spark-ml-model-4fedcc31a0f4.


19. spark.apache.org. (n.d.). Extracting, transforming and selecting features - Spark 2.2.0 Documentation. [online] URL: https://spark.apache.org/docs/2.2.0/ml-features.html#chisqselector.
