<a href="https://colab.research.google.com/github/AnnLivio/PySpark/blob/main/Challenge_Employee_Attrition_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome the challenge notebook
---


In this challenge, you will work with a dataset provided by an HR manager who wants to predict which employees are at risk of leaving the company. The dataset contains four key performance indicators (KPIs) related to each employee. Your task is to use PySpark to build a machine learning model that can predict employee attrition and to identify which KPI is most strongly associated with attrition in this company.

- Please note that the dataset is already clean and ready to be modeled.
- The dataset only contains numerical features.

Installing pyspark

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.3.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.3-py2.py3-none-any.whl size=317840625 sha256=7cfeadb8899ef8b09114ead51bfb40d18f2fdaaeba047b2d41040e72c2e3e645
  Stored in directory: /root/.cache/pip/wheels/1b/3a/92/28b93e2fbfdbb07509ca4d6f50c5e407f48dce4ddbda69a4ab
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.3


Importing the needed modules and creating the spark session

In [2]:
pip install plotly



In [3]:
# importing spark session
from pyspark.sql import SparkSession

# data visualization modules
import matplotlib.pyplot as plt
import plotly.express as px

# pandas module
import pandas as pd

# pyspark data preprocessing modules
from pyspark.ml.feature import  VectorAssembler, StandardScaler,StringIndexer

# pyspark data modeling and model evaluation modules
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# creating the spark session
spark = SparkSession.builder.appName("Challenge").getOrCreate()
spark

Loading the `Challenge_dataset.csv` file

In [5]:
data = spark.read.csv("/content/Challenge_dataset.csv", header = True, inferSchema = True)
data.show(5)

+----------+------------------+-------------------+-------------------+-------------------+---------------+
|EmployeeID|              KPI1|               KPI2|               KPI3|               KPI4|CurrentEmployee|
+----------+------------------+-------------------+-------------------+-------------------+---------------+
|         0|1.4347155493478079| 0.8445778971189396| 1.2907117554310856|-1.4201273531837943|              1|
|         1|0.8916245735832885| 0.8308158727699302| 1.0779750584283363|-1.0598957663940176|              1|
|         2|-0.891158353098296|-0.9469681237741348|-1.1825287909456643| 1.1269205082112577|              0|
|         3|1.2797294893867808| 1.6690888870054317| 1.9769417044649022| -1.797525912345404|              1|
|         4|0.2576789316661615|0.34201906896710577|0.40342208520171396|-0.3653830886145554|              1|
+----------+------------------+-------------------+-------------------+-------------------+---------------+
only showing top 5 rows



Create the numerical feature vector using `Vector Assembler`.

Hint: The numerical input features are the KPIs.

In [7]:
# List the numerical features
numerical_cols = [name for name, dtype in data.dtypes if dtype == 'double']
numerical_vector = VectorAssembler(inputCols = numerical_cols, outputCol = 'input_features')
data = numerical_vector.transform(data)
data.show(5)


+----------+------------------+-------------------+-------------------+-------------------+---------------+--------------------+
|EmployeeID|              KPI1|               KPI2|               KPI3|               KPI4|CurrentEmployee|      input_features|
+----------+------------------+-------------------+-------------------+-------------------+---------------+--------------------+
|         0|1.4347155493478079| 0.8445778971189396| 1.2907117554310856|-1.4201273531837943|              1|[1.43471554934780...|
|         1|0.8916245735832885| 0.8308158727699302| 1.0779750584283363|-1.0598957663940176|              1|[0.89162457358328...|
|         2|-0.891158353098296|-0.9469681237741348|-1.1825287909456643| 1.1269205082112577|              0|[-0.8911583530982...|
|         3|1.2797294893867808| 1.6690888870054317| 1.9769417044649022| -1.797525912345404|              1|[1.27972948938678...|
|         4|0.2576789316661615|0.34201906896710577|0.40342208520171396|-0.3653830886145554|      

Apply `Standard Scaler` to the numerical feature vector

In [8]:
# Standard Scaler for input_features
scaler = StandardScaler(inputCol = 'input_features', outputCol = 'scaled_features')
scaler_model = scaler.fit(data)
data = scaler_model.transform(data)
data.show(5)

+----------+------------------+-------------------+-------------------+-------------------+---------------+--------------------+--------------------+
|EmployeeID|              KPI1|               KPI2|               KPI3|               KPI4|CurrentEmployee|      input_features|     scaled_features|
+----------+------------------+-------------------+-------------------+-------------------+---------------+--------------------+--------------------+
|         0|1.4347155493478079| 0.8445778971189396| 1.2907117554310856|-1.4201273531837943|              1|[1.43471554934780...|[1.08046317968569...|
|         1|0.8916245735832885| 0.8308158727699302| 1.0779750584283363|-1.0598957663940176|              1|[0.89162457358328...|[0.67146935313946...|
|         2|-0.891158353098296|-0.9469681237741348|-1.1825287909456643| 1.1269205082112577|              0|[-0.8911583530982...|[-0.6711182493489...|
|         3|1.2797294893867808| 1.6690888870054317| 1.9769417044649022| -1.797525912345404|         

In [22]:
data = data.select(['scaled_features','CurrentEmployee'])

Split the data into train and test sets

In [24]:
# Split data in train and test
train, test = data.randomSplit([0.7, 0.3], seed = 42)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

Training Dataset Count: 2866
Test Dataset Count: 1134


Train your Decision Tree model. Use `maxDepth = 3`

In [26]:
# Create Decision Tree Model, maxDepth=3
decision_tree = DecisionTreeClassifier(featuresCol = 'scaled_features', labelCol = 'CurrentEmployee', maxDepth = 3)
dt_model = decision_tree.fit(train)

Perform the prediction on the test set and calculate the accuracy using `BinaryClassificationEvaluator`

In [27]:
# Perform predictions
predictions = dt_model.transform(test)
evaluator = BinaryClassificationEvaluator(labelCol='CurrentEmployee')
auc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
print("Test Area Under ROC: " + str(auc))

Test Area Under ROC: 0.8802175456515781


Apply the hyper paramter tuning to find the proper `maxDepth` for your decision tree from the `candidates` list.

In [28]:
def evaluate_dt(mode_params):
      test_accuracies = []
      train_accuracies = []

      for maxD in mode_params:
        # train the model based on the maxD
        decision_tree = DecisionTreeClassifier(featuresCol = 'scaled_features', labelCol = 'CurrentEmployee', maxDepth = maxD)
        dtModel = decision_tree.fit(train)

        # calculating test error
        predictions_test = dtModel.transform(test)
        evaluator = BinaryClassificationEvaluator(labelCol='CurrentEmployee')
        auc_test = evaluator.evaluate(predictions_test, {evaluator.metricName: "areaUnderROC"})
        # recording the accuracy
        test_accuracies.append(auc_test)

        # calculating training error
        predictions_training = dtModel.transform(train)
        evaluator = BinaryClassificationEvaluator(labelCol='CurrentEmployee')
        auc_training = evaluator.evaluate(predictions_training, {evaluator.metricName: "areaUnderROC"})
        train_accuracies.append(auc_training)

      return(test_accuracies, train_accuracies)



candidates = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]


Use a line chart to visualize the training and testing accuracy. <br>

Hint: To visualize your data, convert the PySpark dataframe to pandas dataframe.

In [29]:
test_accuracies, train_accuracies = evaluate_dt(candidates)

In [30]:
df = pd.DataFrame({'maxDepth': candidates, 'test_accuracies': test_accuracies, 'train_accuracies': train_accuracies})
px.line(df, x='maxDepth', y=['test_accuracies', 'train_accuracies'], title='Hyperparameter Tuning')

Train the decision tree using the proper `maxDepth` parameter.  

In [31]:
decision_tree = DecisionTreeClassifier(featuresCol = 'scaled_features', labelCol = 'CurrentEmployee', maxDepth = 8)
dt_model = decision_tree.fit(train)

Use the `Feature Importance` to find the most important factor for the employee attrition using a barchart.

In [32]:
feature_importance = dt_model.featureImportances
scores = [score for i, score in enumerate(feature_importance)]
df = pd.DataFrame(scores, columns=['score'], index=numerical_cols)
df

Unnamed: 0,score
KPI1,0.070852
KPI2,0.044674
KPI3,0.013977
KPI4,0.870497


In [33]:
px.bar(df, y='score', title='Feature Importance')

In [35]:
spark.stop()