# 【Pyspark】使用决策树和随机森林预测员工离职率

# 一、数据介绍

## 数据说明：

* satisfaction_level 员工对公司的满意度  
* last_evaluation 员工上次KPI评分  
* number_project 同时处理的项目数  
* average_montly_hours 平均每个月的工作时间  
* time_spend_company 在公司的时间  
* Work_accident 是否出现过工作事故  
* left 是否离开  
* promotion_last_5years 最近5年是否升职  
* sales 员工部门  
* salary 工资水平

# 二、模型概述

* 数据中约有76.2%的员工未离职，有23.8%的员工离职了  
* 未离职员工对公司的满意度明显高于已离职员工  
* 离职员工同时处理的项目数略高于未离职员工同时处理的项目数  
* t检验，判定离职员工的满意度和未离职员工的满意度是不是完全一致

# 三、数据预处理

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, mean, when
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import Row
from pyspark.ml.linalg import DenseMatrix
from pyspark.sql import types as T
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, DoubleType
import scipy.stats as stats
from pyspark.ml.stat import Summarizer
from pyspark.ml.feature import StringIndexer

In [3]:
# 创建 SparkSession
spark = SparkSession.builder.appName("Tree").master("local[*]").getOrCreate()

In [5]:
# 读取 员工 数据
data = spark.read.csv("HR_comma_sep.csv", header=True, inferSchema=True)

In [6]:
data.show()

+------------------+---------------+--------------+--------------------+------------------+-------------+----+---------------------+-----+------+
|satisfaction_level|last_evaluation|number_project|average_montly_hours|time_spend_company|Work_accident|left|promotion_last_5years|sales|salary|
+------------------+---------------+--------------+--------------------+------------------+-------------+----+---------------------+-----+------+
|              0.38|           0.53|             2|                 157|                 3|            0|   1|                    0|sales|   low|
|               0.8|           0.86|             5|                 262|                 6|            0|   1|                    0|sales|medium|
|              0.11|           0.88|             7|                 272|                 4|            0|   1|                    0|sales|medium|
|              0.72|           0.87|             5|                 223|                 5|            0|   1|              

In [7]:
data.count()

14999

In [8]:
data.dtypes

[('satisfaction_level', 'double'),
 ('last_evaluation', 'double'),
 ('number_project', 'int'),
 ('average_montly_hours', 'int'),
 ('time_spend_company', 'int'),
 ('Work_accident', 'int'),
 ('left', 'int'),
 ('promotion_last_5years', 'int'),
 ('sales', 'string'),
 ('salary', 'string')]

## 缺失值处理

In [8]:
# 计算DataFrame中每一列的缺失率
def calculate_missing_rates(df):
    """
    计算每列的缺失率（为 null 和 空格 都认为是缺失值）

    Parameters:
    - df: PySpark DataFrame

    Returns:
    - missing_rates_df: PySpark DataFrame，包含每列的缺失率
    """
    total_rows = df.count()
    missing_rates = []

    for column in df.columns:
        missing_count = df.filter(col(column).isNull() | (col(column) == "") | (col(column) == " ")).count()
        missing_rate = (missing_count / total_rows) * 100
        missing_rates.append((column, missing_rate))

    # 创建 DataFrame 显示结果
    missing_rates_df = spark.createDataFrame(missing_rates, ["Column", "MissingRate"])
    return missing_rates_df

In [11]:
missing_rates = calculate_missing_rates(data)

In [12]:
missing_rates.show()

+--------------------+-----------+
|              Column|MissingRate|
+--------------------+-----------+
|  satisfaction_level|        0.0|
|     last_evaluation|        0.0|
|      number_project|        0.0|
|average_montly_hours|        0.0|
|  time_spend_company|        0.0|
|       Work_accident|        0.0|
|                left|        0.0|
|promotion_last_5y...|        0.0|
|               sales|        0.0|
|              salary|        0.0|
+--------------------+-----------+



* 数据集中不存在缺失数据

## 特征重命名

In [17]:
# 定义列名映射关系
column_mapping = {'satisfaction_level': 'satisfaction', 
                  'last_evaluation': 'evaluation',
                  'number_project': 'projectCount',
                  'average_montly_hours': 'averageMonthlyHours',
                  'time_spend_company': 'yearsAtCompany',
                  'Work_accident': 'workAccident',
                  'promotion_last_5years': 'promotion',
                  'sales' : 'department',
                  'left' : 'turnover',
                  'salary': 'salary'
                 }

In [124]:
# 特征列重命名
data = data.selectExpr(*[f"{col} as {column_mapping[col]}" for col in data.columns])

In [19]:
data.show()

+------------+----------+------------+-------------------+--------------+------------+--------+---------+----------+------+
|satisfaction|evaluation|projectCount|averageMonthlyHours|yearsAtCompany|workAccident|turnover|promotion|department|salary|
+------------+----------+------------+-------------------+--------------+------------+--------+---------+----------+------+
|        0.38|      0.53|           2|                157|             3|           0|       1|        0|     sales|   low|
|         0.8|      0.86|           5|                262|             6|           0|       1|        0|     sales|medium|
|        0.11|      0.88|           7|                272|             4|           0|       1|        0|     sales|medium|
|        0.72|      0.87|           5|                223|             5|           0|       1|        0|     sales|   low|
|        0.37|      0.52|           2|                159|             3|           0|       1|        0|     sales|   low|
|       

# 四、分析过程

## 1、探索性分析

In [21]:
# 计算离职员工以及未离职员工比例
turnover_rate = data.groupBy("turnover").count().withColumn(
    "turnover_rate", col("count") / data.count()
).select("turnover", "turnover_rate")

In [22]:
turnover_rate.show()

+--------+------------------+
|turnover|     turnover_rate|
+--------+------------------+
|       1|0.2380825388359224|
|       0|0.7619174611640777|
+--------+------------------+



* 数据中约有76.2%的员工未离职，有23.8%的员工离职了

In [23]:
# 分组查看员工的特征
turnover_Summary = data.groupby('turnover').mean()
turnover_Summary.show()

+--------+-------------------+------------------+------------------+------------------------+-------------------+--------------------+-------------+--------------------+
|turnover|  avg(satisfaction)|   avg(evaluation)| avg(projectCount)|avg(averageMonthlyHours)|avg(yearsAtCompany)|   avg(workAccident)|avg(turnover)|      avg(promotion)|
+--------+-------------------+------------------+------------------+------------------------+-------------------+--------------------+-------------+--------------------+
|       1|0.44009801176140917|0.7181125735088183|3.8555026603192384|      207.41921030523662|  3.876505180621675|0.047325679081489776|          1.0|0.005320638476617194|
|       0|  0.666809590479516|0.7154733986699274| 3.786664333216661|       199.0602030101505| 3.3800315015750786| 0.17500875043752187|          0.0|0.026251312565628283|
+--------+-------------------+------------------+------------------+------------------------+-------------------+--------------------+-------------+--

* 未离职员工对公司的满意度明显高于已离职员工  
* 离职员工同时处理的项目数略高于未离职员工同时处理的项目数

## 2、相关性分析

In [27]:
# 将所有数字类型特征合并到一个向量中
assembler = VectorAssembler(inputCols=[col_name for col_name, col_type in data.dtypes if col_type == "double" or col_type == "int"], outputCol="features")

In [28]:
data_vector = assembler.transform(data).select("features")

In [40]:
# 计算相关矩阵
matrix = Correlation.corr(data_vector, "features").head()[0]

In [75]:
# 将结果转换为矩阵格式并创建DataFrame
schema = StructType([
    StructField("feature", StringType(), True),
    StructField("correlations", ArrayType(DoubleType()), True)  # 修正此处
])

In [84]:
columns = [col_name for col_name, col_type in data.dtypes if col_type == "double" or col_type == "int"]
corr_data = [
    (column, [float(x) for x in matrix.toArray()[i]]) for i, column in enumerate(columns)
]

In [86]:
corr_data = spark.createDataFrame(corr_data, schema=schema) 

In [87]:
corr_data.show()

+-------------------+--------------------+
|            feature|        correlations|
+-------------------+--------------------+
|       satisfaction|[1.0, 0.105021213...|
|         evaluation|[0.10502121397148...|
|       projectCount|[-0.1429695860368...|
|averageMonthlyHours|[-0.0200481132194...|
|     yearsAtCompany|[-0.1008660725779...|
|       workAccident|[0.05869724105198...|
|           turnover|[-0.3883749834240...|
|          promotion|[0.02560518570904...|
+-------------------+--------------------+



In [91]:
column_indices = range(len(columns))

# 使用foldLeft操作遍历列索引，并将每个索引位置的值提取到新列中
for i in column_indices:
    # 创建新列名
    new_column_name = columns[i]
    # 使用withColumn添加新列
    corr_data = corr_data.withColumn(new_column_name, col("correlations")[i])

# 删除原始的correlations列
corr_data = corr_data.drop("correlations")

In [92]:
corr_data.show()

+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|            feature|        satisfaction|          evaluation|        projectCount| averageMonthlyHours|      yearsAtCompany|        workAccident|            turnover|           promotion|
+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       satisfaction|                 1.0|  0.1050212139714834|-0.14296958603689447|-0.02004811321947...|-0.10086607257797037|0.058697241051982506| -0.3883749834240641|0.025605185709041058|
|         evaluation|  0.1050212139714834|                 1.0|  0.3493325885162647|  0.3397417998383528| 0.13159072244765946|-0.00710428851960...|0.006567120447532696|-0.00868376790479...|
|       projectCount|-0.14296958603689447|  0.3493

* 正相关的特征:  
    * projectCount VS evaluation: 0.349333  
    * projectCount VS averageMonthlyHours: 0.417211  
    * averageMonthlyHours VS evaluation: 0.339742

* 负相关的特征:  
    * satisfaction VS turnover: -0.388375

## 3、T- test T检验

* t检验，判定离职员工的满意度和未离职员工的满意度是不是完全一致

In [104]:
# 计算未离职员工满意度
emp_population = data.filter(data['turnover'] == 0).agg(mean('satisfaction')).collect()[0][0]

# 计算离职员工满意度
emp_turnover_satisfaction = data.filter(data['turnover'] == 1).agg(mean('satisfaction')).collect()[0][0]

In [99]:
# 打印结果
print('未离职员工满意度: ' + str(emp_population))
print('离职员工满意度: ' + str(emp_turnover_satisfaction))

未离职员工满意度: 0.666809590479516
离职员工满意度: 0.44009801176140917


In [106]:
# 单样本t检验
t_statistic, p_value = stats.ttest_1samp(a=data.filter(data['turnover'] == 1).select('satisfaction').rdd.flatMap(lambda x: x).collect(), #离职员工样本数据
                  popmean=emp_population ) #未离职员工满意度均值

In [107]:
# 打印结果
print('T 统计量:', t_statistic)
print('p 值:', p_value)

T 统计量: -51.3303486754725
p 值: 0.0


* P值小于0.05，差异显著  
* 离职员工和未离职员工从统计意义上来讲，差异显著  
* 从统计上来讲离开和没有离开的人对公司的满意度有显著差异

## 4、决策树和随机森林

In [108]:
data.dtypes

[('satisfaction', 'double'),
 ('evaluation', 'double'),
 ('projectCount', 'int'),
 ('averageMonthlyHours', 'int'),
 ('yearsAtCompany', 'int'),
 ('workAccident', 'int'),
 ('turnover', 'int'),
 ('promotion', 'int'),
 ('department', 'string'),
 ('salary', 'string')]

In [125]:
# data中存在String类型的特征，无法代入决策树或随机森林模型中完成建模
# 将string类型转换为整数类型
department_indexer = StringIndexer(inputCol="department", outputCol="department_encoded")
data = department_indexer.fit(data).transform(data)

In [126]:
# 使用 StringIndexer 将 'salary' 列转换为整数编码
salary_indexer = StringIndexer(inputCol="salary", outputCol="salary_encoded")
data = salary_indexer.fit(data).transform(data)

In [115]:
data.show(5)

+------------+----------+------------+-------------------+--------------+------------+--------+---------+----------+------+------------------+--------------+
|satisfaction|evaluation|projectCount|averageMonthlyHours|yearsAtCompany|workAccident|turnover|promotion|department|salary|department_encoded|salary_encoded|
+------------+----------+------------+-------------------+--------------+------------+--------+---------+----------+------+------------------+--------------+
|        0.38|      0.53|           2|                157|             3|           0|       1|        0|     sales|   low|               0.0|           0.0|
|         0.8|      0.86|           5|                262|             6|           0|       1|        0|     sales|medium|               0.0|           1.0|
|        0.11|      0.88|           7|                272|             4|           0|       1|        0|     sales|medium|               0.0|           1.0|
|        0.72|      0.87|           5|              

In [127]:
# 合并特征列
assembler = VectorAssembler(inputCols=['satisfaction','evaluation','projectCount','averageMonthlyHours','yearsAtCompany','workAccident',
                                       'promotion','department_encoded','salary_encoded'], outputCol="features")
data = assembler.transform(data)

In [128]:
# 划分训练集和测试集
train_data, test_data = data.randomSplit([0.8, 0.2], seed=123)

### 决策树建模

In [129]:
# 初始化决策树模型
dt = DecisionTreeClassifier(labelCol="turnover", featuresCol="features")

In [131]:
# 设置决策树参数
dt.setMaxDepth(5)  # 最大深度
dt.setMaxBins(32)  # 最大分箱数
dt.setImpurity("entropy")

DecisionTreeClassifier_321d9f997731

In [132]:
# 训练决策树模型
model = dt.fit(train_data)

In [133]:
# 在测试集上进行预测
predictions = model.transform(test_data)

In [139]:
# 评估模型性能
evaluator = MulticlassClassificationEvaluator(labelCol="turnover", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

print("模型准确率:", accuracy)

模型准确率: 0.9674449194343966


In [141]:
# 评估模型性能
evaluator = BinaryClassificationEvaluator(labelCol="turnover", metricName='areaUnderROC')
auc = evaluator.evaluate(predictions)

print("模型AUC值:", auc)

模型AUC值: 0.9250518797782842


### 随机森林建模

In [143]:
# 初始化随机森林模型
rf = RandomForestClassifier(labelCol="turnover", featuresCol="features")

In [144]:
RandomForestClassifier?

In [145]:
# 设置随机森林参数
rf.setNumTrees(20)  # 树的数量
rf.setMaxDepth(5)  # 最大深度
rf.setMaxBins(32)  # 最大分箱数
rf.setMinInstancesPerNode(1)  # 每个节点的最小实例数
rf.setMinInfoGain(0.0)  # 最小信息增益
rf.setFeatureSubsetStrategy("auto")  # 特征子集策略
rf.setImpurity("entropy")

RandomForestClassifier_a0be7f314079

In [146]:
# 训练随机森林模型
model_rf = rf.fit(train_data)

In [149]:
# 在测试集上进行预测
predictions_rf = model_rf.transform(test_data)

In [150]:
# 评估模型性能
evaluator = MulticlassClassificationEvaluator(labelCol="turnover", metricName="accuracy")
accuracy = evaluator.evaluate(predictions_rf)

print("模型准确率:", accuracy)

模型准确率: 0.9697467938178231


In [151]:
# 评估模型性能
evaluator = BinaryClassificationEvaluator(labelCol="turnover", metricName='areaUnderROC')
auc = evaluator.evaluate(predictions_rf)

print("模型AUC值:", auc)

模型AUC值: 0.9790629514013165


* 随机森林模型的拟合效果明显优于决策树模型