# 【Pyspark】逻辑回归预测银行客户是否会开设定期存款账户

# 一、项目介绍

* 逻辑回归预测银行客户是否会开设定期存款账户，采用多次建模进行投票的方式解决数据不均衡的问题

# 二、分析概述

* 购买定期存款的客户的平均年龄高于未购买定期存款的客户的平均年龄  

* 购买定期存款的客户的pdays（自上次联系客户以来的天数）较低。pdays越低，最后一次通话的记忆越好，因此销售的机会就越大  

* 逻辑回归准确率为0.91

# 三、数据预处理

In [1]:
import findspark
import result as result

findspark.init()

In [164]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col, mean, monotonically_increasing_id, rand
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from functools import reduce
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [3]:
# 创建 SparkSession
spark = SparkSession.builder.appName("LogisticRegression").master("local[*]").getOrCreate()

In [4]:
# 读取 CSV 数据
# header = True 代表读入数据的第一行是列名
# inferSchema=True 自动推断每列数据的数据类型，设为False就都当成字符串类型了
data = spark.read.csv("banking.csv", header=True, inferSchema=True)

In [5]:
data.show()

+---+-----------+--------+-----------------+-------+-------+----+---------+-----+-----------+--------+--------+-----+--------+-----------+------------+--------------+-------------+---------+-----------+---+
|age|        job| marital|        education|default|housing|loan|  contact|month|day_of_week|duration|campaign|pdays|previous|   poutcome|emp_var_rate|cons_price_idx|cons_conf_idx|euribor3m|nr_employed|  y|
+---+-----------+--------+-----------------+-------+-------+----+---------+-----+-----------+--------+--------+-----+--------+-----------+------------+--------------+-------------+---------+-----------+---+
| 44|blue-collar| married|         basic.4y|unknown|    yes|  no| cellular|  aug|        thu|     210|       1|  999|       0|nonexistent|         1.4|        93.444|        -36.1|    4.963|     5228.1|  0|
| 53| technician| married|          unknown|     no|     no|  no| cellular|  nov|        fri|     138|       1|  999|       0|nonexistent|        -0.1|          93.2|      

In [6]:
# 删除 data 中包含空值的行
data = data.na.drop()

In [7]:
# 删除重复数据
data = data.dropDuplicates()

In [9]:
# 剩余数据的数量
data.count()

41176

In [10]:
data.columns

['age',
 'job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'emp_var_rate',
 'cons_price_idx',
 'cons_conf_idx',
 'euribor3m',
 'nr_employed',
 'y']

In [13]:
# education的不同取值
data.select('education').distinct().collect()

[Row(education='high.school'),
 Row(education='unknown'),
 Row(education='basic.6y'),
 Row(education='professional.course'),
 Row(education='university.degree'),
 Row(education='illiterate'),
 Row(education='basic.4y'),
 Row(education='basic.9y')]

In [16]:
# 更新 education 列 数据
data = data.withColumn("education", when(data["education"] == "basic.4y", "Basic")
                                .when(data["education"] == "basic.6y", "Basic")
                                .when(data["education"] == "basic.9y", "Basic")
                                .otherwise(data["education"]))

In [18]:
# 观察标签列的数据分布
y_counts_df = data.groupBy("y").count().orderBy("y")
y_counts_df.show()

+---+-----+
|  y|count|
+---+-----+
|  0|36537|
|  1| 4639|
+---+-----+



* 数据处于极度不平衡的状态

In [20]:
count_no_sub = y_counts_df.filter(col("y") == 0).select("count").first()[0]
count_sub = y_counts_df.filter(col("y") == 1).select("count").first()[0]
pct_of_no_sub = count_no_sub / (count_no_sub + count_sub)
pct_of_sub = count_sub / (count_no_sub + count_sub)
print('未开户的百分比：%.2f%%' %(pct_of_no_sub * 100))
print('开户的百分比：%.2f%%' %(pct_of_sub * 100))

未开户的百分比：88.73%
开户的百分比：11.27%


## 探索分析

In [30]:
grouped_mean_df1 = data.groupBy("y").agg(
    mean("age").alias("mean_age"),
    mean("duration").alias("mean_duration"),
    mean("campaign").alias("mean_campaign"),
    mean("pdays").alias("mean_pdays"),
    mean("previous").alias("mean_previous")
)

In [31]:
grouped_mean_df2 = data.groupBy("y").agg(
    mean("emp_var_rate").alias("mean_emp_var_rate"),
    mean("cons_price_idx").alias("mean_cons_price_idx"),
    mean("cons_conf_idx").alias("mean_cons_conf_idx"),
    mean("euribor3m").alias("mean_euribor3m"),
    mean("nr_employed").alias("mean_nr_employed")
)

In [32]:
grouped_mean_df1.show()

+---+-----------------+------------------+------------------+-----------------+-------------------+
|  y|         mean_age|     mean_duration|     mean_campaign|       mean_pdays|      mean_previous|
+---+-----------------+------------------+------------------+-----------------+-------------------+
|  1|40.91226557447726| 553.2560896744989| 2.051950851476611|791.9909463246389|0.49277861608105195|
|  0|39.91099433451022|220.86807893368368|2.6333853354134167|984.1093959547856|0.13241371760133563|
+---+-----------------+------------------+------------------+-----------------+-------------------+



In [33]:
grouped_mean_df2.show()

+---+-------------------+-------------------+------------------+------------------+------------------+
|  y|  mean_emp_var_rate|mean_cons_price_idx|mean_cons_conf_idx|    mean_euribor3m|  mean_nr_employed|
+---+-------------------+-------------------+------------------+------------------+------------------+
|  1|-1.2330890278077151|  93.35457684845879|-39.79111877559823|2.1233617158870453| 5095.120068980305|
|  0|0.24888469222979895|  93.60379779400651|-40.59323151873497|3.8114816213702474|5176.1656895749975|
+---+-------------------+-------------------+------------------+------------------+------------------+



### 观察：

* 购买定期存款的客户的平均年龄高于未购买定期存款的客户的平均年龄  
* 购买定期存款的客户的pdays（自上次联系客户以来的天数）较低。pdays越低，最后一次通话的记忆越好，因此销售的机会就越大  
* 此次活动期间购买定期存款的客户的销售通话次数（campaign）相比未购买的较低  
* 需进一步计算其他特征值（如教育和婚姻状况）的分布，以便更加详细地了解数据

# 四、逻辑回归建模

## 不均衡数据集处理

* 由于数据集极度不平衡，为解决这个问题，将负样本随机分为7份，与正样本分别构建7个逻辑回归模型，模型构建完成后，使用投票法对测试集进行预测，评估建模效果

In [100]:
# 划分测试集和训练集
(train, test) = data.randomSplit([0.8, 0.2], seed=12345)

In [101]:
# 查看训练集中正负样本的比例
y_counts_df = train.groupBy("y").count().orderBy("y")
y_counts_df.show()
count_no_sub = y_counts_df.filter(col("y") == 0).select("count").first()[0]
count_sub = y_counts_df.filter(col("y") == 1).select("count").first()[0]
pct_of_no_sub = count_no_sub / (count_no_sub + count_sub)
pct_of_sub = count_sub / (count_no_sub + count_sub)
print('未开户的百分比：%.2f%%' %(pct_of_no_sub * 100))
print('开户的百分比：%.2f%%' %(pct_of_sub * 100))

+---+-----+
|  y|count|
+---+-----+
|  0|29186|
|  1| 3745|
+---+-----+

未开户的百分比：88.63%
开户的百分比：11.37%


In [102]:
# 训练集数据量
train.count()

32931

In [104]:
# 分离训练集正负样本
positive_df = train.filter(col('y') == 1)
negative_df = train.filter(col('y') == 0)

In [107]:
# 将负样本随机分为7份
negative_df = negative_df.withColumn("random", (rand(seed=12345) * 7).cast("int"))
negative_dfs = [negative_df.filter(col("random") == i) for i in range(7)]

## 处理离散特征

In [108]:
data.dtypes

[('age', 'double'),
 ('job', 'string'),
 ('marital', 'string'),
 ('education', 'string'),
 ('default', 'string'),
 ('housing', 'string'),
 ('loan', 'string'),
 ('contact', 'string'),
 ('month', 'string'),
 ('day_of_week', 'string'),
 ('duration', 'int'),
 ('campaign', 'int'),
 ('pdays', 'int'),
 ('previous', 'int'),
 ('poutcome', 'string'),
 ('emp_var_rate', 'double'),
 ('cons_price_idx', 'double'),
 ('cons_conf_idx', 'double'),
 ('euribor3m', 'double'),
 ('nr_employed', 'double'),
 ('y', 'int')]

In [109]:
# 分离出 data 中的离散型变量和连续型变量
string_columns = [field for (field, dtype) in data.dtypes if dtype == "string"]
numeric_columns = [field for (field, dtype) in data.dtypes if dtype != "string" and field != "y"]

# 对离散变量进行独热编码
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in string_columns]
encoders = [OneHotEncoder(inputCol=column+"_index", outputCol=column+"_vec") for column in string_columns]

## 处理连续型特征

In [110]:
# 合并数值型特征列
assembler_numeric = VectorAssembler(inputCols=numeric_columns, outputCol="features_numeric")

In [111]:
# 将每个数值列转换为向量列
vector_assemblers = [VectorAssembler(inputCols=[col], outputCol=col+"_vec") for col in numeric_columns]

# 分别对每个特征列进行标准化
scalers = [StandardScaler(inputCol=col+"_vec", outputCol=col+"_scaled") for col in numeric_columns]

## 模型搭建

In [113]:
# 合并所有特征到一个向量中
assembler_inputs = [c + "_vec" for c in string_columns] + [col + "_scaled" for col in numeric_columns]
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

In [114]:
# 创建逻辑回归模型
lr = LogisticRegression(featuresCol="features", labelCol="y")

In [115]:
# 构建Pipeline
pipeline = Pipeline(stages=indexers + encoders + vector_assemblers + scalers + [assembler, lr])

In [116]:
# 设置参数网格
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.maxIter, [50, 100]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

In [117]:
# 设置评估器
evaluator = BinaryClassificationEvaluator(labelCol='y')

In [118]:
# 设置交叉验证
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)  # 使用5折交叉验证

In [122]:
# 训练模型
models = []
for i in range(7):
    combined_df = positive_df.union(negative_dfs[i].drop("random"))
    # 运行交叉验证并选择最佳参数
    cvModel = crossval.fit(data)
    models.append(cvModel)

In [123]:
models

[CrossValidatorModel_67e6a69caa32,
 CrossValidatorModel_a9ed59c20cdf,
 CrossValidatorModel_679d73782444,
 CrossValidatorModel_ec1f68105b27,
 CrossValidatorModel_1ee2546b6b76,
 CrossValidatorModel_5d13d17c1862,
 CrossValidatorModel_1061dea5bd39]

In [99]:
data.count()

41176

In [134]:
# 显示每个最佳模型参数
for i,cvModel in enumerate(models):  
    bestModel = cvModel.bestModel
    bestLRModel = bestModel.stages[-1]
    output_text = "Model {} 的最优参数为：regParam={}, maxIter={}, elasticNetParam={}".format(i, bestLRModel.getRegParam(), bestLRModel.getMaxIter(), bestLRModel.getElasticNetParam())
    print(output_text)

Model 0 的最优参数为：regParam=0.01, maxIter=50, elasticNetParam=0.5
Model 1 的最优参数为：regParam=0.01, maxIter=50, elasticNetParam=0.5
Model 2 的最优参数为：regParam=0.01, maxIter=50, elasticNetParam=0.5
Model 3 的最优参数为：regParam=0.01, maxIter=50, elasticNetParam=0.5
Model 4 的最优参数为：regParam=0.01, maxIter=50, elasticNetParam=0.5
Model 5 的最优参数为：regParam=0.01, maxIter=50, elasticNetParam=0.5
Model 6 的最优参数为：regParam=0.01, maxIter=50, elasticNetParam=0.5


In [151]:
test_new = test.withColumn("incremental_id", monotonically_increasing_id())

In [152]:
# 分别用训练好的7个模型完成对测试集的预测
predictions = [model.transform(test_new).select("incremental_id", "prediction") for model in models]

In [159]:
# 投票法计算最终结果
def majority_vote(df1, df2):
    return df1.join(df2.withColumnRenamed("prediction", "prediction2"), on="incremental_id", how="inner")\
                .withColumn("prediction3", col("prediction") + col("prediction2"))\
                .select("prediction3", "incremental_id" )\
                .withColumnRenamed("prediction3", "prediction")

In [161]:
# 计算投票结果
vote_result = reduce(majority_vote, predictions)
final_prediction = vote_result.withColumn("majority_vote", (col("prediction") > 3.5).cast("int"))

In [167]:
result.dtypes

[('majority_vote', 'int'), ('y', 'int')]

In [169]:
# 计算最终预测结果在测试集上的准确率
result = final_prediction.join(test_new, on="incremental_id", how="inner").select("majority_vote", "y").withColumn("prediction", col("majority_vote").cast("double"))
evaluator = MulticlassClassificationEvaluator(labelCol="y", predictionCol="prediction", metricName="accuracy")

# 计算准确率
accuracy = evaluator.evaluate(result)

In [170]:
accuracy

0.9142510612492419

* 预测准确率达到91%以上，预测效果较好