# <center> 【Kaggle】Telco Customer Churn 电信用户流失预测案例

## <font face="仿宋">第三部分导读

&emsp;&emsp;<font face="仿宋">在案例的第二部分中，我们详细介绍了常用特征转化方法，其中有些是模型训练之必须，如自然数编码、独热编码，而有些方法则是以提高数据质量为核心、在大多数时候都是作为模型优化的备选方法，如连续变量分箱、数据标准化等。当然，在此之后，我们首先尝试构建一些可解释性较强的模型来进行用户流失预测，即采用逻辑回归和决策树模型来进行预测，并同时详细介绍了两种模型在实战中的调优技巧，在最终模型训练完成后，我们也重点讨论了关于两种可解释性模型建模结果的解释方法。

&emsp;&emsp;<font face="仿宋">从理论上来说，树模型的判别能力是要强于逻辑回归的，但在上一节最后的建模结果中我们发现两个模型的建模并无显著差别，预测准确率都维持在79%-80%之间，这或许说明很多逻辑回归无法正确判别的样本决策树模型也无法判别，据此我们推测，这是一个“入门容易、精通较难”的数据集。当然，如果我们进一步尝试其他“更强”的集成学习算法，如随机森林、XGB、CatBoost等，在当前数据集上的建模结果和逻辑回归也并无太大差异，因此我们亟需通过特征工程方法进一步提升数据集质量，进而提升最终模型效果。

&emsp;&emsp;<font face="仿宋">当然，哪怕是复杂模型在当前数据集上表现出了更好的效果，采用特征工程方法提升数据质量仍是优化建模结果必不可少的部分，正如时下流行的描述那样，“数据质量决定模型上界，而建模过程只是不断逼近这个上界”，特征工程中的一系列提高数据质量的方法、无论是在工业界实践中还是各大顶级竞赛里，都已然成了最为重要的提升模型效果的手段。

<center><img src="https://tva1.sinaimg.cn/large/008i3skNly1gwllgk4wgqj31hr0u0wh4.jpg" alt="image-20211112170651500" style="zoom:15%;" />

&emsp;&emsp;<font face="仿宋">不过，所谓的通过特征工程方法提高数据质量，看似简单但实际操作起来却并不容易。其难点并不在于其中具体操作方法的理解，至少相比机器学习算法原理，特征工程的很多方法并不复杂，特征工程的最大难点在于配合模型与数据进行方法选择、以及各种方法的工程化部署实现。一方面，特征工程方法众多，需要根据实际情况“因地制宜”，但数据的情况千变万化，很多时候需要同时结合数据探索结论、建模人员自身经验以及对各种备选方法的熟悉程度，才能快速制定行之有效的特征工程策略；另一方面，很多特征工程方法不像机器学习算法有现成的库可以直接调用，很多方法、尤其是一些围绕当前数据集的定制方法，需要自己手动实现，而这个过程就对建模人员本身的代码编写能力及工程部署能力提出了更高的要求。总而言之，特征工程是一个实践高度相关的技术，这也是为何课程会在介绍案例的过程中同步介绍特征工程常用方法的原因。

&emsp;&emsp;<font face="仿宋">当然，从宽泛的角度来看，所有围绕数据集的数据调整工作都可以看成是特征工程的一部分，包括此前介绍的缺失值填补、数据编码、特征变换等，这些方法其实都能一定程度提升数据质量，而本节开始，我们将花费一整节的时间来讨论另一类特征工程方法：特征衍生与特征筛选。而该方法通过创建更多特征来提供更多捕捉数据规律的维度，从而提升模型效果。当然特征衍生也是目前公认的最为有效的、能够显著提升数据集质量方法。

# <center>Part 3.特征衍生与特征筛选

&emsp;&emsp;本阶开始我们将重点讨论特征工程中的特征衍生与特征筛选方法，并借此进一步提升模型效果。首先需要将此前的操作中涉及到的第三方库进行统一的导入：

In [1]:
# 基础数据科学运算库
import numpy as np
import pandas as pd

# 可视化库
import seaborn as sns
import matplotlib.pyplot as plt

# 时间模块
import time

# sklearn库
# 数据预处理
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer

# 实用函数
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split

# 常用评估器
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# 网格搜索
from sklearn.model_selection import GridSearchCV

# 自定义评估器支持模块
from sklearn.base import BaseEstimator, TransformerMixin

# 自定义模块
from telcoFunc import *

# re模块相关
import inspect, re

from tqdm import tqdm
import gc

其中telcoFunc是自定义的模块，其内保存了此前自定义的函数和类，后续新增的函数和类也将逐步写入其中，telcoFunc.py文件随课件提供，需要将其放置于当前ipy文件同一文件夹内才能正常导入。

&emsp;&emsp;接下来导入数据并执行Part 1中的数据清洗步骤。

In [2]:
# 读取数据
tcc = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# 标注连续/离散字段
# 离散字段
category_cols = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
                'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
                'PaymentMethod']

# 连续字段
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
 
# 标签
target = 'Churn'

# ID列
ID_col = 'customerID'

# 验证是否划分能完全
assert len(category_cols) + len(numeric_cols) + 2 == tcc.shape[1]

# 连续字段转化
tcc['TotalCharges']= tcc['TotalCharges'].apply(lambda x: x if x!= ' ' else np.nan).astype(float)
tcc['MonthlyCharges'] = tcc['MonthlyCharges'].astype(float)

# 缺失值填补
tcc['TotalCharges'] = tcc['TotalCharges'].fillna(0)

# 标签值手动转化 
tcc['Churn'].replace(to_replace='Yes', value=1, inplace=True)
tcc['Churn'].replace(to_replace='No',  value=0, inplace=True)

In [3]:
features = tcc.drop(columns=[ID_col, target]).copy()
labels = tcc['Churn'].copy()

接下来即可直接带入数据进行特征衍生。

&emsp;&emsp;接下来，我们对本部分介绍的有特征衍生方法进行汇总及总结。总结分为两部分，其一是对所有定义的函数代码进行汇总，并进一步进行更高层次的封装，以增加一些实用功能、同时简化各函数的调用过程；其二是对不同特征衍生方法应用场景的总结，梳理在不同场景下我们可以考虑执行的特征衍生策略，并最终结合telco数据集完成特征衍生工作。

## 六、特征衍生实战

&emsp;&emsp;在完成特征衍生方法函数的封装后，接下来，我们考虑尝试调用这些方法，对案例数据集进行特征衍生，并在实践过程中测试衍生特征的实际建模效果。

#### - 特征衍生的核心矛盾

&emsp;&emsp;尽管我们知道特征衍生有效，但很多时候却不知道哪种方法衍生出来的特征有效。并且，由于特征衍生方法众多，而且理论上这些衍生方法只要稍加组合，就能创造“无限”特征，因此这里就会面对一个非常经典的特征衍生难题，即特征衍生效果和特征衍生效率之间如何权衡的问题，也就是说理论上衍生的特征越多、可能有效的衍生特征也越多、特征衍生的最终效果就越好，但特征衍生（以及后续的特征筛选、模型验证）需要耗费大量的算力，我们实际上并不能无止尽的进行尝试，因此，在进行特征衍生之前，我们需要首先考虑的问题就是效果和效率如何权衡的问题。

> 当然，如果算力没有限制，则可以尽可能在更大的范围内进行特征信息的重组、并且可以朝向无限高阶的范围延展，但一方面我们不可能拥有无限的算力，其二哪怕算力没有限制，对于高阶衍生特征的搜索效率也会伴随着阶数上升而迅速下降（越是高阶衍生的特征、有效信息就越少）。因此，效果和效率的权衡，始终是特征衍生过程无法规避的问题。

&emsp;&emsp;如何做好效果和效率的平衡？这就需有一些“先验”的判断，找准了特征衍生的思路、选择对了特征衍生的方法，就能在尽可能短的时间内找到最能提升模型效果的衍生特征，反之则事倍而功半。当然，但凡涉及到思路的问题，一般都是需要根据数据的实际情况来判断接下来应该采取的策略，而最终特征衍生效果如何、特征衍生过程执行效率又如何，则是非常算法工程人员实际建模经验的一件事了。

&emsp;&emsp;正因如此，在后续的案例讲解过程中，我们也将尽可能的尝试更多的方法，帮助大家快速积累实践经验，同时也会尽可能的总结一些特征衍生的思路和框架，帮助大家以更高的效率、找到更好的特征衍生方法。

In [4]:
# 导入特征衍生模块
import features_creation as fc
from features_creation import *

In [5]:
fc?

[1;31mType:[0m        module
[1;31mString form:[0m <module 'features_creation' from 'D:\\Work\\jupyter\\telco\\正式课程\\features_creation.py'>
[1;31mFile:[0m        d:\work\jupyter\telco\正式课程\features_creation.py
[1;31mDocstring:[0m  
自动化批量特征衍生模块

总共分为四个板块：
Part 1.相关第三方库
Part 2.基本方法实现函数
Part 3.高阶函数辅助函数
Part 4.高阶函数

使用过程中最常调用高阶函数进行批量自动化特征衍生。
高阶函数能够区分执行训练集和测试集的特征衍生过程，
并且支持测试集特征自动补全、目标编码等额外功能，
具体包括：

交叉组合特征衍生函数：
Cross_Combination

多项式特征衍生函数：
Polynomial_Features

分组统计特征衍生函数：
Group_Statistics

目标编码函数：
Target_Encode

时序字段特征衍生函数：
timeSeries_Creation

NLP特征衍生函数：
NLP_Group_Stat


#### - 特征衍生的一般顺序

&emsp;&emsp;首先，我们可以根据此前讨论的特征衍生方法、以及方法之间的关系，制定一个基本的特征衍生方法应用顺序的基本思路：

- Stage 1.时序特征衍生

&emsp;&emsp;首先，如果数据集中存在时序数据，我们可以优先考虑对时序数据进行特征衍生。时序特征衍生过程并不需要依赖其他任何特征，且衍生出来的特征可以作为备选特征带入到交叉组合或者分组统计的过程中。

- Stage 2.多项式特征衍生

&emsp;&emsp;紧接着，我们就需要来进行多项式特征衍生了。多项式特征衍生往往只适用于连续变量或者取值水平较多的有序变量，并且在实际操作过程中，需要注意衍生特征取值大小的问题，如果衍生特征的绝对值过大，则需要进行数据标准化处理。需要注意的是，这里的数据标准化只需针对演变特征进行处理即可，不需要对原始特征进行处理。而具体衍生几阶，一般来说2-3阶都是可以考虑的，如果连续特征较多并且连续特征包括很多有效特征，则可以考虑更高阶的多项式衍生。

- Stage 3.交叉组合特征衍生

&emsp;&emsp;在完成了时序特征衍生和多项式特征衍生之后，接下来我们看进一步考虑交叉组合特征衍生，并且由于衍生特征的稀疏性会伴随着参与组合的特征数量增加而增加，也就是参与交叉组合的特征越多、衍生特征中有效特征占比越少，因此我们会优先考虑两两交叉组合特征衍生，然后再考虑三三组合、甚至考虑四四组合等。同时需要注意，由于交叉组合本身是一种非常通用且执行效率非常高的特征衍生方法，因此一般情况下，需要针对所有原始离散变量和部分时序衍生字段（分类水平较少的时序衍生字段）进行两两组合，而是否需要进行三三组合，则需要根据两两组合的情况来决定。

- Stage 4.分组统计特征衍生

&emsp;&emsp;由于很多时候分组统计特征衍生需要依据交叉组合的结果进行分组，所以分组统计特征衍生一般会放在交叉组合特征衍生之后。同时，多项式的衍生特征也可以作为被分组统计的特征，因此往往在交叉组合和多项式特征衍生之后，才会进行分组统计。

&emsp;&emsp;分组统计特征衍生既是最重要的特征衍生环节（有可能产出非常多的有效特征），同时也是最为复杂的特征衍生环节。对于分组统计特征衍生来说，无论是通常意义的分组统计还是目标编码，最重要的是要找准关键变量（keyCol）。此处关键变量可以是单独的原始变量、也可以是衍生的时序字段、当然也可以双变量（或者多变量）的交叉组合字段。但选取keyCol的过程并不简单，后续我们会在实践过程中详细讨论。

&emsp;&emsp;此外，分组统计和目标编码是需要分开的，一般来说我们会在分组统计阶段验证合适的keyCol，而在目标编码阶段直接利用已经挑选好的keyCol进行标签的分组统计。

- Stage 5.NLP特征衍生

&emsp;&emsp;NLP特征衍生也可以看成是分组统计特征衍生的一种拓展形式，当然也并不是所有的数据集都适合进行NLP特征衍生，同时，NLP特征衍生也基本上可以看成是独立于其他方法的单独方法，如果出现了适合NLP特征衍生的情况，单独执行NLP方法即可，并不存在和此前方法过多的交叉，关键在于判定当前数据集是否适合进行NLP特征衍生。

&emsp;&emsp;接下来，我们就利用此前的Telco数据集来实践上述流程，由于我们现在并未介绍其他特征筛选技术以及更多的集成学习的工业化建模流程，因此特征的筛选过程仍然暂时考虑使用相关系数进行特征有效性评估，并借助此前定义的新特征建模效果判别函数来进行新特征实际效果。

### 1.特征分析

&emsp;&emsp;在套用特征衍生方法进行特征衍生之前，我们需要简单回顾数据集特征的基本情况，也就是围绕原始数据集进行特征分析，以梳理特征衍生基本思路、确定整体特征衍生策略。

<center><img src="https://tva1.sinaimg.cn/large/008i3skNly1gvyrhwvl01j30x40u0769.jpg" alt="1" style="zoom:70%;" />

&emsp;&emsp;数据集并不存在需要多表关联的情况，我们只需要围绕原始表格进行分析即可。一般分析流程是是先从标签和特殊特征入手进行分析。当前数据集的标签是二分类离散变量，因此可以考虑对其进行离散变量的目标编码；同时数据集存在时序特征，尽管是非常粗粒度（也就是精确到月，假设）的时间刻度，但仍然可以进行年、月、季度的周期划分，同时，在关键时间点这块，由于原始时间刻度就是距离起止时间的差值，因此我们只需考虑进一步设置一些淡旺季的关键时间点即可；同时，根据此前的分析，我们不难发现，原始数据集中存在大量相互关联、共同描述类似事件、并且能够相互互补的字段，如人口统计方面的四个字段（性别、年龄、配偶、经济状况）、注册服务信息的若干字段（是否开通在线备份、在线服务、设备安全等）等，这些离散字段都是非常适合采用NLP方法（尤其是TF-IDF）进行特征衍生。

&emsp;&emsp;然后再看数据集其他特征。整体来说，数据集分类字段较多，因此适合大规模进行交叉组合，同时这些分类字段本身以及一阶交叉组合出来的字段又可以作为分组依据进一步进行分组统计特征衍生。当然，鉴于当前数据集本身的数据量可能并不适合进行大规模高阶特征衍生，因此我们可以考虑先只对连续变量进行extension特征衍生（因为初步判断消费金额可能很大程度代表用户忠诚度），后续是否进一步围绕其他字段进行高阶特征衍生，可以根据进一步分析的结果来判断。

&emsp;&emsp;最后我们需要回顾一下特征衍生的目标。特征衍生阶段我们希望通过“重组”出更多的特征，来增强特征整体表现、提升模型效果，尽管在这个过程中衍生特征的数量往往是多多益善，但是过多的衍生特征可能也会带来一些问题，比如高阶衍生特征往往信息衰减的比较厉害（也就是高阶衍生的特征可能对模型帮助不大）、并且太多的特征也会对后续特征筛选阶段造成困扰、以及造成维度灾难、增加计算量等问题，因此如何权衡特征数量、特征表现和计算效率，就需要不断的在实践中积累经验了。

### 2.时序特征衍生

&emsp;&emsp;接下来就需要实际执行特征衍生了，首先是时序特征衍生。时序特征衍生的特征一方面可以和离散变量进行交叉组合（甚至是二阶组合），另一方面也可以作为后续进行分组统计时的分组变量。对于大多数时序字段，我们都可以直接调用timeSeries函数进行时序特征衍生，但telco数据集的时序特征较为特殊，需要按照此前介绍的方法，手动创建年、月、日和所属季度的特征。

- 时序特征衍生

&emsp;&emsp;首先，出于更加严谨的建模流程考虑，此处我们需要进行训练集和测试集的划分：

In [6]:
train, test = train_test_split(tcc, random_state=22)

In [7]:
X_train = train.drop(columns=[ID_col, target]).copy()
X_test = test.drop(columns=[ID_col, target]).copy()

y_train = train['Churn'].copy()
y_test = test['Churn'].copy()

In [8]:
X_train.head(5)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
4067,Female,0,Yes,No,68,Yes,No,DSL,Yes,Yes,No,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),79.6,5515.8
3306,Female,0,Yes,Yes,3,Yes,No,Fiber optic,No,Yes,No,Yes,No,No,Month-to-month,Yes,Electronic check,80.0,241.3
3391,Male,0,No,No,4,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,19.0,73.45
3249,Female,0,Yes,Yes,10,Yes,No,DSL,No,Yes,Yes,No,No,No,Month-to-month,Yes,Mailed check,55.55,551.3
2674,Female,1,No,No,4,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,20.05,91.45


In [9]:
X_train.shape

(5282, 19)

In [10]:
y_train

4067    0
3306    0
3391    0
3249    0
2674    0
       ..
5478    0
356     0
4908    1
6276    0
2933    0
Name: Churn, Length: 5282, dtype: int64

然后围绕tenure列，进行分训练集和测试集的时序特征衍生：

In [11]:
X_train_seq = pd.DataFrame()
X_test_seq = pd.DataFrame()

In [12]:
# 年份衍生
X_train_seq['tenure_year'] = ((72 - X_train['tenure']) // 12) + 2014
X_test_seq['tenure_year'] = ((72 - X_test['tenure']) // 12) + 2014

In [13]:
X_train_seq.head()

Unnamed: 0,tenure_year
4067,2014
3306,2019
3391,2019
3249,2019
2674,2019


In [14]:
# 月份衍生
X_train_seq['tenure_month'] = (72 - X_train['tenure']) % 12 + 1
X_test_seq['tenure_month'] = (72 - X_test['tenure']) % 12 + 1

In [15]:
X_train_seq.head()

Unnamed: 0,tenure_year,tenure_month
4067,2014,5
3306,2019,10
3391,2019,9
3249,2019,3
2674,2019,9


In [16]:
# 季度衍生
X_train_seq['tenure_quarter'] = ((X_train_seq['tenure_month']-1) // 3) + 1
X_test_seq['tenure_quarter'] = ((X_test_seq['tenure_month']-1) // 3) + 1

In [17]:
X_train_seq.head()

Unnamed: 0,tenure_year,tenure_month,tenure_quarter
4067,2014,5,2
3306,2019,10,4
3391,2019,9,3
3249,2019,3,1
2674,2019,9,3


然后进行独热编码：

In [18]:
enc = preprocessing.OneHotEncoder()
enc.fit(X_train_seq)

OneHotEncoder()

In [19]:
seq_new = list(X_train_seq.columns)

In [20]:
# 创建带有列名称的独热编码之后的df
X_train_seq = pd.DataFrame(enc.transform(X_train_seq).toarray(), 
                           columns = cate_colName(enc, seq_new, drop=None))

X_test_seq = pd.DataFrame(enc.transform(X_test_seq).toarray(), 
                          columns = cate_colName(enc, seq_new, drop=None))


X_train_seq.head(5)

Unnamed: 0,tenure_year_2014,tenure_year_2015,tenure_year_2016,tenure_year_2017,tenure_year_2018,tenure_year_2019,tenure_year_2020,tenure_month_1,tenure_month_2,tenure_month_3,...,tenure_month_7,tenure_month_8,tenure_month_9,tenure_month_10,tenure_month_11,tenure_month_12,tenure_quarter_1,tenure_quarter_2,tenure_quarter_3,tenure_quarter_4
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


这些时序衍生特征的实际效果此前测试过，此处仍然考虑进行相关系数计算：

In [21]:
# 首先进行index调整
X_train_seq.index = X_train.index
X_test_seq.index = X_test.index

# 然后进行数据集拼接
df_temp = pd.concat([X_train_seq, y_train], axis=1)

In [22]:
df_temp.head()

Unnamed: 0,tenure_year_2014,tenure_year_2015,tenure_year_2016,tenure_year_2017,tenure_year_2018,tenure_year_2019,tenure_year_2020,tenure_month_1,tenure_month_2,tenure_month_3,...,tenure_month_8,tenure_month_9,tenure_month_10,tenure_month_11,tenure_month_12,tenure_quarter_1,tenure_quarter_2,tenure_quarter_3,tenure_quarter_4,Churn
4067,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
3306,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3391,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
3249,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
2674,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0


- 时序特征效果验证

&emsp;&emsp;和此前一样，我们可以先通过相关系数，简单验证衍生的时序特征和标签之间的关系：

In [23]:
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)

In [24]:
df_corr

Churn               1.000000
tenure_year_2019    0.320558
tenure_quarter_4    0.201279
tenure_month_12     0.196918
tenure_month_11     0.048933
tenure_month_10     0.036697
tenure_year_2018    0.025471
tenure_month_9      0.017106
tenure_month_8      0.002575
tenure_quarter_3    0.000518
tenure_month_6     -0.018544
tenure_month_3     -0.019293
tenure_month_7     -0.019341
tenure_year_2020   -0.025913
tenure_month_4     -0.035117
tenure_year_2017   -0.038261
tenure_month_5     -0.047308
tenure_quarter_2   -0.062907
tenure_year_2016   -0.065928
tenure_month_2     -0.082142
tenure_year_2015   -0.100355
tenure_month_1     -0.131205
tenure_quarter_1   -0.155000
tenure_year_2014   -0.227663
Name: Churn, dtype: float64

挑选前两个相关系数最高的特征进行测试。这里我们仍然考虑使用features_test函数进行测试，唯一需要注意的是此时我们是划分了训练集和测试集的数据，因此需要简单修改原始的features_test函数，使其能够分别处理训练数据和测试数据：

In [25]:
def features_test(features_train_new,
                  features_test_new,
                  X_train, 
                  X_test, 
                  y_train, 
                  y_test, 
                  category_cols, 
                  numeric_cols):
    """
    新特征测试函数
    
    :param features_train_new: 训练集衍生特征
    :param features_test_new: 测试集衍生特征
    :param X_train: 训练集特征
    :param X_test: 测试集特征
    :param y_train: 训练集标签
    :param y_test: 测试集标签   
    :param category_cols: 离散列名称
    :param numeric_cols: 连续列名称
    :return: result_df评估指标
    """
    
    # 数据准备
    # 如果是一个衍生特征，则将其转化为series
    if type(features_train_new) == np.ndarray:
        name = 'features_train_new'
        features_train_new = pd.Series(features_train_new, name=name)
        
    if type(features_test_new) == np.ndarray:
        name = 'features_test_new'
        features_test_new = pd.Series(features_test_new, name=name)    
    
    # 复制里散列、连续列的列名称
    category_cols = category_cols.copy()
    numeric_cols = numeric_cols.copy()

    # 修改衍生特征矩阵的index
    features_train_new.index = X_train.index
    features_test_new.index = X_test.index
    
    # 将衍生特征和原始特征进行拼接
    X_train = pd.concat([X_train, features_train_new], axis=1)
    X_test = pd.concat([X_test, features_test_new], axis=1)
    
    # 判断衍生特征是连续还是离散
    if type(features_train_new) == pd.DataFrame:
        for col in features_train_new:
            if features_train_new[col].nunique() >= 15:
                numeric_cols.append(col)
            else:
                category_cols.append(col)
    
    else:
        if features_train_new.nunique() >= 15:
            numeric_cols.append(name)
        else:
            category_cols.append(name)

        
    # print(category_cols)
    # 检验列是否划分完全
    assert len(category_cols) + len(numeric_cols) == X_train.shape[1]

    # 训练部分
    # 设置转化器流
    logistic_pre = ColumnTransformer([
        ('cat', preprocessing.OneHotEncoder(drop='if_binary'), category_cols), 
        ('num', 'passthrough', numeric_cols)
    ])

    num_pre = ['passthrough', preprocessing.StandardScaler(), preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')]

    # 实例化逻辑回归评估器
    logistic_model = logit_threshold(max_iter=int(1e8))

    # 设置机器学习流
    logistic_pipe = make_pipeline(logistic_pre, logistic_model)

    # 设置超参数空间
    logistic_param = [
        {'columntransformer__num':num_pre, 'logit_threshold__penalty': ['l1'], 'logit_threshold__C': np.arange(0.1, 1.1, 0.1).tolist(), 'logit_threshold__solver': ['saga']}, 
        {'columntransformer__num':num_pre, 'logit_threshold__penalty': ['l2'], 'logit_threshold__C': np.arange(0.1, 1.1, 0.1).tolist(), 'logit_threshold__solver': ['lbfgs', 'newton-cg', 'sag', 'saga']}, 
    ]

    # 实例化网格搜索评估器
    logistic_search = GridSearchCV(estimator = logistic_pipe,
                                   param_grid = logistic_param,
                                   scoring='accuracy',
                                   n_jobs = 12)

    # 输出时间
    s = time.time()
    logistic_search.fit(X_train, y_train)
    print(time.time()-s, "s")

    # 计算预测结果
    return(logistic_search.best_score_, logistic_search.best_params_)

然后带入相关系数绝对值最高的两个特征进行衍生特征效果测试：

In [26]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 3].index)
print(new_col)

train_new_temp = X_train_seq[new_col]
test_new_temp = X_test_seq[new_col]

features_test(train_new_temp, 
              test_new_temp, 
              X_train, 
              X_test, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

['tenure_year_2019', 'tenure_year_2014']
46.666375398635864 s


(0.8097323757919785,
 {'columntransformer__num': StandardScaler(),
  'logit_threshold__C': 0.2,
  'logit_threshold__penalty': 'l1',
  'logit_threshold__solver': 'saga'})

能够发现，单独的时序特征衍生就已经达到了非常好的效果（原数据集情况下交叉验证的平均准确率为0.8042）。

### 3.多项式特征衍生

&emsp;&emsp;接下来继续尝试进行多项式特征衍生。由于原始数据集较为简单，只有两个连续变量，因此我们只需要考虑这两个变量的多项式计算即可：

In [27]:
numeric_cols

['tenure', 'MonthlyCharges', 'TotalCharges']

In [28]:
colNames = ['MonthlyCharges', 'TotalCharges']
colNames

['MonthlyCharges', 'TotalCharges']

In [29]:
Polynomial_Features?

[1;31mSignature:[0m [0mPolynomial_Features[0m[1;33m([0m[0mcolNames[0m[1;33m,[0m [0mdegree[0m[1;33m,[0m [0mX_train[0m[1;33m,[0m [0mX_test[0m[1;33m,[0m [0mmulti[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
多项式特征衍生函数

:param colNames: 参与交叉衍生的列名称
:param degree: 多项式最高阶
:param X_train: 训练集特征
:param X_test: 测试集特征
:param multi: 是否进行多变量多项式组衍生

:return：多项式衍生后的新特征和新列名称
[1;31mFile:[0m      d:\work\jupyter\telco\正式课程\features_creation.py
[1;31mType:[0m      function


In [30]:
X_train_ply, X_test_ply, colNames_train_new, colNames_test_new = Polynomial_Features(colNames=colNames, 
                                                                                     degree=3,
                                                                                     X_train=X_train, 
                                                                                     X_test=X_test)

In [31]:
X_train_ply

Unnamed: 0,MonthlyCharges**2*TotalCharges**0,MonthlyCharges**1*TotalCharges**1,MonthlyCharges**0*TotalCharges**2,MonthlyCharges**3*TotalCharges**0,MonthlyCharges**2*TotalCharges**1,MonthlyCharges**1*TotalCharges**2,MonthlyCharges**0*TotalCharges**3
0,6336.1600,439057.6800,3.042405e+07,5.043583e+05,3.494899e+07,2.421754e+09,1.678130e+11
1,6400.0000,19304.0000,5.822569e+04,5.120000e+05,1.544320e+06,4.658055e+06,1.404986e+07
2,361.0000,1395.5500,5.394903e+03,6.859000e+03,2.651545e+04,1.025031e+05,3.962556e+05
3,3085.8025,30624.7150,3.039317e+05,1.714163e+05,1.701203e+06,1.688341e+07,1.675575e+08
4,402.0025,1833.5725,8.363103e+03,8.060150e+03,3.676313e+04,1.676802e+05,7.648057e+05
...,...,...,...,...,...,...,...
5277,11299.6900,583268.1000,3.010717e+07,1.201157e+06,6.200140e+07,3.200392e+09,1.651980e+11
5278,2926.8100,48094.9000,7.903210e+05,1.583404e+05,2.601934e+06,4.275637e+07,7.025954e+08
5279,11267.8225,334637.8750,9.938256e+06,1.196079e+06,3.552181e+07,1.054946e+09,3.133035e+10
5280,414.1225,6836.5825,1.128624e+05,8.427393e+03,1.391245e+05,2.296750e+06,3.791612e+07


然后对其进行数据标准化处理：

In [32]:
from sklearn.preprocessing import StandardScaler

In [33]:
scaler = StandardScaler()
scaler.fit(X_train_ply)

StandardScaler()

In [34]:
scaler.transform(X_train_ply)

array([[ 0.33464874,  1.03533026,  1.24455267, ...,  0.68739317,
         0.85860063,  0.94153185],
       [ 0.35164345, -0.72233392, -0.63923349, ..., -0.6317181 ,
        -0.56737633, -0.50565539],
       [-1.25598529, -0.79732324, -0.64251092, ..., -0.69165442,
        -0.5700639 , -0.50577314],
       ...,
       [ 1.64749563,  0.59808585, -0.02631204, ...,  0.71001313,
         0.05224576, -0.23556692],
       [-1.24184366, -0.77453962, -0.63584403, ..., -0.68720762,
        -0.5687694 , -0.50544955],
       [-1.19875062, -0.63590995, -0.47101134, ..., -0.65484604,
        -0.53090569, -0.46601802]])

In [35]:
X_train_ply = pd.DataFrame(scaler.transform(X_train_ply), columns=colNames_train_new)
X_train_ply.index = X_train.index
X_train_ply.head()

Unnamed: 0,MonthlyCharges**2*TotalCharges**0,MonthlyCharges**1*TotalCharges**1,MonthlyCharges**0*TotalCharges**2,MonthlyCharges**3*TotalCharges**0,MonthlyCharges**2*TotalCharges**1,MonthlyCharges**1*TotalCharges**2,MonthlyCharges**0*TotalCharges**3
4067,0.334649,1.03533,1.244553,0.15919,0.687393,0.858601,0.941532
3306,0.351643,-0.722334,-0.639233,0.177696,-0.631718,-0.567376,-0.505655
3391,-1.255985,-0.797323,-0.642511,-1.045667,-0.691654,-0.570064,-0.505773
3249,-0.530622,-0.67493,-0.623991,-0.647138,-0.625523,-0.560164,-0.504331
2674,-1.24507,-0.795489,-0.642327,-1.042758,-0.69125,-0.570025,-0.50577


In [36]:
X_test_ply = pd.DataFrame(scaler.transform(X_test_ply), columns=colNames_test_new)
X_test_ply.index = X_test.index
X_test_ply.head()

Unnamed: 0,MonthlyCharges**2*TotalCharges**0,MonthlyCharges**1*TotalCharges**1,MonthlyCharges**0*TotalCharges**2,MonthlyCharges**3*TotalCharges**0,MonthlyCharges**2*TotalCharges**1,MonthlyCharges**1*TotalCharges**2,MonthlyCharges**0*TotalCharges**3
6187,-1.244,-0.789203,-0.641146,-1.042464,-0.690048,-0.569799,-0.505737
6448,1.563321,1.922314,1.756934,1.713345,1.997069,1.818142,1.569246
5492,-1.181689,-0.617925,-0.453174,-1.023058,-0.648504,-0.52449,-0.459669
2028,-1.24982,-0.775952,-0.636024,-1.044043,-0.687671,-0.568853,-0.505462
5376,-1.201928,-0.76095,-0.631667,-1.029834,-0.683246,-0.5676,-0.505117


接下来测试衍生特征与标签的相关性：

In [37]:
# 然后进行数据集拼接
df_temp = pd.concat([X_train_ply, y_train], axis=1)

df_temp.head()

Unnamed: 0,MonthlyCharges**2*TotalCharges**0,MonthlyCharges**1*TotalCharges**1,MonthlyCharges**0*TotalCharges**2,MonthlyCharges**3*TotalCharges**0,MonthlyCharges**2*TotalCharges**1,MonthlyCharges**1*TotalCharges**2,MonthlyCharges**0*TotalCharges**3,Churn
4067,0.334649,1.03533,1.244553,0.15919,0.687393,0.858601,0.941532,0
3306,0.351643,-0.722334,-0.639233,0.177696,-0.631718,-0.567376,-0.505655,0
3391,-1.255985,-0.797323,-0.642511,-1.045667,-0.691654,-0.570064,-0.505773,0
3249,-0.530622,-0.67493,-0.623991,-0.647138,-0.625523,-0.560164,-0.504331,0
2674,-1.24507,-0.795489,-0.642327,-1.042758,-0.69125,-0.570025,-0.50577,0


In [38]:
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)

In [39]:
df_corr

Churn                                1.000000
MonthlyCharges**2*TotalCharges**0    0.151934
MonthlyCharges**3*TotalCharges**0    0.120895
MonthlyCharges**2*TotalCharges**1   -0.121555
MonthlyCharges**1*TotalCharges**1   -0.148175
MonthlyCharges**1*TotalCharges**2   -0.152267
MonthlyCharges**0*TotalCharges**3   -0.158389
MonthlyCharges**0*TotalCharges**2   -0.178061
Name: Churn, dtype: float64

| 原始字段 | 相关系数 |
| ------ | ------ |
| MonthlyCharges | 0.193356 |
| TotalCharges | -0.199484 |

对比原始字段，我们发现衍生字段和标签仍然呈现出了一定的相关关系，但由于并没有出现衍生字段强于原始字段的情况，因此无需考虑进行更高阶的多项式衍生了。接下来带入到模型当中进行测试：

In [40]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 3].index)
print(new_col)

train_new_temp = X_train_ply[new_col]
test_new_temp = X_test_ply[new_col]

features_test(train_new_temp, 
              test_new_temp, 
              X_train, 
              X_test, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

['MonthlyCharges**0*TotalCharges**2', 'MonthlyCharges**0*TotalCharges**3']
46.75545644760132 s


(0.810678628766377,
 {'columntransformer__num': 'passthrough',
  'logit_threshold__C': 0.1,
  'logit_threshold__penalty': 'l2',
  'logit_threshold__solver': 'newton-cg'})

### 4.交叉组合特征衍生

&emsp;&emsp;接下来我们考虑进行交叉组合特征衍生。需要注意的是，在实际建模过程中，特征衍生和后续环节（如特征筛选和模型验证等）并不是完全分割的，而是交叉进行的，也就是完全可以衍生一部分特征后就进行验证，验证后再进行更深入的特征衍生等等。不过无论是何种情况，在特征衍生的开始，我们都可以一次性的进行“全特征衍生”，也就是带入全部离散变量进行两两交叉组合。

> 本阶段我们将进行大量实验，为制定后续特征衍生流程储备实验结果数据，在后续案例教学中我们则会效率优先，直接采用更高效的流程。

#### 3.1 原始特征的两两交叉组合

- 特征衍生

In [41]:
# 查看分类变量
category_cols

['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod']

In [42]:
# 查看每个分类变量的取值水平
for feature in tcc[category_cols]:
        print(f'{feature}: {tcc[feature].unique()}')

gender: ['Female' 'Male']
SeniorCitizen: [0 1]
Partner: ['Yes' 'No']
Dependents: ['No' 'Yes']
PhoneService: ['No' 'Yes']
MultipleLines: ['No phone service' 'No' 'Yes']
InternetService: ['DSL' 'Fiber optic' 'No']
OnlineSecurity: ['No' 'Yes' 'No internet service']
OnlineBackup: ['Yes' 'No' 'No internet service']
DeviceProtection: ['No' 'Yes' 'No internet service']
TechSupport: ['No' 'Yes' 'No internet service']
StreamingTV: ['No' 'Yes' 'No internet service']
StreamingMovies: ['No' 'Yes' 'No internet service']
Contract: ['Month-to-month' 'One year' 'Two year']
PaperlessBilling: ['Yes' 'No']
PaymentMethod: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']


In [43]:
Cross_Combination?

[1;31mSignature:[0m [0mCross_Combination[0m[1;33m([0m[0mcolNames[0m[1;33m,[0m [0mX_train[0m[1;33m,[0m [0mX_test[0m[1;33m,[0m [0mmulti[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m [0mOneHot[0m[1;33m=[0m[1;32mTrue[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
交叉组合特征衍生函数

:param colNames: 参与交叉衍生的列名称
:param X_train: 训练集特征
:param X_test: 测试集特征
:param multi: 是否进行多变量交叉组合
:param OneHot: 是否进行独热编码

:return：交叉衍生后的新特征和特征名称
[1;31mFile:[0m      d:\work\jupyter\telco\正式课程\features_creation.py
[1;31mType:[0m      function


然后，先尝试带入原始变量的离散变量进行双变量的两两交叉特征衍生：

In [44]:
features_train_new, features_test_new, colNames_train_new, colNames_test_new = Cross_Combination(category_cols, X_train, X_test)

> 这里需要注意，在进行交叉组合特征衍生时，无需对object对象类型进行自然数编码（ordinary encode），我们需要原始object对象类型中的字符串来协助创建衍生列的列名称。

In [45]:
features_train_new.head()

Unnamed: 0,gender&SeniorCitizen_Female&0,gender&SeniorCitizen_Female&1,gender&SeniorCitizen_Male&0,gender&SeniorCitizen_Male&1,gender&Partner_Female&No,gender&Partner_Female&Yes,gender&Partner_Male&No,gender&Partner_Male&Yes,gender&Dependents_Female&No,gender&Dependents_Female&Yes,...,Contract&PaymentMethod_Two year&Electronic check,Contract&PaymentMethod_Two year&Mailed check,PaperlessBilling&PaymentMethod_No&Bank transfer (automatic),PaperlessBilling&PaymentMethod_No&Credit card (automatic),PaperlessBilling&PaymentMethod_No&Electronic check,PaperlessBilling&PaymentMethod_No&Mailed check,PaperlessBilling&PaymentMethod_Yes&Bank transfer (automatic),PaperlessBilling&PaymentMethod_Yes&Credit card (automatic),PaperlessBilling&PaymentMethod_Yes&Electronic check,PaperlessBilling&PaymentMethod_Yes&Mailed check
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


能够发现，仅仅是双变量两两交叉衍生一项，就创建了761个新的特征。

- 特征效果评估

&emsp;&emsp;尽管我们创建了许多特征，这些衍生特征是否有效，还需要进行更进一步的检测。跟此前的思路一样，首先我们可以借助相关系数来初步评估衍生特征和标签的相关关系。这里我们首先需要修改衍生特征的index，使其与训练集的index一致，方便后续进行标签列的拼接：

In [46]:
features_train_new.index

RangeIndex(start=0, stop=5282, step=1)

In [47]:
X_train.index

Int64Index([4067, 3306, 3391, 3249, 2674, 2757, 5237, 5856, 3069, 6631,
            ...
            6628, 2527, 2952, 4587, 6646, 5478,  356, 4908, 6276, 2933],
           dtype='int64', length=5282)

In [48]:
features_train_new.index = X_train.index

In [49]:
df_temp = pd.concat([features_train_new, y_train], axis=1)

In [50]:
df_temp.head()

Unnamed: 0,gender&SeniorCitizen_Female&0,gender&SeniorCitizen_Female&1,gender&SeniorCitizen_Male&0,gender&SeniorCitizen_Male&1,gender&Partner_Female&No,gender&Partner_Female&Yes,gender&Partner_Male&No,gender&Partner_Male&Yes,gender&Dependents_Female&No,gender&Dependents_Female&Yes,...,Contract&PaymentMethod_Two year&Mailed check,PaperlessBilling&PaymentMethod_No&Bank transfer (automatic),PaperlessBilling&PaymentMethod_No&Credit card (automatic),PaperlessBilling&PaymentMethod_No&Electronic check,PaperlessBilling&PaymentMethod_No&Mailed check,PaperlessBilling&PaymentMethod_Yes&Bank transfer (automatic),PaperlessBilling&PaymentMethod_Yes&Credit card (automatic),PaperlessBilling&PaymentMethod_Yes&Electronic check,PaperlessBilling&PaymentMethod_Yes&Mailed check,Churn
4067,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
3306,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
3391,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
3249,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
2674,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0


然后进行衍生特征与标签的相关系数的计算：

In [51]:
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)

In [52]:
df_corr

Churn                                                                 1.000000
OnlineSecurity&Contract_No&Month-to-month                             0.441598
TechSupport&Contract_No&Month-to-month                                0.430260
InternetService&Contract_Fiber optic&Month-to-month                   0.415189
OnlineSecurity&TechSupport_No&No                                      0.398994
                                                                        ...   
PhoneService&InternetService_Yes&No                                  -0.230200
OnlineSecurity&TechSupport_No internet service&No internet service   -0.230200
Partner&Contract_Yes&Two year                                        -0.248640
PhoneService&Contract_Yes&Two year                                   -0.283557
SeniorCitizen&Contract_0&Two year                                    -0.289069
Name: Churn, Length: 762, dtype: float64

In [53]:
np.abs(df_corr).sort_values(ascending = False)[1: 11]

OnlineSecurity&Contract_No&Month-to-month                 0.441598
TechSupport&Contract_No&Month-to-month                    0.430260
InternetService&Contract_Fiber optic&Month-to-month       0.415189
OnlineSecurity&TechSupport_No&No                          0.398994
OnlineBackup&Contract_No&Month-to-month                   0.381821
Contract&PaperlessBilling_Month-to-month&Yes              0.373529
Contract&PaymentMethod_Month-to-month&Electronic check    0.373363
Dependents&Contract_No&Month-to-month                     0.364952
InternetService&OnlineSecurity_Fiber optic&No             0.363039
OnlineSecurity&PaymentMethod_No&Electronic check          0.359524
Name: Churn, dtype: float64

能够发现，在衍生特征中，出现了非常多个和标签相关性较强的特征，例如'OnlineSecurity&Contract_No&Month-to-month'（没有买在线安全服务并且按月付费的用户）、OnlineSecurity&TechSupport_No&No（没有购买在线安全服务并且没有买技术支持的用户）等，某些衍生特征和标签的相关性甚至要强于原始特征的相关系数：

| 原始字段 | 相关系数 |
| ------ | ------ |
| Contract_Month-to-month | 0.405103 |
| OnlineSecurity_No | 0.342637 |
| TechSupport_No  | 0.337281 |

这其实非常有利的说明了交叉组合能够显著增强模型表现。根据目前的情况我们不难发现，很多相关系数很大的衍生字段（如合约期限字段），其原始字段与标签的相关性本来就很强，交叉组合只是进一步增强了这些特点。当然，也有一些原本“默默无闻”的字段，通过交叉组合后相关性明显增加，此外，还有一些字段如SeniorCitizen，原始状态下呈现出和标签的正相关，经过交叉组合后，衍生字段却大多与标签呈现出负相关。这也一定程度上也说明交叉组合会一定程度丰富特征表现。

> 这也是我们为何要花费大量时间来讨论衍生列取名问题的原因。如果不确定列名称，则无法“溯源”有效的衍生列背后是由哪些原始列构成。

- 模型验证

&emsp;&emsp;在简单进行相关系数验证后，接下来我们考虑相关系数最高的三项特征带入到逻辑回归模型中进行模型验证。此时因为我们在特征衍生环节中也加入了训练集和测试集的划分，因此我们需要稍微修改此前定义的features_test函数，令其可以分开处理训练集和测试集数据：

然后选出相关系数绝对值最大的三个衍生特征：

In [54]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 4].index)
new_col

['OnlineSecurity&Contract_No&Month-to-month',
 'TechSupport&Contract_No&Month-to-month',
 'InternetService&Contract_Fiber optic&Month-to-month']

带入模型进行建模：

In [55]:
train_new_temp = features_train_new[new_col]
test_new_temp = features_test_new[new_col]

In [56]:
train_new_temp.head()

Unnamed: 0,OnlineSecurity&Contract_No&Month-to-month,TechSupport&Contract_No&Month-to-month,InternetService&Contract_Fiber optic&Month-to-month
4067,0.0,0.0,0.0
3306,1.0,0.0,1.0
3391,0.0,0.0,0.0
3249,1.0,1.0,0.0
2674,0.0,0.0,0.0


In [57]:
features_test(train_new_temp, 
              test_new_temp, 
              X_train, 
              X_test, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

44.58748745918274 s


(0.8097323757919785,
 {'columntransformer__num': 'passthrough',
  'logit_threshold__C': 0.1,
  'logit_threshold__penalty': 'l2',
  'logit_threshold__solver': 'lbfgs'})

能够看出，模型建模结果也达到了非常高的水平（原数据集情况下交叉验证的平均准确率为0.8042），这也进一步验证了特征衍生的有效性。

&emsp;&emsp;当然，在进一步介绍特征筛选方法前，我们目前只能够通过上述方法进行简单的模型效果验证。需要注意的是，逻辑回归本身对特征数量容忍度并不高，带入过多特征数量过多反而会容易造成模型效果下降。不过，这也并不影响本阶段“尽量创造更多有效特征”这一目标，我们只要确认衍生过程有效，更多有效特征的筛选便留与后续特征筛选环节来进行。

&emsp;&emsp;通过上面的过程我们不难看出，批量特征衍生的特征创建效率是非常高的，能够快速组合出海量特征，并且我们也能够从中进一步挖掘优质特征，不过和手动特征衍生不同的是，批量自动创建的特征并没有很明显的业务层面的解释，例如我们很难找到一个准确的业务逻辑来描述不买在线安全服务且按月付费的用户为何更容易流失。不过相关规则仍然还是有价值的，可以提交业务环节进行评估。

#### 3.2 原始特征的多变量交叉组合

- 多变量交叉衍生

&emsp;&emsp;通过一阶的双变量交叉衍生我们不难发现，一些原本就很强的特征经过交叉组合后表现出了更强的相关性，这不禁让我们想要进一步尝试围绕这些强相关的特征进行更进一步进行多变量交叉组合特征衍生。最为明显的、组合后效果最好的三个特征，就是'OnlineSecurity'、'Contract'和'TechSupport'：

In [58]:
colNames = ['OnlineSecurity', 'Contract', 'TechSupport']
colNames

['OnlineSecurity', 'Contract', 'TechSupport']

尝试对其进行交叉组合特征衍生：

In [59]:
features_train_new, features_test_new, colNames_train_new, colNames_test_new = Cross_Combination(colNames, 
                                                                                                 X_train, 
                                                                                                 X_test, 
                                                                                                 multi=True)

In [60]:
features_train_new.head()

Unnamed: 0,OnlineSecurity&Contract&TechSupport_No internet service&Month-to-month&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&One year&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&Two year&No internet service,OnlineSecurity&Contract&TechSupport_No&Month-to-month&No,OnlineSecurity&Contract&TechSupport_No&Month-to-month&Yes,OnlineSecurity&Contract&TechSupport_No&One year&No,OnlineSecurity&Contract&TechSupport_No&One year&Yes,OnlineSecurity&Contract&TechSupport_No&Two year&No,OnlineSecurity&Contract&TechSupport_No&Two year&Yes,OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&No,OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&Yes,OnlineSecurity&Contract&TechSupport_Yes&One year&No,OnlineSecurity&Contract&TechSupport_Yes&One year&Yes,OnlineSecurity&Contract&TechSupport_Yes&Two year&No,OnlineSecurity&Contract&TechSupport_Yes&Two year&Yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


就像此前介绍的一样，当参与交叉组合的特征越多、取值水平越多，衍生特征中出现零值的比例也就越高。我们可以简单计算衍生特征矩阵中0值的占比：

In [61]:
features_train_new.shape

(5282, 15)

In [62]:
(features_train_new == 0).sum() / 5282

OnlineSecurity&Contract&TechSupport_No internet service&Month-to-month&No internet service    0.926543
OnlineSecurity&Contract&TechSupport_No internet service&One year&No internet service          0.948315
OnlineSecurity&Contract&TechSupport_No internet service&Two year&No internet service          0.912533
OnlineSecurity&Contract&TechSupport_No&Month-to-month&No                                      0.689133
OnlineSecurity&Contract&TechSupport_No&Month-to-month&Yes                                     0.931655
OnlineSecurity&Contract&TechSupport_No&One year&No                                            0.958728
OnlineSecurity&Contract&TechSupport_No&One year&Yes                                           0.965165
OnlineSecurity&Contract&TechSupport_No&Two year&No                                            0.987694
OnlineSecurity&Contract&TechSupport_No&Two year&Yes                                           0.967815
OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&No                

在大多数情况下，0值占比较高列有效信息也较少，在特征筛选环节中往往也是要被剔除的。这里能够发现，几乎所有衍生列的0值占比都超过了90%（除了OnlineSecurity&Contract&TechSupport_No&Month-to-month&No外），部分衍生列的0值占比甚至超过了95%，这也应证了此前所说，伴随着参与交叉组合的特征数量增加，有效信息会迅速衰减的过程。

> 这里需要注意，有种方法能够降低交叉组合衍生特征矩阵的稀疏性——降低原始离散变量的取值水平。例如此处可以将在线服务由原先的'Yse'、'No'、'No internet service'改为'Yse'、'No'的二分类变量。同学们可以课后自行进行尝试。

- 效果验证

&emsp;&emsp;当然，更多的关于矩阵稀疏性的探讨我们会在后续进行讨论，此处我们快速验证这些衍生特征与标签的相关性：

In [63]:
features_train_new.index = X_train.index

In [64]:
df_temp = pd.concat([features_train_new, y_train], axis=1)

In [65]:
df_temp.head()

Unnamed: 0,OnlineSecurity&Contract&TechSupport_No internet service&Month-to-month&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&One year&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&Two year&No internet service,OnlineSecurity&Contract&TechSupport_No&Month-to-month&No,OnlineSecurity&Contract&TechSupport_No&Month-to-month&Yes,OnlineSecurity&Contract&TechSupport_No&One year&No,OnlineSecurity&Contract&TechSupport_No&One year&Yes,OnlineSecurity&Contract&TechSupport_No&Two year&No,OnlineSecurity&Contract&TechSupport_No&Two year&Yes,OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&No,OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&Yes,OnlineSecurity&Contract&TechSupport_Yes&One year&No,OnlineSecurity&Contract&TechSupport_Yes&One year&Yes,OnlineSecurity&Contract&TechSupport_Yes&Two year&No,OnlineSecurity&Contract&TechSupport_Yes&Two year&Yes,Churn
4067,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3306,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3391,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3249,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2674,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


然后进行衍生特征与标签的相关系数的计算：

In [66]:
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)

In [67]:
df_corr

Churn                                                                                         1.000000
OnlineSecurity&Contract&TechSupport_No&Month-to-month&No                                      0.440002
OnlineSecurity&Contract&TechSupport_No&Month-to-month&Yes                                     0.042028
OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&No                                     0.021370
OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&Yes                                   -0.029948
OnlineSecurity&Contract&TechSupport_No&One year&No                                           -0.034648
OnlineSecurity&Contract&TechSupport_No&Two year&No                                           -0.042966
OnlineSecurity&Contract&TechSupport_No&One year&Yes                                          -0.044896
OnlineSecurity&Contract&TechSupport_No internet service&Month-to-month&No internet service   -0.060186
OnlineSecurity&Contract&TechSupport_Yes&One year&Yes                     

能够看出，组合出的特征的有效性也在快速递减，即在上述组合衍生特征中，仅有一个特征表现较好，即OnlineSecurity&Contract&TechSupport_No&Month-to-month&No，仔细观察不难发现，该特征也正是此前0值占比较少的特征。也就是说，该特征不仅包含的信息量较大，且有效信息较多，可以进一步尝试带入模型进行测试：

In [68]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 2].index)
new_col

['OnlineSecurity&Contract&TechSupport_No&Month-to-month&No']

In [69]:
train_new_temp = features_train_new[new_col]
test_new_temp = features_test_new[new_col]

In [70]:
train_new_temp.head()

Unnamed: 0,OnlineSecurity&Contract&TechSupport_No&Month-to-month&No
4067,0.0
3306,0.0
3391,0.0
3249,1.0
2674,0.0


In [71]:
features_test(train_new_temp, 
              test_new_temp, 
              X_train, 
              X_test, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

42.55664372444153 s


(0.8082181201800406,
 {'columntransformer__num': 'passthrough',
  'logit_threshold__C': 0.6,
  'logit_threshold__penalty': 'l2',
  'logit_threshold__solver': 'lbfgs'})

并且如果进一步尝试带入相关系数最高的两个特征进行测试，效果也不会有更进一步的提升：

In [72]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 3].index)
print(new_col)

train_new_temp = features_train_new[new_col]
test_new_temp = features_test_new[new_col]

features_test(train_new_temp, 
              test_new_temp, 
              X_train, 
              X_test, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

['OnlineSecurity&Contract&TechSupport_No&Month-to-month&No', 'OnlineSecurity&Contract&TechSupport_No internet service&Two year&No internet service']
43.986942291259766 s


(0.8097325549726213,
 {'columntransformer__num': 'passthrough',
  'logit_threshold__C': 0.4,
  'logit_threshold__penalty': 'l2',
  'logit_threshold__solver': 'lbfgs'})

整体来看，伴随着参与交叉组合的特征越多，有效信息衰减的越严重，因此在大多数时候，我们在执行多变量交叉组合的过程中，都是优先考虑最强特征的组合。

> 当然，如果算力足够强特征筛选过程足够高效，其实也完全可以无视这个过程，直接带入全特征的三三组合、四四组合进行特征衍生，然后再进行特征筛选，但哪怕是这样的过程，在实际操作过程中我们还是会发现，三三组合出来的有效特征要少于两两组合的有效特征，四四组合又要少于三三组合，以此类推。

#### 3.3 带入时序衍生特征的交叉组合衍生

&emsp;&emsp;接下来继续测试带入时序衍生数据后的交叉组合特征衍生。首先需要注意的是，对于交叉组合特征衍生只适用于取值水平较少的离散变量（否则衍生特征矩阵会过于稀疏），因此对于时序衍生特征来说，一般只考虑带入年份（如果取值较少的话）、季度、星期几，最多可以考虑带入月份（12个取值的分类变量）。

In [73]:
X_train_seq.head()

Unnamed: 0,tenure_year_2014,tenure_year_2015,tenure_year_2016,tenure_year_2017,tenure_year_2018,tenure_year_2019,tenure_year_2020,tenure_month_1,tenure_month_2,tenure_month_3,...,tenure_month_7,tenure_month_8,tenure_month_9,tenure_month_10,tenure_month_11,tenure_month_12,tenure_quarter_1,tenure_quarter_2,tenure_quarter_3,tenure_quarter_4
4067,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3306,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3391,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3249,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2674,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


对于当前时序衍生特征来说，可以带入全部特征进行交叉组合。

- 双变量交叉组合特征衍生

&emsp;&emsp;首先是时序特征和原始特征的两两交叉组合。首先我们将原始数据集中的离散变量与时序衍生特征进行拼接：

In [74]:
# 调整index
X_train_seq.index = X_train.index
X_test_seq.index = X_test.index

# 拼接数据集
train_temp = pd.concat([X_train[category_cols], X_train_seq], axis=1)
test_temp = pd.concat([X_test[category_cols], X_test_seq], axis=1)

In [75]:
train_temp.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,...,tenure_month_7,tenure_month_8,tenure_month_9,tenure_month_10,tenure_month_11,tenure_month_12,tenure_quarter_1,tenure_quarter_2,tenure_quarter_3,tenure_quarter_4
4067,Female,0,Yes,No,Yes,No,DSL,Yes,Yes,No,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3306,Female,0,Yes,Yes,Yes,No,Fiber optic,No,Yes,No,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3391,Male,0,No,No,Yes,No,No,No internet service,No internet service,No internet service,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3249,Female,0,Yes,Yes,Yes,No,DSL,No,Yes,Yes,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2674,Female,1,No,No,Yes,No,No,No internet service,No internet service,No internet service,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [76]:
train_temp.shape, test_temp.shape

((5282, 39), (1761, 39))

然后进行交叉组合特征衍生：

In [77]:
list(train_temp.columns)

['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'tenure_year_2014',
 'tenure_year_2015',
 'tenure_year_2016',
 'tenure_year_2017',
 'tenure_year_2018',
 'tenure_year_2019',
 'tenure_year_2020',
 'tenure_month_1',
 'tenure_month_2',
 'tenure_month_3',
 'tenure_month_4',
 'tenure_month_5',
 'tenure_month_6',
 'tenure_month_7',
 'tenure_month_8',
 'tenure_month_9',
 'tenure_month_10',
 'tenure_month_11',
 'tenure_month_12',
 'tenure_quarter_1',
 'tenure_quarter_2',
 'tenure_quarter_3',
 'tenure_quarter_4']

In [78]:
len(list(train_temp.columns))

39

In [79]:
features_train_new, features_test_new, colNames_train_new, colNames_test_new = Cross_Combination(list(train_temp.columns), 
                                                                                                 train_temp, 
                                                                                                 test_temp)

In [80]:
features_train_new.head()

Unnamed: 0,gender&SeniorCitizen_Female&0,gender&SeniorCitizen_Female&1,gender&SeniorCitizen_Male&0,gender&SeniorCitizen_Male&1,gender&Partner_Female&No,gender&Partner_Female&Yes,gender&Partner_Male&No,gender&Partner_Male&Yes,gender&Dependents_Female&No,gender&Dependents_Female&Yes,...,tenure_quarter_1&tenure_quarter_4_1.0&0.0,tenure_quarter_2&tenure_quarter_3_0.0&0.0,tenure_quarter_2&tenure_quarter_3_0.0&1.0,tenure_quarter_2&tenure_quarter_3_1.0&0.0,tenure_quarter_2&tenure_quarter_4_0.0&0.0,tenure_quarter_2&tenure_quarter_4_0.0&1.0,tenure_quarter_2&tenure_quarter_4_1.0&0.0,tenure_quarter_3&tenure_quarter_4_0.0&0.0,tenure_quarter_3&tenure_quarter_4_0.0&1.0,tenure_quarter_3&tenure_quarter_4_1.0&0.0
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


由于特征数量增加，衍生特征的数量也呈现指数级上升，此时衍生的特征已达3589之多。

&emsp;&emsp;同样，我们可以通过相关系数进行有效特征的筛选：

In [81]:
# 修改index
features_train_new.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 查看拼接后的df
df_temp.head()

Unnamed: 0,gender&SeniorCitizen_Female&0,gender&SeniorCitizen_Female&1,gender&SeniorCitizen_Male&0,gender&SeniorCitizen_Male&1,gender&Partner_Female&No,gender&Partner_Female&Yes,gender&Partner_Male&No,gender&Partner_Male&Yes,gender&Dependents_Female&No,gender&Dependents_Female&Yes,...,tenure_quarter_2&tenure_quarter_3_0.0&0.0,tenure_quarter_2&tenure_quarter_3_0.0&1.0,tenure_quarter_2&tenure_quarter_3_1.0&0.0,tenure_quarter_2&tenure_quarter_4_0.0&0.0,tenure_quarter_2&tenure_quarter_4_0.0&1.0,tenure_quarter_2&tenure_quarter_4_1.0&0.0,tenure_quarter_3&tenure_quarter_4_0.0&0.0,tenure_quarter_3&tenure_quarter_4_0.0&1.0,tenure_quarter_3&tenure_quarter_4_1.0&0.0,Churn
4067,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0
3306,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
3391,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0
3249,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0
2674,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0


In [82]:
# 计算相关系数
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)
df_corr

Churn                                                  1.000000
OnlineSecurity&Contract_No&Month-to-month              0.441598
TechSupport&Contract_No&Month-to-month                 0.430260
InternetService&Contract_Fiber optic&Month-to-month    0.415189
Contract&tenure_year_2014_Month-to-month&0.0           0.402879
                                                         ...   
Contract&tenure_year_2020_Two year&0.0                -0.301604
tenure_year_2019&tenure_month_12_0.0&0.0              -0.303862
tenure_year_2018&tenure_year_2019_0.0&0.0             -0.315952
tenure_year_2019&tenure_year_2020_0.0&0.0             -0.317597
SeniorCitizen&tenure_year_2019_0&0.0                  -0.319902
Name: Churn, Length: 3590, dtype: float64

能够看出，较强的原始特征和较强的时序衍生特征，如Contract、OnlineSecurity、tenure_year_2014等，交叉组合后仍然表现出了较强的相关性。此处我们挑选相关系数绝对值最大的20个特征进行观察：

In [83]:
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                                  1.000000
OnlineSecurity&Contract_No&Month-to-month              0.441598
TechSupport&Contract_No&Month-to-month                 0.430260
InternetService&Contract_Fiber optic&Month-to-month    0.415189
Contract&tenure_year_2014_Month-to-month&0.0           0.402879
Contract&tenure_year_2020_Month-to-month&0.0           0.400606
OnlineSecurity&tenure_year_2019_No&1.0                 0.399682
OnlineSecurity&TechSupport_No&No                       0.398994
Contract&tenure_year_2015_Month-to-month&0.0           0.398467
Contract&tenure_month_2_Month-to-month&0.0             0.395384
Contract&tenure_month_1_Month-to-month&0.0             0.392526
OnlineSecurity&tenure_year_2014_No&0.0                 0.391337
Contract&tenure_month_5_Month-to-month&0.0             0.391318
InternetService&tenure_year_2019_Fiber optic&1.0       0.391223
TechSupport&tenure_year_2019_No&1.0                    0.390156
Contract&tenure_month_4_Month-to-month&0

接下来带入模型进行测试。由于逻辑回归本身特征容忍程度有限，我们仅带入相关性最强的三个特征进行建模：

In [84]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 4].index)
new_col

['OnlineSecurity&Contract_No&Month-to-month',
 'TechSupport&Contract_No&Month-to-month',
 'InternetService&Contract_Fiber optic&Month-to-month']

In [85]:
train_new_temp = features_train_new[new_col]
test_new_temp = features_test_new[new_col]

In [86]:
features_test(train_new_temp, 
              test_new_temp, 
              X_train, 
              X_test, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

44.43435311317444 s


(0.8097323757919785,
 {'columntransformer__num': 'passthrough',
  'logit_threshold__C': 0.1,
  'logit_threshold__penalty': 'l2',
  'logit_threshold__solver': 'lbfgs'})

通过相关系数计算结果和建模结果不难发现，时序衍生特征与原始特征的交叉组合，也是极有价值的特征衍生方向。尽管目前限于逻辑回归的特征数量容忍程度，暂时无法看出更明显的模型提升效果。后续我们还将尝试更多的模型带入更多的特征，来对这些衍生特征的效果进行测试。

- 多变量交叉组合特征衍生

&emsp;&emsp;接下来进一步尝试多变量的交叉组合特征衍生。根据“强者恒强”的原则，我们考虑将原始特征的'OnlineSecurity', 'Contract', 'TechSupport'三个字段与时序衍生特征的三个字段tenure_year_2014、tenure_year_2019、tenure_year_2020，进行三三交叉组合特征衍生，并测试最终效果。

In [87]:
X_train_seq.head()

Unnamed: 0,tenure_year_2014,tenure_year_2015,tenure_year_2016,tenure_year_2017,tenure_year_2018,tenure_year_2019,tenure_year_2020,tenure_month_1,tenure_month_2,tenure_month_3,...,tenure_month_7,tenure_month_8,tenure_month_9,tenure_month_10,tenure_month_11,tenure_month_12,tenure_quarter_1,tenure_quarter_2,tenure_quarter_3,tenure_quarter_4
4067,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3306,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3391,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3249,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2674,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [88]:
col1 = ['OnlineSecurity', 'Contract', 'TechSupport']
col2 = ['tenure_year_2014', 'tenure_year_2019', 'tenure_year_2020']

In [89]:
# 拼接数据集
train_temp = pd.concat([X_train[col1], X_train_seq[col2]], axis=1)
test_temp = pd.concat([X_test[col1], X_test_seq[col2]], axis=1)

In [90]:
train_temp.head()

Unnamed: 0,OnlineSecurity,Contract,TechSupport,tenure_year_2014,tenure_year_2019,tenure_year_2020
4067,Yes,Two year,Yes,1.0,0.0,0.0
3306,No,Month-to-month,Yes,0.0,1.0,0.0
3391,No internet service,Month-to-month,No internet service,0.0,1.0,0.0
3249,No,Month-to-month,No,0.0,1.0,0.0
2674,No internet service,Month-to-month,No internet service,0.0,1.0,0.0


In [91]:
train_temp.shape, test_temp.shape

((5282, 6), (1761, 6))

然后进行交叉组合特征衍生。注意此处是需要围绕上述六个特征进行三三组合，我们可以通过如下方式实现：

In [92]:
cl = list(train_temp.columns)
cl

['OnlineSecurity',
 'Contract',
 'TechSupport',
 'tenure_year_2014',
 'tenure_year_2019',
 'tenure_year_2020']

In [93]:
n = len(list(train_temp.columns))
n

6

In [94]:
for i in range(n):
    for j in range(i+1, n):
        for k in range(j+1, n):
            print(i, j, k)

0 1 2
0 1 3
0 1 4
0 1 5
0 2 3
0 2 4
0 2 5
0 3 4
0 3 5
0 4 5
1 2 3
1 2 4
1 2 5
1 3 4
1 3 5
1 4 5
2 3 4
2 3 5
2 4 5
3 4 5


In [95]:
col_temp = []

for i in range(n):
    for j in range(i+1, n):
        for k in range(j+1, n):
            col_temp.append(cl[i])
            col_temp.append(cl[j])
            col_temp.append(cl[k])
            print(col_temp)
            col_temp = []

['OnlineSecurity', 'Contract', 'TechSupport']
['OnlineSecurity', 'Contract', 'tenure_year_2014']
['OnlineSecurity', 'Contract', 'tenure_year_2019']
['OnlineSecurity', 'Contract', 'tenure_year_2020']
['OnlineSecurity', 'TechSupport', 'tenure_year_2014']
['OnlineSecurity', 'TechSupport', 'tenure_year_2019']
['OnlineSecurity', 'TechSupport', 'tenure_year_2020']
['OnlineSecurity', 'tenure_year_2014', 'tenure_year_2019']
['OnlineSecurity', 'tenure_year_2014', 'tenure_year_2020']
['OnlineSecurity', 'tenure_year_2019', 'tenure_year_2020']
['Contract', 'TechSupport', 'tenure_year_2014']
['Contract', 'TechSupport', 'tenure_year_2019']
['Contract', 'TechSupport', 'tenure_year_2020']
['Contract', 'tenure_year_2014', 'tenure_year_2019']
['Contract', 'tenure_year_2014', 'tenure_year_2020']
['Contract', 'tenure_year_2019', 'tenure_year_2020']
['TechSupport', 'tenure_year_2014', 'tenure_year_2019']
['TechSupport', 'tenure_year_2014', 'tenure_year_2020']
['TechSupport', 'tenure_year_2019', 'tenure_yea

In [96]:
# 创建容器
col_temp = []
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

# 多次循环、遍历三三组合
for i in range(n):
    for j in range(i+1, n):
        for k in range(j+1, n):
            col_temp.append(cl[i])
            col_temp.append(cl[j])
            col_temp.append(cl[k])
            features_train1, features_test1, colNames_train, colNames_test = Cross_Combination(col_temp, 
                                                                                               train_temp, 
                                                                                               test_temp, 
                                                                                               multi=True)
            
            colNames_train_new.extend(colNames_train)
            colNames_test_new.extend(colNames_test)
            features_train_new.append(features_train1)
            features_test_new.append(features_test1)
            
            col_temp = []

In [97]:
features_train_new[0].head()

Unnamed: 0,OnlineSecurity&Contract&TechSupport_No internet service&Month-to-month&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&One year&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&Two year&No internet service,OnlineSecurity&Contract&TechSupport_No&Month-to-month&No,OnlineSecurity&Contract&TechSupport_No&Month-to-month&Yes,OnlineSecurity&Contract&TechSupport_No&One year&No,OnlineSecurity&Contract&TechSupport_No&One year&Yes,OnlineSecurity&Contract&TechSupport_No&Two year&No,OnlineSecurity&Contract&TechSupport_No&Two year&Yes,OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&No,OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&Yes,OnlineSecurity&Contract&TechSupport_Yes&One year&No,OnlineSecurity&Contract&TechSupport_Yes&One year&Yes,OnlineSecurity&Contract&TechSupport_Yes&Two year&No,OnlineSecurity&Contract&TechSupport_Yes&Two year&Yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [98]:
features_train_new[1].head()

Unnamed: 0,OnlineSecurity&Contract&tenure_year_2014_No internet service&Month-to-month&0.0,OnlineSecurity&Contract&tenure_year_2014_No internet service&Month-to-month&1.0,OnlineSecurity&Contract&tenure_year_2014_No internet service&One year&0.0,OnlineSecurity&Contract&tenure_year_2014_No internet service&One year&1.0,OnlineSecurity&Contract&tenure_year_2014_No internet service&Two year&0.0,OnlineSecurity&Contract&tenure_year_2014_No internet service&Two year&1.0,OnlineSecurity&Contract&tenure_year_2014_No&Month-to-month&0.0,OnlineSecurity&Contract&tenure_year_2014_No&Month-to-month&1.0,OnlineSecurity&Contract&tenure_year_2014_No&One year&0.0,OnlineSecurity&Contract&tenure_year_2014_No&One year&1.0,OnlineSecurity&Contract&tenure_year_2014_No&Two year&0.0,OnlineSecurity&Contract&tenure_year_2014_No&Two year&1.0,OnlineSecurity&Contract&tenure_year_2014_Yes&Month-to-month&0.0,OnlineSecurity&Contract&tenure_year_2014_Yes&Month-to-month&1.0,OnlineSecurity&Contract&tenure_year_2014_Yes&One year&0.0,OnlineSecurity&Contract&tenure_year_2014_Yes&One year&1.0,OnlineSecurity&Contract&tenure_year_2014_Yes&Two year&0.0,OnlineSecurity&Contract&tenure_year_2014_Yes&Two year&1.0
0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [99]:
features_train_new = pd.concat(features_train_new, axis=1)
features_test_new = pd.concat(features_test_new, axis=1)

In [100]:
features_train_new.head()

Unnamed: 0,OnlineSecurity&Contract&TechSupport_No internet service&Month-to-month&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&One year&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&Two year&No internet service,OnlineSecurity&Contract&TechSupport_No&Month-to-month&No,OnlineSecurity&Contract&TechSupport_No&Month-to-month&Yes,OnlineSecurity&Contract&TechSupport_No&One year&No,OnlineSecurity&Contract&TechSupport_No&One year&Yes,OnlineSecurity&Contract&TechSupport_No&Two year&No,OnlineSecurity&Contract&TechSupport_No&Two year&Yes,OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&No,...,TechSupport&tenure_year_2019&tenure_year_2020_No&0.0&0.0,TechSupport&tenure_year_2019&tenure_year_2020_No&0.0&1.0,TechSupport&tenure_year_2019&tenure_year_2020_No&1.0&0.0,TechSupport&tenure_year_2019&tenure_year_2020_Yes&0.0&0.0,TechSupport&tenure_year_2019&tenure_year_2020_Yes&0.0&1.0,TechSupport&tenure_year_2019&tenure_year_2020_Yes&1.0&0.0,tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&0.0&0.0,tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&0.0&1.0,tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&1.0&0.0,tenure_year_2014&tenure_year_2019&tenure_year_2020_1.0&0.0&0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


同样，可以简单检验下衍生特征矩阵的稀疏性：

In [101]:
features_train_new.shape

(5282, 225)

In [102]:
(features_train_new == 0).sum() / 5282

OnlineSecurity&Contract&TechSupport_No internet service&Month-to-month&No internet service    0.926543
OnlineSecurity&Contract&TechSupport_No internet service&One year&No internet service          0.948315
OnlineSecurity&Contract&TechSupport_No internet service&Two year&No internet service          0.912533
OnlineSecurity&Contract&TechSupport_No&Month-to-month&No                                      0.689133
OnlineSecurity&Contract&TechSupport_No&Month-to-month&Yes                                     0.931655
                                                                                                ...   
TechSupport&tenure_year_2019&tenure_year_2020_Yes&1.0&0.0                                     0.960242
tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&0.0&0.0                                0.515146
tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&0.0&1.0                                0.998107
tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&1.0&0.0           

In [103]:
((features_train_new == 0).sum() / 5282).sort_values(ascending = False)

OnlineSecurity&Contract&tenure_year_2019_No&Two year&1.0                           1.000000
Contract&TechSupport&tenure_year_2014_Month-to-month&No internet service&1.0       1.000000
OnlineSecurity&Contract&tenure_year_2014_No internet service&Month-to-month&1.0    1.000000
TechSupport&tenure_year_2019&tenure_year_2020_No&0.0&1.0                           0.999811
OnlineSecurity&TechSupport&tenure_year_2020_Yes&No&1.0                             0.999811
                                                                                     ...   
Contract&TechSupport&tenure_year_2020_Month-to-month&No&0.0                        0.616622
TechSupport&tenure_year_2014&tenure_year_2020_No&0.0&0.0                           0.560394
OnlineSecurity&tenure_year_2014&tenure_year_2020_No&0.0&0.0                        0.557175
tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&0.0&0.0                     0.515146
Contract&tenure_year_2014&tenure_year_2020_Month-to-month&0.0&0.0               

能够发现，此时就已经出现了若干全是0的列，这也是多变量交叉组合带来的结果。我们可以直接剔除这些列，当然全是0值的列在后续相关系数计算时会得到Nan的结果，因此此处也可以暂时不做处理。

接下来进行相关性检验：

In [104]:
# 修改index
features_train_new.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 查看拼接后的df
df_temp.head()

Unnamed: 0,OnlineSecurity&Contract&TechSupport_No internet service&Month-to-month&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&One year&No internet service,OnlineSecurity&Contract&TechSupport_No internet service&Two year&No internet service,OnlineSecurity&Contract&TechSupport_No&Month-to-month&No,OnlineSecurity&Contract&TechSupport_No&Month-to-month&Yes,OnlineSecurity&Contract&TechSupport_No&One year&No,OnlineSecurity&Contract&TechSupport_No&One year&Yes,OnlineSecurity&Contract&TechSupport_No&Two year&No,OnlineSecurity&Contract&TechSupport_No&Two year&Yes,OnlineSecurity&Contract&TechSupport_Yes&Month-to-month&No,...,TechSupport&tenure_year_2019&tenure_year_2020_No&0.0&1.0,TechSupport&tenure_year_2019&tenure_year_2020_No&1.0&0.0,TechSupport&tenure_year_2019&tenure_year_2020_Yes&0.0&0.0,TechSupport&tenure_year_2019&tenure_year_2020_Yes&0.0&1.0,TechSupport&tenure_year_2019&tenure_year_2020_Yes&1.0&0.0,tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&0.0&0.0,tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&0.0&1.0,tenure_year_2014&tenure_year_2019&tenure_year_2020_0.0&1.0&0.0,tenure_year_2014&tenure_year_2019&tenure_year_2020_1.0&0.0&0.0,Churn
4067,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3306,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0
3391,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
3249,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
2674,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0


In [105]:
# 计算相关系数
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)
df_corr

Churn                                                                                      1.000000
OnlineSecurity&Contract&tenure_year_2014_No&Month-to-month&0.0                             0.443017
OnlineSecurity&Contract&tenure_year_2020_No&Month-to-month&0.0                             0.441598
OnlineSecurity&Contract&TechSupport_No&Month-to-month&No                                   0.440002
Contract&TechSupport&tenure_year_2014_Month-to-month&No&0.0                                0.433924
                                                                                             ...   
OnlineSecurity&TechSupport&tenure_year_2020_No internet service&No internet service&0.0   -0.229197
Contract&tenure_year_2019&tenure_year_2020_Two year&0.0&0.0                               -0.294312
OnlineSecurity&Contract&tenure_year_2014_No internet service&Month-to-month&1.0                 NaN
OnlineSecurity&Contract&tenure_year_2019_No&Two year&1.0                                        NaN


In [106]:
# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                                                1.000000
OnlineSecurity&Contract&tenure_year_2014_No&Month-to-month&0.0       0.443017
OnlineSecurity&Contract&tenure_year_2020_No&Month-to-month&0.0       0.441598
OnlineSecurity&Contract&TechSupport_No&Month-to-month&No             0.440002
Contract&TechSupport&tenure_year_2014_Month-to-month&No&0.0          0.433924
Contract&TechSupport&tenure_year_2020_Month-to-month&No&0.0          0.430260
OnlineSecurity&TechSupport&tenure_year_2014_No&No&0.0                0.411071
Contract&tenure_year_2014&tenure_year_2020_Month-to-month&0.0&0.0    0.402879
OnlineSecurity&Contract&tenure_year_2019_No&Month-to-month&1.0       0.402734
OnlineSecurity&TechSupport&tenure_year_2019_No&No&1.0                0.401730
OnlineSecurity&tenure_year_2014&tenure_year_2019_No&0.0&1.0          0.399682
OnlineSecurity&tenure_year_2019&tenure_year_2020_No&1.0&0.0          0.399682
OnlineSecurity&TechSupport&tenure_year_2020_No&No&0.0           

能够看出，由于时序衍生特征的加入，哪怕是三阶特征衍生，仍然还是衍生了很多高质量特征，这也是为何建议先进行时序特征衍生后进行交叉组合的原因。接下来我们仍然考虑带入相关系数绝对值最大的三个特征进行模型检验：

In [107]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 4].index)
new_col

['OnlineSecurity&Contract&tenure_year_2014_No&Month-to-month&0.0',
 'OnlineSecurity&Contract&tenure_year_2020_No&Month-to-month&0.0',
 'OnlineSecurity&Contract&TechSupport_No&Month-to-month&No']

In [108]:
train_new_temp = features_train_new[new_col]
test_new_temp = features_test_new[new_col]

In [109]:
features_test(train_new_temp, 
              test_new_temp, 
              X_train, 
              X_test, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

44.57247447967529 s


(0.8085967288781859,
 {'columntransformer__num': 'passthrough',
  'logit_threshold__C': 0.5,
  'logit_threshold__penalty': 'l2',
  'logit_threshold__solver': 'lbfgs'})

结果仍然说明当前组合特征拥有较好的效果。当然，通过上述过程我们发现，越是效果较好的特征、越值得进行更深程度的交叉组合，如果条件允许，我们甚至可以进一步尝试进行四四组合。而借助这些本来就表现较好的特征进行进一步的特征衍生，从模型训练角度来说，更深程度的交叉组合相当于是进一步强化这些特征的表现。

### 5.分组统计特征

&emsp;&emsp;接下来我们进一步尝试分组统计特征衍生。当然，在进行进一步特征衍生工作前，我们需要简单对此前的特征衍生过程与结果进行汇总，从而帮助我们更好的理解接下来分组统计特征衍生流程中的要点。

- 重要特征与普通特征

&emsp;&emsp;在此前的特征衍生中，我们发现，特征“生而不同”，有些特征在原始状态下就展现出了与标签的强相关性，此类特征我们称其为重要特征（如'OnlineSecurity'、'Contract'、'TechSupport'、'tenure_year_2014'和'tenure_year_2019'等），而其他特征与标签的相关性较弱，我们可以称其为普通特征。

> 关于特征重要性，此处暂时以相关系数大小作为衡量指标，后续还将介绍其他衡量方法。

- 特征衍生现象：“强者恒强”

&emsp;&emsp;而在实际特征衍生过程中我们发现，在大部分情况下，重要特征彼此的两两交叉组合、或者重要特征和普通特征的两两交叉组合而成的特征，也都能够和标签展现出良好的相关性，也就是说重要特征衍生出来的特征，大概率还是重要特征；而反观普通特征，只有极少一部分普通特征在和重要特征进行交叉组合后的特征才能表现出较好的特性。当然，无论是哪种情况，都可以总结为：“重要特征的衍生特征，往往也是重要特征”

- 特征衍生本质：增强特征表现

&emsp;&emsp;当然，基于“强者恒强”的现象，其实也可以进一步窥探特征衍生的本质，即特征衍生的根本作用是（一定程度、一定概率下）增强特征表现。例如我们可以把交叉组合特征衍生的过程理解为增强特征表现的过程，其中重要特征经过交叉组合后，衍生特征的重要性（一定程度、一定概率下）得到了增强，所以才表现出了“强者恒强”的现象。

- 分组统计特征衍生的“方向性”

&emsp;&emsp;不过就交叉组合特征衍生来说（当然也包括多项式特征衍生），这种特征增强是不分方向的，次级特征的重要性（相关系数）都是继承自父特征，因此，我们很难说'OnlineSecurity'和'Contract'交叉组合后的特征是强化了哪个原始特征，“强者恒强”在交叉组合特征衍生过程中，对我们的指导意义在于需要尽可能带入重要特征进行大规模的交叉组合，并且如果重要特征较多，还可以更进一步进行三三组合甚至更多变量的交叉组合。但是并非所有的特征衍生方法都是如此，例如分组统计特征衍生。

&emsp;&emsp;我们知道，在分组统计特征衍生的过程中，其实是将所有变量分成了两类，其一是关键变量，也就是所谓的KeyCol，其二是分组变量，也就是根据关键变量进行分组后进行统计量计算的变量。而在此情况下，会不会仍然存在“强者恒强”的现象呢？如果存在，是KeyCol选取重要特征衍生的结果更好，还是分组变量选取KeyCol时衍生特征效果更好呢？而无论是哪种情况，是否说明对于分组统计特征衍生来说，其实就是围绕KeyCol或者分组变量进行“有针对性”的特征增强呢？或者说，分组统计衍生特征的重要性是继承自KeyCol还是分组变量呢？

&emsp;&emsp;接下来我们通过具体的实验测试来进行测试。

#### 5.1 单变量分组统计

&emsp;&emsp;当然，在没有任何先验知识储备的情况下，我们可以尝试遍历所有离散变量，并将其作为关键变量、对其他所有变量进行分组统计特征衍生，并借此探究关键变量和分组变量的划分方法。不过需要注意的是，遍历所有离散变量进行分组统计汇总将衍生出海量特征，这里仅作演示和探讨用，并不是一般的特征衍生策略，尤其是针对大规模数据集，该做法将极大的消耗计算量。

- 分组统计衍生探索

In [110]:
Group_Statistics?

[1;31mSignature:[0m
[0mGroup_Statistics[0m[1;33m([0m[1;33m
[0m    [0mkeyCol[0m[1;33m,[0m[1;33m
[0m    [0mX_train[0m[1;33m,[0m[1;33m
[0m    [0mX_test[0m[1;33m,[0m[1;33m
[0m    [0mcol_num[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcol_cat[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mnum_stat[0m[1;33m=[0m[1;33m[[0m[1;34m'mean'[0m[1;33m,[0m [1;34m'var'[0m[1;33m,[0m [1;34m'max'[0m[1;33m,[0m [1;34m'min'[0m[1;33m,[0m [1;34m'skew'[0m[1;33m,[0m [1;34m'median'[0m[1;33m][0m[1;33m,[0m[1;33m
[0m    [0mcat_stat[0m[1;33m=[0m[1;33m[[0m[1;34m'mean'[0m[1;33m,[0m [1;34m'var'[0m[1;33m,[0m [1;34m'max'[0m[1;33m,[0m [1;34m'min'[0m[1;33m,[0m [1;34m'median'[0m[1;33m,[0m [1;34m'count'[0m[1;33m,[0m [1;34m'nunique'[0m[1;33m][0m[1;33m,[0m[1;33m
[0m    [0mquant[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mmulti[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m

&emsp;&emsp;首先，离散变量的遍历过程可以通过如下方式实现。这里需要注意的是，KeyCol不能和分组变量重合，并且由于我们希望带入更多的分组变量以衍生出更多的有效特征，因此我们需要在离散变量中逐个挑选KeyCol，然后将不是KeyCol的特征纳入分组变量中：

In [111]:
col_temp = category_cols.copy()
col_temp

['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod']

In [112]:
keyCol = col_temp.pop(2)
keyCol

'Partner'

In [113]:
col_temp

['gender',
 'SeniorCitizen',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod']

In [114]:
col_temp = category_cols.copy()
col_temp

['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod']

In [115]:
for i in range(len(col_temp)):
    keyCol = col_temp.pop(i)
    col_temp = category_cols.copy()
    print(keyCol)

gender
SeniorCitizen
Partner
Dependents
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
StreamingMovies
Contract
PaperlessBilling
PaymentMethod


当然，在具体执行分组统计汇总之前，我们还需要对object类型的离散变量进行ordinary encode，将其转化为数值型对象：

In [116]:
from sklearn.preprocessing import OrdinalEncoder

In [117]:
ord_enc = OrdinalEncoder()
ord_enc.fit(X_train[category_cols])

OrdinalEncoder()

In [118]:
ord_enc.transform(X_train[category_cols])

array([[0., 0., 1., ..., 2., 1., 0.],
       [0., 0., 1., ..., 0., 1., 2.],
       [1., 0., 0., ..., 0., 0., 3.],
       ...,
       [0., 1., 0., ..., 0., 1., 2.],
       [0., 0., 1., ..., 1., 0., 1.],
       [1., 0., 1., ..., 2., 1., 0.]])

In [119]:
X_train_OE = pd.DataFrame(ord_enc.transform(X_train[category_cols]), columns=category_cols)
X_train_OE.index = X_train.index
X_train_OE = pd.concat([X_train_OE, X_train[numeric_cols]], axis=1)

X_test_OE = pd.DataFrame(ord_enc.transform(X_test[category_cols]), columns=category_cols)
X_test_OE.index = X_test.index
X_test_OE = pd.concat([X_test_OE, X_test[numeric_cols]], axis=1)

In [120]:
X_train_OE.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,tenure,MonthlyCharges,TotalCharges
4067,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,1.0,0.0,68,79.6,5515.8
3306,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,1.0,2.0,3,80.0,241.3
3391,1.0,0.0,0.0,0.0,1.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,3.0,4,19.0,73.45
3249,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0,3.0,10,55.55,551.3
2674,0.0,1.0,0.0,0.0,1.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,3.0,4,20.05,91.45


In [121]:
X_train_OE.shape, X_train.shape

((5282, 19), (5282, 19))

In [122]:
X_test_OE.shape, X_test.shape

((1761, 19), (1761, 19))

当然，离散变量的不同取值对应的数值结果，可以通过如下方式进行查看：

In [123]:
ord_enc.categories_

[array(['Female', 'Male'], dtype=object),
 array([0, 1], dtype=int64),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'No phone service', 'Yes'], dtype=object),
 array(['DSL', 'Fiber optic', 'No'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['Month-to-month', 'One year', 'Two year'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['Bank transfer (automatic)', 'Credit card (automatic)',
        'Electronic check', 'Mailed check'], dtype=object)]

接下来，带入自然数编码之后的数据集，进行遍历离散变量的分组统计汇总，这里我们同时进行extension特征衍生：

In [124]:
# 创建容器
col_temp = category_cols.copy()
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

for i in range(len(col_temp)):
    keyCol = col_temp.pop(i)
    features_train1, features_test1, colNames_train, colNames_test = Group_Statistics(keyCol,
                                                                                      X_train_OE,
                                                                                      X_test_OE,
                                                                                      col_num=numeric_cols,
                                                                                      col_cat=col_temp, 
                                                                                      extension=True)
    
    colNames_train_new.extend(colNames_train)
    colNames_test_new.extend(colNames_test)
    features_train_new.append(features_train1)
    features_test_new.append(features_test1)
    
    col_temp = category_cols.copy()

In [125]:
features_train_new[0]

Unnamed: 0,tenure_gender_mean,tenure_gender_var,tenure_gender_max,tenure_gender_min,tenure_gender_skew,tenure_gender_median,MonthlyCharges_gender_mean,MonthlyCharges_gender_var,MonthlyCharges_gender_max,MonthlyCharges_gender_min,...,InternetService_cv_gender,OnlineSecurity_cv_gender,OnlineBackup_cv_gender,DeviceProtection_cv_gender,TechSupport_cv_gender,StreamingTV_cv_gender,StreamingMovies_cv_gender,Contract_cv_gender,PaperlessBilling_cv_gender,PaymentMethod_cv_gender
0,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.40,...,0.852147,1.093349,0.990747,0.978976,1.088451,0.891776,0.899249,1.228772,0.818448,0.682600
1,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.40,...,0.852147,1.093349,0.990747,0.978976,1.088451,0.891776,0.899249,1.228772,0.818448,0.682600
2,32.597584,614.757184,72,0,0.234222,29,64.016969,899.633055,118.35,18.25,...,0.857846,1.094405,0.989697,0.978766,1.083917,0.916737,0.905134,1.199651,0.843495,0.669323
3,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.40,...,0.852147,1.093349,0.990747,0.978976,1.088451,0.891776,0.899249,1.228772,0.818448,0.682600
4,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.40,...,0.852147,1.093349,0.990747,0.978976,1.088451,0.891776,0.899249,1.228772,0.818448,0.682600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5277,32.597584,614.757184,72,0,0.234222,29,64.016969,899.633055,118.35,18.25,...,0.857846,1.094405,0.989697,0.978766,1.083917,0.916737,0.905134,1.199651,0.843495,0.669323
5278,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.40,...,0.852147,1.093349,0.990747,0.978976,1.088451,0.891776,0.899249,1.228772,0.818448,0.682600
5279,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.40,...,0.852147,1.093349,0.990747,0.978976,1.088451,0.891776,0.899249,1.228772,0.818448,0.682600
5280,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.40,...,0.852147,1.093349,0.990747,0.978976,1.088451,0.891776,0.899249,1.228772,0.818448,0.682600


In [126]:
features_train_new = pd.concat(features_train_new, axis=1)
features_test_new = pd.concat(features_test_new, axis=1)

In [127]:
features_train_new.head()

Unnamed: 0,tenure_gender_mean,tenure_gender_var,tenure_gender_max,tenure_gender_min,tenure_gender_skew,tenure_gender_median,MonthlyCharges_gender_mean,MonthlyCharges_gender_var,MonthlyCharges_gender_max,MonthlyCharges_gender_min,...,MultipleLines_cv_PaymentMethod,InternetService_cv_PaymentMethod,OnlineSecurity_cv_PaymentMethod,OnlineBackup_cv_PaymentMethod,DeviceProtection_cv_PaymentMethod,TechSupport_cv_PaymentMethod,StreamingTV_cv_PaymentMethod,StreamingMovies_cv_PaymentMethod,Contract_cv_PaymentMethod,PaperlessBilling_cv_PaymentMethod
0,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.4,...,0.869666,0.891326,0.919172,0.826531,0.836828,0.889565,0.827997,0.801837,0.852233,0.864201
1,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.4,...,0.909129,0.692528,1.702213,1.278833,1.266881,1.704683,0.956058,0.963521,2.042807,0.608718
2,32.597584,614.757184,72,0,0.234222,29,64.016969,899.633055,118.35,18.25,...,1.576416,0.865125,0.828221,0.893478,0.893478,0.834575,0.911518,0.923124,1.238971,1.190282
3,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.4,...,1.576416,0.865125,0.828221,0.893478,0.893478,0.834575,0.911518,0.923124,1.238971,1.190282
4,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.4,...,1.576416,0.865125,0.828221,0.893478,0.893478,0.834575,0.911518,0.923124,1.238971,1.190282


In [128]:
features_train_new.shape, features_test_new.shape

((5282, 5136), (1761, 5136))

同样，接下来进一步检测衍生特征和标签之间的相关性：

In [129]:
# 修改index
features_train_new.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 查看拼接后的df
df_temp.head()

Unnamed: 0,tenure_gender_mean,tenure_gender_var,tenure_gender_max,tenure_gender_min,tenure_gender_skew,tenure_gender_median,MonthlyCharges_gender_mean,MonthlyCharges_gender_var,MonthlyCharges_gender_max,MonthlyCharges_gender_min,...,InternetService_cv_PaymentMethod,OnlineSecurity_cv_PaymentMethod,OnlineBackup_cv_PaymentMethod,DeviceProtection_cv_PaymentMethod,TechSupport_cv_PaymentMethod,StreamingTV_cv_PaymentMethod,StreamingMovies_cv_PaymentMethod,Contract_cv_PaymentMethod,PaperlessBilling_cv_PaymentMethod,Churn
4067,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.4,...,0.891326,0.919172,0.826531,0.836828,0.889565,0.827997,0.801837,0.852233,0.864201,0
3306,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.4,...,0.692528,1.702213,1.278833,1.266881,1.704683,0.956058,0.963521,2.042807,0.608718,0
3391,32.597584,614.757184,72,0,0.234222,29,64.016969,899.633055,118.35,18.25,...,0.865125,0.828221,0.893478,0.893478,0.834575,0.911518,0.923124,1.238971,1.190282,0
3249,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.4,...,0.865125,0.828221,0.893478,0.893478,0.834575,0.911518,0.923124,1.238971,1.190282,0
2674,31.947588,601.095276,72,0,0.257136,28,65.345765,891.427517,118.75,18.4,...,0.865125,0.828221,0.893478,0.893478,0.834575,0.911518,0.923124,1.238971,1.190282,0


In [130]:
# 计算相关系数
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)
df_corr

Churn                                 1.000000
tenure_cv_Contract                    0.406388
OnlineSecurity_cv_Contract            0.406261
Dependents_cv_Contract                0.406230
TotalCharges_norm_Contract            0.406185
                                        ...   
Partner_gap_PaymentMethod                  NaN
PhoneService_gap_PaymentMethod             NaN
PaperlessBilling_gap_PaymentMethod         NaN
SeniorCitizen_mag2_PaymentMethod           NaN
Dependents_mag2_PaymentMethod              NaN
Name: Churn, Length: 5137, dtype: float64

初步观察发现，衍生特征中还是存在较多高质量的衍生特征。接下来我们查看相关系数绝对值最大的前20个衍生特征： 

In [131]:
# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                              1.000000
TotalCharges_mag2_Contract                         0.406394
tenure_cv_Contract                                 0.406388
StreamingTV_Contract_mean                          0.406276
OnlineSecurity_cv_Contract                         0.406261
tenure_dive2_tenure_Contract_median                0.406239
Dependents_cv_Contract                             0.406230
Dependents_Contract_var                            0.406203
TotalCharges_norm_Contract                         0.406185
DeviceProtection_cv_Contract                       0.406177
OnlineBackup_cv_Contract                           0.406168
OnlineBackup_Contract_mean                         0.406082
tenure_mag2_Contract                               0.406042
TotalCharges_Contract_skew                         0.406039
StreamingTV_cv_Contract                            0.406021
DeviceProtection_Contract_mean                     0.405749
PaperlessBilling_Contract_mean          

&emsp;&emsp;通过上述结果我们不难发现，相关性最强的衍生特征都是基于Contract字段的分组统计结果（其中mag2是数据倾斜特征、cv是变异系数），这其实就说明分组统计特征衍生其实就是围绕KeyCol进行的特征增强，也就是说，如果KeyCol本身是重要特征，则经过分组统计汇总后也能衍生出很多重要特征。并且，从数量上来看，只需一个关键特征即可衍生出如此多的重要特征：单独一个Contract字段在分组统计特征衍生过程中就衍生出了至少100个相关系数在0.4以上的特征，该数量远远高于交叉组合特征衍生。可以说分组统计特征衍生在所有特征衍生方法中，效率是最高的。

In [132]:
np.abs(df_corr).sort_values(ascending = False)[20: 40]

TotalCharges_Contract_q2                          0.405671
tenure_Contract_q2                                0.405599
TotalCharges_Contract_q1                          0.405563
StreamingMovies_cv_Contract                       0.405519
TotalCharges_minus2_TotalCharges_Contract_mean    0.405406
TotalCharges_minus1_TotalCharges_Contract_mean    0.405406
TotalCharges_Contract_mean                        0.405402
tenure_dive1_tenure_Contract_mean                 0.405313
TechSupport_dive1_TechSupport_Contract_mean       0.405298
OnlineSecurity_Contract_mean                      0.405243
TechSupport_cv_Contract                           0.405048
PaperlessBilling_cv_Contract                      0.404911
TotalCharges_cv_Contract                          0.404896
PaymentMethod_Contract_var                        0.404894
Partner_cv_Contract                               0.404877
MonthlyCharges_Contract_q1                        0.404671
TotalCharges_Contract_median                      0.4041

当然，能够衍生出如此多的重要特征，也归功于我们在设计分组统计函数时所设置的众多统计量。不过无论如何，对于交叉组合来说我们又有了一个新的理解该过程的角度，即强化KeyCol的特征重要性、增强KeyCol的特征表现，这其实也是一个“强者恒强”的过程。当然我们也可以进行进一步衍生，看下相关系数绝对值位于第二梯队的衍生特征的KeyCol是哪个：

In [133]:
np.abs(df_corr).sort_values(ascending = False)[200:220]

MonthlyCharges_minus2_MonthlyCharges_Contract_mean       0.371829
InternetService_mag2_Contract                            0.371437
PaymentMethod_mag2_Contract                              0.369155
MultipleLines_Contract_mean                              0.368319
InternetService_Contract_mean                            0.363831
InternetService_mag1_Contract                            0.363831
PaymentMethod_mag1_Contract                              0.362901
Dependents_OnlineSecurity_var                            0.362594
MultipleLines_dive1_MultipleLines_OnlineSecurity_mean    0.362330
Dependents_cv_OnlineSecurity                             0.361842
SeniorCitizen_OnlineSecurity_count                       0.361838
Partner_OnlineSecurity_count                             0.361838
Dependents_OnlineSecurity_count                          0.361838
PhoneService_OnlineSecurity_count                        0.361838
MultipleLines_OnlineSecurity_count                       0.361838
OnlineBack

能够发现，重要性位于第二梯队的衍生特征的KeyCol（OnlineSecurity）也都是重要特征，并且我们发现，衍生特征的相关性和原始特征较为接近。

| 原始字段 | 相关系数 |
| ------ | ------ |
| Contract_Month-to-month | 0.405103 |
| OnlineSecurity_No | 0.342637 |
| TechSupport_No  | 0.337281 |

&emsp;&emsp;当然，我们也可以围绕这些衍生字段进行模型验证，挑选相关性最强的三个特征带入进行建模：

In [134]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 4].index)
new_col

['TotalCharges_mag2_Contract',
 'tenure_cv_Contract',
 'StreamingTV_Contract_mean']

In [135]:
train_new_temp = features_train_new[new_col]
test_new_temp = features_test_new[new_col]

In [136]:
train_new_temp.head()

Unnamed: 0,TotalCharges_mag2_Contract,tenure_cv_Contract,StreamingTV_Contract_mean
4067,0.986651,0.312622,1.268409
3306,0.495887,0.986208,0.794933
3391,0.495887,0.986208,0.794933
3249,0.495887,0.986208,0.794933
2674,0.495887,0.986208,0.794933


In [137]:
features_test(train_new_temp, 
              test_new_temp, 
              X_train, 
              X_test, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

50.71905517578125 s


(0.810489414007626,
 {'columntransformer__num': StandardScaler(),
  'logit_threshold__C': 0.30000000000000004,
  'logit_threshold__penalty': 'l1',
  'logit_threshold__solver': 'saga'})

模型测试结果也在意料之中。

- 分组统计特征衍生能够强化KeyCol的本质

&emsp;&emsp;当然，这里面我们可以进一步探讨分组统计强化KeyCol特征的本质，这将非常有助于我们进一步探讨这些分组统计特征本身的有效性。首先，对于分组统计特征衍生来说，衍生特征和原始特征数值不同，但分布一致：即原始情况下属于同一组的数据分组统计后还是同一组（分布一致），只不过每一组的数据大小发生变化（取值不同），例如：

<center><img src="https://s2.loli.net/2022/03/16/1UWuFzTK3dLHQMR.png" alt="image-20220316171913807" style="zoom:50%;" />

这就会导致衍生特征彼此相似度很高，我们也可以挑选此前衍生出来的特征进行观察：

In [138]:
pd.concat([X_train_OE['Contract'], 
           features_train_new['StreamingTV_Contract_mean'], 
           features_train_new['TotalCharges_mag2_Contract']], axis=1)[:10]

Unnamed: 0,Contract,StreamingTV_Contract_mean,TotalCharges_mag2_Contract
4067,2.0,1.268409,0.986651
3306,0.0,0.794933,0.495887
3391,0.0,0.794933,0.495887
3249,0.0,0.794933,0.495887
2674,0.0,0.794933,0.495887
2757,1.0,1.14663,0.877504
5237,0.0,0.794933,0.495887
5856,0.0,0.794933,0.495887
3069,0.0,0.794933,0.495887
6631,0.0,0.794933,0.495887


&emsp;&emsp;而大量的同分布的衍生特征则会对部分模型的建模造成影响，典型的就是共线性敏感的模型（如不带正则项的逻辑回归），而对于目前更为主流的集成学习来说，相同分布的特征并不会对建模过程造成太大影响，并且可以给很多模型基分类器训练过程创造更大的随机性的空间（例如可以带入不同特征进行基分类器训练的随机森林）。但无论如何，正如我们此前所说，特征衍生仍然不是创造信息的过程，而是重组信息的过程，我们衍生的特征仍然还是严重的受到原始特征的影响，他们只是原始特征的“影子”，而不是和原始特征相互独立的存在。

&emsp;&emsp;无论如何，既然分组统计特征衍生是有针对性增强KeyCol的特征表现，那么此前创建的哪些时序特征衍生特征、以及双变量、甚至是多变量交叉组合出来的重要特征，在理论上，也是可以通过分组统计特征衍生来进一步衍生出更多重要特征的。接下来我们就此进行验证。

#### 5.2 时序衍生字段分组统计

&emsp;&emsp;接下来我们进一步测试以时序字段作为KeyCol测试特征衍生的效果。首先我们将带时序衍生字段与原始数据集进行拼接：

In [139]:
X_train_seq.head()

Unnamed: 0,tenure_year_2014,tenure_year_2015,tenure_year_2016,tenure_year_2017,tenure_year_2018,tenure_year_2019,tenure_year_2020,tenure_month_1,tenure_month_2,tenure_month_3,...,tenure_month_7,tenure_month_8,tenure_month_9,tenure_month_10,tenure_month_11,tenure_month_12,tenure_quarter_1,tenure_quarter_2,tenure_quarter_3,tenure_quarter_4
4067,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3306,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3391,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3249,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2674,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [140]:
# 调整index
X_train_seq.index = X_train.index
X_test_seq.index = X_test.index

# 拼接数据集
train_temp = pd.concat([X_train_OE, X_train_seq], axis=1)
test_temp = pd.concat([X_test_OE, X_test_seq], axis=1)

In [141]:
train_temp.shape

(5282, 42)

In [142]:
# 包括时序衍生变量在内的所有离散变量名
cat_temp = (category_cols + list(X_train_seq.columns)).copy()
cat_temp

['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'tenure_year_2014',
 'tenure_year_2015',
 'tenure_year_2016',
 'tenure_year_2017',
 'tenure_year_2018',
 'tenure_year_2019',
 'tenure_year_2020',
 'tenure_month_1',
 'tenure_month_2',
 'tenure_month_3',
 'tenure_month_4',
 'tenure_month_5',
 'tenure_month_6',
 'tenure_month_7',
 'tenure_month_8',
 'tenure_month_9',
 'tenure_month_10',
 'tenure_month_11',
 'tenure_month_12',
 'tenure_quarter_1',
 'tenure_quarter_2',
 'tenure_quarter_3',
 'tenure_quarter_4']

In [143]:
cat_temp = list(X_train_seq.columns).copy()
cat_temp

['tenure_year_2014',
 'tenure_year_2015',
 'tenure_year_2016',
 'tenure_year_2017',
 'tenure_year_2018',
 'tenure_year_2019',
 'tenure_year_2020',
 'tenure_month_1',
 'tenure_month_2',
 'tenure_month_3',
 'tenure_month_4',
 'tenure_month_5',
 'tenure_month_6',
 'tenure_month_7',
 'tenure_month_8',
 'tenure_month_9',
 'tenure_month_10',
 'tenure_month_11',
 'tenure_month_12',
 'tenure_quarter_1',
 'tenure_quarter_2',
 'tenure_quarter_3',
 'tenure_quarter_4']

In [144]:
# 创建容器
col_temp = cat_temp.copy()
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

for i in range(len(col_temp)):
    keyCol = col_temp.pop(i)
    features_train1, features_test1, colNames_train, colNames_test = Group_Statistics(keyCol,
                                                                                      train_temp,
                                                                                      test_temp,
                                                                                      col_num=numeric_cols,
                                                                                      col_cat=col_temp+category_cols, 
                                                                                      extension=True)
    
    colNames_train_new.extend(colNames_train)
    colNames_test_new.extend(colNames_test)
    features_train_new.append(features_train1)
    features_test_new.append(features_test1)
    
    col_temp = cat_temp.copy()

In [145]:
features_train_new[0]

Unnamed: 0,tenure_tenure_year_2014_mean,tenure_tenure_year_2014_var,tenure_tenure_year_2014_max,tenure_tenure_year_2014_min,tenure_tenure_year_2014_skew,tenure_tenure_year_2014_median,MonthlyCharges_tenure_year_2014_mean,MonthlyCharges_tenure_year_2014_var,MonthlyCharges_tenure_year_2014_max,MonthlyCharges_tenure_year_2014_min,...,InternetService_cv_tenure_year_2014,OnlineSecurity_cv_tenure_year_2014,OnlineBackup_cv_tenure_year_2014,DeviceProtection_cv_tenure_year_2014,TechSupport_cv_tenure_year_2014,StreamingTV_cv_tenure_year_2014,StreamingMovies_cv_tenure_year_2014,Contract_cv_tenure_year_2014,PaperlessBilling_cv_tenure_year_2014,PaymentMethod_cv_tenure_year_2014
0,68.127341,13.127162,72,61,-0.549207,69,76.213530,1022.759716,118.75,19.10,...,0.848957,0.694922,0.572021,0.566283,0.691981,0.620829,0.594970,0.380092,0.821334,0.974531
1,23.186758,350.262051,60,0,0.439575,19,61.756122,821.604310,116.60,18.25,...,0.856458,1.222631,1.121632,1.107576,1.212794,0.985965,0.991487,1.565422,0.833398,0.601744
2,23.186758,350.262051,60,0,0.439575,19,61.756122,821.604310,116.60,18.25,...,0.856458,1.222631,1.121632,1.107576,1.212794,0.985965,0.991487,1.565422,0.833398,0.601744
3,23.186758,350.262051,60,0,0.439575,19,61.756122,821.604310,116.60,18.25,...,0.856458,1.222631,1.121632,1.107576,1.212794,0.985965,0.991487,1.565422,0.833398,0.601744
4,23.186758,350.262051,60,0,0.439575,19,61.756122,821.604310,116.60,18.25,...,0.856458,1.222631,1.121632,1.107576,1.212794,0.985965,0.991487,1.565422,0.833398,0.601744
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5277,23.186758,350.262051,60,0,0.439575,19,61.756122,821.604310,116.60,18.25,...,0.856458,1.222631,1.121632,1.107576,1.212794,0.985965,0.991487,1.565422,0.833398,0.601744
5278,23.186758,350.262051,60,0,0.439575,19,61.756122,821.604310,116.60,18.25,...,0.856458,1.222631,1.121632,1.107576,1.212794,0.985965,0.991487,1.565422,0.833398,0.601744
5279,23.186758,350.262051,60,0,0.439575,19,61.756122,821.604310,116.60,18.25,...,0.856458,1.222631,1.121632,1.107576,1.212794,0.985965,0.991487,1.565422,0.833398,0.601744
5280,23.186758,350.262051,60,0,0.439575,19,61.756122,821.604310,116.60,18.25,...,0.856458,1.222631,1.121632,1.107576,1.212794,0.985965,0.991487,1.565422,0.833398,0.601744


In [146]:
features_train_new = pd.concat(features_train_new, axis=1)
features_test_new = pd.concat(features_test_new, axis=1)

In [147]:
features_train_new.head()

Unnamed: 0,tenure_tenure_year_2014_mean,tenure_tenure_year_2014_var,tenure_tenure_year_2014_max,tenure_tenure_year_2014_min,tenure_tenure_year_2014_skew,tenure_tenure_year_2014_median,MonthlyCharges_tenure_year_2014_mean,MonthlyCharges_tenure_year_2014_var,MonthlyCharges_tenure_year_2014_max,MonthlyCharges_tenure_year_2014_min,...,InternetService_cv_tenure_quarter_4,OnlineSecurity_cv_tenure_quarter_4,OnlineBackup_cv_tenure_quarter_4,DeviceProtection_cv_tenure_quarter_4,TechSupport_cv_tenure_quarter_4,StreamingTV_cv_tenure_quarter_4,StreamingMovies_cv_tenure_quarter_4,Contract_cv_tenure_quarter_4,PaperlessBilling_cv_tenure_quarter_4,PaymentMethod_cv_tenure_quarter_4
0,68.127341,13.127162,72,61,-0.549207,69,76.21353,1022.759716,118.75,19.1,...,0.858849,1.015211,0.911021,0.892671,1.001012,0.84299,0.836854,1.041919,0.819885,0.736798
1,23.186758,350.262051,60,0,0.439575,19,61.756122,821.60431,116.6,18.25,...,0.846682,1.290253,1.186932,1.195498,1.301364,1.050394,1.059107,1.773409,0.855493,0.544229
2,23.186758,350.262051,60,0,0.439575,19,61.756122,821.60431,116.6,18.25,...,0.858849,1.015211,0.911021,0.892671,1.001012,0.84299,0.836854,1.041919,0.819885,0.736798
3,23.186758,350.262051,60,0,0.439575,19,61.756122,821.60431,116.6,18.25,...,0.858849,1.015211,0.911021,0.892671,1.001012,0.84299,0.836854,1.041919,0.819885,0.736798
4,23.186758,350.262051,60,0,0.439575,19,61.756122,821.60431,116.6,18.25,...,0.858849,1.015211,0.911021,0.892671,1.001012,0.84299,0.836854,1.041919,0.819885,0.736798


能够看出，伴随着带入特征数量的增加，衍生特征呈现指数级增加。接下来进一步检测衍生特征和标签之间的相关性：

In [148]:
# 修改index
features_train_new.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 查看拼接后的df
df_temp.head()

Unnamed: 0,tenure_tenure_year_2014_mean,tenure_tenure_year_2014_var,tenure_tenure_year_2014_max,tenure_tenure_year_2014_min,tenure_tenure_year_2014_skew,tenure_tenure_year_2014_median,MonthlyCharges_tenure_year_2014_mean,MonthlyCharges_tenure_year_2014_var,MonthlyCharges_tenure_year_2014_max,MonthlyCharges_tenure_year_2014_min,...,OnlineSecurity_cv_tenure_quarter_4,OnlineBackup_cv_tenure_quarter_4,DeviceProtection_cv_tenure_quarter_4,TechSupport_cv_tenure_quarter_4,StreamingTV_cv_tenure_quarter_4,StreamingMovies_cv_tenure_quarter_4,Contract_cv_tenure_quarter_4,PaperlessBilling_cv_tenure_quarter_4,PaymentMethod_cv_tenure_quarter_4,Churn
4067,68.127341,13.127162,72,61,-0.549207,69,76.21353,1022.759716,118.75,19.1,...,1.015211,0.911021,0.892671,1.001012,0.84299,0.836854,1.041919,0.819885,0.736798,0
3306,23.186758,350.262051,60,0,0.439575,19,61.756122,821.60431,116.6,18.25,...,1.290253,1.186932,1.195498,1.301364,1.050394,1.059107,1.773409,0.855493,0.544229,0
3391,23.186758,350.262051,60,0,0.439575,19,61.756122,821.60431,116.6,18.25,...,1.015211,0.911021,0.892671,1.001012,0.84299,0.836854,1.041919,0.819885,0.736798,0
3249,23.186758,350.262051,60,0,0.439575,19,61.756122,821.60431,116.6,18.25,...,1.015211,0.911021,0.892671,1.001012,0.84299,0.836854,1.041919,0.819885,0.736798,0
2674,23.186758,350.262051,60,0,0.439575,19,61.756122,821.60431,116.6,18.25,...,1.015211,0.911021,0.892671,1.001012,0.84299,0.836854,1.041919,0.819885,0.736798,0


此处增加了23个二分类特征，衍生特征增加了一万一千余条。

&emsp;&emsp;接下来计算相关系数，由于带入特征有所增加，采用.corr方法进行两两变量相关系数将耗费大量的时间，由于我们仅需要各特征和标签之间的相关系数结果，因此可以通过如下方式进行计算：

In [149]:
df_corr = pd.Series(dtype=np.float64)

for col in df_temp:
    corr = np.corrcoef(df_temp[col], df_temp['Churn'])[0, 1]
    s = pd.Series(corr, index=[col])
    df_corr = df_corr.append(s)

  c /= stddev[:, None]
  c /= stddev[None, :]


In [150]:
df_corr

tenure_tenure_year_2014_mean           -0.227663
tenure_tenure_year_2014_var             0.227663
tenure_tenure_year_2014_max            -0.227663
tenure_tenure_year_2014_min            -0.227663
tenure_tenure_year_2014_skew            0.227663
                                          ...   
StreamingMovies_cv_tenure_quarter_4     0.201279
Contract_cv_tenure_quarter_4            0.201279
PaperlessBilling_cv_tenure_quarter_4    0.201279
PaymentMethod_cv_tenure_quarter_4      -0.201279
Churn                                   1.000000
Length: 16906, dtype: float64

In [151]:
# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                                            1.000000
tenure_month_6_norm_tenure_year_2019                             0.320558
tenure_tenure_year_2019_q2                                       0.320558
StreamingMovies_tenure_year_2019_var                             0.320558
MonthlyCharges_tenure_year_2019_mean                             0.320558
tenure_year_2018_tenure_year_2019_var                            0.320558
PaperlessBilling_cv_tenure_year_2019                             0.320558
TotalCharges_tenure_year_2019_max                                0.320558
TotalCharges_tenure_year_2019_var                                0.320558
tenure_month_2_norm_tenure_year_2019                             0.320558
tenure_month_8_norm_tenure_year_2019                             0.320558
DeviceProtection_cv_tenure_year_2019                             0.320558
tenure_month_9_tenure_year_2019_var                              0.320558
MonthlyCharges_tenure_year_2019_min   

In [152]:
np.abs(df_corr).sort_values(ascending = False)[500: 520]

TechSupport_tenure_year_2014_var                                  0.227663
StreamingMovies_dive1_StreamingMovies_tenure_year_2014_mean       0.227663
TotalCharges_tenure_year_2014_min                                 0.227663
PhoneService_mag2_tenure_year_2014                                0.227663
InternetService_tenure_year_2014_mean                             0.227663
tenure_quarter_3_minus1_tenure_quarter_3_tenure_year_2014_mean    0.227663
tenure_month_3_cv_tenure_year_2014                                0.227663
tenure_quarter_3_minus2_tenure_quarter_3_tenure_year_2014_mean    0.227663
PhoneService_tenure_year_2014_var                                 0.227663
PhoneService_norm_tenure_year_2014                                0.227663
PaperlessBilling_mag2_tenure_year_2014                            0.227663
PaymentMethod_norm_tenure_year_2014                               0.227663
tenure_month_9_tenure_year_2014_var                               0.227663
tenure_year_2019_cv_tenur

能够发现，原本就重要的时序衍生特征，经过分组统计衍生后仍然创造了很多重要特征，并且衍生特征和原始特征相关系数较为接近：

| 原始字段 | 相关系数 |
| ------ | ------ |
| tenure_year_2019 | 0.320558 |
| tenure_year_2014 | -0.227663 |

#### 5.3 多变量分组统计

&emsp;&emsp;当然，在进行分组统计特征衍生的过程中，KeyCol不仅可以是原始变量，也可以是经过交叉组合后的变量，并且根据分组统计特征衍生实际上是在强化原始特征这一基本判断，我们仅需挑选那些重要的双变量甚至是三变量的交叉组合衍生特征进行分组统计衍生即可。

- 双变量交叉组合作为KeyCol

&emsp;&emsp;首先，我们挑选包括时序衍生特征在内的相关性最强的五个双变量交叉组合特征：

In [153]:
# 调整index
X_train_seq.index = X_train_OE.index
X_test_seq.index = X_test_OE.index

# 拼接数据集
train_temp = pd.concat([X_train_OE[category_cols], X_train_seq], axis=1)
test_temp = pd.concat([X_test_OE[category_cols], X_test_seq], axis=1)

# 双变量组合特征衍生
features_train_new, features_test_new, colNames_train_new, colNames_test_new = Cross_Combination(list(train_temp.columns), 
                                                                                                 train_temp, 
                                                                                                 test_temp)

# 组合标签
features_train_new.index = X_train.index
features_test_new.index = X_test.index
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 挑选最重要的5个衍生特征
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 6].index)
print(new_col)

# 创建对应df
train_new_BC = features_train_new[new_col]
test_new_BC = features_test_new[new_col]

['OnlineSecurity&Contract_0.0&0.0', 'TechSupport&Contract_0.0&0.0', 'InternetService&Contract_1.0&0.0', 'Contract&tenure_year_2014_0.0&0.0', 'Contract&tenure_year_2020_0.0&0.0']


In [154]:
train_new_BC.head()

Unnamed: 0,OnlineSecurity&Contract_0.0&0.0,TechSupport&Contract_0.0&0.0,InternetService&Contract_1.0&0.0,Contract&tenure_year_2014_0.0&0.0,Contract&tenure_year_2020_0.0&0.0
4067,0.0,0.0,0.0,0.0,0.0
3306,1.0,0.0,1.0,1.0,1.0
3391,0.0,0.0,0.0,1.0,1.0
3249,1.0,1.0,0.0,1.0,1.0
2674,0.0,0.0,0.0,1.0,1.0


带入进行分组统计特征衍生，此处为了减少计算量，只带入原始特征且进行非extension的分组统计：

In [155]:
# 合并数据集
train_temp = pd.concat([X_train_OE, train_new_BC], axis=1)
test_temp = pd.concat([X_test_OE, test_new_BC], axis=1)

# 包括时序衍生变量在内的所有离散变量名
cat_temp = list(train_new_BC.columns).copy()
print(cat_temp)

# 创建容器
col_temp = cat_temp.copy()
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

for i in range(len(col_temp)):
    keyCol = col_temp.pop(i)
    features_train1, features_test1, colNames_train, colNames_test = Group_Statistics(keyCol,
                                                                                      train_temp,
                                                                                      test_temp,
                                                                                      col_num=numeric_cols,
                                                                                      col_cat=category_cols)
    
    colNames_train_new.extend(colNames_train)
    colNames_test_new.extend(colNames_test)
    features_train_new.append(features_train1)
    features_test_new.append(features_test1)
    
    col_temp = cat_temp.copy()
    
# 合并数据集
features_train_new = pd.concat(features_train_new, axis=1)
features_test_new = pd.concat(features_test_new, axis=1)

# 查看衍生特征规模
print(features_train_new.shape)
print(features_test_new.shape)

['OnlineSecurity&Contract_0.0&0.0', 'TechSupport&Contract_0.0&0.0', 'InternetService&Contract_1.0&0.0', 'Contract&tenure_year_2014_0.0&0.0', 'Contract&tenure_year_2020_0.0&0.0']
(5282, 840)
(1761, 840)


接下来进行相关系数计算：

In [156]:
# 修改index
features_train_new.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 计算相关系数
df_corr = pd.Series(dtype=np.float64)

for col in df_temp:
    corr = np.corrcoef(df_temp[col], df_temp['Churn'])[0, 1]
    s = pd.Series(corr, index=[col])
    df_corr = df_corr.append(s)

  c /= stddev[:, None]
  c /= stddev[None, :]


In [157]:
# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                                    1.000000
TotalCharges_OnlineSecurity&Contract_0.0&0.0_min         0.441598
StreamingTV_OnlineSecurity&Contract_0.0&0.0_var          0.441598
MonthlyCharges_OnlineSecurity&Contract_0.0&0.0_q1        0.441598
TechSupport_OnlineSecurity&Contract_0.0&0.0_mean         0.441598
InternetService_OnlineSecurity&Contract_0.0&0.0_mean     0.441598
OnlineSecurity_OnlineSecurity&Contract_0.0&0.0_var       0.441598
MonthlyCharges_OnlineSecurity&Contract_0.0&0.0_median    0.441598
MonthlyCharges_OnlineSecurity&Contract_0.0&0.0_var       0.441598
MonthlyCharges_OnlineSecurity&Contract_0.0&0.0_skew      0.441598
TotalCharges_OnlineSecurity&Contract_0.0&0.0_q2          0.441598
StreamingTV_OnlineSecurity&Contract_0.0&0.0_mean         0.441598
Contract_OnlineSecurity&Contract_0.0&0.0_var             0.441598
StreamingMovies_OnlineSecurity&Contract_0.0&0.0_mean     0.441598
OnlineBackup_OnlineSecurity&Contract_0.0&0.0_mean        0.441598
gender_Onl

能够看出来，衍生特征中相关系数的最高的几个特征的keyCol都是相关系数最高的几个双变量交叉组合的结果，并且相关性和KeyCol原始变量相关性较为接近，同时需要注意的是，此时不同衍生特征的相关系数表现出了高度相似性，其原因也是因为彼此不同取值的分布一致。

&emsp;&emsp;另外需要注意的是，由于上述特征衍生是围绕自然数编码后的特征进行的衍生，因此如果要进一步判断每一个衍生特征具体含义，可以通过如下方式查看：

In [158]:
X_train_OE.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService',
       'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'Contract', 'PaperlessBilling', 'PaymentMethod', 'tenure',
       'MonthlyCharges', 'TotalCharges'],
      dtype='object')

In [159]:
ord_enc.categories_

[array(['Female', 'Male'], dtype=object),
 array([0, 1], dtype=int64),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['No', 'No phone service', 'Yes'], dtype=object),
 array(['DSL', 'Fiber optic', 'No'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['No', 'No internet service', 'Yes'], dtype=object),
 array(['Month-to-month', 'One year', 'Two year'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['Bank transfer (automatic)', 'Credit card (automatic)',
        'Electronic check', 'Mailed check'], dtype=object)]

In [160]:
X_train_OE.columns[7]

'OnlineSecurity'

In [161]:
ord_enc.categories_[7]

array(['No', 'No internet service', 'Yes'], dtype=object)

那么第一个特征：TotalCharges_OnlineSecurity&Contract_0.0&0.0_min所代表的含义就是总消费金额在OnlineSecurity为No且Contract为Month-to-month情况下，进行最小值统计进行的特征衍生。当然，在大多数情况下，我们并不需要深究每个特征的具体含义。

- 多变量交叉组合作为KeyCol

&emsp;&emsp;当然，我们也可以使用分组统计特征衍生增强多变量交叉组合衍生出来的重要特征。和此前一样，此处采用'OnlineSecurity'、'Contract'、'TechSupport'、'tenure_year_2014'、'tenure_year_2019'及'tenure_year_2020'三三组合的衍生特征作为KeyCol，并采用非extension的形式进行特征衍生。

&emsp;&emsp;该过程会更加复杂，首先需要进行KeyCol的挑选：

In [162]:
col1 = ['OnlineSecurity', 'Contract', 'TechSupport']
col2 = ['tenure_year_2014', 'tenure_year_2019', 'tenure_year_2020']
cl = col1 + col2

In [163]:
# 拼接数据集
train_temp = pd.concat([X_train_OE, X_train_seq[col2]], axis=1)
test_temp = pd.concat([X_test_OE, X_test_seq[col2]], axis=1)
print(train_temp.shape)
print(test_temp.shape)

# 创建容器
col_temp = []
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

# 多次循环、遍历三三组合
for i in range(n):
    for j in range(i+1, n):
        for k in range(j+1, n):
            col_temp.append(cl[i])
            col_temp.append(cl[j])
            col_temp.append(cl[k])
            features_train1, features_test1, colNames_train, colNames_test = Cross_Combination(col_temp, 
                                                                                               train_temp, 
                                                                                               test_temp, 
                                                                                               multi=True)
            
            colNames_train_new.extend(colNames_train)
            colNames_test_new.extend(colNames_test)
            features_train_new.append(features_train1)
            features_test_new.append(features_test1)
            
            col_temp = []

# 创建三变量交叉组合衍生数据集            
features_train_new = pd.concat(features_train_new, axis=1)
features_test_new = pd.concat(features_test_new, axis=1)

# 查看衍生数据集规模
print(features_train_new.shape)
print(features_test_new.shape)

# 组合标签
features_train_new.index = X_train.index
features_test_new.index = X_test.index
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 挑选最重要的5个衍生特征
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 6].index)
print(new_col)

# 创建对应df
train_new_MC = features_train_new[new_col]
test_new_MC = features_test_new[new_col]

(5282, 22)
(1761, 22)
(5282, 225)
(1761, 225)
['OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0', 'OnlineSecurity&Contract&tenure_year_2020_0.0&0.0&0.0', 'OnlineSecurity&Contract&TechSupport_0.0&0.0&0.0', 'Contract&TechSupport&tenure_year_2014_0.0&0.0&0.0', 'Contract&TechSupport&tenure_year_2020_0.0&0.0&0.0']


接下来进行分组统计汇总：

In [164]:
# 合并数据集
train_temp = pd.concat([X_train_OE, train_new_MC], axis=1)
test_temp = pd.concat([X_test_OE, test_new_MC], axis=1)

# 包括时序衍生变量在内的所有离散变量名
cat_temp = list(train_new_MC.columns).copy()
print(cat_temp)

# 创建容器
col_temp = cat_temp.copy()
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

for i in range(len(col_temp)):
    keyCol = col_temp.pop(i)
    features_train1, features_test1, colNames_train, colNames_test = Group_Statistics(keyCol,
                                                                                      train_temp,
                                                                                      test_temp,
                                                                                      col_num=numeric_cols,
                                                                                      col_cat=category_cols)
    
    colNames_train_new.extend(colNames_train)
    colNames_test_new.extend(colNames_test)
    features_train_new.append(features_train1)
    features_test_new.append(features_test1)
    
    col_temp = cat_temp.copy()
    
# 合并数据集
features_train_new = pd.concat(features_train_new, axis=1)
features_test_new = pd.concat(features_test_new, axis=1)

# 查看衍生特征规模
print(features_train_new.shape)
print(features_test_new.shape)

['OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0', 'OnlineSecurity&Contract&tenure_year_2020_0.0&0.0&0.0', 'OnlineSecurity&Contract&TechSupport_0.0&0.0&0.0', 'Contract&TechSupport&tenure_year_2014_0.0&0.0&0.0', 'Contract&TechSupport&tenure_year_2020_0.0&0.0&0.0']
(5282, 840)
(1761, 840)


最后查看衍生特征和标签的相关系数：

In [165]:
# 修改index
features_train_new.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 计算相关系数
df_corr = pd.Series(dtype=np.float64)

for col in df_temp:
    corr = np.corrcoef(df_temp[col], df_temp['Churn'])[0, 1]
    s = pd.Series(corr, index=[col])
    df_corr = df_corr.append(s)

  c /= stddev[:, None]
  c /= stddev[None, :]


In [166]:
# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                                                           1.000000
MonthlyCharges_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_q2          0.443017
gender_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_var                 0.443017
DeviceProtection_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_var       0.443017
TotalCharges_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_min           0.443017
TotalCharges_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_var           0.443017
DeviceProtection_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_mean      0.443017
PaymentMethod_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_mean         0.443017
Contract_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_mean              0.443017
PaperlessBilling_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_mean      0.443017
PaperlessBilling_OnlineSecurity&Contract&tenure_year_2014_0.0&0.0&0.0_var       0.443017
tenure_OnlineSecurity

能够看出，分组统计汇总特征衍生过程仍然创造了大量相关性接近原始散变量交叉组合特征的衍生特征，和我们的基本预判一致。

#### 5.4 分组统计特征衍生使用方法拓展

- 选取独热编码的离散特征作为关键变量

&emsp;&emsp;在上述讨论内容中，我们从原理及实验结果两方面证明了分组统计汇总实际上就是在围绕KeyCol进行强化，衍生特征是否有效关键要看KeyCol特征是否重要。那么自然而然在讨论KeyCol的选取过程中，我们不仅会考虑原始离散变量，同时也会考虑时序衍生特征，同时也会考虑双变量甚至是散变量交叉组合衍生特征，并借此快速创造大量有效特征。这个过程看似穷尽了我们介绍的所有方法，但实则不然，还有一种情况也也是需要进行考虑的，那就是先对多分类离散变量进行独热编码、然后再进行分组统计汇总，甚至对于拥有四个甚至更高取值水平的离散变量来说，我们还可以在这个过程中穿插进行交叉组合。

&emsp;&emsp;这里首先需要明确的是，对于三分类的离散变量而言，是否进行独热编码，将很大程度影响最终的特征衍生结果。例如在原始情况下，Monthly Charges在Contract不同取值下进行分组统计衍生结果如下：

<center><img src="https://s2.loli.net/2022/03/17/o1Vi2h6RDeMEmkb.png" alt="image-20220317153432864" style="zoom:40%;" />

而如果我们先对Contract字段进行独热编码，然后再进行分组统计的话，则会由如下特征衍生结果：

<center><img src="https://s2.loli.net/2022/03/17/U8ZbmpgyFSucCDz.png" alt="image-20220317153443674" style="zoom:50%;" />

甚至，你还可以同时保留独热编码前后的分组统计衍生结果带入模型进行测试，如此以来便能进一步扩大衍生特征的数量范围。

&emsp;&emsp;当然，如果我们将独热编码也看成是一种（对于单变量）的特征衍生方法，则上述过程也可以理解为先进行单变量特征衍生、然后再进行分组统计。

> 需要注意，对于交叉组合特征衍生来说，无论是否先进行独热编码，都不会影响最终衍生结果。

- 带入多项式衍生特征作为分组变量

&emsp;&emsp;对于分组变量的选择，在很多时候，我们为了尽可能增加有有效的衍生特征数量，可以考虑将将多项式衍生特征作为分组变量（被分组统计的变量）带入分组特征衍生的过程中。当然，由于分组统计特征衍生是否有效更多还是受到KeyCol的影响，因此这里我们可以考虑带入尽可能多的多项式衍生特征作为分组变量，以进一步增加有效衍生特征的数量（如果需要的话）。

### 6.目标编码

&emsp;&emsp;接下来，我们尝试对数据集进行目标编码特征衍生。我们知道目标编码其实就是分组统计特征衍生方法的一种变种，但由于目标编码过程会引入交叉验证，哪怕是训练集上的衍生特征其实也是“间接”算出的，因此我们并不确定在目标编码的过程中是否仍然还是KeyCol起主导作用。

#### 6.1 原始数据集单变量目标编码

- 特征衍生

&emsp;&emsp;为了验证目标编码的实际作用，此处我们首先考虑在原始数据集基础上，依次带入全部离散变量进行目标编码，并测试衍生特征的实际效果：

In [167]:
# 定义标签
col_cat = [target]
print(col_cat)

# 创建容器
col_temp = category_cols.copy()
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

for keyCol in col_temp:
    features_train1, features_test1, colNames_train_new, colNames_test_new = Target_Encode(keyCol, 
                                                                                           X_train_OE, 
                                                                                           y_train,
                                                                                           X_test_OE, 
                                                                                           col_cat=col_cat, 
                                                                                           extension=True)
    
    colNames_train_new.extend(colNames_train)
    colNames_test_new.extend(colNames_test)
    features_train_new.append(features_train1)
    features_test_new.append(features_test1)
    
    col_temp = category_cols.copy()

['Churn']


In [168]:
features_train_new[0]

Unnamed: 0,Churn_gender_mean_kfold,Churn_gender_var_kfold,Churn_gender_max_kfold,Churn_gender_min_kfold,Churn_gender_median_kfold,Churn_gender_count_kfold,Churn_gender_nunique_kfold,Churn_gender_q1_kfold,Churn_gender_q2_kfold,Churn_dive1_Churn_gender_mean_kfold,Churn_dive2_Churn_gender_median_kfold,Churn_minus1_Churn_gender_mean_kfold,Churn_minus2_Churn_gender_mean_kfold,Churn_norm_gender_kfold,Churn_gap_gender_kfold,Churn_mag1_gender_kfold,Churn_mag2_gender_kfold,Churn_cv_gender_kfold
0,0.272425,0.198304,1,0,0,2107,2,0,1.00,0.000000,0.0,-0.272425,-0.272425,-0.611747,1.00,-0.272425,0.0,1.634565
1,0.258141,0.191594,1,0,0,2119,2,0,1.00,3.873707,100000.0,0.741859,0.741859,1.694807,1.00,-0.258141,0.0,1.695580
3,0.248839,0.187005,1,0,0,2154,2,0,0.00,4.018495,100000.0,0.751161,0.751161,1.736985,0.00,-0.248839,0.0,1.737762
4,0.272425,0.198304,1,0,0,2107,2,0,1.00,0.000000,0.0,-0.272425,-0.272425,-0.611747,1.00,-0.272425,0.0,1.634565
6,0.260540,0.192750,1,0,0,2111,2,0,1.00,3.838035,100000.0,0.739460,0.739460,1.684253,1.00,-0.260540,0.0,1.685025
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0.258141,0.191594,1,0,0,2119,2,0,1.00,3.873707,100000.0,0.741859,0.741859,1.694807,1.00,-0.258141,0.0,1.695580
7039,0.264775,0.194761,1,0,0,2115,2,0,1.00,0.000000,0.0,-0.264775,-0.264775,-0.599952,1.00,-0.264775,0.0,1.666700
7040,0.264775,0.194761,1,0,0,2115,2,0,1.00,0.000000,0.0,-0.264775,-0.264775,-0.599952,1.00,-0.264775,0.0,1.666700
7041,0.250000,0.187589,1,0,0,2096,2,0,0.25,3.999840,100000.0,0.750000,0.750000,1.731598,0.25,-0.250000,0.0,1.732395


In [169]:
train_new_TE = pd.concat(features_train_new, axis=1)
test_new_TE = pd.concat(features_test_new, axis=1)

In [170]:
train_new_TE.head()

Unnamed: 0,Churn_gender_mean_kfold,Churn_gender_var_kfold,Churn_gender_max_kfold,Churn_gender_min_kfold,Churn_gender_median_kfold,Churn_gender_count_kfold,Churn_gender_nunique_kfold,Churn_gender_q1_kfold,Churn_gender_q2_kfold,Churn_dive1_Churn_gender_mean_kfold,...,Churn_PaymentMethod_q2_kfold,Churn_dive1_Churn_PaymentMethod_mean_kfold,Churn_dive2_Churn_PaymentMethod_median_kfold,Churn_minus1_Churn_PaymentMethod_mean_kfold,Churn_minus2_Churn_PaymentMethod_mean_kfold,Churn_norm_PaymentMethod_kfold,Churn_gap_PaymentMethod_kfold,Churn_mag1_PaymentMethod_kfold,Churn_mag2_PaymentMethod_kfold,Churn_cv_PaymentMethod_kfold
0,0.272425,0.198304,1,0,0,2107,2,0,1.0,0.0,...,1,4.377378,200000.0,1.543115,1.543115,3.09663,1,-0.456885,0.0,1.090647
1,0.258141,0.191594,1,0,0,2119,2,0,1.0,3.873707,...,0,14.87491,300000.0,2.798328,2.798328,6.970235,0,-0.201672,0.0,1.990552
3,0.248839,0.187005,1,0,0,2154,2,0,0.0,4.018495,...,0,0.0,0.0,-0.148724,-0.148724,-0.417735,0,-0.148724,0.0,2.393631
4,0.272425,0.198304,1,0,0,2107,2,0,1.0,0.0,...,1,4.377378,200000.0,1.543115,1.543115,3.09663,1,-0.456885,0.0,1.090647
6,0.26054,0.19275,1,0,0,2111,2,0,1.0,3.838035,...,0,6.663735,100000.0,0.849944,0.849944,2.378554,0,-0.150056,0.0,2.381129


In [171]:
train_new_TE.shape, test_new_TE.shape

((5282, 288), (1761, 288))

同样，接下来进一步检测衍生特征和标签之间的相关性：

In [172]:
# 修改index
train_new_TE.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([train_new_TE, y_train], axis=1)

# 计算相关系数
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)

# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                           1.000000
Churn_dive1_Churn_PaymentMethod_mean_kfold      0.028967
Churn_PaymentMethod_q2_kfold                    0.028299
Churn_gap_PaymentMethod_kfold                   0.028299
Churn_cv_OnlineBackup_kfold                     0.027046
Churn_StreamingTV_q2_kfold                      0.026733
Churn_gap_StreamingMovies_kfold                 0.026733
Churn_StreamingMovies_q2_kfold                  0.026733
Churn_gap_StreamingTV_kfold                     0.026733
Churn_dive1_Churn_InternetService_mean_kfold    0.026715
Churn_cv_StreamingTV_kfold                      0.026328
Churn_norm_InternetService_kfold                0.026254
Churn_StreamingTV_var_kfold                     0.026090
Churn_OnlineBackup_var_kfold                    0.026039
Churn_OnlineBackup_count_kfold                  0.025715
Churn_PaymentMethod_count_kfold                 0.025656
Churn_mag1_PaymentMethod_kfold                  0.025561
Churn_PaymentMethod_mean_kfold 

我们发现，目标编码衍生出的特征和标签的相关性普遍较弱，并且第一梯队的衍生特征的keyCol并不是重要特征，而是此前极少受到关注的PaymentMethod特征，该特征原始状态下和标签的相关系数在0.25左右。关于为何会出现这一现象，我们稍后再进行更深入的探究，此处我们先借助模型进行衍生特征有效性的验证，目标编码作为实际效果最强的特征衍生方法，尽管相关系数的计算结果并不如人意，但最终衍生特征是否有效还是要依据模型最终结果进行判断。

- 模型验证

&emsp;&emsp;接下来，我们带入相关性最前的衍生特征进入模型进行模型测试：

In [173]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 2].index)
new_col

['Churn_dive1_Churn_PaymentMethod_mean_kfold']

In [174]:
train_new_temp = train_new_TE[new_col]
test_new_temp = test_new_TE[new_col]

# 带入模型进行训练
features_test(train_new_temp, 
              test_new_temp, 
              X_train_OE, 
              X_test_OE, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

43.17520570755005 s


(0.81067880794702,
 {'columntransformer__num': StandardScaler(),
  'logit_threshold__C': 0.2,
  'logit_threshold__penalty': 'l1',
  'logit_threshold__solver': 'saga'})

根据最终结果能够发现目标编码得出的特征在模型验证过程中仍然取得了较好的成绩，这也说明目标编码的特征确实也都是优质特征，但和普通的分组统计特征衍生出来的特征的特性会有所不同。接下来我们就重点探讨目标编码的衍生特征为何会有如此特性。

- 交叉验证的多重影响

&emsp;&emsp;当然，最为根本的原因还是在于我们在目标编码的过程中采用了交叉验证的过程，而从直观的结果上来看，在训练集上交叉验证会导致哪怕是同一组内的数据，衍生出来的特征取值也会不同。例如下述数据集中的1号数据和3号数据，尽管在tenure分组情况下属于同一组，但由于每条数据的衍生特征取值都是经过交叉统计间接算得的，因此哪怕二者属于同一组，但最后衍生特征的取值却各不相同：

<center><img src="https://s2.loli.net/2022/02/27/lKo98urN4pji2Ad.png" alt="image-20220227120356377" style="zoom:50%;" />

而如果是普通的分组统计特征衍生，同一组内的数据一定会算得相同的结果：

<center><img src="https://s2.loli.net/2022/03/16/1UWuFzTK3dLHQMR.png" alt="image-20220316171913807" style="zoom:50%;" />

而这就是导致目标编码的衍生特征取值分布和KeyCol不同，从而使得衍生特征一定程度摆脱了KeyCol的影响、衍生特征的相关性和KeyCol呈现出一定的差异，而最终在训练集上衍生出来的重要特征，其背后的KeyCol也不一定是相关性最强的特征。

&emsp;&emsp;当然，关于这一点，其实我们也可以进行反向验证。对于目标编码来说，测试集上的衍生特征的取值分布其实是和KeyCol是完全一致的，如果测试集上标签（如果可知的话）和原始特征的相关性和训练集上一致，那么如果我们去检测测试集上的目标编码衍生变量和标签的关系，那么我们会发现重要的衍生特征，其背后的KeyCol往往也是重要特征，例如在本数据集中进行测试集的衍生特征的相关系数计算，则会有如下结果：

In [175]:
# 修改index
test_new_TE.index = X_test.index

# 拼接衍生特征与标签
df_temp = pd.concat([test_new_TE, y_test], axis=1)

# 计算相关系数
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)

# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                           1.000000
Churn_Contract_count_kfold                      0.420910
Churn_gap_Contract_kfold                        0.420833
Churn_Contract_q2_kfold                         0.420833
Churn_mag1_Contract_kfold                       0.419774
Churn_Contract_mean_kfold                       0.419774
Churn_Contract_var_kfold                        0.416007
Churn_minus2_Churn_Contract_mean_kfold          0.412676
Churn_minus1_Churn_Contract_mean_kfold          0.412676
Churn_dive2_Churn_Contract_median_kfold         0.410428
Churn_cv_Contract_kfold                         0.403028
Churn_norm_Contract_kfold                       0.398627
Churn_dive1_Churn_Contract_mean_kfold           0.384253
Churn_cv_InternetService_kfold                  0.341227
Churn_InternetService_count_kfold               0.340154
Churn_InternetService_var_kfold                 0.338333
Churn_mag1_InternetService_kfold                0.334336
Churn_InternetService_mean_kfol

我们发现，在测试集上，由于衍生特征和KeyCol的数值分布一致，此时衍生特征表现出了和分组统计汇总相类似的规律：即相关性最强的的KeyCol衍生出相关性最强的特征，并且衍生特征相关系数结果和KeyCol接近。

&emsp;&emsp;当然无论何时，我们都不能依据测试集中数据进行特征筛选，并且哪怕我们通过理论判断在测试集中以Contract作为KeyCol的衍生特征（和测试集标签）会有更强的相关性（例如Churn_Contract_count_kfold），但该特征在训练集上的取值分布和测试集并不一致，利用Contract的衍生特征训练出来的模型也不一定能达到更好的效果。这其实也是我们在训练集上采用交叉统计结果的根本原因，也是解决目标编码过拟合问题的釜底抽薪之计。也就是说，我们就按照训练数据集上目标编码所展示出的特性进行特征筛选即可，筛选出来的特征就是既能保证模型效果、同时又能一定程度抑制过拟合的特征。

> 如果更进一步“钻牛角尖”想要验证目标编码过拟合特性的话，我们可以设计一组实验，将标签视作一般特征带入到分组特征衍生过程中，然后配合一些天然不会抑制过拟合的模型（如sklearn中的随机森林），在不进行网格搜索的情况下，在大量带入Contract的衍生特征的情况下，模型会表现出非常严重的过拟合现象。

#### 6.2 基于时序衍生特征的目标编码

&emsp;&emsp;接下来，我们尝试带入时序衍生数据进行目标编码。时序衍生特征也同样是目标编码中非常常见且效果普遍较好的KeyCol：

In [176]:
# 定义标签
col_cat = [target]
print(col_cat)

# 合并数据集
X_train_temp = pd.concat([X_train_OE, X_train_seq], axis=1)
X_test_temp = pd.concat([X_test_OE, X_test_seq], axis=1)

# 创建容器
col_temp = list(X_train_seq.columns).copy()
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

for keyCol in col_temp:
    features_train1, features_test1, colNames_train_new, colNames_test_new = Target_Encode(keyCol, 
                                                                                           X_train_temp, 
                                                                                           y_train,
                                                                                           X_test_temp, 
                                                                                           col_cat=col_cat, 
                                                                                           extension=True)
    
    colNames_train_new.extend(colNames_train)
    colNames_test_new.extend(colNames_test)
    features_train_new.append(features_train1)
    features_test_new.append(features_test1)
    
    col_temp = category_cols.copy()
    
# 合并数据集
train_new_TE_seq = pd.concat(features_train_new, axis=1)
test_new_TE_seq = pd.concat(features_test_new, axis=1)

# 测试相关系数
# 修改index
train_new_TE_seq.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([train_new_TE_seq, y_train], axis=1)

# 计算相关系数
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)

# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

['Churn']


Churn                                              1.000000
Churn_gap_tenure_year_2015_kfold                   0.031359
Churn_tenure_year_2015_q2_kfold                    0.031359
Churn_dive2_Churn_tenure_year_2015_median_kfold    0.031359
Churn_tenure_year_2015_count_kfold                 0.031286
Churn_minus1_Churn_tenure_year_2015_mean_kfold     0.031271
Churn_minus2_Churn_tenure_year_2015_mean_kfold     0.031271
Churn_norm_tenure_year_2015_kfold                  0.030900
Churn_tenure_year_2015_mean_kfold                  0.030564
Churn_mag1_tenure_year_2015_kfold                  0.030564
Churn_dive1_Churn_tenure_year_2015_mean_kfold      0.030321
Churn_tenure_year_2015_var_kfold                   0.030229
Churn_cv_tenure_year_2015_kfold                    0.029787
Churn_mag1_tenure_year_2019_kfold                  0.028483
Churn_tenure_year_2019_mean_kfold                  0.028483
Churn_cv_tenure_year_2019_kfold                    0.028470
Churn_tenure_year_2019_var_kfold        

同样我们发现，衍生特征中相关性表现相对较好的特征，背后的KeyCol并不是原始特征中第一梯队的特征。接下来验证模型效果：

In [177]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 2].index)
new_col

['Churn_gap_tenure_year_2015_kfold']

In [178]:
train_new_temp = train_new_TE_seq[new_col]
test_new_temp = test_new_TE_seq[new_col]

# 带入模型进行训练
features_test(train_new_temp, 
              test_new_temp, 
              X_train_OE, 
              X_test_OE, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

43.03908157348633 s


(0.8104897723689115,
 {'columntransformer__num': StandardScaler(),
  'logit_threshold__C': 0.2,
  'logit_threshold__penalty': 'l1',
  'logit_threshold__solver': 'saga'})

能够发现衍生特征仍然表现出了非常不错的特性。

#### 6.3 多变量交叉组合目标编码

- 双变量交叉组合目标编码

&emsp;&emsp;接下来我们进一步尝试以交叉组合特征作为KeyCol进行目标编码。既然对于目标编码来说无法通过KeyCol的重要性提前预判衍生特征的重要性，那么我们其实就无法通过原始特征的相关性大小提前预判可能合适的双变量交叉组合KeyCol，因此我们这里首先尝试通过大范围的尝试来找出合适的衍生特征。

&emsp;&emsp;首先是创建交叉组合KeyCol

In [179]:
# 合并数据集
train_temp = pd.concat([X_train_OE[category_cols], X_train_seq], axis=1)
test_temp = pd.concat([X_test_OE[category_cols], X_test_seq], axis=1)

# 所有离散变量名称列表
col_temp = (category_cols + list(X_train_seq.columns)).copy()

# 双变量组合特征衍生
features_train_new, features_test_new, colNames_train_new, colNames_test_new = Cross_Combination(col_temp, 
                                                                                                 train_temp, 
                                                                                                 test_temp)

# 调整index
features_train_new.index = X_train.index
features_test_new.index = X_test.index

# 合并数据集
X_train_temp = pd.concat([X_train_OE, X_train_seq, features_train_new], axis=1)
X_test_temp = pd.concat([X_test_OE, X_test_seq, features_test_new], axis=1)

In [180]:
features_train_new.shape

(5282, 3589)

In [181]:
X_train_temp.shape

(5282, 3631)

然后进行目标编码。这里我们导入gc库方便及时清理临时变量，以清理不必要的内存占用。此外导入tqdm库，方便我们查看循环执行的进度。

In [182]:
from tqdm import tqdm
import gc

# 定义标签
col_cat = [target]
print(col_cat)

# 创建容器
key_temp = list((features_train_new.columns)).copy()
n = len(key_temp)
print(n)

colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

for keyCol in tqdm(key_temp):
    features_train1, features_test1, colNames_train_new, colNames_test_new = Target_Encode(keyCol, 
                                                                                           X_train_temp, 
                                                                                           y_train,
                                                                                           X_test_temp, 
                                                                                           col_cat=col_cat)
    
    colNames_train_new.extend(colNames_train)
    colNames_test_new.extend(colNames_test)
    features_train_new.append(features_train1)
    features_test_new.append(features_test1)

    col_temp = category_cols.copy()
    
# 合并数据集
train_new_TE_BC = pd.concat(features_train_new, axis=1)
test_new_TE_BC = pd.concat(features_test_new, axis=1)
print(train_new_TE_BC.shape)
print(test_new_TE_BC.shape)

# 内存清理
gc.collect()

  0%|          | 0/3589 [00:00<?, ?it/s]

['Churn']
3589


100%|██████████| 3589/3589 [28:18<00:00,  2.11it/s]


(5282, 32301)
(1761, 32301)


0

In [183]:
train_new_TE_BC.shape

(5282, 32301)

此处由于我们采用了所有离散变量的双变量交叉组合作为keyCol，因此衍生出了非常多的特征。接下来我们同样通过循环计算判别每个特征的相关系数：

In [184]:
# 测试相关系数
# 修改index
train_new_TE_BC.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([train_new_TE_BC, y_train], axis=1)

# 计算相关系数
df_corr = pd.Series(dtype=np.float64)

for col in tqdm(df_temp.columns):
    corr = np.corrcoef(df_temp[col], df_temp['Churn'])[0, 1]
    s = pd.Series(corr, index=[col])
    df_corr = df_corr.append(s)

# 内存清理
gc.collect()

  c /= stddev[:, None]
  c /= stddev[None, :]
100%|██████████| 32302/32302 [01:38<00:00, 329.55it/s]


26

然后查看相关性较强的衍生特征都是由那些KeyCol创建的：

In [185]:
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                                       1.000000
Churn_Contract&PaymentMethod_0.0&3.0_mean_kfold             0.048124
Churn_Contract&PaymentMethod_0.0&3.0_var_kfold              0.047889
Churn_Contract&PaymentMethod_0.0&3.0_count_kfold            0.047669
Churn_PaymentMethod&tenure_year_2015_3.0&0.0_count_kfold    0.045870
Churn_PaymentMethod&tenure_year_2015_3.0&0.0_q2_kfold       0.045795
Churn_PaymentMethod&tenure_year_2015_3.0&0.0_mean_kfold     0.045372
Churn_PaymentMethod&tenure_year_2015_3.0&0.0_var_kfold      0.045235
Churn_Partner&PaymentMethod_0.0&3.0_q2_kfold                0.042659
Churn_Partner&PaymentMethod_0.0&3.0_count_kfold             0.042645
Churn_PaymentMethod&tenure_month_5_3.0&0.0_count_kfold      0.041902
Churn_PaymentMethod&tenure_month_5_3.0&0.0_q2_kfold         0.041848
Churn_PaymentMethod&tenure_year_2014_3.0&0.0_count_kfold    0.041755
Churn_PaymentMethod&tenure_year_2014_3.0&0.0_q2_kfold       0.041701
Churn_PaymentMethod&tenure_month_5

In [186]:
np.abs(df_corr).sort_values(ascending = False)[60: 80]

Churn_PaymentMethod&tenure_month_1_3.0&0.0_q2_kfold          0.038604
Churn_StreamingMovies&PaperlessBilling_2.0&1.0_mean_kfold    0.038552
Churn_PaymentMethod&tenure_year_2017_3.0&0.0_count_kfold     0.038336
Churn_StreamingMovies&PaperlessBilling_2.0&1.0_var_kfold     0.038299
Churn_PaymentMethod&tenure_year_2017_3.0&0.0_q2_kfold        0.038265
Churn_PaymentMethod&tenure_year_2017_3.0&0.0_mean_kfold      0.038190
Churn_PaymentMethod&tenure_year_2017_3.0&0.0_var_kfold       0.038023
Churn_PaymentMethod&tenure_month_12_3.0&0.0_mean_kfold       0.037821
Churn_PaymentMethod&tenure_month_7_3.0&0.0_count_kfold       0.037772
Churn_PaymentMethod&tenure_month_7_3.0&0.0_q2_kfold          0.037741
Churn_PaymentMethod&tenure_month_12_3.0&0.0_count_kfold      0.037736
Churn_PaymentMethod&tenure_month_12_3.0&0.0_q2_kfold         0.037713
Churn_PaymentMethod&tenure_month_12_3.0&0.0_var_kfold        0.037682
Churn_PaymentMethod&tenure_month_10_3.0&0.0_mean_kfold       0.037666
Churn_PaymentMethod&

- 有效KeyCol识别

&emsp;&emsp;通过观察上述衍生结果我们不难发现，哪些表现较好的双变量交叉组合的KeyCol，原始特征也正是目标编码中表现较好的特征，例如PaymentMethod、tenure_year_2015、StreamingMovies、OnlineBackup等。那么也就是说，尽管我们无法通过原始特征的相关系数判断交叉组合后在目标编码中的表现，但我们可以通过原始特征在目标编码中的表现，判断这些特征是否会在双变量、甚至是三变量交叉组合过程中会有更好的表现。而基于这一判断，我们即可在后续的双变量交叉组合目标编码的过程中提前做好预判、利用有限的算力和时间优先去创建那些在原始状态下目标编码就有较好表现的特征的两两交叉组合作为KeyCol带入到目标编码环节中。当然，在这一判断下，我们甚至能够进一步尝试重要特征的三变量、甚至是四变量交叉组合的目标编码特征衍生。

&emsp;&emsp;不过需要注意的是，这一规律其实在交叉组特征衍生和分组统计特征衍生中其实也存在，也就是之前描述的“强者恒强”的规律，只不过此时的“强”，不再是通过原始变量的相关系数进行判断，而是通过衍生特征的有效性，判断原始特征是否够“强”。

#### 6.4 目标编码的执行流程

&emsp;&emsp;进一步的，基于该规律，其实我们就能制定目标编码特征衍生的一般流程。首先对于目标编码来说，仍然是一个由浅入深的过程，首先我们先需要大范围尝试单变量作为KeyCol进行目标编码的结果，同样，这里的单变量可以是原始特征、也可以是时序衍生特征、或者是原始特征独热编码之后的衍生特征，然后根据目标编码结果，挑选那些重要的单变量进行两两组合甚至是三三组合，再将这些交叉组合衍生的特征作为KeyCol进行更深层次的目标编码。

&emsp;&emsp;当然，另外需要注意的是，在实际建模过程中，交叉组合特征衍生肯定是在目标编码之前执行的，因此在交叉组合特征衍生阶段，我们建议尽可能进行更大范围的组合，以及进行三三组合。交叉组合过程本身并不会特别消耗算力，而组合出来的特征，可以直接作为目标编码过程中的备选KeyCol。

> 尽管三三组合作为KeyCol的目标编码的确有可能提升模型效果，但在大多数实践环节中，受限于算力和时间，两两组合已经是极限了。

### 7.NLP特征衍生

&emsp;&emsp;在特征衍生的最后阶段，我们将考虑进行NLP特征衍生。基于我们对NLP特征衍生的理解，当前数据集的NLP特征衍生总共有以下两种衍生方案，其一是直接带入离散变量进行NLP特征衍生（此时只计算TF-IDF），或者按照分组统计特征衍生的思路，先选择KeyCol，然后分组进行NLP特征衍生，此时可以考虑进行CountVec和TF-IDF两种指标的计算。

#### 7.1 基于离散变量的NLP特征衍生

&emsp;&emsp;首先我们尝试带入全部离散变量，在不进行分组时，NLP特征衍生实际上就是围绕每一列进行TF-IDF计算。

In [187]:
NLP_Group_Stat?

[1;31mSignature:[0m
[0mNLP_Group_Stat[0m[1;33m([0m[1;33m
[0m    [0mX_train[0m[1;33m,[0m[1;33m
[0m    [0mX_test[0m[1;33m,[0m[1;33m
[0m    [0mcol_cat[0m[1;33m,[0m[1;33m
[0m    [0mkeyCol[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtfidf[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mcountVec[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
NLP特征衍生函数

:param X_train: 训练集特征
:param X_test: 测试集特征
:param col_cat: 参与衍生的离散型变量，只能带入多个列
:param keyCol: 分组参考的关键变量，输入字符串时代表按照单独列分组，输入list代表按照多个列进行分组
:param tfidf: 是否进行tfidf计算  
:param countVec: 是否进行CountVectorizer计算

:return：NLP特征衍生后的新特征和新特征的名称
[1;31mFile:[0m      d:\work\jupyter\telco\正式课程\features_creation.py
[1;31mType:[0m      function


In [188]:
# 数据集拼接
X_train_temp = pd.concat([X_train_OE, X_train_seq], axis=1)
X_test_temp = pd.concat([X_test_OE, X_test_seq], axis=1)

# 所有离散变量名称
col_cat = category_cols + list(X_train_seq.columns)

# 进行NLP特征衍生
features_train_new, features_test_new, colNames_train_new, colNames_test_new = NLP_Group_Stat(X_train_temp, 
                                                                                              X_test_temp, 
                                                                                              col_cat)


# 查看衍生数据集规模
print(features_train_new.shape)

# 测试相关系数
# 修改index
features_train_new.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 计算相关系数
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)

# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

(5282, 39)


Churn                     1.000000
Contract_tfidf            0.398629
tenure_year_2019_tfidf    0.333148
OnlineSecurity_tfidf      0.296923
TechSupport_tfidf         0.278950
PaperlessBilling_tfidf    0.246287
tenure_quarter_4_tfidf    0.233339
tenure_year_2014_tfidf    0.226298
tenure_month_12_tfidf     0.214692
PaymentMethod_tfidf       0.186292
OnlineBackup_tfidf        0.177502
DeviceProtection_tfidf    0.171308
SeniorCitizen_tfidf       0.161434
tenure_quarter_1_tfidf    0.136922
Dependents_tfidf          0.128100
Partner_tfidf             0.127240
PhoneService_tfidf        0.126132
tenure_month_1_tfidf      0.120929
tenure_year_2015_tfidf    0.098806
MultipleLines_tfidf       0.077045
Name: Churn, dtype: float64

能够发现，在IF-IDF的特征衍生过程中，仍然还是表现出了“强者恒强”的现象，其中原始状态下Contract、tenure_year_2019、OnlineSecurity等相关性较强的变量，衍生特征仍然表现出了较强的相关性。接下来我们进一步带入模型进行检测：

In [189]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 4].index)
new_col

['Contract_tfidf', 'tenure_year_2019_tfidf', 'OnlineSecurity_tfidf']

In [190]:
train_new_temp = features_train_new[new_col]
test_new_temp = features_test_new[new_col]

# 带入模型进行训练
features_test(train_new_temp, 
              test_new_temp, 
              X_train_OE, 
              X_test_OE, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

47.58120656013489 s


(0.8108676643444855,
 {'columntransformer__num': 'passthrough',
  'logit_threshold__C': 0.1,
  'logit_threshold__penalty': 'l2',
  'logit_threshold__solver': 'lbfgs'})

能够发现衍生特征效果不错。

#### 7.2 基于分组统计的NLP特征衍生

&emsp;&emsp;接下来进一步尝试基于分组统计的NLP特征衍生，既然NLP特征衍生过程并不涉及交叉统计，那么我们有理由判断在进行分组NLP特征统计的过程中，KeyCol的选取将表现出和分组统计特征衍生相类似的情况，即重要的KeyCol衍生出重要的特征，甚至两两组合、三三组合而成的KeyCol也是如此。

&emsp;&emsp;接下来我们就尝试进行验证。

- 单变量分组NLP特征衍生

&emsp;&emsp;首先是单变量分组情况下的NLP特征衍生，可以通过如下方式实现:

In [191]:
# 创建容器
col_temp = col_cat.copy()
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

for i in range(len(col_temp)):
    keyCol = col_temp.pop(i)
    features_train1, features_test1, colNames_train, colNames_test = NLP_Group_Stat(X_train_temp, 
                                                                                    X_test_temp, 
                                                                                    col_temp,
                                                                                    keyCol)

    colNames_train_new.extend(colNames_train)
    colNames_test_new.extend(colNames_test)
    features_train_new.append(features_train1)
    features_test_new.append(features_test1)
    
    col_temp = col_cat.copy()


# 合并数据集
features_train_new = pd.concat(features_train_new, axis=1)
features_test_new = pd.concat(features_test_new, axis=1)
print(features_train_new.shape)
print(features_test_new.shape)

# 内存清理
gc.collect()

# 测试相关系数
# 修改index
features_train_new.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 计算相关系数
df_corr = df_temp.corr()['Churn'].sort_values(ascending = False)

# 取相关系数绝对值最大的20个特征进行观察
np.abs(df_corr).sort_values(ascending = False)[: 20]

(5282, 2964)
(1761, 2964)


Churn                              1.000000
tenure_month_8_Contract_cntv       0.406389
Dependents_Contract_tfidf          0.406367
tenure_month_12_Contract_tfidf     0.406191
OnlineSecurity_Contract_tfidf      0.406165
tenure_year_2018_Contract_cntv     0.405956
tenure_month_11_Contract_cntv      0.405771
Partner_Contract_tfidf             0.405769
PaperlessBilling_Contract_tfidf    0.405416
DeviceProtection_Contract_tfidf    0.405269
SeniorCitizen_Contract_tfidf       0.405048
tenure_quarter_3_Contract_cntv     0.404980
TechSupport_Contract_tfidf         0.404805
tenure_month_10_Contract_cntv      0.404802
tenure_year_2019_Contract_tfidf    0.404660
tenure_quarter_4_Contract_tfidf    0.404501
PaymentMethod_Contract_tfidf       0.404190
tenure_quarter_4_Contract_cntv     0.404121
tenure_month_7_Contract_cntv       0.404077
tenure_month_5_Contract_cntv       0.403653
Name: Churn, dtype: float64

能够发现，衍生特征中处于第一梯队的仍然都是以contract作为KeyCol的衍生特征。我们继续查看其他衍生特征：

In [192]:
np.abs(df_corr).sort_values(ascending = False)[50: 70]

StreamingMovies_Contract_cntv            0.366081
tenure_month_8_OnlineSecurity_cntv       0.362731
Dependents_OnlineSecurity_tfidf          0.362719
tenure_year_2017_OnlineSecurity_cntv     0.362678
tenure_month_8_OnlineSecurity_tfidf      0.362612
PhoneService_OnlineSecurity_cntv         0.362591
gender_OnlineSecurity_cntv               0.362476
tenure_quarter_3_OnlineSecurity_cntv     0.362327
tenure_year_2020_OnlineSecurity_cntv     0.362078
tenure_month_7_OnlineSecurity_cntv       0.361873
tenure_month_9_OnlineSecurity_cntv       0.361821
tenure_year_2017_OnlineSecurity_tfidf    0.361776
tenure_month_10_OnlineSecurity_cntv      0.361756
tenure_month_11_OnlineSecurity_cntv      0.361098
tenure_quarter_3_Contract_tfidf          0.360663
SeniorCitizen_OnlineSecurity_cntv        0.360471
StreamingTV_OnlineSecurity_cntv          0.359959
gender_OnlineSecurity_tfidf              0.359015
tenure_month_6_OnlineSecurity_cntv       0.358508
PaymentMethod_OnlineSecurity_cntv        0.358384


第二梯队的衍生特征的KeyCol是OnlineSecurity，也是重要特征之一。根据上述结果也最终验证了NLP特征衍生的一般规律，和分组统计衍生一样，NLP分组特征衍生结果也是高度依赖于KeyCol的选取。

- 双变量交叉组合NLP特征衍生

&emsp;&emsp;接下来，我们进一步测试以双变量交叉组合作为KeyCol进行NLP特征衍生。当然，根据上述结果我们已经不难判断，在双变量交叉组合的NLP特征衍生过程中，表现较好的特征肯定也是重要特征交叉组合出来的KeyCol的衍生结果。我们可以通过下述方式进行验证，需要注意的是，对于NLP特征衍生来说，我们定义的函数可以自动执行交叉组合的过程，无需另行进行交叉组合特征衍生计算：

In [193]:
NLP_Group_Stat?

[1;31mSignature:[0m
[0mNLP_Group_Stat[0m[1;33m([0m[1;33m
[0m    [0mX_train[0m[1;33m,[0m[1;33m
[0m    [0mX_test[0m[1;33m,[0m[1;33m
[0m    [0mcol_cat[0m[1;33m,[0m[1;33m
[0m    [0mkeyCol[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtfidf[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mcountVec[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
NLP特征衍生函数

:param X_train: 训练集特征
:param X_test: 测试集特征
:param col_cat: 参与衍生的离散型变量，只能带入多个列
:param keyCol: 分组参考的关键变量，输入字符串时代表按照单独列分组，输入list代表按照多个列进行分组
:param tfidf: 是否进行tfidf计算  
:param countVec: 是否进行CountVectorizer计算

:return：NLP特征衍生后的新特征和新特征的名称
[1;31mFile:[0m      d:\work\jupyter\telco\正式课程\features_creation.py
[1;31mType:[0m      function


In [194]:
# 合并数据集
train_temp = pd.concat([X_train_OE[category_cols], X_train_seq], axis=1)
test_temp = pd.concat([X_test_OE[category_cols], X_test_seq], axis=1)

# 所有离散变量名称列表
col_cat = (category_cols + list(X_train_seq.columns)).copy()
n = len(col_cat)

# 创建容器
col_temp = col_cat.copy()
keyCol = []
colNames_train_new = []
colNames_test_new = []
features_train_new = []
features_test_new = []

for i in tqdm(range(n)):
    for j in range(i+1, n):
        keyCol.append(col_temp.pop(i))
        keyCol.append(col_temp.pop(j-1))
        features_train1, features_test1, colNames_train, colNames_test = NLP_Group_Stat(train_temp, 
                                                                                        test_temp, 
                                                                                        col_temp,
                                                                                        keyCol)
        
        colNames_train_new.extend(colNames_train)
        colNames_test_new.extend(colNames_test)
        features_train_new.append(features_train1)
        features_test_new.append(features_test1)
        
        keyCol = []
        col_temp = col_cat.copy()
    
# 合并数据集
features_train_new = pd.concat(features_train_new, axis=1)
features_test_new = pd.concat(features_test_new, axis=1)
print(features_train_new.shape)
print(features_test_new.shape)

# 内存清理
gc.collect()

100%|██████████| 39/39 [00:16<00:00,  2.38it/s]


(5282, 54834)
(1761, 54834)


26

能够发现，由于此处只进行了sum和TF-IDF的计算，相比分组统计特征衍生，整体计算速度要快了很多。接下来进一步进行相关系数检验：

In [195]:
# 测试相关系数
# 修改index
features_train_new.index = X_train.index

# 拼接衍生特征与标签
df_temp = pd.concat([features_train_new, y_train], axis=1)

# 计算相关系数
df_corr = pd.Series(dtype=np.float64)

for col in tqdm(df_temp.columns):
    corr = np.corrcoef(df_temp[col], df_temp['Churn'])[0, 1]
    s = pd.Series(corr, index=[col])
    df_corr = df_corr.append(s)

# 内存清理
gc.collect()

100%|██████████| 54835/54835 [00:39<00:00, 1392.86it/s]


26

In [196]:
np.abs(df_corr).sort_values(ascending = False)[: 20]

Churn                                             1.000000
tenure_month_11_InternetService&Contract_cntv     0.467464
tenure_month_9_InternetService&Contract_cntv      0.466496
tenure_month_10_InternetService&Contract_cntv     0.465473
tenure_quarter_4_InternetService&Contract_cntv    0.464573
tenure_quarter_3_InternetService&Contract_cntv    0.463899
PaymentMethod_InternetService&Contract_cntv       0.460555
tenure_month_8_InternetService&Contract_cntv      0.459681
PaperlessBilling_InternetService&Contract_cntv    0.458440
tenure_month_7_InternetService&Contract_cntv      0.458075
gender_InternetService&Contract_cntv              0.457934
tenure_year_2018_InternetService&Contract_cntv    0.457297
tenure_year_2018_OnlineSecurity&Contract_cntv     0.455537
tenure_month_8_OnlineSecurity&Contract_cntv       0.454826
TechSupport_InternetService&Contract_tfidf        0.454414
tenure_month_12_InternetService&Contract_cntv     0.454358
tenure_year_2017_OnlineSecurity&Contract_cntv     0.4541

能够发现，仍然还是原始条件较好的变量在进行交叉组合后衍生出了更好的特征，和我们此前判断的一致。当然我们也可以进一步带入模型进行验证：

In [197]:
new_col = list(np.abs(df_corr).sort_values(ascending = False)[1: 4].index)
new_col

['tenure_month_11_InternetService&Contract_cntv',
 'tenure_month_9_InternetService&Contract_cntv',
 'tenure_month_10_InternetService&Contract_cntv']

In [198]:
train_new_temp = features_train_new[new_col]
test_new_temp = features_test_new[new_col]

# 带入模型进行训练
features_test(train_new_temp, 
              test_new_temp, 
              X_train_OE, 
              X_test_OE, 
              y_train, 
              y_test, 
              category_cols, 
              numeric_cols)

56.599395513534546 s


(0.8101111636707662,
 {'columntransformer__num': StandardScaler(),
  'logit_threshold__C': 0.30000000000000004,
  'logit_threshold__penalty': 'l1',
  'logit_threshold__solver': 'saga'})

当然，既然在分组NLP特征衍生过程中，同样也是KeyCol起决定性作用，同学们课后可以进一步尝试通过最为重要的原始离散变量进行三三组合甚至是四四组合，并以交叉组合结果作为KeyCol继续进行NLP特征衍生，相信也能够有不错的结果。整体来看，NLP特征衍生在实践过程中所表现出来的特质和分组统计汇总无异。

&emsp;&emsp;至此，我们就完成了所有批量特征衍生的简单实践，接下来我们对上述实践过程进行总结，并最终形成一套能够很好的权衡效率和效果的特征衍生通用流程。

### 8.批量特征衍生的通用实战流程

&emsp;&emsp;通过上述实践，我们已经基本掌握了每种方法的使用方法和使用技巧，接下来我们将其进行汇总，总结成一套完整、通用的批量特征衍生实战流程，并最终达到保证效果、提高效率的目的。

<center><img src="https://s2.loli.net/2022/03/20/gkWACzGaDQc1fmy.png" alt="3" style="zoom:50%;" />

&emsp;&emsp;至此，我们就完成了所有特征衍生方法的介绍。