# Pipeline介绍  
一、  
Pipelines ,直译是“管道“， 类似于流水线的意思，可以将数据预处理和建模流程封装起来。
在数据处理过程中，很多步骤都是重复或者类似的，比如数据处理、特征选择、标准化、分类等等，pipeline就可以实现以下几点好处：
1. 简化代码：直接将步骤封装成完整的工作流，避免了代码重复。
2. 更少出Bug：流程规范化，避免在建模和部署过程中漏掉某个步骤。
3. 更易于生产/复制：直接调用fit和predict来对管道中所有的算法模型一次性进行训练和预测。
4. 简化模型验证过程：网格搜索（Grid Search）可以遍历管道中所有评估器的参数。

Pipeline是使用 （key，value） 对的列表构建的，其中key是步骤名称的字符串，而value是一个估计器对象。

二、  
Sklearn中有两个pipeline类型的模块，分别是：  
1. ColumnTransformer，主要用于数据与特征处理的工作流，可并行
2. pipeline，可用于任何需求的工作流，只能串行

**重要：管道中的所有评估器，除了最后一个评估器，管道的所有评估器必须是转换器。 (例如，必须有 transform 方法). 最后一个评估器的类型不限（转换器、分类器等等）**

In [1]:
# 导入所有的库
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import pandas as pd

import warnings
# warnings.filterwarnings("ignore")

# 数据导入与处理（ColumnTransformer）

In [3]:
# 导入数据，拆分标签
df= pd.read_csv('./data/保险数据_第一期.csv')
labels= df.pop('resp_flag')   # 取出y

In [4]:
# 数据类型区分
cat_cols= df.select_dtypes(include=["object"])   # 分类型变量
num_cols= df.select_dtypes(exclude=["object"])   # 数值型变量

In [5]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [6]:
cat_imp = SimpleImputer(strategy='most_frequent')
cat_encode = OrdinalEncoder()
cat_pipeline =Pipeline(steps=[('cat_imp', cat_imp), ('cat_encode', cat_encode)])

num_imp =SimpleImputer(strategy='mean')
num_std = StandardScaler()
num_pipeline =Pipeline(steps=[('num_imp', num_imp), ('num_std', num_std)])

col_trans = ColumnTransformer(transformers=[('cat_pipeline', cat_pipeline, cat_cols.columns),
                                           ('num_pipeline', num_pipeline, num_cols.columns)])

In [7]:
aaa=col_trans.fit_transform(df)
pd.DataFrame(aaa)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,39,40,41,42,43,44,45,46,47,48
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.541625,0.166376,0.068909,-0.951312,1.137397,-0.160673,0.842515,-1.042277,-0.022663,1.142626
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.042015,0.814953,1.001337,0.313886,0.220109,-0.051676,0.651137,-0.738785,-0.590107,-0.474091
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.479746,-0.684880,-0.230800,0.904312,-0.959262,0.111821,0.651137,-0.738785,-0.306385,-0.474091
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.030940,0.409593,1.001337,0.482579,0.613232,-0.760161,0.714929,-1.421642,-0.022663,1.465970
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.917477,-1.941498,-1.096626,-1.541737,0.220109,-0.487667,0.000000,-1.421642,-1.157552,-0.474091
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39168,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,-1.938848,-0.238984,1.001337,0.060847,-1.286865,0.057322,0.906308,-0.283548,-0.306385,-0.797434
39169,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-2.084758,-2.184714,-2.128957,-1.035658,-2.662797,-0.215172,0.523551,-0.511167,-0.022663,-1.444121
39170,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.541625,-1.495601,-1.396335,-1.120005,-0.173015,1.038302,-1.581616,0.702800,-0.590107,-1.444121
39171,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.406791,-0.157912,-0.064295,-0.276539,-1.090303,-0.923658,0.523551,-0.511167,-1.724997,0.172596


In [None]:
# ColumnTransformer，List of (name, transformer, columns) tuples


# 实例化模型，建立Pipeline

In [8]:
# pipeline，List of (name, transform) tuples
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

lr_pipeline = Pipeline(steps=[('col_trans', col_trans), ('lr', lr)])

# 训练模型，调参

In [9]:
# 分割测试集
from sklearn.model_selection  import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(df, labels, test_size=0.3 , stratify= labels, random_state= 42)

In [10]:
# 网格搜索，使用<estimator>__<parameter>访问每个transform内部的参数
from sklearn.model_selection import GridSearchCV

# 参数字典
param= dict(lr__C= np.arange(1, 2, 0.1),    # 启，止， 步长
           lr__class_weight= [None, "balanced"],
           col_trans__num_pipeline__num_imp__strategy= ["mean", "median"])  # 从外往内一层一层写转换器

grid_lr_pipeline= GridSearchCV(lr_pipeline, param, cv= 3, n_jobs= -1)
grid_lr_pipeline.fit(Xtrain, Ytrain)

In [11]:
from sklearn.metrics import classification_report
print(classification_report(Ytest, grid_lr_pipeline.predict(Xtest)))
grid_lr_pipeline.best_params_   # 最优参数

              precision    recall  f1-score   support

           0       0.67      0.84      0.74      7054
           1       0.60      0.37      0.45      4698

    accuracy                           0.65     11752
   macro avg       0.63      0.60      0.60     11752
weighted avg       0.64      0.65      0.63     11752



{'col_trans__num_pipeline__num_imp__strategy': 'mean',
 'lr__C': 1.8000000000000007,
 'lr__class_weight': None}

# 模型的保存

In [12]:
# 保存模型
import joblib
joblib.dump(grid_lr_pipeline, "LGmodel_20240121.joblib")

['LGmodel_20240121.joblib']

# 预测

In [13]:
# 预测新进用户
df_new= pd.read_csv("./data/保险案例_新进用户.csv")

grid_lr_pipeline.predict(df_new)

array([1, 0, 0], dtype=int64)

In [14]:
# 模型上线
import joblib
import pandas as pd
model_20240121=joblib.load('LGmodel_20240121.joblib')
df_new= pd.read_csv("./data/保险案例_新进用户.csv")
model_20240121.predict(df_new)

array([1, 0, 0], dtype=int64)