    数据预处理
        缺失值的填充
        时间格式处理
        对象类型特征转换到数值
    异常值处理
        基于3segama原则
        基于箱型图
    数据分箱
        固定宽度分箱
        分位数分箱
        离散数值型数据分箱
        连续数值型数据分箱
        卡方分箱（选做作业）
    特征交互
        特征和特征之间组合
        特征和特征之间衍生
        其他特征衍生的尝试（选做作业）
    特征编码
        one-hot编码
        label-encode编码
    特征选择
        1 Filter
        2 Wrapper （RFE）
        3 Embedded

In [261]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedGroupKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, log_loss
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor

import warnings
warnings.filterwarnings('ignore')

In [262]:
data_train = pd.read_csv('./data/train.csv')
data_test_a = pd.read_csv('./data/testA.csv')

In [263]:
data_train['subGrade']

0         E2
1         D2
2         D3
3         A4
4         C2
          ..
799995    C4
799996    A4
799997    C3
799998    A4
799999    B3
Name: subGrade, Length: 800000, dtype: object

In [264]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-nul

### 特征预处理

1. 查找对象特征和数值特征

In [265]:
numerical_fea = list(data_train.select_dtypes(exclude='object').columns)
category_fea = list(filter(lambda x: x not in numerical_fea, list(data_train.columns)))
label = 'isDefault'
numerical_fea.remove(label)
numerical_fea


['id',
 'loanAmnt',
 'term',
 'interestRate',
 'installment',
 'employmentTitle',
 'homeOwnership',
 'annualIncome',
 'verificationStatus',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'initialListStatus',
 'applicationType',
 'title',
 'policyCode',
 'n0',
 'n1',
 'n2',
 'n3',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n11',
 'n12',
 'n13',
 'n14']

In [266]:
category_fea

['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

2. 缺失值填充

* 0 填
  * data_train = data_trian.fillno(0)
+ 上填
  * data_trian = data_trian.fillno(axis=0, method='ffill)
- 下填且设置最多只填充两个连续的值
  * data_train = data_train.fillna(axis=0,method='bfill',limit=2)

In [267]:
data_train.isnull().sum()

id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           1
employmentLength      46799
homeOwnership             0
annualIncome              0
verificationStatus        0
issueDate                 0
isDefault                 0
purpose                   0
postCode                  1
regionCode                0
dti                     239
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies      405
revolBal                  0
revolUtil               531
totalAcc                  0
initialListStatus         0
applicationType           0
earliesCreditLine         0
title                     1
policyCode                0
n0                    40270
n1                    40270
n2                    40270
n3                  

In [268]:
# 按照平均值填充
data_train[numerical_fea] = data_train[numerical_fea].fillna(data_train[numerical_fea].mean())
data_test_a[numerical_fea] = data_test_a[numerical_fea].fillna(data_test_a[numerical_fea].mean())

In [269]:
data_train.isnull().sum()

id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           0
employmentLength      46799
homeOwnership             0
annualIncome              0
verificationStatus        0
issueDate                 0
isDefault                 0
purpose                   0
postCode                  0
regionCode                0
dti                       0
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies        0
revolBal                  0
revolUtil                 0
totalAcc                  0
initialListStatus         0
applicationType           0
earliesCreditLine         0
title                     0
policyCode                0
n0                        0
n1                        0
n2                        0
n3                  

In [270]:
data_train['employmentLength'].describe()

count        753201
unique           11
top       10+ years
freq         262753
Name: employmentLength, dtype: object

In [271]:
data_train['employmentLength'].isnull().sum()

46799

## 处理类别特征

In [272]:
category_fea

['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

#### 时间格式的处理

In [273]:
startdate = pd.to_datetime(data_train['issueDate'].min(), format='%Y-%m-%d')

In [274]:
for data in [data_train, data_test_a]:
    data['issueDate'] = pd.to_datetime(data['issueDate'], format='%Y-%m-%d')
    #构造时间特征
    data['issueDateDT'] = data['issueDate'].apply(lambda x: x-startdate).dt.days
    data['issueDate_year'] = data['issueDate'].dt.year
    data['issueDate_month'] = data['issueDate'].dt.month
    # data_train['issueDate_day'].value_counts() # 唯一值，所以可以删去
    data.drop('issueDate', axis=1, inplace=True)

In [275]:
data_train['employmentLength'].value_counts(dropna=False).sort_index()

1 year        52489
10+ years    262753
2 years       72358
3 years       64152
4 years       47985
5 years       50102
6 years       37254
7 years       35407
8 years       36192
9 years       30272
< 1 year      64237
NaN           46799
Name: employmentLength, dtype: int64

##### 对象类型转换为数值

In [276]:
def employmenLength_to_int(s):
    if pd.isnull(s):
        return s
    else: 
        return np.int8(s.split()[0])

for data in [data_train, data_test_a]:
    data['employmentLength'].replace('10+ years', "10 years", inplace=True)
    data['employmentLength'].replace('< 1 year', "1 years", inplace=True)
    data['employmentLength'] = data['employmentLength'].apply(lambda x: employmenLength_to_int(x))


In [277]:
data_train['employmentLength'].value_counts(dropna=False).sort_index()

1.0     116726
2.0      72358
3.0      64152
4.0      47985
5.0      50102
6.0      37254
7.0      35407
8.0      36192
9.0      30272
10.0    262753
NaN      46799
Name: employmentLength, dtype: int64

In [278]:
data_train['employmentLength'] = data_train['employmentLength'].fillna(data_train['employmentLength'].std())

In [279]:
data_train['earliesCreditLine'].sample(5)

137803    Nov-2005
481944    Sep-2011
704131    Apr-2002
258700    May-2007
535839    Jun-2007
Name: earliesCreditLine, dtype: object

In [280]:
for data in [data_train, data_test_a]:
    data['earliesCreditLine_year'] = data['earliesCreditLine'].apply(lambda x: int(x[-4:]))
    # data['earliesCreditLine_month'] = data['earliesCreditLine'].apply(lambda x: str(x[:3])) 缺失值太多，没什么意义
    data.drop('earliesCreditLine', axis=1, inplace=True)

In [281]:
# for data in [data_train, data_test_a]:
#     data['earliesCreditLine_month'] = data['earliesCreditLine_month'].map({'Jun': 1, 'Feb' : 2, 'Mar': 3, 'April': 4,
#                                                                       'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 
#                                                                       'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12})
# data['earliesCreditLine_month'].value_counts()

## 类别特征的处理

    首先查看类别特征的类别数

In [283]:
cate_features = ['grade', 'subGrade', 'employmentTitle', 'homeOwnership', 'verificationStatus', 'purpose', 'postCode', 'regionCode', \
                 'applicationType', 'initialListStatus', 'title', 'policyCode']
for col in cate_features:
    print('{}: 类型数 ：{}'.format(col, data[col].nunique()))

grade: 类型数 ：7
subGrade: 类型数 ：35
employmentTitle: 类型数 ：79282
homeOwnership: 类型数 ：6
verificationStatus: 类型数 ：3
purpose: 类型数 ：14
postCode: 类型数 ：889
regionCode: 类型数 ：51
applicationType: 类型数 ：2
initialListStatus: 类型数 ：2
title: 类型数 ：12058
policyCode: 类型数 ：1


* 类型数目为 1 的，说明该特征对于训练集无任何效果，可以删去。
* 低维的且有*优先级的*的首先考虑自映射和labelEncode编码
* 类型数在2之上，又不是高维稀疏的,且纯分类特征(注意：多少类型就会产生多少新的列)

In [284]:
for data in [data_train, data_test_a]:
    data.drop(['policyCode'], axis=1, inplace=True)
    data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
    # data = pd.get_dummies(data, columns=['subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)

In [285]:
data_train['subGrade'].value_counts().sort_index()

A1    25909
A2    22124
A3    22655
A4    30928
A5    38045
B1    42382
B2    44227
B3    48600
B4    49516
B5    48965
C1    50763
C2    47068
C3    44751
C4    44272
C5    40264
D1    30538
D2    26528
D3    23410
D4    21139
D5    17838
E1    14064
E2    12746
E3    10925
E4     9273
E5     8653
F1     5925
F2     4340
F3     3577
F4     2859
F5     2352
G1     1759
G2     1231
G3      978
G4      751
G5      645
Name: subGrade, dtype: int64

In [286]:
# 高维类别特征需要进行转换
for col in tqdm(['employmentTitle', 'postCode', 'title','subGrade']):
    le = LabelEncoder()
    le.fit(list(data_train[col].astype(str).values) + list(data_test_a[col].astype(str).values))
    data_train[col] = le.transform(list(data_train[col].astype(str).values))
    data_test_a[col] = le.transform(list(data_test_a[col].astype(str).values))
print('Label Encoding 完成')

100%|██████████| 4/4 [00:07<00:00,  1.88s/it]

Label Encoding 完成





In [287]:
data_test_a['subGrade'].value_counts()

10    12857
8     12423
9     12400
7     12100
11    11791
13    11110
12    11018
6     10898
5     10544
14     9925
4      9629
3      7753
15     7667
16     6713
0      6398
17     5821
2      5644
1      5503
18     5236
19     4487
20     3527
21     3175
22     2780
23     2414
24     2114
25     1462
26     1073
27      906
28      714
29      543
30      488
31      325
32      232
33      166
34      164
Name: subGrade, dtype: int64

In [288]:
data_train.select_dtypes(include='object')

0
1
2
3
4
...
799995
799996
799997
799998
799999


In [289]:
data_test_a.select_dtypes(include='object')

0
1
2
3
4
...
199995
199996
199997
199998
199999


## 异常值的处理（只针对训练集）

* 当你发现异常值后，一定要先分清是什么原因导致的异常值，然后再考虑如何处理。首先，如果这一异常值并不代表一种规律性的，而是极其偶然的现象，或者说你并不想研究这种偶然的现象，这时可以将其删除。其次，如果异常值存在且代表了一种真实存在的现象，那就不能随便删除。**在现有的欺诈场景中很多时候欺诈数据本身相对于正常数据勒说就是异常的，我们要把这些异常点纳入，重新拟合模型，研究其规律**。能用监督的用监督模型，不能用的还可以考虑用异常检测的算法来做
* 注意test的数据不能删除

### 检测异常的方法一：均方差

* 在统计学中，如果一个数据分布近似正态，那么大约 68% 的数据值会在均值的一个标准差范围内，大约 95% 会在两个标准差范围内，大约 99.7% 会在三个标准差范围内

In [290]:
def find_outliers_by_3segama(data,fea):
    data_std = np.std(data[fea])
    data_mean = np.std(data[fea])
    outliers_cut_off = data_std * 3
    lower_rul = data_mean - outliers_cut_off
    upper_rul = data_mean + outliers_cut_off
    data[fea+'_outliers'] = data[fea].apply(lambda x:str("异常值") if x > upper_rul or x < lower_rul else str('正常值'))
    return data

In [291]:
data_train.sample(5)

Unnamed: 0,id,loanAmnt,term,interestRate,installment,grade,subGrade,employmentTitle,employmentLength,homeOwnership,...,n9,n10,n11,n12,n13,n14,issueDateDT,issueDate_year,issueDate_month,earliesCreditLine_year
683611,683611,40000.0,5,18.06,1017.05,4,16,263632,1.0,0,...,9.0,17.0,0.0,0.0,0.0,4.0,3806,2017,11,2001
462199,462199,5750.0,3,14.65,198.35,3,14,256267,3.560799,0,...,8.0,16.0,0.000815,0.0,1.0,4.0,3014,2015,9,1990
625505,625505,15500.0,5,14.31,363.16,3,13,251637,3.0,2,...,3.0,13.0,0.0,0.0,0.0,2.0,2771,2015,1,1995
521692,521692,11000.0,3,15.65,384.85,4,18,126455,10.0,1,...,5.592345,11.643896,0.000815,0.003384,0.089366,2.178606,884,2009,11,2001
111769,111769,5525.0,3,12.99,186.14,3,11,256267,3.560799,1,...,5.0,7.0,0.0,0.0,0.0,0.0,3227,2016,4,1997


In [292]:
data_train_copy = data_train.copy()

for fea in data_train_copy.columns:
    data_train_copy = find_outliers_by_3segama(data_train_copy,fea)
    print("----"*10)
    print(data_train_copy[fea+'_outliers'].value_counts())
    print(data_train_copy.groupby(fea+'_outliers')['isDefault'].sum())
    print("----"*10)

----------------------------------------
正常值    800000
Name: id_outliers, dtype: int64
id_outliers
正常值    159610
Name: isDefault, dtype: int64
----------------------------------------
----------------------------------------
正常值    764021
异常值     35979
Name: loanAmnt_outliers, dtype: int64
loanAmnt_outliers
异常值      8564
正常值    151046
Name: isDefault, dtype: int64
----------------------------------------
----------------------------------------
正常值    606902
异常值    193098
Name: term_outliers, dtype: int64
term_outliers
异常值    62484
正常值    97126
Name: isDefault, dtype: int64
----------------------------------------
----------------------------------------
正常值    714521
异常值     85479
Name: interestRate_outliers, dtype: int64
interestRate_outliers
异常值     33801
正常值    125809
Name: isDefault, dtype: int64
----------------------------------------
----------------------------------------
正常值    771811
异常值     28189
Name: installment_outliers, dtype: int64
installment_outliers
异常值      6104
正

In [293]:
data_train_copy.sample(5)

Unnamed: 0,id,loanAmnt,term,interestRate,installment,grade,subGrade,employmentTitle,employmentLength,homeOwnership,...,n9_outliers,n10_outliers,n11_outliers,n12_outliers,n13_outliers,n14_outliers,issueDateDT_outliers,issueDate_year_outliers,issueDate_month_outliers,earliesCreditLine_year_outliers
414825,414825,6500.0,3,11.49,214.32,2,9,243717,7.0,1,...,正常值,正常值,正常值,正常值,正常值,正常值,异常值,异常值,正常值,异常值
450196,450196,11200.0,5,15.59,269.93,3,14,181086,10.0,1,...,正常值,正常值,正常值,正常值,正常值,正常值,异常值,异常值,正常值,异常值
5471,5471,6000.0,3,22.45,230.55,5,24,99647,8.0,0,...,正常值,正常值,正常值,正常值,正常值,正常值,异常值,异常值,正常值,异常值
592288,592288,8000.0,3,10.99,261.88,2,8,265001,3.0,1,...,正常值,正常值,正常值,正常值,正常值,正常值,异常值,异常值,正常值,异常值
127114,127114,15000.0,3,11.99,498.15,2,7,268425,1.0,1,...,正常值,正常值,正常值,正常值,正常值,正常值,异常值,异常值,正常值,异常值


In [294]:
data_train_copy.shape

(800000, 96)

In [295]:
numerical_fea = ['id',
 'loanAmnt',
 'term',
 'interestRate',
 'installment',
 'employmentTitle',
 'homeOwnership',
 'annualIncome',
 'verificationStatus',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'initialListStatus',
 'applicationType',
 'title',
 'n0',
 'n1',
 'n2',
 'n3',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n11',
 'n12',
 'n13',
 'n14']
 
for fea in numerical_fea:
    data_train_nomal = data_train_copy[data_train_copy[fea+'_outliers']=='正常值']
    data_train_nomal = data_train_nomal.reset_index(drop=True) 

In [296]:
data_train_nomal.shape ## 80万条数据中删除了2万条，说明这个程度也算可以接受

(788884, 96)

In [299]:
data_train.shape

(800000, 48)

In [306]:
data_train = data_train_nomal.iloc[: , :48]

In [307]:
data_train.shape

(788884, 48)

#### 逻辑回归等模型要单独增加的特征工程
    * 对特征做归一化，去除相关性高的特征
    * 归一化目的是让训练过程更好更快的收敛，避免特征大吃小的问题
    * 去除相关性是增加模型的可解释性，加快预测过程。
    ``` 
    for fea in [要归一化的特征列表]：
     data[fea] = ((data[fea] - np.min(data[fea])) / (np.max(data[fea]) - np.min(data[fea])))
     
    ```






In [308]:
data_train.isnull().sum()

id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           0
employmentLength          0
homeOwnership             0
annualIncome              0
verificationStatus        0
isDefault                 0
purpose                   0
postCode                  0
regionCode                0
dti                       0
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies        0
revolBal                  0
revolUtil                 0
totalAcc                  0
initialListStatus         0
applicationType           0
title                     0
n0                        0
n1                        0
n2                        0
n3                        0
n4                        0
n5                        0
n6                  

In [309]:
data_test_a.isnull().sum()

id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           0
employmentLength          0
homeOwnership             0
annualIncome              0
verificationStatus        0
purpose                   0
postCode                  0
regionCode                0
dti                       0
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies        0
revolBal                  0
revolUtil                 0
totalAcc                  0
initialListStatus         0
applicationType           0
title                     0
n0                        0
n1                        0
n2                        0
n3                        0
n4                        0
n5                        0
n6                        0
n7                  

#### 检测异常的方法二：箱型图
* 总结一句话：四分位数会将数据分为三个点和四个区间，IQR = Q3 -Q1，下触须=Q1 − 1.5x IQR，上触须=Q3 + 1.5x IQR；

### 数据分桶
* 特征分箱的目的：

    * 从模型效果上来看，特征分箱主要是为了降低变量的复杂性，减少变量噪音对模型的影响，提高自变量和因变量的相关度。从而使模型更加稳定。
数据分桶的对象：

        * 将连续变量离散化
        * 将多状态的离散变量合并成少状态
* 分箱的原因：

    * 数据的特征内的值跨度可能比较大，对有监督和无监督中如k-均值聚类它使用欧氏距离作为相似度函数来测量数据点之间的相似度。都会造成大吃小的影响，其中一种解决方法是对计数值进行区间量化即数据分桶也叫做数据分箱，然后使用量化后的结果。
* 分箱的优点：

    * 处理缺失值：当数据源可能存在缺失值，此时可以把null单独作为一个分箱。
    * 处理异常值：当数据中存在离群点时，可以把其通过分箱离散化处理，从而提高变量的鲁棒性（抗干扰能力）。例如，age若出现200这种异常值，可分入“age > 60”这个分箱里，排除影响。
    * 业务解释性：我们习惯于线性判断变量的作用，当x越来越大，y就越来越大。但实际x与y之间经常存在着非线性关系，此时可经过WOE变换。
* 特别要注意一下分箱的基本原则：

    * 最小分箱占比不低于5%
    * 箱内不能全部是好客户
    * 连续箱单调

### 特征交互（代价不菲，但往往可能创造出不错的选择）