    数据预处理
        缺失值的填充
        时间格式处理
        对象类型特征转换到数值
    异常值处理
        基于3segama原则
        基于箱型图
    数据分箱
        固定宽度分箱
        分位数分箱
        离散数值型数据分箱
        连续数值型数据分箱
        卡方分箱（选做作业）
    特征交互
        特征和特征之间组合
        特征和特征之间衍生
        其他特征衍生的尝试（选做作业）
    特征编码
        one-hot编码
        label-encode编码
    特征选择
        1 Filter
        2 Wrapper （RFE）
        3 Embedded

In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedGroupKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, log_loss, roc_auc_score
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings('ignore')

In [656]:
data_train = pd.read_csv('./data/train.csv')
data_test_a = pd.read_csv('./data/testA.csv')

In [657]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-nul

### 特征预处理

1. 查找对象特征和数值特征

In [658]:
numerical_fea = list(data_train.select_dtypes(exclude='object').columns)
category_fea = list(filter(lambda x: x not in numerical_fea, list(data_train.columns)))
label = 'isDefault'
numerical_fea.remove(label)
numerical_fea


['id',
 'loanAmnt',
 'term',
 'interestRate',
 'installment',
 'employmentTitle',
 'homeOwnership',
 'annualIncome',
 'verificationStatus',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'initialListStatus',
 'applicationType',
 'title',
 'policyCode',
 'n0',
 'n1',
 'n2',
 'n3',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n11',
 'n12',
 'n13',
 'n14']

In [659]:
category_fea

['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

2. 缺失值填充

* 0 填
  * data_train = data_trian.fillno(0)
+ 上填
  * data_trian = data_trian.fillno(axis=0, method='ffill)
- 下填且设置最多只填充两个连续的值
  * data_train = data_train.fillna(axis=0,method='bfill',limit=2)

In [660]:
data_train.isnull().sum()

id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           1
employmentLength      46799
homeOwnership             0
annualIncome              0
verificationStatus        0
issueDate                 0
isDefault                 0
purpose                   0
postCode                  1
regionCode                0
dti                     239
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies      405
revolBal                  0
revolUtil               531
totalAcc                  0
initialListStatus         0
applicationType           0
earliesCreditLine         0
title                     1
policyCode                0
n0                    40270
n1                    40270
n2                    40270
n3                  

In [661]:
# 按照平均值填充
data_train[numerical_fea] = data_train[numerical_fea].fillna(data_train[numerical_fea].mean())
data_test_a[numerical_fea] = data_test_a[numerical_fea].fillna(data_test_a[numerical_fea].mean())

In [662]:
data_train.isnull().sum()

id                        0
loanAmnt                  0
term                      0
interestRate              0
installment               0
grade                     0
subGrade                  0
employmentTitle           0
employmentLength      46799
homeOwnership             0
annualIncome              0
verificationStatus        0
issueDate                 0
isDefault                 0
purpose                   0
postCode                  0
regionCode                0
dti                       0
delinquency_2years        0
ficoRangeLow              0
ficoRangeHigh             0
openAcc                   0
pubRec                    0
pubRecBankruptcies        0
revolBal                  0
revolUtil                 0
totalAcc                  0
initialListStatus         0
applicationType           0
earliesCreditLine         0
title                     0
policyCode                0
n0                        0
n1                        0
n2                        0
n3                  

In [663]:
data_train['employmentLength'].describe()

count        753201
unique           11
top       10+ years
freq         262753
Name: employmentLength, dtype: object

In [664]:
data_train['employmentLength'].isnull().sum()

46799

## 处理类别特征

In [665]:
category_fea

['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

#### 时间格式的处理

In [666]:
startdate = pd.to_datetime(data_train['issueDate'].min(), format='%Y-%m-%d')

In [667]:
for data in [data_train, data_test_a]:
    data['issueDate'] = pd.to_datetime(data['issueDate'], format='%Y-%m-%d')
    #构造时间特征
    data['issueDateDT'] = data['issueDate'].apply(lambda x: x-startdate).dt.days
    data['issueDate_year'] = data['issueDate'].dt.year
    data['issueDate_month'] = data['issueDate'].dt.month
    # data_train['issueDate_day'].value_counts() # 唯一值，所以可以删去
    data.drop('issueDate', axis=1, inplace=True)

In [668]:
data_train['employmentLength'].value_counts(dropna=False).sort_index()

1 year        52489
10+ years    262753
2 years       72358
3 years       64152
4 years       47985
5 years       50102
6 years       37254
7 years       35407
8 years       36192
9 years       30272
< 1 year      64237
NaN           46799
Name: employmentLength, dtype: int64

##### 对象类型转换为数值

In [669]:
def employmenLength_to_int(s):
    if pd.isnull(s):
        return s
    else: 
        return np.int8(s.split()[0])

for data in [data_train, data_test_a]:
    data['employmentLength'].replace('10+ years', "10 years", inplace=True)
    data['employmentLength'].replace('< 1 year', "1 years", inplace=True)
    data['employmentLength'] = data['employmentLength'].apply(lambda x: employmenLength_to_int(x))


In [670]:
data_train['employmentLength'].value_counts(dropna=False).sort_index()
data_test_a['employmentLength'].value_counts(dropna=False).sort_index()

1.0     29171
2.0     18207
3.0     16011
4.0     11833
5.0     12543
6.0      9328
7.0      8823
8.0      8976
9.0      7594
10.0    65772
NaN     11742
Name: employmentLength, dtype: int64

In [671]:
data_train['employmentLength'] = data_train['employmentLength'].fillna(data_train['employmentLength'].std())
data_test_a['employmentLength'] = data_test_a['employmentLength'].fillna(data_test_a['employmentLength'].std())

In [672]:
data_train['earliesCreditLine'].sample(5)

618903    Aug-2006
138898    Feb-1999
791143    Apr-1986
426921    Jan-2007
133318    Jan-1994
Name: earliesCreditLine, dtype: object

In [673]:
for data in [data_train, data_test_a]:
    data['earliesCreditLine_year'] = data['earliesCreditLine'].apply(lambda x: int(x[-4:]))
    # data['earliesCreditLine_month'] = data['earliesCreditLine'].apply(lambda x: str(x[:3])) 缺失值太多，没什么意义
    data.drop('earliesCreditLine', axis=1, inplace=True)

In [674]:
# for data in [data_train, data_test_a]:
#     data['earliesCreditLine_month'] = data['earliesCreditLine_month'].map({'Jun': 1, 'Feb' : 2, 'Mar': 3, 'April': 4,
#                                                                       'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 
#                                                                       'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12})
# data['earliesCreditLine_month'].value_counts()

## 类别特征的处理

    首先查看类别特征的类别数

In [675]:
cate_features = ['grade', 'subGrade', 'employmentTitle', 'homeOwnership', 'verificationStatus', 'purpose', 'postCode', 'regionCode', \
                 'applicationType', 'initialListStatus', 'title', 'policyCode']
for col in cate_features:
    print('{}: 类型数 ：{}'.format(col, data[col].nunique()))

grade: 类型数 ：7
subGrade: 类型数 ：35
employmentTitle: 类型数 ：79282
homeOwnership: 类型数 ：6
verificationStatus: 类型数 ：3
purpose: 类型数 ：14
postCode: 类型数 ：889
regionCode: 类型数 ：51
applicationType: 类型数 ：2
initialListStatus: 类型数 ：2
title: 类型数 ：12058
policyCode: 类型数 ：1


* 类型数目为 1 的，说明该特征对于训练集无任何效果，可以删去。
* 低维的且有*优先级的*的首先考虑自映射和labelEncode编码
* 类型数在2之上，又不是高维稀疏的,且纯分类特征(注意：多少类型就会产生多少新的列)

In [676]:
for data in [data_train, data_test_a]:
    data.drop(['policyCode'], axis=1, inplace=True)
    data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
    # data = pd.get_dummies(data, columns=['subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)

In [677]:
data_train['subGrade'].value_counts().sort_index()

A1    25909
A2    22124
A3    22655
A4    30928
A5    38045
B1    42382
B2    44227
B3    48600
B4    49516
B5    48965
C1    50763
C2    47068
C3    44751
C4    44272
C5    40264
D1    30538
D2    26528
D3    23410
D4    21139
D5    17838
E1    14064
E2    12746
E3    10925
E4     9273
E5     8653
F1     5925
F2     4340
F3     3577
F4     2859
F5     2352
G1     1759
G2     1231
G3      978
G4      751
G5      645
Name: subGrade, dtype: int64

In [678]:
# 高维类别特征需要进行转换
for col in tqdm(['employmentTitle', 'postCode', 'title','subGrade']):
    le = LabelEncoder()
    le.fit(list(data_train[col].astype(str).values) + list(data_test_a[col].astype(str).values))
    data_train[col] = le.transform(list(data_train[col].astype(str).values))
    data_test_a[col] = le.transform(list(data_test_a[col].astype(str).values))
print('Label Encoding 完成')

100%|██████████| 4/4 [00:07<00:00,  1.93s/it]

Label Encoding 完成





In [679]:
data_test_a['subGrade'].value_counts()

10    12857
8     12423
9     12400
7     12100
11    11791
13    11110
12    11018
6     10898
5     10544
14     9925
4      9629
3      7753
15     7667
16     6713
0      6398
17     5821
2      5644
1      5503
18     5236
19     4487
20     3527
21     3175
22     2780
23     2414
24     2114
25     1462
26     1073
27      906
28      714
29      543
30      488
31      325
32      232
33      166
34      164
Name: subGrade, dtype: int64

In [680]:
data_train.select_dtypes(include='object')

0
1
2
3
4
...
799995
799996
799997
799998
799999


In [681]:
data_test_a.select_dtypes(include='object')

0
1
2
3
4
...
199995
199996
199997
199998
199999


## 异常值的处理（只针对训练集）

* 当你发现异常值后，一定要先分清是什么原因导致的异常值，然后再考虑如何处理。首先，如果这一异常值并不代表一种规律性的，而是极其偶然的现象，或者说你并不想研究这种偶然的现象，这时可以将其删除。其次，如果异常值存在且代表了一种真实存在的现象，那就不能随便删除。**在现有的欺诈场景中很多时候欺诈数据本身相对于正常数据勒说就是异常的，我们要把这些异常点纳入，重新拟合模型，研究其规律**。能用监督的用监督模型，不能用的还可以考虑用异常检测的算法来做
* 注意test的数据不能删除

### 检测异常的方法一：均方差

* 在统计学中，如果一个数据分布近似正态，那么大约 68% 的数据值会在均值的一个标准差范围内，大约 95% 会在两个标准差范围内，大约 99.7% 会在三个标准差范围内

In [682]:
def find_outliers_by_3segama(data,fea):
    data_std = np.std(data[fea])
    data_mean = np.std(data[fea])
    outliers_cut_off = data_std * 3
    lower_rul = data_mean - outliers_cut_off
    upper_rul = data_mean + outliers_cut_off
    data[fea+'_outliers'] = data[fea].apply(lambda x:str("异常值") if x > upper_rul or x < lower_rul else str('正常值'))
    return data

In [683]:
data_train.sample(5)

Unnamed: 0,id,loanAmnt,term,interestRate,installment,grade,subGrade,employmentTitle,employmentLength,homeOwnership,...,n9,n10,n11,n12,n13,n14,issueDateDT,issueDate_year,issueDate_month,earliesCreditLine_year
142383,142383,24000.0,3,18.06,868.39,4,16,27055,10.0,1,...,9.0,15.0,0.0,0.0,0.0,2.0,3806,2017,11,2005
481407,481407,14000.0,3,5.32,421.61,1,0,213981,1.0,0,...,6.0,17.0,0.0,0.0,0.0,0.0,2952,2015,7,1998
547671,547671,11700.0,3,8.39,368.75,1,4,258587,4.0,0,...,7.0,34.0,0.0,0.0,0.0,3.0,2679,2014,10,2004
759089,759089,15000.0,3,11.99,498.15,2,7,143765,7.0,1,...,5.0,6.0,0.0,0.0,2.0,1.0,2314,2013,10,1990
274089,274089,33000.0,3,7.97,1033.65,1,4,13890,9.0,0,...,4.0,8.0,0.0,0.0,0.0,0.0,3745,2017,9,2000


In [684]:
data_train_copy = data_train.copy()

for fea in data_train_copy.columns:
    data_train_copy = find_outliers_by_3segama(data_train_copy,fea)
    print("----"*10)
    print(data_train_copy[fea+'_outliers'].value_counts())
    print(data_train_copy.groupby(fea+'_outliers')['isDefault'].sum())
    print("----"*10)

----------------------------------------
正常值    800000
Name: id_outliers, dtype: int64
id_outliers
正常值    159610
Name: isDefault, dtype: int64
----------------------------------------
----------------------------------------
正常值    764021
异常值     35979
Name: loanAmnt_outliers, dtype: int64
loanAmnt_outliers
异常值      8564
正常值    151046
Name: isDefault, dtype: int64
----------------------------------------
----------------------------------------
正常值    606902
异常值    193098
Name: term_outliers, dtype: int64
term_outliers
异常值    62484
正常值    97126
Name: isDefault, dtype: int64
----------------------------------------
----------------------------------------
正常值    714521
异常值     85479
Name: interestRate_outliers, dtype: int64
interestRate_outliers
异常值     33801
正常值    125809
Name: isDefault, dtype: int64
----------------------------------------
----------------------------------------
正常值    771811
异常值     28189
Name: installment_outliers, dtype: int64
installment_outliers
异常值      6104
正

In [685]:
data_train_copy.sample(5)

Unnamed: 0,id,loanAmnt,term,interestRate,installment,grade,subGrade,employmentTitle,employmentLength,homeOwnership,...,n9_outliers,n10_outliers,n11_outliers,n12_outliers,n13_outliers,n14_outliers,issueDateDT_outliers,issueDate_year_outliers,issueDate_month_outliers,earliesCreditLine_year_outliers
104347,104347,5000.0,3,9.16,159.38,2,6,258397,6.0,1,...,正常值,正常值,正常值,正常值,正常值,正常值,异常值,异常值,正常值,异常值
4824,4824,6000.0,3,6.99,185.24,1,2,112431,7.0,0,...,正常值,正常值,正常值,正常值,正常值,正常值,正常值,异常值,正常值,异常值
761306,761306,10000.0,3,7.35,310.38,1,3,249810,3.0,1,...,正常值,正常值,正常值,正常值,正常值,正常值,异常值,异常值,正常值,异常值
125452,125452,7500.0,3,11.55,247.5,2,7,164544,10.0,1,...,正常值,正常值,正常值,正常值,正常值,正常值,正常值,异常值,正常值,异常值
29320,29320,10000.0,3,11.99,332.1,2,9,7738,3.0,0,...,正常值,正常值,正常值,正常值,正常值,正常值,异常值,异常值,正常值,异常值


In [686]:
data_train_copy.shape

(800000, 96)

In [687]:
numerical_fea = ['id',
 'loanAmnt',
 'term',
 'interestRate',
 'installment',
 'employmentTitle',
 'homeOwnership',
 'annualIncome',
 'verificationStatus',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'initialListStatus',
 'applicationType',
 'title',
 'n0',
 'n1',
 'n2',
 'n3',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n11',
 'n12',
 'n13',
 'n14']
 
for fea in numerical_fea:
    data_train_nomal = data_train_copy[data_train_copy[fea+'_outliers']=='正常值']
    data_train_nomal = data_train_nomal.reset_index(drop=True) 

In [688]:
data_train_nomal.shape ## 80万条数据中删除了2万条，说明这个程度也算可以接受

(788884, 96)

In [689]:
data_train.shape

(800000, 48)

In [690]:
data_train = data_train_nomal.iloc[: , :48]

In [691]:
data_train.shape

(788884, 48)

#### 逻辑回归等模型要单独增加的特征工程
    * 对特征做归一化，去除相关性高的特征
    * 归一化目的是让训练过程更好更快的收敛，避免特征大吃小的问题
    * 去除相关性是增加模型的可解释性，加快预测过程。
    ``` 
    for fea in [要归一化的特征列表]：
     data[fea] = ((data[fea] - np.min(data[fea])) / (np.max(data[fea]) - np.min(data[fea])))
     
    ```






#### 检测异常的方法二：箱型图
* 总结一句话：四分位数会将数据分为三个点和四个区间，IQR = Q3 -Q1，下触须=Q1 − 1.5x IQR，上触须=Q3 + 1.5x IQR；

### 数据分桶
* 特征分箱的目的：

    * 从模型效果上来看，特征分箱主要是为了降低变量的复杂性，减少变量噪音对模型的影响，提高自变量和因变量的相关度。从而使模型更加稳定。
数据分桶的对象：

        * 将连续变量离散化
        * 将多状态的离散变量合并成少状态
* 分箱的原因：

    * 数据的特征内的值跨度可能比较大，对有监督和无监督中如k-均值聚类它使用欧氏距离作为相似度函数来测量数据点之间的相似度。都会造成大吃小的影响，其中一种解决方法是对计数值进行区间量化即数据分桶也叫做数据分箱，然后使用量化后的结果。
* 分箱的优点：

    * 处理缺失值：当数据源可能存在缺失值，此时可以把null单独作为一个分箱。
    * 处理异常值：当数据中存在离群点时，可以把其通过分箱离散化处理，从而提高变量的鲁棒性（抗干扰能力）。例如，age若出现200这种异常值，可分入“age > 60”这个分箱里，排除影响。
    * 业务解释性：我们习惯于线性判断变量的作用，当x越来越大，y就越来越大。但实际x与y之间经常存在着非线性关系，此时可经过WOE变换。
* 特别要注意一下分箱的基本原则：

    * 最小分箱占比不低于5%
    * 箱内不能全部是好客户
    * 连续箱单调

### 特征交互（代价不菲，但往往可能创造出不错的特征选择）

In [692]:
for col in ['grade', 'subGrade']:
    temp_dict = data_train.groupby([col])['isDefault'].agg(['mean']).reset_index().rename(columns={'mean': col + '_target_mean'})
    temp_dict.index = temp_dict[col].values
    temp_dict = temp_dict[col + '_target_mean'].to_dict()

    data_train[col + '_target_mean'] = data_train[col].map(temp_dict)
    data_test_a[col + '_target_mean'] = data_test_a[col].map(temp_dict)
    

In [693]:
data_train.groupby('grade')['isDefault']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fed8f70b3d0>

In [694]:
data_train['grade_target_mean']

0         0.382934
1         0.302915
2         0.302915
3         0.060293
4         0.224651
            ...   
788879    0.224651
788880    0.060293
788881    0.224651
788882    0.060293
788883    0.132813
Name: grade_target_mean, Length: 788884, dtype: float64

In [695]:
# 其他衍生变量 mean 和 std
for df in [data_train, data_test_a]:
    for item in ['n0','n1','n2','n3','n4','n5','n6','n7','n8','n9','n10','n11','n12','n13','n14']:
        df['grade_to_mean_' + item] = np.float32(df['grade'] / df.groupby([item])['grade'].transform('mean'))
        # df['grade_to_std_' + item] = np.float64(df['grade'] / df.groupby([item])['grade'].transform('std'))

        ## 在后续的探索中发现 df['grade_to_std_' + item] 会转成 object

### 业务逻辑的探索

In [None]:
for data in [data_train, data_test_a]:
    data['total_open_acc'] = data['totalAcc'] - data['openAcc']
    data['revolUtil_loanAmnt'] = data['revolUtil'] * data['loanAmnt'] /1000
    data['loan_Income'] = data['loanAmnt'] / data['annualIncome']
    data['dti_Income'] = data['annualIncome'] * data['dti'] /100

In [27]:
data_train[['revolBal', 'revolUtil', 'loanAmnt', 'totalAcc', 'openAcc','annualIncome','dti',\
             ]]

Unnamed: 0,revolBal,revolUtil,loanAmnt,totalAcc,openAcc,annualIncome,dti
0,24178.0,48.9,35000.0,27.0,7.0,110000.0,17.05
1,15096.0,38.9,18000.0,18.0,13.0,46000.0,27.83
2,4606.0,51.8,12000.0,27.0,11.0,74000.0,22.77
3,9948.0,52.6,11000.0,28.0,9.0,118000.0,17.21
4,2942.0,32.0,3000.0,27.0,12.0,29000.0,32.16
...,...,...,...,...,...,...,...
788879,9933.0,46.4,25000.0,15.0,14.0,72000.0,19.03
788880,20472.0,98.4,17000.0,42.0,7.0,99000.0,15.72
788881,6381.0,51.9,6000.0,36.0,5.0,65000.0,12.11
788882,69702.0,61.3,19200.0,37.0,16.0,96000.0,29.25


In [None]:
data_train.info()

In [28]:
for col in data_test_a.columns:
    if data_test_a[col].isnull().sum() > 0:
        print(col)
        data_test_a[col] = data_test_a[col].fillna(data_test_a[col].std)

In [29]:
for col in data_train.columns:
    if data_train[col].isnull().sum() > 0:
        print(col)
        data_train[col] = data_train[col].fillna(data_train[col].std)

In [30]:
data_train.replace(np.inf, 0, inplace=True)
data_test_a.replace(np.inf, 0, inplace=True)

In [31]:
import pickle
data_train.to_pickle('./data/train_features.pkl')
data_test_a.to_pickle('./data/test_features.pkl')

In [21]:
import pickle

with open('./data/train_features.pkl', 'rb') as f:
    data_train = pickle.load(f)
with open('./data/test_features.pkl', 'rb') as f:
    data_test_a = pickle.load(f)

#### 特征筛选

    * 作为对比，我们先不进行特征筛选

#### 设计模型

In [32]:
features = [f for f in data_train.columns if f not in ['id','issueDate','isDefault'] and '_outliers' not in f]
x_train = data_train[features]
x_test = data_test_a[features]
y_train = data_train['isDefault']

In [None]:
x_train.info()

In [33]:
for col in x_train.columns:
    if data_train[col].isnull().sum() > 0:
        print(col)

In [34]:
for col in x_test.columns:
    if data_train[col].isnull().sum() > 0:
        print(col)

In [36]:
x_train.replace(np.inf, 0, inplace=True)
x_test.replace(np.inf, 0, inplace=True)

In [7]:
print(x_train.shape, x_test.shape, y_train.shape)

(788884, 63) (200000, 63) (788884,)


In [8]:
y_train.value_counts()

0    632638
1    156246
Name: isDefault, dtype: int64

In [55]:
def cv_model(clf, train_x, train_y, test_x, clf_name):
    folds = 2
    seed = 2022
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])

    train_prob = np.zeros(train_x.shape[0])
    test_prob = np.zeros(test_x.shape[0])

    cv_scores = []
    cv_scores_prob = []

    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):

        print('*' * 10 + ' 第{}折交叉验证 '.format(i+1) + '*' * 10)
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y.iloc[train_index], train_x.iloc[valid_index], train_y.iloc[valid_index]
  
        if clf_name == 'lgb':
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)

            params = {
                    'boosting_type': 'gbdt',
                    'objective': 'binary',
                    'metric': 'auc',
                    'min_child_weight': 5,
                    'num_leaves': 2 ** 5,
                    'lambda_l2': 10,
                    'feature_fraction': 0.8,
                    'bagging_fraction': 0.8,
                    'bagging_freq': 4,
                    'learning_rate': 0.1,
                    'seed': 2020,
                    'nthread': 28,
                    'n_jobs':24,
                    'silent': True,
                    'verbose': -1,
                }  

            model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
            
            file_path = './model/model_lgb_{}.pkl'.format(i+1)
            model.save_model(file_path)
            
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)

            val_pred_prob = model.predict_prob(val_x, num_iteration=model.best_iteration)
            test_pred_prob = model.predict_prob(test_x, num_iteration=model.best_iteration)

        
        if clf_name == "xgb":
            train_matrix = clf.DMatrix(trn_x , label=trn_y)
            valid_matrix = clf.DMatrix(val_x , label=val_y)
            test_matrix = clf.DMatrix(test_x)

            params = {'booster': 'gbtree',
                      'objective': 'binary:logistic',
                      'eval_metric': 'auc',
                      'gamma': 1,
                      'min_child_weight': 1.5,
                      'max_depth': 5,
                      'lambda': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'eta': 0.04,
                      'tree_method': 'exact',
                      'seed': 2020,
                      'nthread': 36,
                      "silent": True,
                      }
            
            watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
            
            model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)
            file_path = './model/model_xgb_{}.pkl'.format(i+1)
            model.save_model(file_path)

            val_pred  = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
            test_pred = model.predict(test_matrix , ntree_limit=model.best_ntree_limit)

            val_pred_prob  = model.predict_prob(valid_matrix, ntree_limit=model.best_ntree_limit)
            test_pred_prob = model.predict_prob(test_matrix , ntree_limit=model.best_ntree_limit)
                 
        if clf_name == "cat":

            params = {'learning_rate': 0.05,
                      'depth': 5, 
                      'l2_leaf_reg': 10, 
                      'bootstrap_type': 'Bernoulli',
                      'od_type': 'Iter', 
                      'od_wait': 50, 
                      'random_seed': 11, 
                      'allow_writing_files': False}
            
            model = clf(iterations=20000, **params)
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                       use_best_model=True, verbose=500)
                      
            file_path = './model/model_cat_{}.pkl'.format(i+1)
            model.save_model(file_path)
            
            val_pred  = model.predict(val_x)
            test_pred = model.predict(test_x)

            val_pred_prob  = model.predict_prob(val_x)
            test_pred_prob = model.predict_prob(test_x)
            
        train[valid_index] = val_pred
        test = test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))


        train_prob[valid_index] = val_pred_prob
        test_prob = test_pred_prob / kf.n_splits
        cv_scores_prob.append(roc_auc_score(val_y, val_pred_prob))
        
        print(cv_scores)
        print(cv_scores_prob)
        
    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train, test, train_prob, test_prob





In [24]:
def lgb_model(x_train, y_train, x_test):
    lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_train, lgb_test
    
def xgb_model(x_train, y_train, x_test):
    xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
    return xgb_train, xgb_test

def cat_model(x_train, y_train, x_test):
    cat_train, cat_test = cv_model(CatBoostClassifier, x_train, y_train, x_test, "cat")
    return cat_train, cat_test


In [35]:
lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)

********** 第1折交叉验证 **********
Training until validation scores don't improve for 200 rounds
[200]	training's auc: 0.749109	valid_1's auc: 0.733937
[400]	training's auc: 0.762206	valid_1's auc: 0.735335
[600]	training's auc: 0.773571	valid_1's auc: 0.735141
Early stopping, best iteration is:
[461]	training's auc: 0.766	valid_1's auc: 0.735416
[0.7354164705572636]
********** 第2折交叉验证 **********
Training until validation scores don't improve for 200 rounds
[200]	training's auc: 0.748863	valid_1's auc: 0.734995
[400]	training's auc: 0.762328	valid_1's auc: 0.736323
[600]	training's auc: 0.773666	valid_1's auc: 0.736412
Early stopping, best iteration is:
[567]	training's auc: 0.771767	valid_1's auc: 0.736479
[0.7354164705572636, 0.7364788978014749]
********** 第3折交叉验证 **********
Training until validation scores don't improve for 200 rounds
[200]	training's auc: 0.749317	valid_1's auc: 0.732915
[400]	training's auc: 0.762242	valid_1's auc: 0.733498
[600]	training's auc: 0.773305	valid_1's auc:

In [44]:
lgb_train = pd.DataFrame(lgb_train)
lgb_train.to_csv('./result/lgbm_trian_k5.csv', index=False)

In [45]:
lgb_test = pd.DataFrame(lgb_test)
lgb_test.to_csv('./result/lgbm_test_k5.csv', index=False)

In [41]:
lgb_train

Unnamed: 0,0
0,0.351030
1,0.294844
2,0.490297
3,0.047851
4,0.441416
...,...
788879,0.428179
788880,0.028843
788881,0.188408
788882,0.050611


In [37]:
xgb_train, xgb_test = xgb_model(x_train, y_train, x_test)

********** 第1折交叉验证 **********
Parameters: { "silent" } are not used.

[0]	train-auc:0.69602	eval-auc:0.69353
[200]	train-auc:0.73269	eval-auc:0.72711
[400]	train-auc:0.74098	eval-auc:0.73169
[600]	train-auc:0.74651	eval-auc:0.73360
[800]	train-auc:0.75083	eval-auc:0.73458
[1000]	train-auc:0.75479	eval-auc:0.73530
[1200]	train-auc:0.75831	eval-auc:0.73572
[1400]	train-auc:0.76167	eval-auc:0.73594
[1600]	train-auc:0.76471	eval-auc:0.73606
[1800]	train-auc:0.76778	eval-auc:0.73615
[2000]	train-auc:0.77066	eval-auc:0.73624
[2200]	train-auc:0.77349	eval-auc:0.73631
[2400]	train-auc:0.77629	eval-auc:0.73629
[2600]	train-auc:0.77905	eval-auc:0.73632
[2800]	train-auc:0.78171	eval-auc:0.73628
[2876]	train-auc:0.78267	eval-auc:0.73626
[0.7363632658627725]
********** 第2折交叉验证 **********
Parameters: { "silent" } are not used.

[0]	train-auc:0.69648	eval-auc:0.69507
[200]	train-auc:0.73277	eval-auc:0.72844
[400]	train-auc:0.74093	eval-auc:0.73264
[600]	train-auc:0.74630	eval-auc:0.73453
[800]	train-

In [47]:
xgb_train = pd.DataFrame(xgb_train)
xgb_train.to_csv('./result/xgb_train_k5.csv', index=False)
xgb_test = pd.DataFrame(xgb_test)
xgb_test.to_csv('./result/xgb_test_k5.csv', index=False)

In [None]:
cat_train, cat_test = cat_model(x_train, y_train, x_test)

In [None]:
cat_train = pd.DataFrame(cat_train)
cat_train.to_csv('./result/cat_train_k5.csv', index=False)
cat_test = pd.DataFrame(cat_test)
cat_test.to_csv('./result/cat_test_k5.csv', index=False)

In [74]:
sample = pd.read_csv('./data/sample_submit.csv')
sample['isDefault'] = lgb_test['isDefault']

In [76]:
sample.to_csv('./result/result_lgb_k5.csv', index=False)

In [77]:
xgb_test.columns = ['isDefault']
sample = pd.read_csv('./data/sample_submit.csv')
sample['isDefault'] = xgb_test['isDefault']
sample.to_csv('./result/result_xgb_k5.csv', index=False)
sample

Unnamed: 0,id,isDefault
0,800000,0.012863
1,800001,0.071228
2,800002,0.126335
3,800003,0.062480
4,800004,0.074169
...,...,...
199995,999995,0.032861
199996,999996,0.006873
199997,999997,0.031750
199998,999998,0.053614
