# 目录
[1 数据探索](#1)

[1.1 删除和预测无关的数据](#1.1)

[1.2  查看行列缺失比例](#1.2)

[1.3 处理类别特征](#1.3)

[1.4 特征标准差探索](#1.4)

[1.5 缺失值填充和特征编码](#1.5)

[1.6 构造特征矩阵](#1.6)

[1.7 划分训练集和测试集](#1.7)

[2 模型选择](#2)

[2.1 LR](#2.1)

[2.2 SVM](#2.2)

[2.3 DT](#2.3)

[2.4 XGBoost](#2.4)

[2.5 LightGBM](#2.5)

<a id='1'></a>
# 1 数据探索

In [43]:
#encoding: utf-8

import pandas as pd
data = pd.read_csv('data/data.csv', encoding='GB18030')
print(data.shape)

(4754, 90)


In [44]:
data.columns

Index([                                u'Unnamed: 0',
                                           u'custid',
                                         u'trade_no',
                                     u'bank_card_no',
                               u'low_volume_percent',
                            u'middle_volume_percent',
            u'take_amount_in_later_12_month_highest',
                u'trans_amount_increase_rate_lately',
                             u'trans_activity_month',
                               u'trans_activity_day',
                                       u'transd_mcc',
                       u'trans_days_interval_filter',
                              u'trans_days_interval',
                                u'regional_mobility',
                                  u'student_feature',
                             u'repayment_capability',
                                     u'is_high_user',
                        u'number_of_trans_from_2011',
                           u

<a id='1.1'></a>
## 1.1 删除和预测无关的数据

In [45]:
# 'source'和'bank_card_no'值无区分度
# ‘custid’、‘trade_no’、‘id_name’和预测无关
data = data.drop(['custid', 'trade_no', 'bank_card_no', 'id_name', 'source'], axis = 1)

In [46]:
# 'Unnamed: 0'和预测无关
data = data.drop(['Unnamed: 0'], axis = 1)

In [47]:
# 先删除data
data = data.drop(['first_transaction_time', 'latest_query_time', 'loans_latest_time'], axis = 1)

<a id='1.2'></a>
## 1.2 查看行列缺失比例

In [48]:
# 统计各个列缺失值所占比例
for i in data.columns:
    d = len(data) - data[i].count()
    r = (float(d) / len(data)) * 100
    # rate = '%.2f%%' % r
    # print 'name: ', str(i).ljust(10),'d: ', str(d).ljust(4), 'rate: ', rate
    print '%.2f%%' % r, i

# 由下图统计可以看出，‘student_feature’列缺失一半以上，且本列为类别类型，可以将缺失值用-1填充，相当于“是否缺失”当成另一种类别。
# 其他列缺失概率比较小，可以用中值填充。

0.04% low_volume_percent
0.04% middle_volume_percent
0.00% take_amount_in_later_12_month_highest
0.06% trans_amount_increase_rate_lately
0.04% trans_activity_month
0.04% trans_activity_day
0.04% transd_mcc
0.17% trans_days_interval_filter
0.04% trans_days_interval
0.04% regional_mobility
63.06% student_feature
0.00% repayment_capability
0.00% is_high_user
0.04% number_of_trans_from_2011
0.00% historical_trans_amount
0.04% historical_trans_day
0.04% rank_trad_1_month
0.00% trans_amount_3_month
0.04% avg_consume_less_12_valid_month
0.00% abs
0.04% top_trans_count_last_1_month
0.00% avg_price_last_12_month
2.19% avg_price_top_last_12_valid_month
0.04% reg_preference_for_trad
0.17% trans_top_time_last_1_month
0.17% trans_top_time_last_6_month
0.17% consume_top_time_last_1_month
0.17% consume_top_time_last_6_month
8.96% cross_consume_count_last_1_month
0.34% trans_fail_top_count_enum_last_1_month
0.34% trans_fail_top_count_enum_last_6_month
0.34% trans_fail_top_count_enum_last_12_month
0.55

In [49]:
# 缺失个数作为一种特征，衡量用户的信息完善程度
miss_rate = []
for i in range(len(data)):
    temp = float((data[i:i+1]).count().sum()) / len(data.columns)
    miss_rate.append(temp)

print data.shape
data['miss_rate'] = miss_rate
print data.shape

(4754, 81)
(4754, 82)


In [50]:
print data['miss_rate']

0       0.987654
1       1.000000
2       0.987654
3       0.987654
4       0.987654
5       1.000000
6       0.962963
7       0.506173
8       0.987654
9       1.000000
10      0.987654
11      1.000000
12      1.000000
13      0.975309
14      0.987654
15      0.987654
16      0.987654
17      1.000000
18      1.000000
19      0.987654
20      1.000000
21      0.987654
22      0.987654
23      0.987654
24      1.000000
25      0.975309
26      1.000000
27      1.000000
28      1.000000
29      0.876543
          ...   
4724    1.000000
4725    0.975309
4726    0.987654
4727    0.987654
4728    0.987654
4729    0.975309
4730    1.000000
4731    1.000000
4732    0.975309
4733    0.987654
4734    0.987654
4735    1.000000
4736    1.000000
4737    1.000000
4738    0.987654
4739    0.987654
4740    0.987654
4741    1.000000
4742    1.000000
4743    0.987654
4744    0.987654
4745    0.987654
4746    0.518519
4747    0.987654
4748    0.975309
4749    1.000000
4750    1.000000
4751    0.9876

<a id='1.3'></a>
## 1.3 处理类别特征

In [51]:
# 'regional_mobility'列的统计，按类别特征处理
data['regional_mobility'].value_counts()

3.0    1950
2.0    1515
4.0     802
1.0     446
5.0      39
Name: regional_mobility, dtype: int64

In [52]:
# 'reg_preference_for_trad'列的统计，按类别特征处理
data['reg_preference_for_trad'].value_counts()

一线城市    3403
三线城市    1064
境外       150
二线城市     131
其他城市       4
Name: reg_preference_for_trad, dtype: int64

In [53]:
# 'student_feature'列的统计，按类别特征处理
data['student_feature'].value_counts()

1.0    1754
2.0       2
Name: student_feature, dtype: int64

In [54]:
# 'is_high_user'列的统计，按类别特征处理
data['is_high_user'].value_counts()

0    4701
1      53
Name: is_high_user, dtype: int64

In [55]:
# 'status'列的统计，预测变量，正负样本接近1：3，可以不做处理。
data['status'].value_counts()

0    3561
1    1193
Name: status, dtype: int64

In [56]:
# 将刚刚被归类为类别变量和预测变量的列去掉，生成data_temp，数值特征为77维，类别特征为5维
data_temp = data
data_temp = data_temp.drop(['regional_mobility', 'reg_preference_for_trad', 'student_feature', 'is_high_user', 'status', 'miss_rate'], axis = 1)
print data_temp.shape
print data.shape

(4754, 76)
(4754, 82)


<a id='1.4'></a>
## 1.4 特征标准差探索

In [57]:
# 统计各个列标准差，将标准差小于0.1的特征剔除，数值特征变为71维
print (len(data_temp.columns))
for i in data_temp.columns:
    r = data_temp[i].std()
    print '%.2f' % r, i
    
    if r < 0.1:
        data_temp = data_temp.drop([i], axis = 1)
print (len(data_temp.columns))

76
0.04 low_volume_percent
0.14 middle_volume_percent
3923.97 take_amount_in_later_12_month_highest
694.18 trans_amount_increase_rate_lately
0.20 trans_activity_month
0.17 trans_activity_day
4.48 transd_mcc
22.72 trans_days_interval_filter
16.47 trans_days_interval
52217.83 repayment_capability
10.06 number_of_trans_from_2011
320493.12 historical_trans_amount
99.69 historical_trans_day
0.26 rank_trad_1_month
101746.13 trans_amount_3_month
1.39 avg_consume_less_12_valid_month
27007.60 abs
0.35 top_trans_count_last_1_month
765.87 avg_price_last_12_month
0.10 avg_price_top_last_12_valid_month
5.32 trans_top_time_last_1_month
12.96 trans_top_time_last_6_month
5.46 consume_top_time_last_1_month
13.13 consume_top_time_last_6_month
2.34 cross_consume_count_last_1_month
1.91 trans_fail_top_count_enum_last_1_month
4.46 trans_fail_top_count_enum_last_6_month
4.76 trans_fail_top_count_enum_last_12_month
374267.23 consume_mini_time_last_1_month
10813.45 max_cumulative_consume_later_1_month
5.68 ma

<a id='1.5'></a>
## 1.5 缺失值填充和特征编码

In [58]:
# 接下来对类别特征和数值特征进行填充
# 数值特征和类别特征均用用中值进行填充
# 缺失值特征特别大的特征‘student_feature’用‘-1’填充
for i in data_temp.columns:
    temp = data_temp[i].isnull().sum()
    if temp:
        print i
        data_temp[i].fillna(data_temp[i].median(), inplace = True)

# 数值特征归一化 
# 从sklearn.preprocessing导入StandardScaler  
from sklearn.preprocessing import StandardScaler  
# 标准化数据，保证每个维度的特征数据方差为1，均值为0，使得预测结果不会被某些维度过大的特征值而主导  
ss = StandardScaler()  
# fit_transform()先拟合数据，再标准化  
data_temp = ss.fit_transform(data_temp)

middle_volume_percent
trans_amount_increase_rate_lately
trans_activity_month
trans_activity_day
transd_mcc
trans_days_interval_filter
trans_days_interval
number_of_trans_from_2011
historical_trans_day
rank_trad_1_month
avg_consume_less_12_valid_month
top_trans_count_last_1_month
avg_price_top_last_12_valid_month
trans_top_time_last_1_month
trans_top_time_last_6_month
consume_top_time_last_1_month
consume_top_time_last_6_month
cross_consume_count_last_1_month
trans_fail_top_count_enum_last_1_month
trans_fail_top_count_enum_last_6_month
trans_fail_top_count_enum_last_12_month
consume_mini_time_last_1_month
max_consume_count_later_6_month
railway_consume_count_last_12_month
jewelry_consume_count_last_6_month
first_transaction_day
trans_day_last_12_month
apply_score
apply_credibility
query_org_count
query_finance_count
query_cash_count
query_sum_count
latest_one_month_apply
latest_three_month_apply
latest_six_month_apply
loans_score
loans_credibility_behavior
loans_count
loans_settle_count

In [59]:
a5 = data['miss_rate']
b5 = a5.as_matrix()
print b5
print b5.shape
b5 = b5.reshape(len(b5), 1)
print b5.shape

[ 0.98765432  1.          0.98765432 ...,  0.98765432  0.98765432  1.        ]
(4754,)
(4754, 1)


In [60]:
# 类别特征one-hot编码

a1 = data['student_feature']
#print a
a1.fillna(-1, inplace = True)
# print a
b1 = a1.as_matrix()
# print b.shape (4754,)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from numpy import array
from numpy import argmax
label_encoder1 = LabelEncoder()
integer_encoded1 = label_encoder1.fit_transform(b1)
# print(integer_encoded)
# binary encode
onehot_encoder1 = OneHotEncoder(sparse=False)
integer_encoded1 = integer_encoded1.reshape(len(integer_encoded1), 1)
onehot_encoded1 = onehot_encoder1.fit_transform(integer_encoded1)
print(onehot_encoded1)

a2 = data['regional_mobility']
#print a
a2.fillna(data['regional_mobility'].median(), inplace = True)
# print a
b2 = a2.as_matrix()
# print b.shape (4754,)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from numpy import array
from numpy import argmax
label_encoder2 = LabelEncoder()
integer_encoded2 = label_encoder2.fit_transform(b2)
# print(integer_encoded)
# binary encode
onehot_encoder2 = OneHotEncoder(sparse=False)
integer_encoded2 = integer_encoded2.reshape(len(integer_encoded2), 1)
onehot_encoded2 = onehot_encoder2.fit_transform(integer_encoded2)
print(onehot_encoded2)

a3 = data['reg_preference_for_trad']
#print a
a3.fillna(data['reg_preference_for_trad'].max(), inplace = True)
# print a
b3 = a3.as_matrix()
# print b.shape (4754,)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from numpy import array
from numpy import argmax
label_encoder3 = LabelEncoder()
integer_encoded3 = label_encoder3.fit_transform(b3)
# print(integer_encoded)
# binary encode
onehot_encoder3 = OneHotEncoder(sparse=False)
integer_encoded3 = integer_encoded3.reshape(len(integer_encoded3), 1)
onehot_encoded3 = onehot_encoder3.fit_transform(integer_encoded3)
print(onehot_encoded3)

a4 = data['is_high_user']
# print a4
a4.fillna(data['is_high_user'].max(), inplace = True)
# print a
b4 = a4.as_matrix()
# print b.shape (4754,)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from numpy import array
from numpy import argmax
label_encoder4 = LabelEncoder()
integer_encoded4 = label_encoder4.fit_transform(b4)
# print(integer_encoded)
# binary encode
onehot_encoder4 = OneHotEncoder(sparse=False)
integer_encoded4 = integer_encoded4.reshape(len(integer_encoded4), 1)
onehot_encoded4 = onehot_encoder4.fit_transform(integer_encoded4)
print(onehot_encoded4)

print onehot_encoded1.shape, onehot_encoded2.shape, onehot_encoded3.shape, onehot_encoded4.shape

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 ..., 
 [ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]]
[[ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 1.  0.  0.  0.  0.]
 ..., 
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  1.  0.  0.  0.]]
[[ 1.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]
 ..., 
 [ 1.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]]
[[ 1.  0.]
 [ 1.  0.]
 [ 1.  0.]
 ..., 
 [ 1.  0.]
 [ 1.  0.]
 [ 1.  0.]]
(4754, 3) (4754, 5) (4754, 5) (4754, 2)


<a id='1.6'></a>
## 1.6 构造特征矩阵

In [61]:
# 特征矩阵X
print data_temp.shape
import numpy as np
X = np.hstack([data_temp, onehot_encoded1, onehot_encoded2, onehot_encoded3, onehot_encoded4, b5])
print X.shape
# 预测变量y
y = data['status']
print y.shape

(4754, 75)
(4754, 91)
(4754,)


<a id='1.7'></a>
## 1.7 划分训练集和测试集

In [62]:
# 划分训练集测试集
from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=23)

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1,test_size=0.3,random_state=1)
sss.get_n_splits(X, y)
print(sss)

for train_index,test_index in sss.split(X,y):
    print("Train Index:",train_index,",Test Index:",test_index)
    X_train, X_test=X[train_index],X[test_index]
    y_train, y_test=y[train_index],y[test_index]
    # print(X_train,X_test,y_train,y_test)

StratifiedShuffleSplit(n_splits=1, random_state=1, test_size=0.3,
            train_size=None)
('Train Index:', array([4151,  381,  104, ..., 3500,  278,  961]), ',Test Index:', array([2593, 2388, 3542, ..., 3250,  377, 2418]))


<a id='2'></a>
# 2 模型选择

<a id='2.1'></a>
## 2.1 LR模型

In [65]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
lr = LogisticRegression()
lr.fit(X_train, y_train)

from sklearn.metrics import accuracy_score, f1_score

# 准确性
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

print '[准确性]'
print '训练集：', accuracy_score(y_train, y_train_pred)
print '测试集：', accuracy_score(y_test, y_test_pred)

[准确性]
训练集： 0.802524797115
测试集： 0.803784162579


In [67]:
from sklearn import metrics

# 准确性
y_train_pred = lr.predict_proba(X_train)
y_test_pred = lr.predict_proba(X_test)
print y_test_pred[:,[0]]

# auc
test_auc = metrics.roc_auc_score(y_test_pred[:,[0]], y_test_pred) #验证集上的auc值
print test_auc

[[ 0.78692431]
 [ 0.91232501]
 [ 0.85748486]
 ..., 
 [ 0.90865681]
 [ 0.80359218]
 [ 0.39740008]]


ValueError: continuous format is not supported

<a id='2.2'></a>
## 2.2 SVM模型

In [69]:
from sklearn.svm import SVC
from  sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

svc = SVC(C=1.0, kernel='rbf', gamma=0.1)
svc.fit(X_train, y_train)

#lin_svc模型
Lin_SVC = LinearSVC()
Lin_SVC.fit(X_train,y_train)

y_train_pred = svc.predict(X_train)
y_test_pred = svc.predict(X_test)

# print y_train[0:100]
print y_test_pred[0:100]
print y_test_pred[100:200]

# SVM预测结果都是同一个值，可能原因有：1. 可能是由于样本数据没有归一化导致的。由于维度太大，如果不采用归一化处理的话，各个点的距离值将非常大，
# 故模型对于待预测点的预测结果值都判为同一个值。2. 也有可能是参数的问题。

print '[准确性]'
print '训练集：', accuracy_score(y_train, y_train_pred)
print '测试集：', accuracy_score(y_test, y_test_pred)
print('f1 score：')
print('训练集：{:.4f}'.format(f1_score(y_train, y_train_pred)))
print('测试集：{:.4f}'.format(f1_score(y_test, y_test_pred)))
print('ROC AUC：')
print('训练集：{:.4f}'.format(roc_auc_score(y_train, y_train_pred)))
print('测试集：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))

y_train_pred = Lin_SVC.predict(X_train)
y_test_pred = Lin_SVC.predict(X_test)

print '[准确性]'
print '训练集：', accuracy_score(y_train, y_train_pred)
print '测试集：', accuracy_score(y_test, y_test_pred)
print('f1 score：')
print('训练集：{:.4f}'.format(f1_score(y_train, y_train_pred)))
print('测试集：{:.4f}'.format(f1_score(y_test, y_test_pred)))
print('ROC AUC：')
print('训练集：{:.4f}'.format(roc_auc_score(y_train, y_train_pred)))
print('测试集：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[准确性]
训练集： 0.993988578299
测试集： 0.751927119832
f1 score：
训练集：0.9879
测试集：0.0380
ROC AUC：
训练集：0.9880
测试集：0.5084
[准确性]
训练集： 0.794409377818
测试集： 0.798878766643
f1 score：
训练集：0.4384
测试集：0.4636
ROC AUC：
训练集：0.6366
测试集：0.6484


<a id='2.3'></a>
## 2.3 DT模型

In [70]:
clf = DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

print y_test_pred[0:100]

print '[准确性]'
print '训练集：', accuracy_score(y_train, y_train_pred)
print '测试集：', accuracy_score(y_test, y_test_pred)
print('f1 score：')
print('训练集：{:.4f}'.format(f1_score(y_train, y_train_pred)))
print('测试集：{:.4f}'.format(f1_score(y_test, y_test_pred)))
print('ROC AUC：')
print('训练集：{:.4f}'.format(roc_auc_score(y_train, y_train_pred)))
print('测试集：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))

[0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0
 0 0 0 1 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1
 1 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1]
[准确性]
训练集： 0.702735196874
测试集： 0.668535388928
f1 score：
训练集：0.5547
测试集：0.5188
ROC AUC：
训练集：0.7144
测试集：0.6831


<a id='2.4'></a>
## 2.4 XGBoost模型

In [73]:
import xgboost as xgb

from xgboost.sklearn import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train, y_train)

y_test_pred = xgb.predict(X_test)

print y_test_pred

print '[准确性]'
print '测试集：', xgb.score(X_test, y_test)
print('f1 score：')
print('测试集：{:.4f}'.format(f1_score(y_test, y_test_pred)))
print('ROC AUC：')
print('测试集：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))

 [0 0 0 ..., 0 0 0]
[准确性]
测试集： 0.78976874562
f1 score：
测试集：0.4774
ROC AUC：
测试集：0.6544


<a id='2.5'></a>
## 2.5 LightGBM模型

In [74]:
from lightgbm import LGBMClassifier

lgb = LGBMClassifier()
lgb.fit(X_train, y_train)

y_test_pred = lgb.predict(X_test)

print y_test_pred

print '[准确性]'
print '测试集：', lgb.score(X_test, y_test)
print('f1 score：')
print('测试集：{:.4f}'.format(f1_score(y_test, y_test_pred)))
print('ROC AUC：')
print('测试集：{:.4f}'.format(roc_auc_score(y_test, y_test_pred)))