## 数据说明与切分要求


**数据集说明**：这份数据集是金融数据（非原始数据，已经处理过了），我们要做的是预测贷款用户是否会逾期。表格中 "status" 是结果标签：0表示未逾期，1表示逾期。

**切分要求**：数据切分方式 - 三七分，其中测试集30%，训练集70%，随机种子设置为2018

## 任务1 数据分析

任务1内容：对数据进行探索和分析，包括：

* 数据类型的分析
* 无关特征删除
* 数据类型转换
* 缺失值处理

……以及你能想到和借鉴的数据分析处理

DDL：20190806 10:00pm



In [2]:
import numpy as np
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# 读取data.csv文件
# df = pd.read_csv('./data.csv') 编码错误
# df = pd.read_csv('./data.csv', encoding='utf-8') 编码错误
# df = pd.read_csv('./data.csv', encoding='unicode_escape') 有乱码，可能有中文
df = pd.read_csv('./data.csv', encoding='gbk')

### 初步探索

In [4]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,custid,trade_no,bank_card_no,low_volume_percent,middle_volume_percent,take_amount_in_later_12_month_highest,trans_amount_increase_rate_lately,trans_activity_month,trans_activity_day,...,loans_max_limit,loans_avg_limit,consfin_credit_limit,consfin_credibility,consfin_org_count_current,consfin_product_count,consfin_max_limit,consfin_avg_limit,latest_query_day,loans_latest_day
0,5,2791858,20180507115231274000000023057383,卡号1,0.01,0.99,0,0.9,0.55,0.313,...,2900.0,1688.0,1200.0,75.0,1.0,2.0,1200.0,1200.0,12.0,18.0
1,10,534047,20180507121002192000000023073000,卡号1,0.02,0.94,2000,1.28,1.0,0.458,...,3500.0,1758.0,15100.0,80.0,5.0,6.0,22800.0,9360.0,4.0,2.0
2,12,2849787,20180507125159718000000023114911,卡号1,0.04,0.96,0,1.0,1.0,0.114,...,1600.0,1250.0,4200.0,87.0,1.0,1.0,4200.0,4200.0,2.0,6.0
3,13,1809708,20180507121358683000000388283484,卡号1,0.0,0.96,2000,0.13,0.57,0.777,...,3200.0,1541.0,16300.0,80.0,5.0,5.0,30000.0,12180.0,2.0,4.0
4,14,2499829,20180507115448545000000388205844,卡号1,0.01,0.99,0,0.46,1.0,0.175,...,2300.0,1630.0,8300.0,79.0,2.0,2.0,8400.0,8250.0,22.0,120.0
5,15,518072,20180507121233054000000388275132,卡号1,0.02,0.98,2000,7.59,1.0,0.733,...,5300.0,1941.0,11200.0,80.0,10.0,12.0,20400.0,8130.0,3.0,4.0
6,16,1205125,20180507121931540000000388298915,卡号1,0.02,0.98,0,23.67,0.94,0.087,...,2200.0,2200.0,7600.0,73.0,2.0,2.0,16800.0,8900.0,1.0,3.0
7,18,1129897,20180507124659235000000023105807,卡号1,0.02,0.98,0,0.25,0.88,0.302,...,,,,,,,,,,
8,20,2599411,20180507115855621000000388224458,卡号1,0.03,0.65,0,0.31,0.76,0.472,...,5300.0,4750.0,5500.0,79.0,8.0,11.0,19200.0,7987.0,24.0,7.0
9,26,1413051,20180504155156296000000021138084,卡号1,0.01,0.99,500,0.8,1.0,0.088,...,2800.0,1520.0,0.0,0.0,0.0,0.0,0.0,0.0,18.0,142.0


In [5]:
# 查看数据形状
df.shape

(4754, 90)

In [73]:
# 查看数据特征名称
df.columns

Index(['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no',
       'low_volume_percent', 'middle_volume_percent',
       'take_amount_in_later_12_month_highest',
       'trans_amount_increase_rate_lately', 'trans_activity_month',
       'trans_activity_day', 'transd_mcc', 'trans_days_interval_filter',
       'trans_days_interval', 'regional_mobility', 'student_feature',
       'repayment_capability', 'is_high_user', 'number_of_trans_from_2011',
       'first_transaction_time', 'historical_trans_amount',
       'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month',
       'avg_consume_less_12_valid_month', 'abs',
       'top_trans_count_last_1_month', 'avg_price_last_12_month',
       'avg_price_top_last_12_valid_month', 'reg_preference_for_trad',
       'trans_top_time_last_1_month', 'trans_top_time_last_6_month',
       'consume_top_time_last_1_month', 'consume_top_time_last_6_month',
       'cross_consume_count_last_1_month',
       'trans_fail_top_count_enum_last_1_mont

In [119]:
# 查看数据类型
df.dtypes.value_counts()

float64    70
int64      13
object      7
dtype: int64

In [122]:
# 查看数据统计信息
df.describe()

Unnamed: 0.1,Unnamed: 0,custid,low_volume_percent,middle_volume_percent,take_amount_in_later_12_month_highest,trans_amount_increase_rate_lately,trans_activity_month,trans_activity_day,transd_mcc,trans_days_interval_filter,...,loans_max_limit,loans_avg_limit,consfin_credit_limit,consfin_credibility,consfin_org_count_current,consfin_product_count,consfin_max_limit,consfin_avg_limit,latest_query_day,loans_latest_day
count,4754.0,4754.0,4752.0,4752.0,4754.0,4751.0,4752.0,4752.0,4752.0,4746.0,...,4457.0,4457.0,4457.0,4457.0,4457.0,4457.0,4457.0,4457.0,4450.0,4457.0
mean,6008.414178,1690993.0,0.021806,0.901294,1940.197728,14.160674,0.804411,0.365425,17.502946,29.02992,...,3390.038142,1820.357864,9187.009199,76.04263,4.732331,5.227507,16153.690823,8007.696881,24.112809,55.181512
std,3452.071428,1034235.0,0.041527,0.144856,3923.971494,694.180473,0.19692,0.170196,4.475616,22.722432,...,1474.206546,583.418291,7371.257043,14.536819,2.974596,3.409292,14301.037628,5679.418585,37.725724,53.486408
min,5.0,114.0,0.0,0.0,0.0,0.0,0.12,0.033,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.0,-2.0
25%,3106.0,759335.8,0.01,0.88,0.0,0.615,0.67,0.233,15.0,16.0,...,2300.0,1535.0,4800.0,77.0,2.0,3.0,7800.0,4737.0,5.0,10.0
50%,6006.5,1634942.0,0.01,0.96,500.0,0.97,0.86,0.35,17.0,23.0,...,3100.0,1810.0,7700.0,79.0,4.0,5.0,13800.0,7050.0,14.0,36.0
75%,8999.0,2597905.0,0.02,0.99,2000.0,1.6,1.0,0.48,20.0,32.0,...,4300.0,2100.0,11700.0,80.0,7.0,7.0,20400.0,10000.0,24.0,91.0
max,11992.0,4004694.0,1.0,1.0,68000.0,47596.74,1.0,0.941,42.0,285.0,...,10000.0,6900.0,87100.0,87.0,18.0,20.0,266400.0,82800.0,360.0,323.0


**小结**：

1. 原数据集有4754行数据，90个特征

2. 原数据集的特征中，有70个特征值的数据类型是float64，13个是int64，7个object

针对以上结论，原数据集在接下来的操作中，需进行**数据类型转换**

### 删除无关特征

In [109]:
"""

初步判断，无关特征如下：

Unnamed: 0
custid: 客户id
trade_no: 交易编号
bank_card_no: 银行卡号
id_name: 客户名称
latest_query_time: 最近询问时间

"""
df1 = df.drop(['Unnamed: 0','custid','trade_no','bank_card_no','first_transaction_time','id_name','latest_query_time'], axis=1)

In [110]:
df1.shape

(4754, 83)

In [111]:
# 判断df1是否存在某些特征所有值相等，删除这些特征
for i in df1.columns:
    count = df1[i].count()
    if len(list(df1[i].unique())) in [1]:
        df1.drop(i,axis = 1,inplace=True )

In [112]:
df1.shape

(4754, 82)

**小结**：

经删除无关特征值处理后的数据集剩余**82个特征**

### 处理缺失值

缺失值的处理方法为：

* 删除，或
* 填补，或
* 不处理

一般采用删除或填补的方法，首选基于业务的填补方法，其次根据单变量分析进行填补

缺失值处理基本原则：当缺失值大于80%时，考虑删除该行/列


In [113]:
# 删除缺失值达到80%的行
df1 = df1.dropna(thresh=65,axis=0)

In [114]:
df1.shape

(4455, 82)

In [115]:
# 查看df1各特征缺失值占比
missing_value_fri = (df1.isnull().sum()/df1.isnull().count()).sort_values(ascending=True)
missing_value_fri

low_volume_percent                            0.000000
status                                        0.000000
trans_day_last_12_month                       0.000000
loans_score                                   0.000000
loans_credibility_behavior                    0.000000
loans_count                                   0.000000
loans_settle_count                            0.000000
loans_overdue_count                           0.000000
loans_org_count_behavior                      0.000000
consfin_org_count_behavior                    0.000000
loans_cash_count                              0.000000
latest_one_month_loan                         0.000000
latest_three_month_loan                       0.000000
latest_six_month_loan                         0.000000
history_suc_fee                               0.000000
history_fail_fee                              0.000000
latest_one_month_suc                          0.000000
latest_one_month_fail                         0.000000
consfin_av

由df1缺失值分析可知，student_feature特征缺失值最多，达到62.7%；其次是cross_consume_count_last_1_month，缺失值达到8.7%；avg_price_top_last_12_valid_month的缺失值为2.2%左右。其他的特征缺失值在1%以下或者无缺失值。

In [107]:
# 分组查看student_feature列的值
df1['student_feature'].value_counts()

1.0    1754
2.0       2
Name: student_feature, dtype: int64

In [108]:
# student_feature列含1和2两个值，缺失值占63%左右
# 在实际业务中，银行会在客户贷款前得知客户的职业。
# 由此推测，student_feature值为空代表该账户非学生账户（不适用），值为2属于错误数据
# 对于student_feature列，将NA填充为0，代表非学生，2以众数0替代
df1['student_feature'] = df1['student_feature'].fillna(0)
df1['student_feature'] = df1['student_feature'].replace([2],[0])

In [117]:
# 查看数据类型
df1.dtypes.value_counts()

float64    69
int64      11
object      2
dtype: int64

In [106]:
df1.dtypes

low_volume_percent                        float64
middle_volume_percent                     float64
take_amount_in_later_12_month_highest       int64
trans_amount_increase_rate_lately         float64
trans_activity_month                      float64
trans_activity_day                        float64
transd_mcc                                float64
trans_days_interval_filter                float64
trans_days_interval                       float64
regional_mobility                         float64
student_feature                           float64
repayment_capability                        int64
is_high_user                                int64
number_of_trans_from_2011                 float64
historical_trans_amount                     int64
historical_trans_day                      float64
rank_trad_1_month                         float64
trans_amount_3_month                        int64
avg_consume_less_12_valid_month           float64
abs                                         int64
