# 数据清洗

数据清洗策略子项:
- 申请者个人 / 家庭基本信息
- 申请者联系方式
- 申请者车辆购置情况
- 申请者工作情况
- 申请者房产情况
- 贷款申请材料
- 申请者社交状况
- 风险评估

每个策略子项从两个方面进行清洗:
- 与业务逻辑相结合判断是否存在异常值
- 从统计学意义上进行判断是否存在异常值

清洗完成后:
- **标记**异常项, 暂时不删除异常项
- 对文本类型的特征项进行数值化处理
- 对数值化类型的特征项进行合适的数值化处理
- 对于缺失值填充`np.nan`或者单独列成一类

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
DATA_DIR = '../data'
APP_TRAIN_FILENAME = 'application_train.csv'
APP_TEST_FILENAME = 'application_test.csv'
BUREAU_FILENAME = 'bereau.csv'
BUREAU_BALANCE_FILENAME = 'bureau_balance.csv'
CREDIT_CARD_BALANCE_FILENAME = 'credit_card_balance.csv'
INSTALLMENTS_PAYMENTS = 'installments_payments.csv'
POS_CACHE_BALANCE_FILENAME = 'POS_CACHE_balance.csv'
PREVIOUS_APP_FILENAME = 'previous_application.csv'

In [3]:
user_df = pd.read_csv(os.path.join(DATA_DIR, APP_TRAIN_FILENAME))

## 申请者个人 / 家庭基本信息

主要包含以下字段:
- SK_ID_CURR
- TARGET
- CODE_GENDER
- DAYS_BIRTH
- DAYS_REGISTRATION
- DAYS_ID_PUBLISH
- NAME_EDUCATION_TYPE
- CNT_CHILDREN
- CNT_FAM_MEMBERS
- NAME_FAMILY_STATUS

有效性检查包括:

- DAYS_BIRTH >= DAYS_REGISTRATION
- DAYS_BIRTH >= DAYS_ID_PUBLISH
- NAME_FAMILY_STATUS 是否与 CNT_CHILDREN 以及 CNT_FAM_MEMBERS 冲突
- CNT_FAM_MEMBERS < CNT_CHILDREN

In [7]:
PER_FAM_FACTORS = ['SK_ID_CURR', 'TARGET', 'CODE_GENDER',
                  'DAYS_BIRTH', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
                  'NAME_EDUCATION_TYPE', 'CNT_CHILDREN', 'CNT_FAM_MEMBERS',
                  'NAME_FAMILY_STATUS']
user_per_fam_df = user_df[PER_FAM_FACTORS]

In [9]:
user_per_fam_df[user_per_fam_df['DAYS_BIRTH'] > user_per_fam_df['DAYS_REGISTRATION']]

Unnamed: 0,SK_ID_CURR,TARGET,CODE_GENDER,DAYS_BIRTH,DAYS_REGISTRATION,DAYS_ID_PUBLISH,NAME_EDUCATION_TYPE,CNT_CHILDREN,CNT_FAM_MEMBERS,NAME_FAMILY_STATUS
266366,408583,0,F,-10116,-10116.041667,-2715,Secondary / secondary special,2,4.0,Married


In [10]:
user_per_fam_df[user_per_fam_df['DAYS_BIRTH'] > user_per_fam_df['DAYS_ID_PUBLISH']]

Unnamed: 0,SK_ID_CURR,TARGET,CODE_GENDER,DAYS_BIRTH,DAYS_REGISTRATION,DAYS_ID_PUBLISH,NAME_EDUCATION_TYPE,CNT_CHILDREN,CNT_FAM_MEMBERS,NAME_FAMILY_STATUS


In [11]:
user_per_fam_df[user_per_fam_df['CNT_FAM_MEMBERS'] < user_per_fam_df['CNT_CHILDREN']]

Unnamed: 0,SK_ID_CURR,TARGET,CODE_GENDER,DAYS_BIRTH,DAYS_REGISTRATION,DAYS_ID_PUBLISH,NAME_EDUCATION_TYPE,CNT_CHILDREN,CNT_FAM_MEMBERS,NAME_FAMILY_STATUS
