![](./img/chinahadoop.png)
# 风控实战项目 -- 电信行业风控建模
**[小象学院](http://www.chinahadoop.cn/course/landpage/15)《机器学习集训营》实战项目案例 by [@寒小阳](http://www.chinahadoop.cn/user/49339/about)**

## 说明
本notebook为特征工程部分，给出了基本的思路和框架，请大家填充完相应环节的步骤，同时也鼓励大家尝试更多更好的数据处理与特征工程和建模方式。

### 引入工具库

In [2]:
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score
import xgboost as xgb

咱们从单表和多表交叉，分别抽取单维度特征和交叉维度特征

### voice features
#### 先对所有的时间进行特征提取
1. 用户电话的总的通话次数 opp
2. 通话的人数，voice_all_unique_cnt
3. 通话次数 / 人数的比例， voice_all_cnt_all_unique_cnt_rate
4. 对端电话的前n位的个数，所有的不同号码的个数。 opp_head
5. 对端号码长度的分布个数   opp_len
6. 通话最大时长，平均时长，最小时长，极差时长等统计的信息  start_time, end_time
7. 通话类型的分布个数或者比例 call_type
8. 通话类型的分布个数和比例   in_out

In [3]:
df_train_voice = pd.read_csv('../data/train/voice_train.txt', names=['uid', 'opp_num', 'opp_head', 'opp_len', \
                                    'start_time', 'end_time', 'call_type', 'in_out'], sep='\t', low_memory=False)

df_train_label = pd.read_csv('../data/train/uid_train.txt', names = ['uid', 'label'], sep='\t', low_memory=False)

df_testA_voice = pd.read_csv('../data/testA/voice_test_a.txt', names=['uid', 'opp_num', 'opp_head', 'opp_len', \
                                    'start_time', 'end_time', 'call_type', 'in_out'], sep='\t', low_memory=False)

df_testB_voice = pd.read_csv('../data/testB/voice_test_b.txt', names=['uid', 'opp_num', 'opp_head', 'opp_len', \
                                    'start_time', 'end_time', 'call_type', 'in_out'], sep='\t', low_memory=False)

In [None]:
def get_voice_features(df_train_voice, target='train', Type=None):
    # 为了保证函数通用性，我们设定mode来标示训练集和不同的测试集处理
    if target == 'train':
        # 复制lable的数据
        df_train = df_train_label.copy()
    else:
        if Type == 'A':
            df_train = pd.DataFrame(data={'uid':['u'+str(id) for id in range(5000, 7000)]})
        else:
            df_train = pd.DataFrame(data={'uid':['u'+str(id) for id in range(7000, 10000)]})
    
    
    # 总的通话次数
    # your code here
    
    
    
    # 总的通话的对端的不重复的个数
    # your code here
    
    
    
    # 通话次数 / 人数的比例，每个人通话的次数， voice_all_per_opp_rate
    # your code here
    
    
    
    # 对端电话的前n位的个数，所有的不同号码的个数以及其所有的分布个数和比例(部分特征待定)。 opp_head_cnt_{k}, opp_head_rate_{k}
    # your code here
    
    
    
    # 联系最多和最少的次数的opp_head
    # your code here
    
    
    
    # 通话最多的head的个数
    # your code here
    
    
    
    # 最近一次通话的号码的长度
    # your code here
    
    
    
    # call_type 分布
    # your code here
    
    

    # call_type 的比例
    # your code here
    
    
    
    # in_out 的比例
    # your code here
    
    
    
    # 通话最大时长，平均时长，最小时长，极差时长等统计的信息
    # your code here
    
    
    
    # day的分布
    # 同上一个notebook
    # your code here
    
    
    
    # hour分布
    # 同上一个notebook
    # your code here

    
    
    # minute 分布
    # your code here
    
    

    # 各种统计特征：最大最小分位数
    # your code here
    
    
    
    # 平均多久打一次电话
    # your code here
    
    
    
    # 其他补充特征
    # your code here


                      
    return df_train

In [None]:
# 功能函数：时间差，计算两个时间点秒数的差别
def diff_time(a,b):
    a_day, a_hour, a_minute, a_second = (a / 1000000, a / 10000 % 100, a / 100 % 100, a % 100)
    b_day, b_hour, b_minute, b_second = (b / 1000000, b / 10000 % 100, b / 100 % 100, b % 100)
    
    d_day = b_day - a_day
    d_hour = b_hour - a_hour
    d_minute = b_minute - a_minute
    d_second = b_second - a_second
    
    diff = d_day * 24 * 60 * 60 + d_hour * 60 * 60 + d_minute * 60 + d_second
    return diff

def get_diff_time(x):
    diff_t = []
    for d in x:
        diff_t.append(diff_time(d[0],d[1]))
    return diff_t

### 调用特征工程处理

In [None]:
df_test = get_voice_features(df_testB_voice, mode='test', Type='B')
df_testA = get_voice_features(df_testA_voice, mode='test', Type='A')
df_train = get_voice_features(df_train_voice)

df_train.fillna(0,inplace=True)
df_test.fillna(0,inplace=True)

df_train.to_csv('./features/df_train_voice_feat.csv',index=False)
df_test.to_csv('./features/df_testB_voice_feat.csv',index=False)
df_testA.to_csv('./features/df_testA_voice_feat.csv',index=False)

### 查看数据

In [None]:
df_train.info()
df_test.info()
df_test.head()