# 划分训练集和测试集
基本方法：  
- 首先，用户行为记录、消费记录、信用卡记录的最新时间，作为划分训练集和测试集的标准。
- 然后，再把测试集划分为A榜和B榜，此处需要随机划分

## 获取每个用户的最新时间
方法分为如下几个步骤：
  1. 从datav5中读入数据
  2. 从profile中获取用户id，构建新的df
  3. 从银行流水中，取得每名用户的最新时间，更新到df中
  4. 从信用卡账单中，取得每名用户的最新时间，更新到df中


In [1]:
import pandas as pd
# step1: 从datav5中读入数据
df_profile = pd.read_csv("../data/dataV5/profile.csv")
df_bank = pd.read_csv("../data/dataV5/bankStatement.csv")
df_bill = pd.read_csv("../data/dataV5/creditBill.csv")
df_behaviors = pd.read_csv("../data/dataV5/behaviors.csv")
df_overdue = pd.read_csv("../data/dataV5/label.csv")

# step2: 从profile中获取用户id，构建新的df
df_userTime = pd.DataFrame(df_profile['用户标识'])
df_userTime.columns = ['用户标识']

In [2]:
# step3: 从银行流水中，取得每名用户的最新时间，更新到df中
# df_bank.head()
df_userTimeBank = df_bank.groupby('用户标识').agg({'流水时间':'max'})
df_userTime = pd.merge(df_userTime,df_userTimeBank,on='用户标识',how='outer')

In [3]:
# step4: 从信用卡账单中，取得每名用户的最新时间，更新到df中
# df_bill.head()
df_userTimeBill = df_bill.groupby('用户标识').agg({'账单时间戳':'max'})
df_userTime = pd.merge(df_userTime,df_userTimeBill,on='用户标识',how='outer')

# 把nan的列，填充为0
df_userTime.fillna(0, inplace = True)

In [4]:
df_userTime.head()

Unnamed: 0,用户标识,流水时间,账单时间戳
0,0,3816705000.0,0.0
1,1,3822616000.0,3815255000.0
2,2,0.0,3805389000.0
3,3,3822528000.0,3838115000.0
4,4,0.0,4036882000.0


In [5]:
# step5: 从信用卡账单中，取得每名用户的最新时间，更新到df中
# df_bill.head()
df_userTime['时间标准'] = df_userTime.apply(lambda x: max(x.流水时间, x.账单时间戳), axis = 1)
df_userTime.drop(['流水时间','账单时间戳'], axis = 1, inplace=True)
df_userTime.head()

Unnamed: 0,用户标识,时间标准
0,0,3816705000.0
1,1,3822616000.0
2,2,3805389000.0
3,3,3838115000.0
4,4,4036882000.0


## 按时间排序，划分训练集和测试集；按4:1的比例


In [6]:
# 按时间排序
df_userTime.sort_values(by='时间标准',inplace=True)
# 取后20%为测试集
train_list = [1] * int(df_userTime.shape[0]*0.8)
test_list = [0] *( df_userTime.shape[0]-int(df_userTime.shape[0]*0.8) )

df_userTime['signTrain'] = train_list + test_list
df_userTime.drop(['时间标准'], axis = 1, inplace=True)

Unnamed: 0,用户标识,signTrain
43239,43239,1
66462,66462,1
32625,32625,1
32623,32623,1
54790,54790,1


In [37]:
df_userTime.sort_values(by='用户标识',inplace = True)
df_userTime.head()

Unnamed: 0,用户标识,signTrain
0,0,1
1,1,1
2,2,1
3,3,1
4,4,0


In [38]:
# 分别merge
df_profile = pd.merge(df_profile,df_userTime,on='用户标识',how='left')
df_bank = pd.merge(df_bank,df_userTime,on='用户标识',how='left')
df_bill = pd.merge(df_bill,df_userTime,on='用户标识',how='left')
df_behaviors = pd.merge(df_behaviors,df_userTime,on='用户标识',how='left')
df_overdue = pd.merge(df_overdue,df_userTime,on='用户标识',how='left')

In [39]:
# 然后根据signTrain完成训练集和测试集的划分
def getTrainTest(df,featSignTrain):
    df_train = df[df[featSignTrain] == 1]
    df_test = df[df[featSignTrain] == 0]
    df_train.drop([featSignTrain],axis=1, inplace=True)
    df_test.drop([featSignTrain],axis=1, inplace=True)
    return df_train,df_test

df_profile_train, df_profile_test = getTrainTest(df_profile,'signTrain')
df_bank_train, df_bank_test = getTrainTest(df_bank,'signTrain')
df_bill_train, df_bill_test = getTrainTest(df_bill,'signTrain')
df_behaviors_train, df_behaviors_test = getTrainTest(df_behaviors,'signTrain')
df_overdue_train, df_overdue_test = getTrainTest(df_overdue,'signTrain')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


## 将测试集划分为A榜和B榜
采用随机划分，1:1的比例

In [69]:
# 打乱顺序
df_overdue_test = df_overdue_test.sample(frac=1)

# 前一半为0，后一半为1
A_list = [1] * int(df_overdue_test.shape[0]*0.5)
B_list = [0] *( df_overdue_test.shape[0]-int(df_overdue_test.shape[0]*0.5) )

df_overdue_test['signAB'] = A_list + B_list

df_overdue_testA = df_overdue_test[df_profile_test['signAB']==1]
df_overdue_testB = df_overdue_test[df_profile_test['signAB']==0]
df_overdue_testA.sort_values(by='用户标识',inplace = True)
df_overdue_testB.sort_values(by='用户标识',inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,用户标识,signAB
4,4,0
6,6,0
13,13,0
19,19,1
27,27,1


## 写出所有数据

In [76]:
# 写出训练数据
df_profile_train.to_csv("../data/dataV6/train/train_profile.csv",index=None)
df_bank_train.to_csv("../data/dataV6/train/train_bankStatement.csv",index=None)
df_bill_train.to_csv("../data/dataV6/train/train_creditBill.csv",index=None)
df_behaviors_train.to_csv("../data/dataV6/train/train_behaviors.csv",index=None)
df_overdue_train.to_csv("../data/dataV6/train/train_label.csv",index=None)

In [77]:
# 写出测试数据
df_profile_test.to_csv("../data/dataV6/test/test_profile.csv",index=None)
df_bank_test.to_csv("../data/dataV6/test/test_bankStatement.csv",index=None)
df_bill_test.to_csv("../data/dataV6/test/test_creditBill.csv",index=None)
df_behaviors_test.to_csv("../data/dataV6/test/test_behaviors.csv",index=None)
df_overdue_testA.to_csv("../data/dataV6/test/test_label_A.csv",index=None)
df_overdue_testB.to_csv("../data/dataV6/test/test_label_B.csv",index=None)