**飞桨常规赛: MarTech Challenge 用户购买预测 8月第六名方案**

In [None]:
# 查看当前挂载的数据集目录, 该目录下的变更重启环境后会自动还原
# View dataset directory. 
# This directory will be recovered automatically after resetting environment. 
!ls /home/aistudio/data

In [0]:
# 查看工作区文件, 该目录下的变更将会持久保存. 请及时清理不必要的文件, 避免加载过慢.
# View personal work directory. 
# All changes under this directory will be kept even after reset. 
# Please clean unnecessary files in time to speed up environment loading. 
!ls /home/aistudio/work

In [None]:
# 如果需要进行持久化安装, 需要使用持久化路径, 如下方代码示例:
# If a persistence installation is required, 
# you need to use the persistence path as the following: 
!mkdir /home/aistudio/external-libraries
!pip install beautifulsoup4 -t /home/aistudio/external-libraries

In [None]:
# 同时添加如下代码, 这样每次环境(kernel)启动的时候只要运行下方代码即可: 
# Also add the following code, 
# so that every time the environment (kernel) starts, 
# just run the following code: 
import sys 
sys.path.append('/home/aistudio/external-libraries')

解题思路:
数据探索: 发现存在会员信息列存在大量空值, 大量空值做单独处理, 省份城市列少量空值, 可以作为特征处理
数据预处理: 空值处理, 异常值处理
数据集分割: 由于是预测历史用户在9月是否购买, 因此模型训练可以用7月数据去预测8月来作为训练, 全量数据来预测9月数据作为测试; 其中
特征工程:
* 由于数据量庞大, 可以用来训练多特征, 因此, 在特征工程中主要思路还是以多构造辅助特征为主;
* 从商品角度, 订单. 交易金额角度, 进行了多钟数据维度的构造, 包括均值,标准差等
* 对用户基本信息如城市, 消费习惯(偏好的消费时间可能会影响到他的客户黏性,) 因此, 构造了多尺度去
* 多构造时间多尺度特征, 我们认为商品的新鲜度对用户购买也有影响, 因此构造了上下架时间等;
模型训练: 采用lgb模型, 利用早停法训练. 采用标准lgb.train模型, 由于样本严重不均衡, 因此保留了预测的原概率输出, 手动调整阈值;
模型观测: 
* 在训练过程中, 模型最收敛AUC0.83 训练集, 测试集0.8 , 结果较好; 保存参数, 用来预测;
* 由于样本严重不均衡,考虑到8月涌入了15万新用户, 9月也是销售的旺季, 因此设置了一个较大值, 最终预测结果对阈值进行了观测调整, 以达到一个较优值;



In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt

In [2]:
raw = pd.read_csv("train.csv")
print(raw.shape)    #(2306871, 29)
# 处理空缺值
print(raw.isna().sum())   #customer_province customer_city goods_price少量空值, 
# raw

(2306871, 29)
order_detail_id                 0
order_id                        0
order_total_num                 0
order_amount                    0
order_total_payment             0
order_total_discount            0
order_pay_time                  0
order_status                    0
order_count                     0
is_customer_rate                0
order_detail_status             0
order_detail_goods_num          0
order_detail_amount             0
order_detail_payment            0
order_detail_discount           0
customer_province            1139
customer_city                1150
member_id                       0
customer_id                     0
customer_gender           1671081
member_status             1671081
is_member_actived         1671081
goods_id                        0
goods_class_id                  0
goods_price                   436
goods_status                    0
goods_has_discount              0
goods_list_time                 0
goods_delist_time               0


In [3]:
#将数据源排序, 按照支付时间排序, 准备按用户来分组求最后数据
raw.sort_values("order_pay_time",ascending=True, inplace=True)

In [4]:
#数据探索
print(raw.shape)   #总数据集  (2306871, 29)
print(max(raw["order_pay_time"]))  #最后下单时间: 2013-08-31 23:59:59
#构建训练集
train_raw = raw[raw["order_pay_time"] <= "2013-07-31 23:59:59"]
train_raw.shape  # 七月份订单(2080703, 29)
print(len(set(raw[raw["order_pay_time"] <= "2013-07-31 23:59:59"]["customer_id"])))
#train 用户  用户数 1435404
raw[raw["order_pay_time"]>"2013-07-31 23:59:59"].shape  
#八月份订单数量 (226168, 29)
print(raw.columns)
label_raw = set(raw[raw["order_pay_time"]>"2013-07-31 23:59:59"]["customer_id"])
len(label_raw)  # 八月份共有173385 个用户下单

(2306871, 29)
2013-08-31 23:59:59
1435404
Index(['order_detail_id', 'order_id', 'order_total_num', 'order_amount',
       'order_total_payment', 'order_total_discount', 'order_pay_time',
       'order_status', 'order_count', 'is_customer_rate',
       'order_detail_status', 'order_detail_goods_num', 'order_detail_amount',
       'order_detail_payment', 'order_detail_discount', 'customer_province',
       'customer_city', 'member_id', 'customer_id', 'customer_gender',
       'member_status', 'is_member_actived', 'goods_id', 'goods_class_id',
       'goods_price', 'goods_status', 'goods_has_discount', 'goods_list_time',
       'goods_delist_time'],
      dtype='object')


173385

In [5]:
#特征工程

def preprocessing(raw):
    #处理数据, 创建一个新的df
    # raw = add_features(raw)
    data = pd.DataFrame(raw.groupby("customer_id")["customer_gender"].last().fillna(0))
    #商品相关（最后一次行为）
    data[[ 'goods_id_last', 'goods_class_id_last',
       'goods_price_last', 'goods_status_last', 'goods_has_discount_last', 'goods_list_time_last',
       'goods_delist_time_last']] = raw.groupby("customer_id")[[ 'goods_id', 'goods_class_id',
       'goods_price', 'goods_status', 'goods_has_discount', 'goods_list_time',
       'goods_delist_time']].last()
    #订单相关(最后一次行为)
    data[['order_detail_id_last', 'order_id_last', 'order_total_num_last', 'order_amount_last',
       'order_total_payment_last', 'order_total_discount_last', 'order_pay_time_last',
       'order_status_last', 'order_count_last', 'is_customer_rate_last',
       'order_detail_status_last', 'order_detail_goods_num_last', 'order_detail_amount_last',
       'order_detail_payment_last', 'order_detail_discount_last']] = \
       raw.groupby("customer_id")[['order_detail_id', 'order_id', 'order_total_num', 'order_amount',
       'order_total_payment', 'order_total_discount', 'order_pay_time',
       'order_status', 'order_count', 'is_customer_rate',
       'order_detail_status', 'order_detail_goods_num', 'order_detail_amount',
       'order_detail_payment', 'order_detail_discount']].last()
    # 订单原始价格（多种统计字段）
    data[["order_amount_min","order_amount_max","order_amount_mean","order_amount_std"]] = \
          raw.groupby("customer_id")["order_amount"].agg([np.min,np.max, np.mean, np.std]).fillna(0)
    # 订单实付金额（多种统计字段）
    data[["order_total_payment_min","order_total_payment_max","order_total_payment_mean","order_total_payment_std"]] = \
          raw.groupby("customer_id")["order_total_payment"].agg([np.min,np.max,np.mean,np.std]).fillna(0)
    # 订单实付金额 统计属性（sum, mean）
    data[["order_total_discount_mean","order_total_discount_sum"]] = raw.groupby("customer_id")["order_total_discount"].agg([np.mean, np.sum])

    # 用户购买的订单数量
    data["order_id_count"] = raw.groupby("customer_id")["order_id"].count()

    # # 最后一次下单时间间隔, 最后一次, 平均时间间隔
    # data["days_since_prior_order_last"] = raw.groupby("customer_id")["days_since_prior_order"].last()
    # data["days_since_prior_order_mean"] = raw.groupby("customer_id")["days_since_prior_order"].mean()
    # data["days_since_prior_order_mean"] = raw.groupby("customer_id")["days_since_prior_order"].mean()

    ##用户状态信息
    # 用户所在省份 把省份的空值直接当成一个特征值
    data["customer_province_last"] = raw.groupby("customer_id")["customer_province"].last().fillna(str(0))
    # 用户所在城市 把city的空值直接当成一个特征值
    data["customer_city_last"] = raw.groupby("customer_id")["customer_city"].last().fillna(str(0))
    # 用户会员状态, 对缺值列构建一个新特征
    data["member_status"] = raw.groupby("customer_id")["member_status"].last().fillna(0)
    data["member_status_default"] = [ 0 if i ==1 else 1 for i in data["member_status"] ]
 
    # 用户是否评价 统计结果（平均，总和）
    data[["is_customer_rate_mean","is_customer_rate_sum"]] = raw.groupby("customer_id")["is_customer_rate"].agg([np.mean, np.sum])
    data["is_customer_rate_count"] = raw.groupby("customer_id")["is_customer_rate"].count()

    # 用户购买的商品数量
    data["good_id_count"] = raw.groupby("customer_id")["goods_id"].count()
    # 商品原始价格(多重统计字段)
    data[["goods_price_min","goods_price_max","goods_price_mean","goods_price_std"]] = \
          raw.groupby("customer_id")["goods_price"].agg([np.min,np.max, np.mean, np.std]).fillna(0)
    # 商品折扣统计属性（sum, mean）
    data[["goods_has_discount_mean","goods_has_discount_sum"]] = raw.groupby("customer_id")["goods_has_discount"].agg([np.mean, np.sum])

    # 付款时间（时间多尺度，时间diff）
    start_time = pd.to_datetime("2012-11-01 00:00:07")
    data["order_pay_time_last"] = pd.to_datetime(data["order_pay_time_last"])
    data["order_pay_time_last_month"]  = data["order_pay_time_last"].dt.month
    data["order_pay_time_last_day"]  = data["order_pay_time_last"].dt.day
    data["order_pay_time_last_hour"]  = data["order_pay_time_last"].dt.hour
    data["order_pay_time_last_minute"]  = data["order_pay_time_last"].dt.minute
    data["order_pay_time_last_weekday"]  = data["order_pay_time_last"].dt.weekday
    # 付款时间diff, 最小时间 "2012-11-01 00:00:07", 用户最后一次下单时间距离最早统计事件的间隔
    data["order_pay_time_last_delta"] = (data["order_pay_time_last"] - start_time).dt.days
    # 商品最新上架时间diff（假设起始时间为"2012-11-01 00:00:07"）

    data["goods_list_time_last"] = pd.to_datetime(data["goods_list_time_last"])
    data["goods_list_time_last_delta"] = (data["goods_list_time_last"] - start_time).dt.days
    # 商品最新下架时间diff（假设起始时间为"2012-11-01 00:00:07"）
    data["goods_delist_time_last"] = pd.to_datetime(data["goods_delist_time_last"]) 
    data["goods_delist_time_last_delta"] = (data["goods_delist_time_last"] - start_time).dt.days
    # 商品展示时间（下架时间-上架时间）
    data["good_display_time"] = (data["goods_delist_time_last"] - data["goods_list_time_last"]).dt.days

    #删除多余特征, 无效分析的
    data.drop(["order_pay_time_last","goods_list_time_last","goods_delist_time_last"],axis=1,inplace=True)

    
    return data


In [6]:
train = preprocessing(train_raw)
print(train.shape)  #构造了共计54个特征

(1435404, 54)


In [32]:
# train.head()

In [33]:
# 为训练集数据添加标签, 在8月份购买了为1, 每购买则为0
train["label"] = train.index.map(lambda x: int(x in label_raw))
train["label"].value_counts()  #8月购买的有22803, 没购买的1312601
# train.info()

0    1412601
1      22803
Name: label, dtype: int64

In [34]:
#构造测试集数据:
test = preprocessing(raw)

In [35]:
#对类别特征做labelEncoder
print(train.select_dtypes("O").columns)
label_features = ['customer_province_last', 'customer_city_last']

encoders = []
for feat in label_features:
  print(feat)
  enc = LabelEncoder()
  test[feat] = enc.fit_transform(test[feat])
  encoders.append(enc)

for i,feat in enumerate(label_features):
  train[feat] = encoders[i].transform(train[feat])

# test[label_features]

Index(['customer_province_last', 'customer_city_last'], dtype='object')
customer_province_last
customer_city_last


In [30]:
#用lightgbm训练
import lightgbm as lgb
from sklearn.model_selection import train_test_split
# param = {
#     'num_leaves':41,
#     'boosting_type': 'gbdt',
#     'objective':'binary',
#     'max_depth':15,
#     'learning_rate':0.001,
#     'metric':'binary_logloss'}
param = {'boosting_type':'gbdt',
                         'objective' : 'binary', #
                         #'metric' : 'binary_logloss',
                         'metric' : 'auc',
#                          'metric' : 'self_metric',
                         'learning_rate' : 0.01,
                         'max_depth' : 15,
                         'feature_fraction':0.8,
                         'bagging_fraction': 0.9,
                         'bagging_freq': 8,
                         'lambda_l1': 0.6,
                         'lambda_l2': 0,
#                          'scale_pos_weight':k,
#                         'is_unbalance':True
        }

X_train, X_valid, y_train, y_valid = train_test_split(train.drop('label',axis=1), train['label'], test_size=0.2, random_state=42)
train_data = lgb.Dataset(X_train, label=y_train)
valid_data = lgb.Dataset(X_valid, label=y_valid)

model = lgb.train(param,train_data,valid_sets=[train_data,valid_data],num_boost_round = 10000 ,early_stopping_rounds=200,verbose_eval=25)
#保存模型
model.save_model("model.txt")

[LightGBM] [Info] Number of positive: 18303, number of negative: 1130020
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7723
[LightGBM] [Info] Number of data points in the train set: 1148323, number of used features: 54
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.015939 -> initscore=-4.122926
[LightGBM] [Info] Start training from score -4.122926
Training until validation scores don't improve for 200 rounds
[25]	training's auc: 0.791325	valid_1's auc: 0.784718
[50]	training's auc: 0.793722	valid_1's auc: 0.786721
[75]	training's auc: 0.797172	valid_1's auc: 0.790401
[100]	training's auc: 0.799352	valid_1's auc: 0.792597
[125]	training's auc: 0.800488	valid_1's auc: 0.793859
[150]	training's auc: 0.802439	valid_1's auc: 0.795582
[175]	training's auc: 0.803506	valid_1's auc: 0.796055
[200]	training's auc: 0.805132	valid_1's auc: 0.796979
[225]	training's auc: 0.806823	valid_1's auc: 0.79793
[250]	training's auc: 0.808222	valid_1's auc: 0.79870

In [17]:
#预测结果, 给test添加label
predict=model.predict(test)
predict

In [None]:
#选取合适的值作为阈值, 将输出概率转成分类
#关键是控制多少人为下月可能购买, 或者不购买
test["label"] = predict


In [None]:
#通过观察数据, 过去7月购买的用户在8月份仅有22803, 没有购买的有1412601,
#而8月购买的用户数为173385万人,说明8月有150582人, 新用户次月购买比例高, 
#加上往期购买人数看, 老用户再次下单人数在350000左右, 因此, 建议阈值设置为50万
threshold = test.label.sort_values(ascending=False).iloc[500000]
test['label'] = test["label"].map(lambda x: 1 if x>threshold else 0 )

In [None]:
submit = pd.read_csv("submission.csv")
submission = submit.set_index("customer_id").join(test["label"]).drop("result",axis=1).reset_index()
submission.columns = submit.columns
submission

#结果保存
submission.to_csv("lgb_submission.csv",index=False)

请点击[此处](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576)查看本环境基本用法.  <br>
Please click [here ](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576) for more detailed instructions. 