## 项目说明

### 1 任务说明
智能营销工具可以帮助商家预测用户购买的行为，根据品牌商家的历史订单数据，构建并训练预测模型，用以预估用户群体在规定时间内产生购买行为的概率。
该模型可应用于各种电商数据分析，不仅可以帮助商家基于平台流量，进行商品售卖、支付，还可以通过MarTech技术更精准地锁定核心用户，对用户的购买行为进行预测。

### 2 任务类型
根据任务说明以及实际数据，将该任务定义为数据挖掘下的二分类任务；既可以运用时间序列模型进行分析，也可以按照常规的分类任务的思路解决；本项目基于baseline实现，使用多层感知机模型完成此任务。

### 3 解决方案
基于比赛提供的baseline实现，在其上通过变动模型结构和二分类阈值以提升评估指标值。

### 4 总结改进
根据赛题重点，合理有效地处理数据集的各类特征是完成分类任务的关键之处。
本项目只是使用较为初级的多层感知机网络执行分类任务，项目可改进的地方包含但不限于：

1. 进一步细化特征处理办法，深化特征工程有关工作；
2. 改进或换用预测模型结构，可以尝试使用现代深度学习框架内更为先进的神经网络模型；
3. 更换任务思路，采用传统机器学习项目中时间序列分析的相关思路与模型解决该问题。

### 5 飞桨使用
在使用paddlepaddle进行深度学习时，注重理论课程与实践应用的合理结合； 一方面，强调通过资料与视频课程领会框架的基本使用； 另一方面，需要结合具体应用（如参加飞桨的各类竞赛）熟练掌握数据预处理、模型构建、模型训练、模型调优与应用等深度学习各阶段操作

## 1 数据导入

### 1.1 数据加载

In [1]:
import os
import re
import gc
import time
import random
import numpy as np  
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

from scipy.stats import skew 
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from itertools import product
import calendar
import datetime as dt
from datetime import datetime, date, timedelta


# 读入数据
PATH = './data/data19383/'
train = pd.read_csv(PATH + 'train.csv')
test  = pd.read_csv(PATH + 'submission.csv').set_index('customer_id')

### 1.2 内存优化

In [2]:
# @from: https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65/code
# @liscense: Apache 2.0
# @author: weijian

def reduce_mem_usage(props):
    # 计算当前内存
    start_mem_usg = props.memory_usage().sum() / 1024 ** 2
    print("Memory usage of the dataframe is :", start_mem_usg, "MB")
    
    # 哪些列包含空值，空值用-999填充。why：因为np.nan当做float处理
    NAlist = []
    for col in props.columns:
        # 这里只过滤了objectd格式，如果你的代码中还包含其他类型，请一并过滤
        if (props[col].dtypes != object):
            
            # print("**************************")
            # print("columns: ", col)
            # print("dtype before", props[col].dtype)
            
            # 判断是否是int类型
            isInt = False
            mmax = props[col].max()
            mmin = props[col].min()
            
            # Integer does not support NA, therefore Na needs to be filled
            if not np.isfinite(props[col]).all():
                NAlist.append(col)
                props[col].fillna(-999, inplace=True) # 用-999填充
                
            # test if column can be converted to an integer
            asint = props[col].fillna(0).astype(np.int64)
            result = np.fabs(props[col] - asint)
            result = result.sum()
            if result < 0.01: # 绝对误差和小于0.01认为可以转换的，要根据task修改
                isInt = True
            
            # make interger / unsigned Integer datatypes
            if isInt:
                if mmin >= 0: # 最小值大于0，转换成无符号整型
                    if mmax <= 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mmax <= 65535:
                        props[col] = props[col].astype(np.uint16)
                    elif mmax <= 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else: # 转换成有符号整型
                    if mmin > np.iinfo(np.int8).min and mmax < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mmin > np.iinfo(np.int16).min and mmax < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mmin > np.iinfo(np.int32).min and mmax < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mmin > np.iinfo(np.int64).min and mmax < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64)  
            else: # 注意：这里对于float都转换成float16，需要根据你的情况自己更改
                props[col] = props[col].astype(np.float16)
            
            # print("dtype after", props[col].dtype)
            # print("********************************")
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return props, NAlist

## 2 数据预处理

### 2.1 字段处理

In [3]:
# 设置id字段的数据类型 (6)
train['order_detail_id'] = train['order_detail_id'].astype(np.uint32)
train['order_id'] = train['order_id'].astype(np.uint32)
train['customer_id'] = train['customer_id'].astype(np.uint32)
train['goods_id'] = train['goods_id'].astype(np.uint32)
train['goods_class_id'] = train['goods_class_id'].astype(np.uint32)
train['member_id'] = train['member_id'].astype(np.uint32)

# 设置状态字段的数据类型并将空值置为0 (10)
train['order_status'] = train['order_status'].astype(np.uint8)
train['goods_has_discount'] = train['goods_has_discount'].astype(np.uint8)
train["is_member_actived"].fillna(0, inplace=True)
train["is_member_actived"]=train["is_member_actived"].astype(np.int8)
train["member_status"].fillna(0, inplace=True)
train["member_status"]=train["member_status"].astype(np.int8)
train["customer_gender"].fillna(0, inplace=True)
train["customer_gender"]=train["customer_gender"].astype(np.int8)
train['is_customer_rate'] = train['is_customer_rate'].astype(np.uint8)
train['order_detail_status'] = train['order_detail_status'].astype(np.uint8)

# 设置日期字段的格式 (3)
train['goods_list_time']=pd.to_datetime(train['goods_list_time'],format="%Y-%m-%d")
train['order_pay_time']=pd.to_datetime(train['order_pay_time'],format="%Y-%m-%d")
train['goods_delist_time']=pd.to_datetime(train['goods_delist_time'],format="%Y-%m-%d")


### 2.2 构造特征

#### 2.2.1 每日付款金额

注意，成功交易的客户数量不等于全部客户数量，说明有相当一部分客户虽然下过单，但是没有成功的订单，那么这些客户自然应当算在训练集之外。
数据合并时，由于`test.csv`中，已经设置了默认0值，只需要和训练后的预测标签做一个`left join`就可以了

In [4]:
df = train[train.order_pay_time>'2013-02-01'] # 按订单支付时间抽取样本
df['date'] = pd.DatetimeIndex(df['order_pay_time']).date  # 增加data一列

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [5]:
df_payment = df[['customer_id','date','order_total_payment']]
df_payment = df_payment.groupby(['date','customer_id']).agg({'order_total_payment': ['sum']})
df_payment.columns = ['day_total_payment']  # 重设列名
df_payment.reset_index(inplace=True)

df_payment = df_payment.set_index(
    ["customer_id", "date"])[["day_total_payment"]].unstack(level=-1).fillna(0)
df_payment.columns = df_payment.columns.get_level_values(1)

#### 2.2.2 每日购买数量
该场景每天都有成交记录，这样就不需要考虑生成完整时间段填充的问题

In [6]:
df_goods = df[['customer_id','date','order_total_num']]
df_goods = df_goods.groupby(['date','customer_id']).agg({'order_total_num': ['sum']})
df_goods.columns = ['day_total_num']
df_goods.reset_index(inplace=True)
df_goods = df_goods.set_index(
    ["customer_id", "date"])[["day_total_num"]].unstack(level=-1).fillna(0)
df_goods.columns = df_goods.columns.get_level_values(1)

### 2.3 数据准备

1. 构造dataset这里有个取巧的地方，因为要预测的9月份除了开学季以外不是非常特殊的月份，因此主要考虑近期的因素，数据集的开始时间也是2月1日，尽量避免了双十一、元旦假期的影响，当然春节假期继续保留。同时，构造数据集的时候保留了customer_id，主要为了与其它特征做整合。
2. 通过一个函数整合付款金额和商品数量的时间滑窗，主要是因为分开做到时候合并占用内存更大，并且函数最后在返回值处做了内存优化，用时间代价尽可能避免内存溢出。

#### 2.3.1 数据准备函数

In [7]:
# 这是一个时间滑窗函数，获得dt之前minus天以来periods的dataframe，以便进一步计算
def get_timespan(df, dt, minus, periods, freq='D'):
    return df[pd.date_range(dt - timedelta(days=minus), periods=periods, freq=freq)]

    
def prepare_dataset(df_payment, df_goods, t2018, is_train=True):
    X = {}
    # 整合用户id
    tmp = df_payment.reset_index()
    X['customer_id'] = tmp['customer_id']
    # 消费特征
    print('Preparing payment feature...')
    for i in [14,30,60,91]:
        tmp = get_timespan(df_payment, t2018, i, i)
        # X['diff_%s_mean' % i] = tmp_1.diff(axis=1).mean(axis=1).values
        X['mean_%s_decay' % i] = (tmp * np.power(0.9, np.arange(i)[::-1])).sum(axis=1).values
        # X['mean_%s' % i] = tmp_1.mean(axis=1).values
        # X['median_%s' % i] = tmp.median(axis=1).values
        # X['min_%s' % i] = tmp_1.min(axis=1).values
        X['max_%s' % i] = tmp.max(axis=1).values
        # X['std_%s' % i] = tmp_1.std(axis=1).values
        X['sum_%s' % i] = tmp.sum(axis=1).values
    for i in [14,30,60,91]:
        tmp = get_timespan(df_payment, t2018 + timedelta(days=-7), i, i)
        X['mean_%s_decay_2' % i] = (tmp * np.power(0.9, np.arange(i)[::-1])).sum(axis=1).values
        # X['mean_%s_2' % i] = tmp_2.mean(axis=1).values
        # X['median_%s_2' % i] = tmp.median(axis=1).values
        # X['min_%s_2' % i] = tmp_2.min(axis=1).values
        X['max_%s_2' % i] = tmp.max(axis=1).values
        # X['std_%s_2' % i] = tmp_2.std(axis=1).values
    for i in [14,30,60,91]:
        tmp = get_timespan(df_payment, t2018, i, i)
        X['has_sales_days_in_last_%s' % i] = (tmp != 0).sum(axis=1).values
        X['last_has_sales_day_in_last_%s' % i] = i - ((tmp != 0) * np.arange(i)).max(axis=1).values
        X['first_has_sales_day_in_last_%s' % i] = ((tmp != 0) * np.arange(i, 0, -1)).max(axis=1).values

    # 对此处进行微调，主要考虑近期因素
    for i in range(1, 4):
        X['day_%s_2018' % i] = get_timespan(df_payment, t2018, i*30, 30).sum(axis=1).values
    # 商品数量特征，这里故意把时间和消费特征错开，提高时间滑窗的覆盖面
    print('Preparing num feature...')
    for i in [21,49,84]:
            tmp = get_timespan(df_goods, t2018, i, i)
            # X['goods_diff_%s_mean' % i] = tmp_1.diff(axis=1).mean(axis=1).values
            # X['goods_mean_%s_decay' % i] = (tmp_1 * np.power(0.9, np.arange(i)[::-1])).sum(axis=1).values
            X['goods_mean_%s' % i] = tmp.mean(axis=1).values
            # X['goods_median_%s' % i] = tmp.median(axis=1).values
            # X['goods_min_%s' % i] = tmp_1.min(axis=1).values
            X['goods_max_%s' % i] = tmp.max(axis=1).values
            # X['goods_std_%s' % i] = tmp_1.std(axis=1).values
            X['goods_sum_%s' % i] = tmp.sum(axis=1).values
    for i in [21,49,84]:    
            tmp = get_timespan(df_goods, t2018 + timedelta(weeks=-1), i, i)
            # X['goods_diff_%s_mean_2' % i] = tmp_2.diff(axis=1).mean(axis=1).values
            # X['goods_mean_%s_decay_2' % i] = (tmp_2 * np.power(0.9, np.arange(i)[::-1])).sum(axis=1).values
            X['goods_mean_%s_2' % i] = tmp.mean(axis=1).values
            # X['goods_median_%s_2' % i] = tmp.median(axis=1).values
            # X['goods_min_%s_2' % i] = tmp_2.min(axis=1).values
            X['goods_max_%s_2' % i] = tmp.max(axis=1).values
            X['goods_sum_%s_2' % i] = tmp.sum(axis=1).values
    for i in [21,49,84]:    
            tmp = get_timespan(df_goods, t2018, i, i)
            X['goods_has_sales_days_in_last_%s' % i] = (tmp > 0).sum(axis=1).values
            X['goods_last_has_sales_day_in_last_%s' % i] = i - ((tmp > 0) * np.arange(i)).max(axis=1).values
            X['goods_first_has_sales_day_in_last_%s' % i] = ((tmp > 0) * np.arange(i, 0, -1)).max(axis=1).values


    # 对此处进行微调，主要考虑近期因素
    for i in range(1, 4):
        X['goods_day_%s_2018' % i] = get_timespan(df_goods, t2018, i*28, 28).sum(axis=1).values

    X = pd.DataFrame(X)
    
    reduce_mem_usage(X)
    
    if is_train:
        # 这样转换之后，打标签直接用numpy切片就可以了
        # 当然这里前提是确认付款总额没有负数的问题
        X['label'] = df_goods[pd.date_range(t2018, periods=30)].max(axis=1).values
        X['label'][X['label'] > 0] = 1
        return X
    return X

#### 2.3.2 训练数据准备

In [8]:
num_days = 4
t2017 = date(2013, 7, 1)
X_l, y_l = [], []
for i in range(num_days):
    delta = timedelta(days=7 * i)
    X_tmp = prepare_dataset(df_payment, df_goods, t2017 + delta)
    X_tmp = pd.concat([X_tmp], axis=1)

    X_l.append(X_tmp)

X_train = pd.concat(X_l, axis=0)
del X_l, y_l

Preparing payment feature...
Preparing num feature...
Memory usage of the dataframe is : 345.16221618652344 MB
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  73.87003993988037  MB
This is  21.401542948710667 % of the initial size


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Preparing payment feature...
Preparing num feature...
Memory usage of the dataframe is : 345.16221618652344 MB
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  73.87003993988037  MB
This is  21.401542948710667 % of the initial size


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Preparing payment feature...
Preparing num feature...
Memory usage of the dataframe is : 345.16221618652344 MB
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  73.87003993988037  MB
This is  21.401542948710667 % of the initial size


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Preparing payment feature...
Preparing num feature...
Memory usage of the dataframe is : 345.16221618652344 MB
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  73.87003993988037  MB
This is  21.401542948710667 % of the initial size


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


#### 2.3.3 测试数据准备

In [9]:
X_test = prepare_dataset(df_payment, df_goods, date(2013, 9, 1), is_train=False)
X_test = pd.concat([X_test], axis=1)

Preparing payment feature...
Preparing num feature...
Memory usage of the dataframe is : 345.16221618652344 MB
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  73.87003993988037  MB
This is  21.401542948710667 % of the initial size


#### 2.3.4 数据保存记录

In [10]:
X_train.to_csv('./X_train.csv')
X_test.to_csv('./X_test.csv')

### 2.4 选取特征

In [11]:
# 读取经过预处理的数据
X_train_read = pd.read_csv('X_train.csv')
X_train_read.drop(['Unnamed: 0','customer_id'], inplace=True, axis=1)

X_test_read = pd.read_csv('X_test.csv')
X_test_read.drop(['Unnamed: 0','customer_id'], inplace=True, axis=1)

In [12]:
# 选择待输入的特征列

input_train_features = [  
                     'has_sales_days_in_last_14',
                     'last_has_sales_day_in_last_14', 
                     'first_has_sales_day_in_last_14',

                     'has_sales_days_in_last_30',
                     'last_has_sales_day_in_last_30',
                     'first_has_sales_day_in_last_30', 

                     'has_sales_days_in_last_60',
                     'last_has_sales_day_in_last_60', 
                     'first_has_sales_day_in_last_60',

                     'has_sales_days_in_last_91',
                     'last_has_sales_day_in_last_91',

                     'goods_mean_21', 'goods_max_21', 'goods_sum_21',
                     'goods_mean_49', 'goods_max_49', 'goods_sum_49',
                     'goods_mean_84', 'goods_max_84', 'goods_sum_84', 
                     'goods_mean_21_2', 'goods_max_21_2','goods_sum_21_2', 
                     'goods_mean_49_2', 'goods_max_49_2', 'goods_sum_49_2',
                     'goods_mean_84_2', 'goods_max_84_2', 'goods_sum_84_2',

                     'goods_has_sales_days_in_last_21',
                     'goods_last_has_sales_day_in_last_21',
                     'goods_first_has_sales_day_in_last_21',

                     'goods_has_sales_days_in_last_49',
                     'goods_last_has_sales_day_in_last_49',
                     'goods_first_has_sales_day_in_last_49',

                     'goods_has_sales_days_in_last_84',
                     'goods_last_has_sales_day_in_last_84',
                     'goods_first_has_sales_day_in_last_84', 

                     'goods_day_1_2018','goods_day_2_2018', 'goods_day_3_2018',

                     'label'
                ]

input_test_features = input_train_features.copy()
del input_test_features[-1]

In [13]:
X_train = X_train_read[input_train_features]  
X_test = X_test_read[input_test_features]

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

In [14]:
# 数据归一化
X_train = (X_train - X_train.min()) / (X_train.max() - X_train.min())
X_test = (X_test - X_test.min()) / (X_test.max() - X_test.min())

# 由于标签也被归一化，需要还原回去
X_train['label'][X_train['label'] > 0] = 1

### 2.5 数据划分

In [15]:
# 训练数据与验证数据
def load_data(df,istrain):
    
    data = df

    feature_num = len(data.columns)

    # 将原始数据进行Reshape
    data = np.array(data)
    data = data.reshape([-1, feature_num])
    
    # 划分比例
    if istrain == True:
        ratio = 0.8
        offset = int(data.shape[0] * ratio)
        training_data = data[:offset]
        test_data = data[offset:]
    else:
        training_data = data
        test_data = None

    return training_data, test_data


# 加载处理后的数据
training_data, dev_data = load_data(X_train,True)
print('train set done.')
test_data, none = load_data(X_test,False)
print('test set done.')

train set done.
test set done.


## 3 搭建网络

### 3.1 结构定义

In [16]:
import paddle
import paddle.fluid as fluid
import paddle.fluid.dygraph as dygraph
from paddle.fluid.dygraph import Linear


class Regressor(fluid.dygraph.Layer):

    def __init__(self, name_scope):
        super(Regressor, self).__init__(name_scope)
        name_scope = self.full_name()

        self.fc1 = Linear(input_dim=len(input_test_features), output_dim=512, act='relu') 
        self.fc2 = Linear(input_dim=512, output_dim=128, act='relu') 
        self.fc3 = Linear(input_dim=128, output_dim=64, act='relu')
        self.fc4 = Linear(input_dim=64, output_dim=1, act='sigmoid')
    
    def forward(self, inputs):
        x = self.fc1(inputs)
        x = self.fc2(x)
        x = self.fc3(x)
        x = self.fc4(x)
        return x

### 3.2 模型配置

In [17]:
with fluid.dygraph.guard():

    # 声明定义好的线性回归模型
    model = Regressor("Regressor")

    # 开启模型训练模式
    model.train()

    # 定义优化算法，这里使用Adam Optimizer
    opt = fluid.optimizer.Adam(learning_rate=0.00005, parameter_list=model.parameters())

    
# 针对类别不平衡问题自定义损失函数
def wce_loss(pred, label, w=48, epsilon=1e-05): # w 是给到 y=1 类别的权重，越大越重视
    label = fluid.layers.clip(label, epsilon, 1-epsilon)
    pred = fluid.layers.clip(pred, epsilon, 1-epsilon)

    loss = -1 * (w * label * fluid.layers.log(pred) + (1 - label) * fluid.layers.log(1 - pred))
    loss = fluid.layers.reduce_mean(loss)
    return loss

W0801 11:21:51.053849    98 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 11.2, Runtime API Version: 9.0
W0801 11:21:51.058857    98 device_context.cc:260] device: 0, cuDNN Version: 7.6.


### 3.3 模型训练

In [18]:
with dygraph.guard(fluid.CPUPlace()):

    EPOCH_NUM = 10   
    BATCH_SIZE = 7000
    
    # 定义外层循环
    for epoch_id in rangeZ(EPOCH_NUM):
        # 在每轮迭代开始之前，将训练数据的顺序随机的打乱
        np.random.shuffle(training_data)
        # 将训练数据进行拆分
        mini_batches = [training_data[k:k+BATCH_SIZE] for k in range(0, len(training_data), BATCH_SIZE)]
        
        # 定义内层循环
        for iter_id, mini_batch in enumerate(mini_batches):
            x = np.array(mini_batch[:, :-1]).astype('float32') # 获得当前批次训练数据
            y = np.array(mini_batch[:, -1:]).astype('float32') # 获得当前批次训练标签

            # 将numpy数据转为飞桨动态图variable形式
            buyer_features = dygraph.to_variable(x)
            result = dygraph.to_variable(y)
            
            # 前向计算
            predicts = model(buyer_features)
            #loss = fluid.layers.log_loss(predicts, result)
            loss = wce_loss(predicts, result)
            avg_loss = fluid.layers.mean(loss)
            
            # 打印训练信息
            if iter_id % 20 == 0:
                print("epoch: {}, iter: {}, loss is: {}".format(epoch_id, iter_id, avg_loss.numpy()))
                # print(predicts)
     
            # 反向传播
            avg_loss.backward()
            # 最小化loss,更新参数
            opt.minimize(avg_loss)
            # 清除梯度
            model.clear_gradients()
   

epoch: 0, iter: 0, loss is: [5.5129666]
epoch: 0, iter: 20, loss is: [4.6771607]
epoch: 0, iter: 40, loss is: [4.1181993]
epoch: 0, iter: 60, loss is: [3.6884623]
epoch: 0, iter: 80, loss is: [3.33423]
epoch: 0, iter: 100, loss is: [3.2013075]
epoch: 0, iter: 120, loss is: [2.9309716]
epoch: 0, iter: 140, loss is: [2.7775202]
epoch: 0, iter: 160, loss is: [2.703033]
epoch: 0, iter: 180, loss is: [2.6392713]
epoch: 0, iter: 200, loss is: [2.5794866]
epoch: 0, iter: 220, loss is: [2.5288477]
epoch: 0, iter: 240, loss is: [2.5392087]
epoch: 0, iter: 260, loss is: [2.5116413]
epoch: 0, iter: 280, loss is: [2.5610123]
epoch: 0, iter: 300, loss is: [2.4655511]
epoch: 1, iter: 0, loss is: [2.4796627]
epoch: 1, iter: 20, loss is: [2.4942076]
epoch: 1, iter: 40, loss is: [2.4632065]
epoch: 1, iter: 60, loss is: [2.455991]
epoch: 1, iter: 80, loss is: [2.458689]
epoch: 1, iter: 100, loss is: [2.4525628]
epoch: 1, iter: 120, loss is: [2.4503186]
epoch: 1, iter: 140, loss is: [2.4641035]
epoch: 1,

### 3.4 保存模型

In [19]:
model_path = './work/model'
with dygraph.guard():
    fluid.save_dygraph(model.state_dict(), model_path)
print("模型参数成功保存")

模型参数成功保存


## 4 模型应用

### 4.1 预测数据

In [20]:
model_path = './work/model'
with dygraph.guard():
    # 加载模型参数
    model_dict, _ = fluid.load_dygraph(model_path)
    model.load_dict(model_dict)
    model.eval()
    pre = test_data.astype('float32')
    pre = dygraph.to_variable(pre)
    results = model(pre)

### 4.2 结果处理

In [21]:
id_column = pd.read_csv('X_test.csv', usecols=['customer_id'])

df_preds = pd.DataFrame(
{    "customer_id": id_column.customer_id, 
    "pred": results.numpy().flatten()}
)

### 4.3 生成文件

In [22]:
# 读入待提交文件
sub = pd.read_csv('data/data19383/submission.csv')

# 合并预测结果
submission = pd.merge(sub, df_preds, on='customer_id', how='left')
submission.fillna(0,inplace=True)
submission = submission[['customer_id','pred']]
submission.rename(columns={'customer_id':'customer_id','pred':'result'}, inplace=True)

# 将概率值转换为用户是否购买的标签
def f(x):
    if x <= 0.35:   # 调整阈值
        return 0
    else:
        return 1
    return x
submission['result'] = submission['result'].map(f)

# 保存结果
submission.to_csv('submission.csv',index=False)