# MarTech Challenge 点击反欺诈预测比赛思路及实现

## 1 背景介绍 
广告欺诈是数字营销需要面临的重要挑战之一，点击会欺诈浪费广告主大量金钱，同时对点击数据会产生误导作用。本次比赛提供了约50万次点击数据。特别注意：我们对数据进行了模拟生成，对某些特征含义进行了隐藏，并进行了脱敏处理。
请预测用户的点击行为是否为正常点击，还是作弊行为。点击欺诈预测适用于各种信息流广告投放，banner广告投放，以及百度网盟平台，帮助商家鉴别点击欺诈，锁定精准真实用户。

[比赛传送门](https://aistudio.baidu.com/aistudio/competition/detail/52)

本思路将从数据分析、数据探索&特征工程、建模三个方面进行介绍：

## 2 赛题剖析

* 特征工程：对重要特征进行甄别和处理；利用原有特征构建新特征。
* 数量特征建模：由业务场景可知，点击反欺诈预测中一个重要的特征是点击的数量，点击作弊往往会出现重复点击的情况，所以在原特征基础上构建相应的数量特征是本次建模的一个重点。


## 3 总体思路（经典机器学习+百度深度学习模型）

lightgbm+PALM语言模型


## 4 具体方案分享

### 读取数据

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from paddle.io import Dataset
from paddle.metric import Accuracy
import paddle.nn as nn
import paddle.tensor as tensor
from paddle.static import InputSpec
# 数据加载
train = pd.read_csv('data/data97586/train.csv')
test = pd.read_csv('data/data97586/test1.csv')
train

Unnamed: 0.1,Unnamed: 0,android_id,apptype,carrier,dev_height,dev_ppi,dev_width,label,lan,media_id,...,os,osv,package,sid,timestamp,version,fea_hash,location,fea1_hash,cus_type
0,0,316361,1199,46000.0,0.0,0.0,0.0,1,,104,...,android,9,18,1438873,1.559893e+12,8,2135019403,0,2329670524,601
1,1,135939,893,0.0,0.0,0.0,0.0,1,,19,...,android,8.1,0,1185582,1.559994e+12,4,2782306428,1,2864801071,1000
2,2,399254,821,0.0,760.0,0.0,360.0,1,,559,...,android,8.1.0,0,1555716,1.559837e+12,0,1392806005,2,628911675,696
3,3,68983,1004,46000.0,2214.0,0.0,1080.0,0,,129,...,android,8.1.0,0,1093419,1.560042e+12,0,3562553457,3,1283809327,753
4,4,288999,1076,46000.0,2280.0,0.0,1080.0,1,zh-CN,64,...,android,8.0.0,0,1400089,1.559867e+12,5,2364522023,4,1510695983,582
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,499995,392477,1028,46000.0,1920.0,3.0,1080.0,1,zh-CN,144,...,Android,7.1.2,25,1546078,1.559834e+12,7,861755946,79,140647032,373
499996,499996,346134,1001,0.0,1424.0,0.0,720.0,0,,29,...,android,8.1.0,0,1480612,1.559814e+12,3,1714444511,23,2745131047,525
499997,499997,499635,761,46000.0,1280.0,0.0,720.0,0,,54,...,android,6.0.1,9,1698442,1.559676e+12,0,3843262581,25,1326115882,810
499998,499998,239786,917,46001.0,960.0,0.0,540.0,0,zh_CN,109,...,android,5.1.1,0,1331155,1.559840e+12,0,1984296118,225,1446741112,772


### 字段说明

![](https://ai-studio-static-online.cdn.bcebos.com/c5a7a8f10ce44593a6dd3310cda0352efea701c63a854ee395a2be52d0fec0ab)

**label是否作弊，0为正常，1位作弊**

### 初步筛选特征

In [2]:
test = test.iloc[:, 1:]
train = train.iloc[:, 1:]
train

Unnamed: 0,android_id,apptype,carrier,dev_height,dev_ppi,dev_width,label,lan,media_id,ntt,os,osv,package,sid,timestamp,version,fea_hash,location,fea1_hash,cus_type
0,316361,1199,46000.0,0.0,0.0,0.0,1,,104,6.0,android,9,18,1438873,1.559893e+12,8,2135019403,0,2329670524,601
1,135939,893,0.0,0.0,0.0,0.0,1,,19,6.0,android,8.1,0,1185582,1.559994e+12,4,2782306428,1,2864801071,1000
2,399254,821,0.0,760.0,0.0,360.0,1,,559,0.0,android,8.1.0,0,1555716,1.559837e+12,0,1392806005,2,628911675,696
3,68983,1004,46000.0,2214.0,0.0,1080.0,0,,129,2.0,android,8.1.0,0,1093419,1.560042e+12,0,3562553457,3,1283809327,753
4,288999,1076,46000.0,2280.0,0.0,1080.0,1,zh-CN,64,2.0,android,8.0.0,0,1400089,1.559867e+12,5,2364522023,4,1510695983,582
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,392477,1028,46000.0,1920.0,3.0,1080.0,1,zh-CN,144,6.0,Android,7.1.2,25,1546078,1.559834e+12,7,861755946,79,140647032,373
499996,346134,1001,0.0,1424.0,0.0,720.0,0,,29,2.0,android,8.1.0,0,1480612,1.559814e+12,3,1714444511,23,2745131047,525
499997,499635,761,46000.0,1280.0,0.0,720.0,0,,54,6.0,android,6.0.1,9,1698442,1.559676e+12,0,3843262581,25,1326115882,810
499998,239786,917,46001.0,960.0,0.0,540.0,0,zh_CN,109,2.0,android,5.1.1,0,1331155,1.559840e+12,0,1984296118,225,1446741112,772


### 数据探索&特征工程

#### 构造函数，寻找关键特征值

In [3]:
#train.info()
#train['lan'].value_counts()
# Object类型： lan, os, osv, version, fea_hash
# 字符串类型 需要转换为数值（labelencoder）
object_cols = train.select_dtypes(include='object').columns

# 缺失值个数
temp = train.isnull().sum()
# 有缺失值的字段： lan, osv
temp[temp>0]

lan    183280
osv      6561
dtype: int64

#### 通过特征类型及意义，确定需要寻找关键特征值的字段

In [4]:
# ['os', 'osv', 'lan', 'sid’]
features = train.columns.tolist()
features.remove('label')
print(features)

['android_id', 'apptype', 'carrier', 'dev_height', 'dev_ppi', 'dev_width', 'lan', 'media_id', 'ntt', 'os', 'osv', 'package', 'sid', 'timestamp', 'version', 'fea_hash', 'location', 'fea1_hash', 'cus_type']


In [5]:
for feature in features:
    print(feature, train[feature].nunique())

android_id 362258
apptype 89
carrier 5
dev_height 798
dev_ppi 92
dev_width 346
lan 21
media_id 284
ntt 8
os 2
osv 154
package 1950
sid 500000
timestamp 500000
version 22
fea_hash 402980
location 332
fea1_hash 4959
cus_type 58


In [6]:
train['fea_hash'].map(lambda x: len(str(x))).value_counts()

10    378925
9     108904
8      11235
7        740
6         93
38        37
39        28
37        16
5         11
36         3
33         2
32         2
1          2
31         1
30         1
Name: fea_hash, dtype: int64

In [7]:
train['fea1_hash'].map(lambda x: len(str(x))).value_counts()

10    391669
9      99347
8       8977
7          6
5          1
Name: fea1_hash, dtype: int64

#### 确定字段，寻找对应字段的关键特征值

In [8]:
# 对osv进行数据清洗
def osv_trans(x):
    x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '')
    if str(x).find('.')>0:
        temp_index1 = x.find('.')
        if x.find(' ')>0:
            temp_index2 = x.find(' ')
        else:
            temp_index2 = len(x)

        if x.find('-')>0:
            temp_index2 = x.find('-')
            
        result = x[0:temp_index1] + '.' + x[temp_index1+1:temp_index2].replace('.', '')
        try:
            return float(result)
        except:
            print(x+'#########')
            return 0
    try:
        return float(x)
    except:
        print(x+'#########')
        return 0
#train['osv'] => LabelEncoder ?
# 采用众数，进行缺失值的填充
train['osv'].fillna('8.1.0', inplace=True)
# 数据清洗
train['osv'] = train['osv'].apply(osv_trans)

# 采用众数，进行缺失值的填充
test['osv'].fillna('8.1.0', inplace=True)
# 数据清洗
test['osv'] = test['osv'].apply(osv_trans)


f073b_changxiang_v01_b1b8_20180915#########
%E6%B1%9F%E7%81%B5OS+5.0#########
GIONEE_YNGA#########


In [9]:
remove_list = ['os', 'lan', 'sid']
col = features
for i in remove_list:
    col.remove(i)
col

['android_id',
 'apptype',
 'carrier',
 'dev_height',
 'dev_ppi',
 'dev_width',
 'media_id',
 'ntt',
 'osv',
 'package',
 'timestamp',
 'version',
 'fea_hash',
 'location',
 'fea1_hash',
 'cus_type']

#### 构造新特征字段

In [10]:
# 特征筛选
features = train[col]
# 构造fea_hash_len特征
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features

Unnamed: 0,android_id,apptype,carrier,dev_height,dev_ppi,dev_width,media_id,ntt,osv,package,timestamp,version,fea_hash,location,fea1_hash,cus_type,fea_hash_len,fea1_hash_len
0,316361,1199,46000.0,0.0,0.0,0.0,104,6.0,9.00,18,1.559893e+12,8,2135019403,0,2329670524,601,10,10
1,135939,893,0.0,0.0,0.0,0.0,19,6.0,8.10,0,1.559994e+12,4,2782306428,1,2864801071,1000,10,10
2,399254,821,0.0,760.0,0.0,360.0,559,0.0,8.10,0,1.559837e+12,0,1392806005,2,628911675,696,10,9
3,68983,1004,46000.0,2214.0,0.0,1080.0,129,2.0,8.10,0,1.560042e+12,0,3562553457,3,1283809327,753,10,10
4,288999,1076,46000.0,2280.0,0.0,1080.0,64,2.0,8.00,0,1.559867e+12,5,2364522023,4,1510695983,582,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,392477,1028,46000.0,1920.0,3.0,1080.0,144,6.0,7.12,25,1.559834e+12,7,861755946,79,140647032,373,9,9
499996,346134,1001,0.0,1424.0,0.0,720.0,29,2.0,8.10,0,1.559814e+12,3,1714444511,23,2745131047,525,10,10
499997,499635,761,46000.0,1280.0,0.0,720.0,54,6.0,6.01,9,1.559676e+12,0,3843262581,25,1326115882,810,10,10
499998,239786,917,46001.0,960.0,0.0,540.0,109,2.0,5.11,0,1.559840e+12,0,1984296118,225,1446741112,772,10,10


In [11]:
test_features = test[col]
# 构造fea_hash_len特征
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking：为什么将很大的，很长的fea_hash化为0？
# 如果fea_hash很长，都归为0，否则为自己的本身
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features

Unnamed: 0,android_id,apptype,carrier,dev_height,dev_ppi,dev_width,media_id,ntt,osv,package,timestamp,version,fea_hash,location,fea1_hash,cus_type,fea_hash_len,fea1_hash_len
0,317625,1181,46000.0,2196.0,2.0,1080.0,639,2.0,8.10,188,1.559872e+12,7,1672223856,57,3872258917,658,10,10
1,435108,944,46003.0,2280.0,3.0,1080.0,704,6.0,8.10,221,1.559739e+12,3,3767901757,23,129322164,943,10,9
2,0,1106,46000.0,0.0,0.0,0.0,39,2.0,5.10,1562,1.559614e+12,0,454638703,30,4226678391,411,9,10
3,451504,761,46000.0,1344.0,0.0,720.0,54,2.0,7.11,9,1.559668e+12,0,1507622951,65,3355419572,848,10,10
4,0,1001,46000.0,665.0,0.0,320.0,29,5.0,8.10,4,1.559694e+12,0,4116351093,148,2644467751,411,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149995,0,1001,46000.0,760.0,0.0,360.0,29,2.0,8.10,4,1.559957e+12,0,3162887451,126,2711576615,411,10,10
149996,0,1001,46000.0,780.0,0.0,360.0,29,2.0,9.00,4,1.559863e+12,0,97238959,322,2678022183,411,8,10
149997,0,1001,46000.0,780.0,0.0,360.0,29,5.0,8.10,4,1.560041e+12,0,1320118495,46,2610913319,411,10,10
149998,500925,1052,46000.0,854.0,240.0,480.0,249,6.0,4.42,0,1.559688e+12,2,1292986591,41,1898209327,430,10,10


In [12]:
#train['os'].value_counts()
# 使用LGBM训练
import lightgbm as lgb
model = lgb.LGBMClassifier()
# 模型训练
model.fit(features.drop(['timestamp', 'version'], axis=1), train['label'])
result = model.predict(test_features.drop(['timestamp', 'version'], axis=1))
result

  from collections import MutableMapping
  from collections import Iterable, Mapping
  from collections import Sized


array([0, 0, 0, ..., 1, 1, 1])

#### 保存预测结果，方便后续投票

In [13]:
#features['version'].value_counts()
res = pd.DataFrame(test['sid'])
res['label'] = result
res.to_csv('baseline2.csv', index=False)
res

Unnamed: 0,sid,label
0,1440682,0
1,1606824,0
2,1774642,0
3,1742535,0
4,1689686,1
...,...,...
149995,1165373,1
149996,1444115,1
149997,1134378,1
149998,1700238,1


### palm建模

#### 数据预处理

In [15]:
import pandas as pd
train= pd.read_csv('data/data97586/train.csv',encoding='utf-8')
test = pd.read_csv('data/data97586/test1.csv',encoding='utf-8')
sid = test.sid
features = train.drop(['Unnamed: 0','label','os','sid'],axis=1)
labels = train['label']
test = test[features.columns]

#### 将时间戳转换为小时数并取整

In [16]:
from datetime import datetime as dt 
def get_date(features):
    features['timestamp'] = features['timestamp'].apply(lambda x: dt.fromtimestamp(x/1000))
    start_time = features['timestamp'].min()
    features['time_diff'] = features['timestamp'] - start_time
    features['time_diff'] = features['time_diff'].dt.days*24 + features['time_diff'].dt.seconds/3600
    features.drop(['timestamp'],axis=1,inplace = True)
    return features

features = get_date(features)
test = get_date(test)

In [17]:
#取整
features.time_diff = features.time_diff.astype(int)
test.time_diff = test.time_diff.astype(int)

#### 缺失值处理
这里使用了mode对osv进行处理，针对lan中的缺失值，由于lan是字符串的形式，直接补充了nan作为特征，这是因为缺失值本身可能也会代表一些信息

In [18]:
features.loc[:,"osv"] = features.loc[:,"osv"].fillna(test.loc[:,"osv"].mode()[0]) 
features.loc[:,"lan"] = features.loc[:,"lan"].fillna('nan')

test.loc[:,"osv"] = test.loc[:,"osv"].fillna(test.loc[:,"osv"].mode()[0])
test.loc[:,"lan"] = test.loc[:,"lan"].fillna('nan') 

#### 特征连接
将特征分为两类，一类是用户信息，一类是媒体信息，将他们的信息分别用空格连接起来变成两个句子，每个特征相当于句子中的一个词语，以用户和媒体信息之间的这种点击关系去做一个类似NLP中的问答任务，用户信息放在了text_a, 媒体信息放在了text_b

In [19]:
#连接函数

def sentence(row):
    return ' '.join([str(row[i]) for i in int_type])


def sentence1(row):
    return ' '.join([str(row[i]) for i in string_type]) 

In [20]:
#提取媒体信息和用户信息
string_type =['package','apptype','version','android_id','media_id']
int_type = []
for i in features.columns:
    if i not in string_type:
        int_type.append(i)

In [21]:
#写入palm的训练和预测数据
train_palm = pd.DataFrame()
train_palm['label'] = train['label']
train_palm['text_a'] = features[int_type].apply(sentence,axis=1)
train_palm['text_b'] = features[string_type].apply(sentence1,axis=1)

test_palm = pd.DataFrame()
test_palm['label'] = test.apptype #label不能为空，可以随便填一个
test_palm['text_a'] = test[int_type].apply(sentence,axis=1)
test_palm['text_b'] = test[string_type].apply(sentence1,axis=1)

In [22]:
#保存palm所需的数据
train_palm.to_csv('data/data97586/train_palm.csv', sep='\t', index=False)
test_palm.to_csv('data/data97586/test_palm.csv', sep='\t', index=False)

#### PALM模型搭建与训练

In [23]:
!pip install paddlepalm 

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [24]:
#查看并下载预训练模型
from paddlepalm import downloader
downloader.ls('pretrain')

Available pretrain items:
  => RoBERTa-zh-base
  => RoBERTa-zh-large
  => ERNIE-v2-en-base
  => ERNIE-v2-en-large
  => XLNet-cased-base
  => XLNet-cased-large
  => ERNIE-v1-zh-base
  => ERNIE-v1-zh-base-max-len-512
  => BERT-en-uncased-large-whole-word-masking
  => BERT-en-cased-large-whole-word-masking
  => BERT-en-uncased-base
  => BERT-en-uncased-large
  => BERT-en-cased-base
  => BERT-en-cased-large
  => BERT-multilingual-uncased-base
  => BERT-multilingual-cased-base
  => BERT-zh-base


In [25]:
#下载
downloader.download('pretrain', 'ERNIE-v2-en-base', './pretrain_models')

Downloading pretrain: ERNIE-v2-en-base from https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz...
>> Downloading... 100.0% done!
Extracting ERNIE_Base_en_stable-2.0.0.tar.gz... done!
done!


#### 设置PALM参数，开始训练
此处的参数参考了 PaddlePALM样例: [Quora问题相似度匹配](https://aistudio.baidu.com/aistudio/projectdetail/402733?channelType=0&channel=0) 和 [4月第1名方案](https://aistudio.baidu.com/aistudio/projectdetail/2195561?contributionType=1)，修改了学习率，epoch，drop率等等，大家可以自己进行调整


In [26]:
import paddle
import json
import paddlepalm


max_seqlen = 128
batch_size = 32
num_epochs = 4
lr = 1e-6
weight_decay = 0.0001
num_classes = 2
random_seed = 1
dropout_prob = 0.002
save_path = './outputs/'
save_type = 'ckpt'
pred_model_path = './outputs/ckpt.step15000'
print_steps = 1000
pred_output = './outputs/predict/'
pre_params =  '/home/aistudio/pretrain_models/pretrain/ERNIE-v2-en-base/params'
task_name = 'Quora Question Pairs matching'
vocab_path = '/home/aistudio/pretrain_models/pretrain/ERNIE-v2-en-base/vocab.txt'
train_file = '/home/aistudio/data/data97586/train_palm.csv'
predict_file = '/home/aistudio/data/data97586/test_palm.csv'
config = json.load(open('/home/aistudio/pretrain_models/pretrain/ERNIE-v2-en-base/ernie_config.json'))
input_dim = config['hidden_size']
paddle.enable_static()

In [27]:
match_reader = paddlepalm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed)
# step 1-2: load the training data
match_reader.load_data(train_file, file_format='tsv', num_epochs=num_epochs, batch_size=batch_size)
# step 2: create a backbone of the model to extract text features
ernie = paddlepalm.backbone.ERNIE.from_config(config)
# step 3: register the backbone in reader
match_reader.register_with(ernie)
# step 4: create the task output head
match_head = paddlepalm.head.Match(num_classes, input_dim, dropout_prob)
# step 5-1: create a task trainer
trainer = paddlepalm.Trainer(task_name)
# step 5-2: build forward graph with backbone and task head

loss_var = trainer.build_forward(ernie, match_head)
# step 6-1*: use warmup
n_steps = match_reader.num_examples * num_epochs // batch_size
warmup_steps = int(0.1 * n_steps)
sched = paddlepalm.lr_sched.TriangularSchedualer(warmup_steps, n_steps)
# step 6-2: create a optimizer
adam = paddlepalm.optimizer.Adam(loss_var, lr, sched)
# step 6-3: build backward
trainer.build_backward(optimizer=adam, weight_decay=weight_decay)
# step 7: fit prepared reader and data
trainer.fit_reader(match_reader)
# step 8-1*: load pretrained parameters
trainer.load_pretrain(pre_params, False)
# step 8-2*: set saver to save model
save_steps = 15000
trainer.set_saver(save_path=save_path, save_steps=save_steps, save_type=save_type)
# step 8-3: start training
trainer.train(print_steps=print_steps)
# 预测部分代码，假设训练保存模型为./outputs/training_pred_model：
print('prepare to predict...')

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  elif dtype == np.bool:
W0103 21:21:28.141738 14786 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0103 21:21:28.147297 14786 device_context.cc:422] device: 0, cuDNN Version: 7.6.


ok!
Loading pretraining parameters from /home/aistudio/pretrain_models/pretrain/ERNIE-v2-en-base/params...

step 1000/15625 (epoch 0), loss: 0.625, speed: 7.95 steps/s
step 2000/15625 (epoch 0), loss: 0.517, speed: 7.90 steps/s
step 3000/15625 (epoch 0), loss: 0.299, speed: 7.94 steps/s
step 4000/15625 (epoch 0), loss: 0.386, speed: 7.91 steps/s
step 5000/15625 (epoch 0), loss: 0.281, speed: 7.88 steps/s
step 6000/15625 (epoch 0), loss: 0.367, speed: 7.92 steps/s
step 7000/15625 (epoch 0), loss: 0.496, speed: 7.84 steps/s
step 8000/15625 (epoch 0), loss: 0.427, speed: 7.91 steps/s
step 9000/15625 (epoch 0), loss: 0.353, speed: 7.92 steps/s
step 10000/15625 (epoch 0), loss: 0.225, speed: 7.86 steps/s
step 11000/15625 (epoch 0), loss: 0.274, speed: 7.90 steps/s
step 12000/15625 (epoch 0), loss: 0.158, speed: 7.89 steps/s
step 13000/15625 (epoch 0), loss: 0.298, speed: 7.92 steps/s
step 14000/15625 (epoch 0), loss: 0.189, speed: 7.91 steps/s
checkpoint has been saved at ./outputs/ckpt.ste

In [28]:
#经过验证，使用从预训练模型训练到60000step的参数预测表现较好
vocab_path = '/home/aistudio/pretrain_models/pretrain/ERNIE-v2-en-base/vocab.txt'

predict_match_reader = paddlepalm.reader.MatchReader(vocab_path, max_seqlen, seed=random_seed, phase='predict')
# step 1-2: load the training data
predict_match_reader.load_data(predict_file, batch_size)
# step 2: create a backbone of the model to extract text features
pred_ernie = paddlepalm.backbone.ERNIE.from_config(config, phase='predict')
# step 3: register the backbone in reader
predict_match_reader.register_with(pred_ernie)
# step 4: create the task output head
match_pred_head = paddlepalm.head.Match(num_classes, input_dim, phase='predict')
predicter=paddlepalm.Trainer(task_name)
# step 5: build forward graph with backbone and task head
predicter.build_predict_forward(pred_ernie, match_pred_head)

#pred_model_path ='./outputs/ckpt.step480000'
pred_model_path='outputs/ckpt.step60000'
# step 6: load pretrained model
pred_ckpt = predicter.load_ckpt(pred_model_path)
# step 7: fit prepared reader and data
predicter.fit_reader(predict_match_reader, phase='predict')

# step 8: predict
print('predicting..')
predicter.predict(print_steps=print_steps, output_dir=pred_output)

Loading pretraining parameters from outputs/ckpt.step60000...

ok!
predicting..
batch 1000/4687, speed: 20.76 steps/s
batch 2000/4687, speed: 21.16 steps/s
batch 3000/4687, speed: 20.79 steps/s
batch 4000/4687, speed: 20.72 steps/s
Predictions saved at ./outputs/predict/predictions.json


  'label': 1,  'logits': [-0.2326768934726715, 0.37437596917152405],  'probs': [0.35273176431655884, 0.6472682356834412]}, {'index': 751,  'label': 1,  'logits': [-0.18504130840301514, 0.06233903765678406],  'probs': [0.4384683668613434, 0.5615316033363342]}, {'index': 752,  'label': 0,  'logits': [2.3856663703918457, -1.4404159784317017],  'probs': [0.978670060634613, 0.021329952403903008]}, {'index': 753,  'label': 0,  'logits': [0.9735164046287537, -0.9860730767250061],  'probs': [0.8764885663986206, 0.12351148575544357]}, {'index': 754,  'label': 1,  'logits': [-1.3987982273101807, 1.2861785888671875],  'probs': [0.06386568397283554, 0.9361343383789062]}, {'index': 755,  'label': 0,  'logits': [0.20144575834274292, 0.1881381720304489],  'probs': [0.5033268332481384, 0.4966731369495392]}, {'index': 756,  'label': 1,  'logits': [-1.6628142595291138, 1.4502067565917969],  'probs': [0.042573343962430954, 0.9574267268180847]}, {'index': 757,  'label': 1,  'l

#### 读取palm预测结果


In [33]:
palm_proba = pd.read_json('./outputs/predict/predictions.json',lines=True)

### 模型结果融合

In [30]:
##读取PALMPALM预测中为欺诈点击的概率
palm_res = palm_proba.probs.apply(lambda x: x[1])

palm_res=palm_res.apply(lambda x:1 if x>=0.5 else 0)

In [31]:
lgb_sub=pd.read_csv("baseline2.csv")
res=(lgb_sub.label+palm_res).apply(lambda x:1 if x>=1 else 0)

In [32]:
##最终结果保存
a = pd.DataFrame(sid)
a['label']= res
a.to_csv('composition.csv',index = False)

## 5 心得&致谢

特征工程 + lgb实现模型迭代，在60000轮可以达到88.112的效果，加入PALM融合后分数为88.18。其中部分数据清洗和特征变换方式参考了某项目公开的[trick](https://aistudio.baidu.com/aistudio/projectdetail/461026?channelType=0&channel=0)，PALM参考https://aistudio.baidu.com/aistudio/projectdetail/2513951?channelType=0&channel=0在这里向热衷于开源的大佬表示感谢，同时还要感谢百度飞桨提供的比赛机会和算力支持！欢迎大家一起交流讨论。