# 1、竞赛背景
如何有效利用海量的广告数据和用户数据去预测用户的广告点击概率，是大数据应用在精准营销中的关键问题。   
本次大赛提供了讯飞AI营销云的海量广告投放数据，参赛选手通过人工智能技术构建预测模型预估用户的广告点击概率，即给定广告点击相关的广告、媒体、用户、上下文内容等信息的条件下预测广告点击概率。

# 2、评分标准
## 评分算法
other
## 评分标准
![](http://third.datacastle.cn/pkbigdata/master.other.img/6c1e634c-897a-4e84-bfac-2f5c4d440009.png)

## AB榜的划分方式和比例：
【1】评分采用AB榜形式。排行榜显示A榜成绩，竞赛结束后2小时切换成B榜单。B榜成绩以选定的两次提交或者默认的最后两次提交的最高分为准，最终比赛成绩以B榜单为准。   
【2】此题目的AB榜是随机划分，A榜为随机抽样测试集50%数据，B榜为另外50%。

# 3、任务
讯飞AI营销广告点击率预估，预测广告被点击的概率。

# 4、数据
提供下载的数据集包括两个部分：  
初赛：
1. round1_iflyad_train.txt 训练集，每一行数据为一个样本，可分为5类数据，包含基础广告投放数据、广告素材信息、媒体信息、用户信息和上下文信息，共1001650条数据。其中‘click’字段为要预测的标签，其它34个字段为特征字段。
2. round1_iflyad_test_feature.txt 测试集，共40024条数据，与训练集文件相比，测试集文件无‘click’字段，其它字段同训练集。

复赛：
1. round2_iflyad_train.txt 训练集，每一行数据为一个样本，可分为5类数据，包含基础广告投放数据、广告素材信息、媒体信息、用户信息和上下文信息，共1998350条数据。其中‘click’字段为要预测的标签，其它34个字段为特征字段。 
2. round2_iflyad_test_feature.txt 测试集，共80276条数据，与训练集文件相比，测试集文件无‘click’字段，其它字段同训练集。

出于数据安全保证的考虑，所有数据均为脱敏处理后的数据。数据集提供了若干天的样本，最后一天数据构成了测试集，其余日期的数据作为训练数据。

注意：此次比赛分为初赛和复赛两个阶段，两个阶段的区别是所提供样本的量级有所不同，其他的设置均相同。

基础广告投放数据（1+1）、广告素材信息（16）、媒体信息（5）、用户信息（1）、上下文信息（11）

# 1、导入依赖包

In [2]:
import numpy as np            # ndarray数组
import pandas as pd           # DataFrame表格
from sklearn import datasets  # 自带的数据集
from sklearn.model_selection import train_test_split # 随机划分为训练子集和测试子集
from sklearn.model_selection import cross_val_score  # 模型评价：训练误差和测试误差
from sklearn.feature_selection import SelectFromModel# 特征选择(三种方法)
from sklearn.metrics import roc_auc_score            # 评价指标
from sklearn.metrics import f1_score 
from sklearn.cross_validation import StratifiedKFold # K折交叉验证

from sklearn.neighbors import KNeighborsClassifier   # KNN
from sklearn.linear_model import LogisticRegression  # 逻辑斯特回归LR
from sklearn.tree import DecisionTreeClassifier      # DT
from sklearn.ensemble import RandomForestClassifier  # RFC随机森林分类
from sklearn.ensemble import RandomForestRegressor   # RFR随机森林回归
from sklearn.ensemble import ExtraTreesClassifier    # ETC极端随机树分类
from sklearn.ensemble import ExtraTreesRegressor     # ETR极端随机树回归
from sklearn.naive_bayes import GaussianNB           # GNB朴素贝叶斯
from sklearn import svm                              # SVM支持向量机
import xgboost as xgb                                # XGB
import lightgbm as lgb                               # LGB

import matplotlib as mpl
import matplotlib.pyplot as plt      # 作图
import seaborn as sns                # 作图

from IPython.display import display  # 输出语句
plt.style.use("fivethirtyeight")
sns.set_style({'font.sans-serif':['simhei','Arial']})

import warnings                      # 消除警告
warnings.filterwarnings("ignore")
import os             # 系统模块
%matplotlib inline  

# 深度学习框架
import mxnet
import tensorflow as tf

# 检查Python版本
from sys import version_info
if version_info.major != 3:
    raise Exception('请使用 Python 3 来完成此项目')

# 2、加载数据集

In [3]:
# 加载数据
train = pd.read_table('./data/round1_iflyad_train.txt')
test = pd.read_table('./data/round1_iflyad_test_feature.txt')

# 合并训练集，验证集
data = pd.concat([train,test], axis=0, ignore_index=True)

data.head()

Unnamed: 0,adid,advert_id,advert_industry_inner,advert_name,app_cate_id,app_id,app_paid,campaign_id,carrier,city,...,make,model,nnt,orderid,os,os_name,osv,province,time,user_tags
0,1560128,230000063,102400_102401,B4734117F35EE97F,107.0,2089229.0,False,1000023,1,137103102105100,...,HUAWEI,HUAWEI-CAZ-AL10,1,3010798,2,android,7.0.0,137103102100100,2190219034,
1,1488859,230000063,102400_102401,B4734117F35EE97F,108.0,2070079.0,False,1000023,3,137105101100100,...,Xiaomi,Redmi Note 4,1,2311397,2,android,6.0,137105101100100,2190221070,"2100191,2100078,3001825,,3001781,3001791,30017..."
2,1537089,230000065,101700_101704,E257895F74792E81,100.0,2089397.0,False,1000021,3,137103104111100,...,OPPO,OPPO+R11s,1,3008491,2,android,7.1.1,137103104100100,2190219793,
3,1577884,230001710,101900_101902,0A421D7B11EABFC5,100.0,2071234.0,False,1003544,0,137103102113100,...,,OPPO A57,1,3011304,2,android,6.0.1,137103102100100,2190221704,"2100098,gd_2100000,3001791,3001795,3002193,300..."
4,1432367,230000063,102400_102401,B4734117F35EE97F,103.0,1030051.0,False,1000023,1,137103102109100,...,Apple,iPhone 7,3,2304491,1,ios,11.1.1,137103102100100,2190220024,


In [4]:
train.shape, test.shape, data.shape

((1001650, 35), (40024, 34), (1041674, 35))

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1041674 entries, 0 to 1041673
Data columns (total 35 columns):
adid                     1041674 non-null int64
advert_id                1041674 non-null int64
advert_industry_inner    1041674 non-null object
advert_name              1041674 non-null object
app_cate_id              1039376 non-null float64
app_id                   1039376 non-null float64
app_paid                 1041674 non-null bool
campaign_id              1041674 non-null int64
carrier                  1041674 non-null int64
city                     1041674 non-null int64
click                    1001650 non-null float64
creative_has_deeplink    1041674 non-null bool
creative_height          1041674 non-null int64
creative_id              1041674 non-null int64
creative_is_download     1041674 non-null bool
creative_is_js           1041674 non-null bool
creative_is_jump         1041674 non-null bool
creative_is_voicead      1041674 non-null bool
creative_tp_dnf      

In [7]:
data['f_channel'].count()

79777

# 3、初探
纠结，大佬说缺失值也是一种特征，所以缺失值使用-1填充，最后训练的时候使用 原始特征+新添的特征
## 缺失值

In [26]:
cnt = data.shape[0]   # 总的样本数量
print('训练样本', train.shape[0])
print('测试样本', test.shape[0])
print('总样本', cnt)

for feat in data.columns:
    tmp = cnt - data[feat].count()
    if tmp > 0:
        print(feat, tmp, '\t%.3f'%(tmp / cnt * 100), '%')

训练样本 1001650
测试样本 40024
总样本 1041674
app_cate_id 2298 	0.221 %
app_id 2298 	0.221 %
click 40024 	3.842 %
f_channel 961897 	92.341 %
make 103043 	9.892 %
model 7793 	0.748 %
osv 8211 	0.788 %
user_tags 323002 	31.008 %


- app_cate_id app分类 2298  
- app_id 媒体id 2298  
- f_channel 一级频道 961897  （缺失太多）  
- make 品牌 103043  
- model 机型 7793 
- osv 操作系统 8211
- user_tags 用户标签信息 323002

In [42]:
pd.unique(data.dtypes)

array([dtype('int64'), dtype('O'), dtype('float64'), dtype('bool')],
      dtype=object)

## 去重

In [84]:
data.drop_duplicates(subset=None, keep='first', inplace=False)

Unnamed: 0,adid,advert_id,advert_name,app_cate_id,app_id,app_paid,campaign_id,carrier,city,click,creative_has_deeplink,creative_height,creative_id,creative_is_download,creative_is_js,creative_is_jump,creative_is_voicead,creative_tp_dnf,creative_type,creative_width,devtype,inner_slot_id,instance_id,nnt,orderid,os_name,province,time,user_tags,advert_industry_inner_0,advert_industry_inner_1,device,system,county,level
0,1560128,230000063,B4734117F35EE97F,107.0,2089229.0,0,1000023,1,310210,0.0,0,720,2338420,0,0,1,0,8390267734059046014,8,1280,2,xf_275C061483984E075832A4373BDDF27B,86294719979897807,1,3010798,android,3102,2190219034,,102400,102401,HUAWEI_HUAWEI-CAZ-AL10,2_7.0.0,31,
1,1488859,230000063,B4734117F35EE97F,108.0,2070079.0,0,1000023,3,510110,0.0,0,640,2310417,0,0,1,0,8390208550469153745,8,960,2,xf_D84DAB691E2E08C5B80D2FF5135F886E,2699289844928136052,1,2311397,android,5101,2190221070,"2100191,2100078,3001825,,3001781,3001791,30017...",102400,102401,Xiaomi_Redmi Note 4,2_6.0,51,
2,1537089,230000065,E257895F74792E81,100.0,2089397.0,0,1000021,3,310411,0.0,0,640,2337017,0,0,1,0,8390430283595430291,8,960,2,xf_7F9FF3BEA11FE5B3AE6332EFBBD59496,3117527168445845752,1,3008491,android,3104,2190219793,,101700,101704,OPPO_OPPO+R11s,2_7.1.1,31,
3,1577884,230001710,0A421D7B11EABFC5,100.0,2071234.0,0,1003544,0,310211,0.0,0,720,2342152,0,0,1,0,8390229093704413749,3,1280,2,iqy_1000000000381-1-15-15,3398484891050993371,1,3011304,android,3102,2190221704,"2100098,gd_2100000,3001791,3001795,3002193,300...",101900,101902,nan_OPPO A57,2_6.0.1,31,
4,1432367,230000063,B4734117F35EE97F,103.0,1030051.0,0,1000023,1,310210,0.0,0,640,2305409,0,0,1,0,8390208550469153745,8,960,2,xf_6C4DCB36DBE7EB12CE55EDF319FF8D93,2035477570591176488,3,2304491,ios,3102,2190220024,,102400,102401,Apple_iPhone 7,1_11.1.1,31,
5,1559158,230000833,862FF2E9B0AD4C14,103.0,2071267.0,0,1002835,1,410410,0.0,0,480,2340236,0,0,1,0,8390266143715132170,5,320,2,xf_F6EC79C9FDC4018896C44BEC4EF1C676,2065527640347419040,1,3005529,android,4104,2190221228,"3002265,3002613,3002993,3003055,3003147,300331...",4,100206,"Xiaomi,MI 6,sagit_MI 6",2_8.0.0,41,
6,1580496,230001696,310405E93895BD58,100.0,2071133.0,0,1003456,1,310110,0.0,0,480,2342232,0,0,1,0,8390272894417697957,5,320,2,xf_E04181AD642C3716CADDFA943CAD1750,9137625226649553828,1,3012112,ios,3101,2190221448,,101000,101002,Apple_iPhone 7 Plus,1_11.4.1,31,
7,1560128,230000063,B4734117F35EE97F,107.0,2089229.0,0,1000023,1,110210,0.0,0,720,2338420,0,0,1,0,8390267734059046014,8,1280,2,xf_275C061483984E075832A4373BDDF27B,8256661937587962948,1,3010798,android,1102,2190221002,"gd_2100000,2100143,ag_2100040,3001837,3002907,...",102400,102401,HUAWEI_HUAWEI-NXT-AL10,2_7.0.0,11,
8,1577884,230001710,0A421D7B11EABFC5,100.0,2071234.0,0,1003544,0,510411,0.0,0,720,2342152,0,0,1,0,8390229093704413749,3,1280,2,iqy_1000000000381-1-15-30,2270018983665706006,1,3011304,android,5104,2190221002,",3004484,3004430,3004434,3004490,3004468,30045...",101900,101902,nan_OPPO R9s,2_6.0.1,51,
9,1579122,230000063,B4734117F35EE97F,100.0,2089397.0,0,1000023,1,410111,0.0,0,640,2334795,0,0,1,0,8390430283595430291,8,960,2,xf_7F9FF3BEA11FE5B3AE6332EFBBD59496,5028130673845168373,1,3012120,android,4101,2190220484,"2100197,2100150,2100094,2100237,2100132,ag_210...",102400,102401,vivo_vivo+Y13iL,2_4.4.4,41,


## 删除f_channel 一级频道特征

In [45]:
data.drop('f_channel', axis=1, inplace=True)

## 提取字符类型特征

In [47]:
from pandas.api.types import is_string_dtype

for feat in data.columns.tolist():
    if is_string_dtype(data[feat]):
        print(feat)

advert_industry_inner
advert_name
inner_slot_id
make
model
os_name
osv
user_tags


## 特征工程
补充说明：  
1. advert_industry_inner字段数据样例为102400_102401，“102400(前者)”表示广告主一级行业标签id，“102401(后者)”表示广告主二级行业id，如“教育_培训”
2. time字段脱敏后为有序排列，且时间间隔和与真实时间对应。

1、advert_industry_inner拆分开  
2、advert_name都是一些B4734117F35EE97F字符串，可以考虑`删除`或者进行pd.get_dummies处理  
3、inner_slot_id媒体广告位也是字符串xf_275C061483984E075832A4373BDDF27B，可以考虑取‘_’前的字段  
4、make model品牌机型可以考虑给机型加上品牌前缀， os_name osv操作系统和版本同理  
5、user_tags用户标签没有什么可以提取的

In [49]:
data['advert_industry_inner_0'] = data['advert_industry_inner'].apply(lambda x: x.split('_')[0])
data['advert_industry_inner_1'] = data['advert_industry_inner'].apply(lambda x: x.split('_')[1])
data['inner_slot_id_0'] = data['inner_slot_id'].apply(lambda x:x.split('_')[0])
data['model'] = data['make'].astype(str).values + '_' + data['model'].astype(str).values
data['osv'] = data['os'].astype(str).values + '_' + data['osv'].astype(str).values

data.drop(['advert_industry_inner', 'advert_name', 'inner_slot_id', 'make', 'model', 'os', 'osv'], axis=1, inplace=True)

In [50]:
data.shape

(1041674, 33)

## 提取bool布尔类型特征
转换为int类型

In [83]:
from pandas.api.types import is_bool_dtype

for feat in data.columns.tolist():
    if is_bool_dtype(data[feat]):
        print(feat)
        data[feat] = data[feat].astype(int)

app_paid
creative_has_deeplink
creative_is_download
creative_is_js
creative_is_jump
creative_is_voicead


## 时间time
补充说明：  
1. advert_industry_inner字段数据样例为102400_102401，“102400(前者)”表示广告主一级行业标签id，“102401(后者)”表示广告主二级行业id，如“教育_培训”
2. time字段脱敏后为有序排列，且时间间隔和与真实时间对应。

数据集提供了若干天的样本，最后一天数据构成了测试集，其余日期的数据作为训练数据。

所以年份就不需要了。

In [40]:
import time
day = data['time'].apply(lambda x: int(time.strftime("%d", time.localtime(x))))
hour = data['time'].apply(lambda x: int(time.strftime("%H", time.localtime(x))))
day.head()

0    29
1    29
2    29
3    29
4    29
Name: time, dtype: int64

In [39]:
time.localtime(2190219034)

time.struct_time(tm_year=2039, tm_mon=5, tm_mday=29, tm_hour=2, tm_min=10, tm_sec=34, tm_wday=6, tm_yday=149, tm_isdst=0)

## city特征

In [77]:
head = []
tail = []
for city in data['city']:
    h = str(city)[0:5]
    t = str(city)[-3:]
    head.append(h)
    tail.append(t)
print(set(head), set(tail))

{'0', '13710'} {'100', '0'}


结论：
1. 前5个数字都是一样的，尾部3个数字都是一样的，无效。
2. city中数字为0。
3. 总的长度是15。
4. 城市：国家、省份、城市、等级
5. 基于4，剩下7个数字，等级占1个，其余占2个，即5-7为国家、5-9为省份、5-11为城市、11-12为等级

In [74]:
data['city'][117]

0

In [78]:
data['city'][0:5]

0    137103102105100
1    137105101100100
2    137103104111100
3    137103102113100
4    137103102109100
Name: city, dtype: int64

In [79]:
data['county'] = data['city'].apply(lambda x: str(x)[5:7])
data['province'] = data['city'].apply(lambda x: str(x)[5:9])
data['city'] = data['city'].apply(lambda x: str(x)[5:11])
data['level'] = data['city'].apply(lambda x: str(x)[11:12])

In [48]:
pd.set_option('display.max_columns', None)    # 强制显示所有属性值
data.head()

Unnamed: 0,adid,advert_id,advert_industry_inner,advert_name,app_cate_id,app_id,app_paid,campaign_id,carrier,city,click,creative_has_deeplink,creative_height,creative_id,creative_is_download,creative_is_js,creative_is_jump,creative_is_voicead,creative_tp_dnf,creative_type,creative_width,devtype,inner_slot_id,instance_id,make,model,nnt,orderid,os,os_name,osv,province,time,user_tags
0,1560128,230000063,102400_102401,B4734117F35EE97F,107.0,2089229.0,False,1000023,1,137103102105100,0.0,False,720,2338420,False,False,True,False,8390267734059046014,8,1280,2,xf_275C061483984E075832A4373BDDF27B,86294719979897807,HUAWEI,HUAWEI-CAZ-AL10,1,3010798,2,android,7.0.0,137103102100100,2190219034,
1,1488859,230000063,102400_102401,B4734117F35EE97F,108.0,2070079.0,False,1000023,3,137105101100100,0.0,False,640,2310417,False,False,True,False,8390208550469153745,8,960,2,xf_D84DAB691E2E08C5B80D2FF5135F886E,2699289844928136052,Xiaomi,Redmi Note 4,1,2311397,2,android,6.0,137105101100100,2190221070,"2100191,2100078,3001825,,3001781,3001791,30017..."
2,1537089,230000065,101700_101704,E257895F74792E81,100.0,2089397.0,False,1000021,3,137103104111100,0.0,False,640,2337017,False,False,True,False,8390430283595430291,8,960,2,xf_7F9FF3BEA11FE5B3AE6332EFBBD59496,3117527168445845752,OPPO,OPPO+R11s,1,3008491,2,android,7.1.1,137103104100100,2190219793,
3,1577884,230001710,101900_101902,0A421D7B11EABFC5,100.0,2071234.0,False,1003544,0,137103102113100,0.0,False,720,2342152,False,False,True,False,8390229093704413749,3,1280,2,iqy_1000000000381-1-15-15,3398484891050993371,,OPPO A57,1,3011304,2,android,6.0.1,137103102100100,2190221704,"2100098,gd_2100000,3001791,3001795,3002193,300..."
4,1432367,230000063,102400_102401,B4734117F35EE97F,103.0,1030051.0,False,1000023,1,137103102109100,0.0,False,640,2305409,False,False,True,False,8390208550469153745,8,960,2,xf_6C4DCB36DBE7EB12CE55EDF319FF8D93,2035477570591176488,Apple,iPhone 7,3,2304491,1,ios,11.1.1,137103102100100,2190220024,


In [127]:
# 新添的特征
addFeature = ['advert_industry_inner_0', 'inner_slot_id_0',  'county', 'level']
# 特征选择
advertFeature = ['adid', 'advert_id', 'orderid',  'advert_industry_inner', 'advert_name', 'campaign_id', 'creative_id', 'creative_type', 'creative_tp_dnf', 'creative_has_deeplink', 'creative_is_jump', 'creative_is_download']  # 没有后4个特征
mediaFeature = ['app_cate_id', 'f_channel', 'app_id', 'inner_slot_id']   # 没有app_paid
contentFeature = ['city', 'carrier', 'province', 'nnt', 'devtype', 'os_name', 'osv', 'os', 'make', 'model']  # 没有time
otherFeature = ['creative_width', 'creative_height', 'hour']

feature = advertFeature + mediaFeature + contentFeature + addFeature
len(feature)

30

In [None]:
减少 time\user_tags\app_paid\creative_is_js\creative_is_voicead\creative_width\creative_height\instance_id   8
增加 adid_prefix\level\couty\inner_slot_prefix\advert_industry_inner_1   5