### 天猫复购数据分析及预测

- 样本综合情况查看
- 缺失值，样本处理（是否要特征之后再关注是否有关键缺失）
- 数据分布，正负样本分布情况，正负样本处理
- 观察数据分布（是否要构建特征工程之后，还是先有这个才构建工程？） 这一步大概是看哪些特征会跟结果有关系 可以通过计算概率观察相关性，可以通过可视化观察相关性，结合多个特征观察相关性（泰坦尼克的3等仓存活率低但是三等舱婴儿存活率并不低），特征有数字的有分类的 挖掘新特征，比如从姓名里挖掘出头衔，统计发现不同头衔复购率也不一样  
  
  
1. 数据导入
2. 查看数据：类型、大小、详情、缺失……（压缩）
3. 观察和探索：数据分布、特征构建方向、正负样本分布……
4. 特征处理：清洗、缺失处理、归一、转换、降维、衍生…………………………
5. 特征构建（建议初步先跑起来）

#####  数据集基本信息
    数据集包含了匿名用户在 "双十一 "前6个月和"双十一 "当天的购物记录，标签为是否是重复购买者。预测给定的商家中，哪些新消费者在未来会成为忠实客户，即需要预测这些新消费者在6个月内再次购买的概率。  
    重要表信息：  
    用户行为日志：
- user_id：购物者的唯一ID编码
- item_id：商品的唯一编码
- cat_id：商品所属品类的唯一编码
- merchant_id：商家的唯一ID编码
- brand_id：商品品牌的唯一编码
- time_tamp：购买时间（格式：mmdd）
- action_type：包含{0, 1, 2, 3}，0表示单击，1表示添加到购物车，2表示购买，3表示添加到收藏夹  
  
    用户表：
- user_id：购物者的唯一ID编码
- age_range：用户年龄范围。<18岁为1；[18,24]为2； [25,29]为3； [30,34]为4；[35,39]为5；[40,49]为6； > = 50时为7和8; 0和NULL表示未知
- gender：用户性别。0表示女性，1表示男性，2和NULL表示未知 

In [104]:
# 导入数据
import pandas as pd

user_info = pd.read_csv('user_info_format1.csv')
print(user_info.info())

user_log = pd.read_csv('user_log_format1.csv')
print(user_log.info())

train_data = pd.read_csv('train_format1.csv')
print(train_data.info())

test_data = pd.read_csv('test_format1.csv')
print(test_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 424170 entries, 0 to 424169
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   user_id    424170 non-null  int64  
 1   age_range  421953 non-null  float64
 2   gender     417734 non-null  float64
dtypes: float64(2), int64(1)
memory usage: 9.7 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54925330 entries, 0 to 54925329
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   user_id      int64  
 1   item_id      int64  
 2   cat_id       int64  
 3   seller_id    int64  
 4   brand_id     float64
 5   time_stamp   int64  
 6   action_type  int64  
dtypes: float64(1), int64(6)
memory usage: 2.9 GB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260864 entries, 0 to 260863
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   user_id      260864 non-null  

In [132]:
# 压缩数据：user_log太占内存了
import numpy as np

def reduce_memmory_usage(df):
    
    for col in df.columns:
        col_dtype = df[col].dtypes
        num_max = df[col].max()
        
        if str(col_dtype) == 'int64':
            if num_max<np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif num_max<np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif num_max<np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
        elif str(col_dtype) == 'float64':
            if num_max<np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif num_max<np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
    
    return df

user_log = reduce_memmory_usage(user_log)
print(user_log.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54925330 entries, 0 to 54925329
Data columns (total 7 columns):
 #   Column       Dtype  
---  ------       -----  
 0   user_id      int32  
 1   item_id      int32  
 2   cat_id       int16  
 3   merchant_id  int16  
 4   brand_id     float16
 5   time_stamp   int16  
 6   action_type  int8   
dtypes: float16(1), int16(3), int32(2), int8(1)
memory usage: 890.5 MB
None


In [106]:
# user_info数据集处理
print('user_info表数据缺失：\n',user_info.isna().sum())
print('年龄分布：\n',user_info.age_range.value_counts())
print('性别分布：\n',user_info.gender.value_counts())

# 年龄空值统一用0代替，性别用2代替
user_info.age_range.fillna(0,inplace=True)
user_info.gender.fillna(2,inplace=True)

# 类型转换、统一
user_info.user_id = user_info.user_id.astype(np.int32)
user_info.age_range = user_info.age_range.astype(np.int8)
user_info.gender = user_info.gender.astype(np.int8)

print(user_info.info())

user_log = user_log.rename(columns={'seller_id':'merchant_id'})

user_info表数据缺失：
 user_id         0
age_range    2217
gender       6436
dtype: int64
年龄分布：
 3.0    111654
0.0     92914
4.0     79991
2.0     52871
5.0     40777
6.0     35464
7.0      6992
8.0      1266
1.0        24
Name: age_range, dtype: int64
性别分布：
 0.0    285638
1.0    121670
2.0     10426
Name: gender, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 424170 entries, 0 to 424169
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   user_id    424170 non-null  int32
 1   age_range  424170 non-null  int8 
 2   gender     424170 non-null  int8 
dtypes: int32(1), int8(2)
memory usage: 2.4 MB
None


In [107]:
# 训练集处理
#print('训练集数据缺失：\n',train_data.isna().sum())
print('label取值：\n',train_data.label.value_counts())

# 类型转换、统一
train_data.user_id = train_data.user_id.astype(np.int32)
train_data.merchant_id = train_data.merchant_id.astype(np.int16)
train_data.label = train_data.label.astype(np.float16)

print(train_data.info())

label取值：
 0    244912
1     15952
Name: label, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260864 entries, 0 to 260863
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      260864 non-null  int32  
 1   merchant_id  260864 non-null  int16  
 2   label        260864 non-null  float16
dtypes: float16(1), int16(1), int32(1)
memory usage: 2.0 MB
None


In [108]:
# 测试集处理
test_data = pd.read_csv('test_format1.csv')

# 类型转换、统一
test_data.user_id = test_data.user_id.astype(np.int32)
test_data.merchant_id =test_data.merchant_id.astype(np.int16)
test_data.rename(columns={'prob':'label'}, inplace = True) 
test_data.label = test_data.label.astype(np.float16)

print(test_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261477 entries, 0 to 261476
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      261477 non-null  int32  
 1   merchant_id  261477 non-null  int16  
 2   label        0 non-null       float16
dtypes: float16(1), int16(1), int32(1)
memory usage: 2.0 MB
None


##### 特征工程

In [109]:
dataset = train_data.append(test_data)
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 261476
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      522341 non-null  int32  
 1   merchant_id  522341 non-null  int16  
 2   label        260864 non-null  float16
dtypes: float16(1), int16(1), int32(1)
memory usage: 8.0 MB
None


In [110]:
import gc

# 合并用户基本信息:性别、年龄
dataset = dataset.merge(user_info,on=['user_id'],how='left')
print(dataset.info())

del train_data,test_data,user_info
gc.collect()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 522340
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      522341 non-null  int32  
 1   merchant_id  522341 non-null  int16  
 2   label        260864 non-null  float16
 3   age_range    522341 non-null  int8   
 4   gender       522341 non-null  int8   
dtypes: float16(1), int16(1), int32(1), int8(2)
memory usage: 9.0 MB
None


226

    用户行为日志：
   - user_id：购物者的唯一ID编码
   - item_id：商品的唯一编码
   - cat_id：商品所属品类的唯一编码
   - seller_id：商家的唯一ID编码
   - brand_id：商品品牌的唯一编码
   - time_tamp：购买时间（格式：mmdd）
   - action_type：包含{0, 1, 2, 3}，0表示单击，1表示添加到购物车，2表示购买，3表示添加到收藏夹  

##### 单个特征、多个特征
    用户特征构造  
- 用户点击总数
- 用户加购总数
- 用户购买总数
- 用户收藏总数 
    
    商户特征构造  
- 来访买家数
- 商品数
- 品类数
- 品牌数
- 总点击量
- 总加购数
- 总购买数
- 总收藏数
    
    用户+商户特征构造 
- 用户在商户的点击次数
- 用户在商户的加购数
- 用户在商户的购买数
- 用户在商户的收藏数
    
##### 衍生特征  
    用户偏好特征（开始懵）是否要结合训练集和用户日志
- 用户购买/访问商户的：商品量级、品类量级、品牌量级、点击、加购、购买、收藏

    商户顾客特征（开始懵）
- 商户顾客性别、年龄众数

    用户在商户的行为特征（懵）
- 用户在店铺出现的月份、天数、月访问次数

In [111]:
# 单特征：用户属性

def user_attribute(df_data,col_name, action_type,dataset):
    df = df_data[df_data['action_type']==action_type].groupby(['user_id'])[['user_id']].count().rename(columns={'user_id':col_name})
    dataset = dataset.merge(df, on=['user_id'], how='left')
    return dataset

dataset = user_attribute(user_log,'user_click_total', 0,dataset)
dataset = user_attribute(user_log,'user_addtocart_total', 1,dataset)
dataset = user_attribute(user_log,'user_purchase_total', 2,dataset)
dataset = user_attribute(user_log,'user_favourite_total', 3,dataset)

print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 522340
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   user_id               522341 non-null  int32  
 1   merchant_id           522341 non-null  int16  
 2   label                 260864 non-null  float16
 3   age_range             522341 non-null  int8   
 4   gender                522341 non-null  int8   
 5   user_click_total      521981 non-null  float64
 6   user_addtocart_total  38179 non-null   float64
 7   user_purchase_total   522341 non-null  int64  
 8   user_favourite_total  294859 non-null  float64
dtypes: float16(1), float64(3), int16(1), int32(1), int64(1), int8(2)
memory usage: 24.9 MB
None


In [112]:
# 单特征：商户属性

def m_attribute1(df_data,col_name,cnt_col,dataset):
    df = df_data.groupby(['merchant_id'])[[cnt_col]].nunique().rename(columns={cnt_col:col_name})
    dataset = dataset.merge(df, on=['merchant_id'], how='left')
    return dataset

dataset = m_attribute1(user_log,'m_users_t', 'user_id',dataset)
dataset = m_attribute1(user_log,'m_items_t', 'item_id',dataset)
dataset = m_attribute1(user_log,'m_catgrys_t', 'cat_id',dataset)
dataset = m_attribute1(user_log,'m_brands_t', 'brand_id',dataset)

print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 522340
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   user_id               522341 non-null  int32  
 1   merchant_id           522341 non-null  int16  
 2   label                 260864 non-null  float16
 3   age_range             522341 non-null  int8   
 4   gender                522341 non-null  int8   
 5   user_click_total      521981 non-null  float64
 6   user_addtocart_total  38179 non-null   float64
 7   user_purchase_total   522341 non-null  int64  
 8   user_favourite_total  294859 non-null  float64
 9   m_users_t             522341 non-null  int64  
 10  m_items_t             522341 non-null  int64  
 11  m_catgrys_t           522341 non-null  int64  
 12  m_brands_t            522341 non-null  int64  
dtypes: float16(1), float64(3), int16(1), int32(1), int64(5), int8(2)
memory usage: 40.8 MB
None


In [114]:
def m_attribute2(df_data,col_name, action_type,dataset):
    df = df_data[df_data['action_type']==action_type].groupby(['merchant_id'])[['user_id']].count().rename(columns={'user_id':col_name})
    dataset = dataset.merge(df, on=['merchant_id'], how='left')
    return dataset

dataset = m_attribute2(user_log,'m_click_t', 0,dataset)
dataset = m_attribute2(user_log,'m_addtocart_t', 1,dataset)
dataset = m_attribute2(user_log,'m_purchase_t', 2,dataset)
dataset = m_attribute2(user_log,'m_favourite_t', 3,dataset)

print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 522340
Data columns (total 17 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   user_id               522341 non-null  int32  
 1   merchant_id           522341 non-null  int16  
 2   label                 260864 non-null  float16
 3   age_range             522341 non-null  int8   
 4   gender                522341 non-null  int8   
 5   user_click_total      521981 non-null  float64
 6   user_addtocart_total  38179 non-null   float64
 7   user_purchase_total   522341 non-null  int64  
 8   user_favourite_total  294859 non-null  float64
 9   m_users_t             522341 non-null  int64  
 10  m_items_t             522341 non-null  int64  
 11  m_catgrys_t           522341 non-null  int64  
 12  m_brands_t            522341 non-null  int64  
 13  m_click_t             522341 non-null  int64  
 14  m_addtocart_t         518289 non-null  float64
 15  

In [128]:
# 多特征：用户+商户特征构造 
def uim_attribute1(df_data,col_name, action_type,dataset):
    df = df_data[df_data['action_type']==action_type].groupby(['user_id','merchant_id'])[['user_id']].count().rename(columns={'user_id':col_name})
    dataset = dataset.merge(df,on=['user_id','merchant_id'], how='left')
    return dataset

dataset = uim_attribute1(user_log,'uim_click', 0,dataset)
dataset = uim_attribute1(user_log,'uim_addtocart', 1,dataset)
dataset = uim_attribute1(user_log,'uim_purchase', 2,dataset)
dataset = uim_attribute1(user_log,'uim_favourite', 3,dataset)

print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 522340
Data columns (total 21 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   user_id               522341 non-null  int32  
 1   merchant_id           522341 non-null  int16  
 2   label                 260864 non-null  float16
 3   age_range             522341 non-null  int8   
 4   gender                522341 non-null  int8   
 5   user_click_total      521981 non-null  float64
 6   user_addtocart_total  38179 non-null   float64
 7   user_purchase_total   522341 non-null  int64  
 8   user_favourite_total  294859 non-null  float64
 9   m_users_t             522341 non-null  int64  
 10  m_items_t             522341 non-null  int64  
 11  m_catgrys_t           522341 non-null  int64  
 12  m_brands_t            522341 non-null  int64  
 13  m_click_t             522341 non-null  int64  
 14  m_addtocart_t         518289 non-null  float64
 15  

In [133]:
# 压缩一下
dataset = reduce_memmory_usage(dataset)
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 522341 entries, 0 to 522340
Data columns (total 21 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   user_id               522341 non-null  int32  
 1   merchant_id           522341 non-null  int16  
 2   label                 260864 non-null  float16
 3   age_range             522341 non-null  int8   
 4   gender                522341 non-null  int8   
 5   user_click_total      521981 non-null  float16
 6   user_addtocart_total  38179 non-null   float16
 7   user_purchase_total   522341 non-null  int16  
 8   user_favourite_total  294859 non-null  float16
 9   m_users_t             522341 non-null  int32  
 10  m_items_t             522341 non-null  int16  
 11  m_catgrys_t           522341 non-null  int16  
 12  m_brands_t            522341 non-null  int8   
 13  m_click_t             522341 non-null  int32  
 14  m_addtocart_t         518289 non-null  float16
 15  

In [138]:
# 训练集、测试集拆分
test_data = dataset[dataset['label'].isna()]
train_data = dataset[~dataset['label'].isna()]

test_data.to_csv('test_data.csv',index=False)
train_data.to_csv('train_data.csv',index=False)

print(test_data.info())
print(train_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 261477 entries, 260864 to 522340
Data columns (total 21 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   user_id               261477 non-null  int32  
 1   merchant_id           261477 non-null  int16  
 2   label                 0 non-null       float16
 3   age_range             261477 non-null  int8   
 4   gender                261477 non-null  int8   
 5   user_click_total      261297 non-null  float16
 6   user_addtocart_total  19130 non-null   float16
 7   user_purchase_total   261477 non-null  int16  
 8   user_favourite_total  147732 non-null  float16
 9   m_users_t             261477 non-null  int32  
 10  m_items_t             261477 non-null  int16  
 11  m_catgrys_t           261477 non-null  int16  
 12  m_brands_t            261477 non-null  int8   
 13  m_click_t             261477 non-null  int32  
 14  m_addtocart_t         259486 non-null  float16


##### 模型训练

In [9]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

train_X, train_y = train_data.drop(['label'], axis=1), train_data['label']

# 空值及无限值处理
def fill_nanvsinf (data):
    data.replace(np.nan, 0, inplace=True)
    data.replace([np.inf, -np.inf], 0, inplace=True)
    return data

train_X = fill_nanvsinf(train_X)

# 归一化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train_X)

# 拆分训练集、测试集
X_train, X_valid, y_train, y_valid = train_test_split(X,train_y)

In [149]:
# 逻辑回归（很快）
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
print('逻辑回归score:', clf.score(X_valid, y_valid))

逻辑回归score: 0.9379293424926398


In [152]:
# KNN模型（计算量大）
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
print('K近邻score:', clf.score(X_valid, y_valid))

K近邻score: 0.9277324582924436


In [153]:
# 高斯贝叶斯模型(秒出)
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB().fit(X_train, y_train)
print('高斯贝叶斯score:', clf.score(X_valid, y_valid))

高斯贝叶斯score: 0.8582096418056918


In [154]:
# 决策树模型（很快）
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
print('决策树score:', clf.score(X_valid, y_valid))

决策树score: 0.8759966879293425


In [160]:
# 随机森林（挺快）
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(
    n_estimators=10, 
    min_samples_split=2,
    max_depth=None, 
    max_features='sqrt')

rf_clf.fit(X_train, y_train)
print('随机森林score:', clf.score(X_valid, y_valid))

随机森林score: 0.8759966879293425


In [159]:
# 随机森林 
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(
    n_estimators=10, 
    min_samples_split=2,
    max_depth=None, 
    random_state=0)

rf_clf.fit(X_train, y_train)
print('决策树score:', clf.score(X_valid, y_valid))

决策树score: 0.8759966879293425


In [163]:
%pip install xgboost

Collecting xgboost
  Downloading xgboost-1.3.3-py3-none-macosx_10_14_x86_64.macosx_10_15_x86_64.macosx_11_0_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 86 kB/s eta 0:00:012
Installing collected packages: xgboost
Successfully installed xgboost-1.3.3
Note: you may need to restart the kernel to use updated packages.


In [12]:
# XGBoost
import xgboost as xgb
from sklearn.metrics import roc_auc_score

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)

params = {
    #通用参数
    'booster':'gbtree',
    #booster参数
    'eta': 0.02, #典型值为0.01-0.2
    'min_child_weight': 2, #默认为1，避免过拟合，但是过高容易欠拟合
    'max_depth': 5, #避免过拟合（越大越容易学习局部值），默认为6，典型值：3-10
    'gamma':1, #与损失函数相关，越大越保守，默认为0
    'subsample': 0.7, #默认为1，典型值为0.5~1，减小则趋于保守，避免过拟合，过小容易欠拟合
    'colsample_bytree': 0.7,  #控制每棵树采样的随机占比，默认为1，典型值0.5~1
    'colsample_bylevel':0.7, #控制树的每一级的每一次分裂，对列数的采样的占比
    'lambda':10, #权重的L2正则化项，控制过拟合
    #学习目标参数
    'objective': 'binary:logistic', 
    #'num_class':2,
    'eval_metric':'auc'
}

evallist = [(dtrain, 'train'),(dvalid,'valid')]
model = xgb.train(params, dtrain, 1000, evallist, early_stopping_rounds=10)

X_valid_DMatrix = xgb.DMatrix(X_valid)
y_pred = model.predict(X_valid_DMatrix)

print('XGBoost_score:', roc_auc_score(y_valid,y_pred))


[0]	train-auc:0.581805	valid-auc:0.581448
Multiple eval metrics have been passed: 'valid-auc' will be used for early stopping.

Will train until valid-auc hasn't improved in 10 rounds.
[1]	train-auc:0.610193	valid-auc:0.60793
[2]	train-auc:0.61242	valid-auc:0.610067
[3]	train-auc:0.616985	valid-auc:0.61334
[4]	train-auc:0.620087	valid-auc:0.61532
[5]	train-auc:0.624415	valid-auc:0.619134
[6]	train-auc:0.625289	valid-auc:0.618792
[7]	train-auc:0.624962	valid-auc:0.618415
[8]	train-auc:0.624451	valid-auc:0.617901
[9]	train-auc:0.623937	valid-auc:0.617332
[10]	train-auc:0.623454	valid-auc:0.616932
[11]	train-auc:0.623173	valid-auc:0.616911
[12]	train-auc:0.623378	valid-auc:0.617086
[13]	train-auc:0.624206	valid-auc:0.616862
[14]	train-auc:0.625734	valid-auc:0.618238
[15]	train-auc:0.626025	valid-auc:0.618473
Stopping. Best iteration:
[5]	train-auc:0.624415	valid-auc:0.619134

XGBoost_score: 0.6184728695627957


In [22]:
# XGBoost，不进行归一化
train_X, train_y = train_data.drop(['label'], axis=1), train_data['label']

# 空值及无限值处理
train_X = fill_nanvsinf(train_X)

# 拆分训练集、测试集
X_train, X_valid, y_train, y_valid = train_test_split(train_X,train_y)

import xgboost as xgb
from sklearn.metrics import roc_auc_score

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)

params = {
    #通用参数
    'booster':'gbtree',
    #booster参数
    'eta': 0.02, #典型值为0.01-0.2
    'min_child_weight': 2, #默认为1，避免过拟合，但是过高容易欠拟合
    'max_depth': 5, #避免过拟合（越大越容易学习局部值），默认为6，典型值：3-10
    'gamma':1, #与损失函数相关，越大越保守，默认为0
    'subsample': 0.7, #默认为1，典型值为0.5~1，减小则趋于保守，避免过拟合，过小容易欠拟合
    'colsample_bytree': 0.7,  #控制每棵树采样的随机占比，默认为1，典型值0.5~1
    'colsample_bylevel':0.7, #控制树的每一级的每一次分裂，对列数的采样的占比
    'lambda':10, #权重的L2正则化项，控制过拟合
    #学习目标参数
    'objective': 'binary:logistic', 
    #'num_class':2,
    'eval_metric':'auc'
}

evallist = [(dtrain, 'train'),(dvalid,'valid')]
model = xgb.train(params, dtrain, 1000, evallist, early_stopping_rounds=10)

X_valid_DMatrix = xgb.DMatrix(X_valid)
y_pred = model.predict(X_valid_DMatrix)

print('XGBoost_score:', roc_auc_score(y_valid,y_pred))


[0]	train-auc:0.578581	valid-auc:0.584764
Multiple eval metrics have been passed: 'valid-auc' will be used for early stopping.

Will train until valid-auc hasn't improved in 10 rounds.
[1]	train-auc:0.607752	valid-auc:0.613713
[2]	train-auc:0.614369	valid-auc:0.618882
[3]	train-auc:0.616379	valid-auc:0.621168
[4]	train-auc:0.617643	valid-auc:0.621822
[5]	train-auc:0.619744	valid-auc:0.620275
[6]	train-auc:0.620828	valid-auc:0.620984
[7]	train-auc:0.619797	valid-auc:0.620132
[8]	train-auc:0.619373	valid-auc:0.619761
[9]	train-auc:0.618656	valid-auc:0.619074
[10]	train-auc:0.618541	valid-auc:0.618968
[11]	train-auc:0.618903	valid-auc:0.619339
[12]	train-auc:0.619049	valid-auc:0.619149
[13]	train-auc:0.619393	valid-auc:0.619468
[14]	train-auc:0.619632	valid-auc:0.619692
Stopping. Best iteration:
[4]	train-auc:0.617643	valid-auc:0.621822

XGBoost_score: 0.6196915416832752


In [16]:
# 测试集处理
test_data.drop(['label'], axis=1, inplace=True)
test_data = fill_nanvsinf(test_data)

print(test_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261477 entries, 0 to 261476
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   user_id               261477 non-null  int32  
 1   merchant_id           261477 non-null  int16  
 2   age_range             261477 non-null  int8   
 3   gender                261477 non-null  int8   
 4   user_click_total      261477 non-null  float16
 5   user_addtocart_total  261477 non-null  float16
 6   user_purchase_total   261477 non-null  int16  
 7   user_favourite_total  261477 non-null  float16
 8   m_users_t             261477 non-null  int32  
 9   m_items_t             261477 non-null  int16  
 10  m_catgrys_t           261477 non-null  int16  
 11  m_brands_t            261477 non-null  int8   
 12  m_click_t             261477 non-null  int32  
 13  m_addtocart_t         261477 non-null  float16
 14  m_purchase_t          261477 non-null  int16  
 15  

In [23]:
test_data_DMatrix = xgb.DMatrix(test_data)
pred = model.predict(test_data_DMatrix)
result = pd.DataFrame()
result['user_id'] = test_data['user_id']
result['merchant_id'] = test_data['merchant_id']
result['prob'] = pred

print(result.head(10))
result.to_csv('prediction.csv', index=False)

   user_id  merchant_id      prob
0   163968         4605  0.385365
1   360576         1581  0.388769
2    98688         1964  0.382539
3    98688         3645  0.381955
4   295296         3361  0.393740
5    33408           98  0.381955
6   230016         1742  0.391033
7   164736          598  0.382829
8   164736         1963  0.381955
9   164736         2634  0.381955


In [24]:
result.to_csv('prediction.csv', index=False)