## CSDN用户画像技术评测(https://www.biendata.xyz/competition/smpcup2017/)
任务1：用户内容主题词生成

给定若干用户文档（博客或帖子），为每一篇文档生成3个最合适的主题词。要求生成的主题词必须出现在文档中。

任务2：用户兴趣标注

给定若干用户的文档信息（博客或帖子）和行为数据（浏览、评论、收藏、转发、点赞/踩、关注、私信等），为每一个用户标注3个最合适的兴趣方向。标签空间由CSDN给定。

任务3：用户成长预测

给定若干用户在一段时间内（至少1年）的文档信息（博客或帖子）和行为数据（浏览、评论、收藏、转发、点赞/踩、关注、私信等），预测每一个用户在未来一段时间内（半年至1年）的成长值。用户成长值是根据用户的综合表现打分所得，但不会公布具体打分准则。成长值将会归一化到[0, 1]区间，其中值为0表示用户流失。 

### 由于我对NLP方面的一些认识不足、这次任务一、任务二的成绩也不是很好，所以我就只讲一下任务三数据挖掘的一些东西

**<font color=red>看看数据长什么样</font>**<br>
还是用pandas加载数据

In [23]:
# 这个ipython notebook主要是我解决smp问题的思路和过程
def set_ch():
    from pylab import mpl
    mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体
    mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题
set_ch()

import pandas as pd #数据分析工具
import pickle # 保存中间数据
import os 
import re # 正则化处理文本
import time # 处理时间
import numpy as np # 科学计算工具
from datetime import datetime # 时间处理工具
from collections import Counter # 统计词频
import jieba # 分词工具
from sklearn.feature_extraction.text import TfidfVectorizer # 文本向量化
import xgboost as xgb
from sklearn.cross_validation import train_test_split
from PIL import Image
import matplotlib.pyplot as plt
import networkx as nx # 网络工具包
#from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
#import sys 
#reload(sys)
#sys.setdefaultencoding('utf-8') # python2.7处理中文必须语句
%matplotlib inline

### 数据路径

In [12]:
# 总共9个数据：包括行为数据、社交数据、文本数据。
Post_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/2_Post.txt'
Browse_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/3_Browse.txt'
Comment_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/4_Comment.txt'
Vote_up_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/5_Vote-up.txt'
Vote_down_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/6_Vote-down.txt'
Favorite_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/7_Favorite.txt'
Follow_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/8_Follow.txt'
Letter_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/9_Letter.txt'
Blog_content_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/BlogContent.txt'

In [3]:
# 文本中排除的停用词
stop_word = open('stop_word.txt').read().decode('utf8')
stop_word[:10}

u'\u554a\n\u963f\n\u54ce\n\u54ce\u5440\n\u54ce'

### 数据读取

In [13]:
# 读取行为数据。注意：社交数据follow没有time
def Read_content(path):
    name = re.split(r'[/_.]',path)[-2]
    csv_path  = name+'.csv'
    if os.path.exists(csv_path):
        data = pd.read_csv(csv_path)
    else:
        data = pd.read_csv(path, sep='', header=None, names = ['userid', 'contentid', 'time'])
        data['userid'] = data['userid'].str.slice(1)#STR
        data['contentid'] = data['contentid'].str.slice(1)
        data.to_csv(csv_path, index=False, encoding='utf8')
    return data

post = Read_content(Post_path)
browse = Read_content(Browse_path)
comment = Read_content(Comment_path)
voteup = Read_content(Vote_up_path)
votedown = Read_content(Vote_down_path)
favorite = Read_content(Favorite_path)
follow = Read_content(Follow_path)
letter = Read_content(Letter_path)
post[:10]

Unnamed: 0,userid,contentid,time
0,U0024827,D0874760,2015-02-05 18:05:49.0
1,U0042700,D0874412,2015-08-15 15:27:07.0
2,U0107217,D0347438,2015-05-17 22:13:54.0
3,U0148074,D0696319,2015-01-06 09:49:48.0
4,U0083747,D0250212,2015-07-27 14:53:41.0
5,U0012666,D0831450,2015-07-25 16:34:45.0
6,U0134425,D0186488,2015-02-26 19:38:22.0
7,U0050880,D0726216,2015-04-25 23:18:01.0
8,U0087326,D0254441,2015-11-09 17:54:56.0
9,U0035101,D0483304,2015-09-08 19:36:54.0


###  <font color=red>发表、浏览、评论、点赞、点踩、私信的行为数据形式都如上所示，包含时间。关系数据不包括时间，而且是有向无权。</font><br> 


In [5]:
follow[:10]

Unnamed: 0,userid1,userid2
0,124114,20107
1,75485,78016
2,46039,144644
3,151927,4992
4,86950,5413
5,123910,133518
6,13999,114909
7,75496,105967
8,127017,139062
9,118129,31311


**<font color=red>我们看大概有以下这些字段</font>**<br>
userid => 用户id<br>
contentid => 博客id<br>
time => 行为时间<br>
userid1 => 被关注用户<br>
userid2 => 关注用户<br>

### 时间处理

In [11]:
# 将时间转化为时间戳 以1970-1-1 0：0:0 为0 单位为：s
def time_stamp(t):
    try:
        timeArray = time.strptime(t, "%Y-%m-%d %H:%M:%S")
    except:
        timeArray = time.strptime(t, "%Y%m%d %H:%M:%S")
    timeStamp = int(time.mktime(timeArray))
    return timeStamp

### <font color=red>特征工程</font><br>
### 行为数据提取特征：发表博客1

In [5]:
def get_feature_post1(post): # 发表帖子的数量、发表时间间隔的统计变量
    csv_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/post_feature1.csv'
    if os.path.exists(csv_path):
        post1 = pd.read_csv(csv_path)
    else:
        post['time'] = post['time'].str.split('.').str[0]#对时间进行处理
        post['time'] = post['time'].map(time_stamp)
        post.sort_values(['userid','time'], inplace = True)#
        #以上是数据的一些格式处理
        post.index = range(len(post))#重新排列索引
        first_index = list(post.drop_duplicates('userid').index)#返回的是去除重复行的userid的索引
        #
        post['time'] -=  post['time'].shift(1)#目的是算时间间隔
        post['time'][first_index] = 0
        post.index = range(len(post))#重新排列索引
        #注意手法
        gg = post.groupby('userid')##******
        g_dict = dict(list(gg))
        
        post1 = pd.DataFrame() 
        post1['userid'] = pd.unique(post['userid'])
        post1['post_contentid'] = [list(g_dict[i]['contentid']) for i in post1['userid']]#以列表的形式返回
        post1['post_count'] = list(gg['contentid'].size())
        post1['post_maxt'] = list(gg['time'].max())
#        post1['mint'] = list(gg['time'].min())
        post1['post_meant'] = list(gg['time'].mean())
        post1['post_vart'] = list(gg['time'].var())
        post1.fillna(0, inplace=True)#空值，添为0
        post1.to_csv(csv_path, index=False)
    return post1

post_feature1 = get_feature_post1(post)
post_feature1[:10]

Unnamed: 0,userid,post_contentid,post_count,post_maxt,post_meant,post_vart
0,1,"[653345, 249294, 106694, 235712, 139383, 10923...",26,14863338,727618.0385,8390000000000.0
1,2,"[760834, 422221, 888644, 682217]",4,11228133,3242629.0,28900000000000.0
2,3,"[818966, 193031, 918426]",3,39932,13592.0,520524800.0
3,4,"[548739, 236223]",2,197783,98891.5,19559060000.0
4,5,"[412100, 484744, 97908, 60576, 218041, 144625,...",156,1557233,164389.6987,26411560000.0
5,6,[378263],1,0,0.0,0.0
6,7,"[442157, 200342, 53878, 791026, 206581, 244631...",23,6902913,950117.3913,2960000000000.0
7,8,"[925844, 877670, 693419, 418368, 688350, 68266...",10,899811,160065.5,74494520000.0
8,11,"[904740, 372821, 643971, 547558, 526165, 44438...",130,2075631,239230.0154,124000000000.0
9,13,"[578318, 985214, 227065, 428724]",4,1207929,417568.0,292000000000.0


**<font color=red>contentid：用户发表的博客<br>
count：发表博客的数量<br>
maxt：博客发表的最大时间间隔<br>
mint：博客发表的最小时间间隔<br>
meant：博客发表的平均时间间隔<br>
vart：博客发表时间的方差<br></font>**

### 行为数据提取特征：发表博客2

In [14]:
def get_feature_post2(post): # 每个月、二个月、三个月、四个月、五个月、六个月发表帖子的数量
    csv_path = u'post_feature2.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
    else:
        post['month'] = post['time'].str.split('-').str[1].astype('int')
        post['count'] = 1
        post = post.groupby(['userid','month'])['count'].sum().reset_index()
        dic_post = dict(list(post.groupby(['userid','month'])['count']))
        df = pd.DataFrame()
        df['userid'] = pd.unique(post['userid'])        
        all_month = []                    
        for month in range(1,13):
            ll = []
            for user in pd.unique(post['userid']):            
                try:
                    ll.append(dic_post[user,month].values[0])
                except:
                    ll.append(0)
            all_month.append(ll)
        columns_each1 = ['post_count_month_each1'+str(i) for i in range(1,13)] # 每个月
        columns_each2 = ['post_count_month_each2'+str(i) for i in range(1,12)] # 每二个月
        columns_each3 = ['post_count_month_each3'+str(i) for i in range(1,11)] # 每三个月
        columns_each4 = ['post_count_month_each4'+str(i) for i in range(1,10)] # 每四个月
        columns_each5 = ['post_count_month_each5'+str(i) for i in range(1,19)] # 每五个月
        columns_each6 = ['post_count_month_each6'+str(i) for i in range(1,8)]# 每六个月
        for i in range(len(columns_each1)):
            df[columns_each1[i]] = all_month[i]
        for i in range(11):
            df[columns_each2[i]] = df[columns_each1[i:i+2]].sum(axis=1)           
        for i in range(10):
            df[columns_each3[i]] = df[columns_each1[i:i+3]].sum(axis=1)
        for i in range(9):
            df[columns_each4[i]] = df[columns_each1[i:i+4]].sum(axis=1)  
        for i in range(8):
            df[columns_each5[i]] = df[columns_each1[i:i+5]].sum(axis=1)              
        for i in range(7):
            df[columns_each6[i]] = df[columns_each1[i:i+6]].sum(axis=1)
        df.to_csv(csv_path, index=False)
    return df

post_feature2 = get_feature_post2(post)
post_feature2[:10]

Unnamed: 0,userid,post_count_month_each11,post_count_month_each12,post_count_month_each13,post_count_month_each14,post_count_month_each15,post_count_month_each16,post_count_month_each17,post_count_month_each18,post_count_month_each19,...,post_count_month_each56,post_count_month_each57,post_count_month_each58,post_count_month_each61,post_count_month_each62,post_count_month_each63,post_count_month_each64,post_count_month_each65,post_count_month_each66,post_count_month_each67
0,U0000001,0,0,0,0,1,0,0,0,0,...,5,24,25,1,1,1,1,6,24,25
1,U0000002,0,0,0,0,0,0,1,2,0,...,3,3,3,0,1,3,3,3,3,4
2,U0000003,0,3,0,0,0,0,0,0,0,...,0,0,0,3,3,0,0,0,0,0
3,U0000004,2,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
4,U0000005,0,0,17,20,15,13,16,11,8,...,62,69,75,65,81,92,83,77,82,91
5,U0000006,0,0,0,1,0,0,0,0,0,...,0,0,0,1,1,1,1,0,0,0
6,U0000007,2,0,9,7,0,0,3,1,1,...,5,5,2,18,19,20,12,5,5,5
7,U0000008,0,0,0,0,0,0,0,0,0,...,0,4,10,0,0,0,0,0,4,10
8,U0000011,14,2,10,16,9,8,21,12,10,...,58,60,50,59,66,76,76,67,68,71
9,U0000013,0,0,0,0,2,2,0,0,0,...,2,0,0,4,4,4,4,4,2,0


### 行为数据提取特征：发表博客3

In [9]:
def IF_Weekend(line): #周末、晚上发表博客的次数
    s = datetime.strptime(line, '%Y-%m-%d %H:%M:%S')
    s = s.weekday()
    if (s == 5) or (s == 6):
        return 1
    else:
        return 0

def Weekend_count(line):
    try:
        c = gg_weekday[line][gg_weekday[line]['weekday']==1]['count'].values[0]
    except:
        c = 0
    return c

def nonWeekend_count(line):
    try:
        c = gg_weekday[line][gg_weekday[line]['weekday']==0]['count'].values[0]
    except:
        c = 0
    return c

def evening_count(line):
    try:
        c = gg_evening[line][gg_evening[line]['evening']==1]['count'].values[0]
    except:
        c = 0
    return c

def nonevening_count(line):
    try:
        c = gg_evening[line][gg_evening[line]['evening']==0]['count'].values[0]
    except:
        c = 0
    return c

def get_feature(post):
    csv_path = 'F:/文档/zhangshuo/SMPCUP2017数据集/post_feature3.csv'
    if os.path.exists(csv_path):
        post_df = pd.read_csv(csv_path)
    else:
        post = pd.read_csv(post)
        post['time'] = post['time'].str.split('.').str[0]
        post['hour'] = post['time'].str.slice(11,13).astype('int')
        post['evening'] = post['hour']>17
        post['evening'] = post['evening'].map({True:1,False:0})
        post['weekday'] = post['time'].map(IF_Weekend)
        post['count'] = 1

        post_evening = post.groupby(['userid','evening'])['count'].sum().reset_index()
        post_weekday = post.groupby(['userid','weekday'])['count'].sum().reset_index()
        gg_evening = dict(list(post_evening.groupby('userid')))
        gg_weekday = dict(list(post_weekday.groupby('userid')))
        userid = pd.unique(post['userid'])

        post_df = pd.DataFrame()
        post_df['userid'] = userid
        post_df['post_evening_count'] = post_df['userid'].map(evening_count)
        post_df['post_nonevening_count'] = post_df['userid'].map(nonevening_count)
        post_df['post_weekend_count'] = post_df['userid'].map(Weekend_count)
        post_df['post_nonweekend_count'] = post_df['userid'].map(nonWeekend_count)
        post_df.to_csv(csv_path, index=False)
    return post_df

#post_df = get_feature('Post.csv')
post_df = pd.read_csv('post_feature3.csv')
post_df.head(10)

Unnamed: 0,userid,post_evening_count,post_nonevening_count,post_weekend_count,post_nonweekend_count
0,U0024827,4,43,0,47
1,U0042700,0,1,1,0
2,U0107217,18,20,20,18
3,U0148074,155,238,98,295
4,U0083747,111,144,81,174
5,U0012666,34,89,31,92
6,U0134425,3,4,1,6
7,U0050880,1,0,1,0
8,U0087326,6,26,5,27
9,U0035101,2,0,0,2


### 行为数据提取特征：浏览博客1

In [7]:
def get_feature_browse1(browse): # 浏览帖子的数量、浏览时间
    csv_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/browse_feature1.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
    else:
        browse['time'] = browse['time'].map(time_stamp)
        browse.sort_values(['userid','contentid','time'], inplace = True)
        browse.index = range(len(browse))
        first_browse = browse.drop_duplicates(['userid','contentid'],keep='first')
        last_browse = browse.drop_duplicates(['userid','contentid'],keep='last')
        gg = browse.groupby(['userid','contentid'])
        df = pd.DataFrame()
        df['userid'] = first_browse['userid'].values
        df['browse_contentid'] = first_browse['contentid'].values
        df['browse_count'] = list(gg['time'].size())
        df['browse_time'] = last_browse['time'].values - first_browse['time'].values
        df.fillna(0, inplace=True)
        df.to_csv(csv_path, index=False)
    return df   

browse_feature1 = get_feature_browse1(browse)
browse_feature1[:10]

Unnamed: 0,userid,browse_contentid,browse_count,browse_time
0,1,87949,1,0
1,1,90462,2,1022066
2,1,106694,1,0
3,1,109238,4,134740
4,1,120422,2,168852
5,1,121164,1,0
6,1,131358,2,169
7,1,139383,2,137830
8,1,141263,1,0
9,1,151517,2,1458


**<font color=red>contentid：用户浏览的博客<br>
count：浏览博客的数量<br>
time：浏览博客的时间<br></font>**

### 行为数据提取特征：浏览博客2  方法同发表一样

In [15]:
def get_feature_browse2(browse): # 每个月、二个月、三个月、四个月、五个月、六个月浏览帖子的数量
    csv_path = u'browse_feature2.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
    else:
        browse['month'] = browse['time'].str.split(' ').str[0].str[4:6].astype('int')
        browse['count'] = 1
        browse = browse.groupby(['userid','month'])['count'].sum().reset_index()
        dic_browse = dict(list(browse.groupby(['userid','month'])['count']))
        df = pd.DataFrame()
        df['userid'] = pd.unique(browse['userid'])        
        all_month = []                    
        for month in range(1,13):
            ll = []
            for user in pd.unique(browse['userid']):            
                try:
                    ll.append(dic_browse[user,month].values[0])
                except:
                    ll.append(0)
            all_month.append(ll)
        columns_each1 = ['browse_count_month_each1'+str(i) for i in range(1,13)] # Ã¿¸öÔÂ
        columns_each2 = ['browse_count_month_each2'+str(i) for i in range(1,12)] # Ã¿¶þ¸öÔÂ
        columns_each3 = ['browse_count_month_each3'+str(i) for i in range(1,11)] # Ã¿Èý¸öÔÂ
        columns_each4 = ['browse_count_month_each4'+str(i) for i in range(1,10)] # Ã¿ËÄ¸öÔÂ
        columns_each5 = ['browse_count_month_each5'+str(i) for i in range(1,9)] # Ã¿Îå¸öÔÂ
        columns_each6 = ['browse_count_month_each6'+str(i) for i in range(1,8)]# Ã¿Áù¸öÔÂ
        for i in range(len(columns_each1)):
            df[columns_each1[i]] = all_month[i]
        df['browse_count'] = df[columns_each1].sum(axis=1)
        for i in range(11):
            df[columns_each2[i]] = df[columns_each1[i:i+2]].sum(axis=1)
        for i in range(10):
            df[columns_each3[i]] = df[columns_each1[i:i+3]].sum(axis=1)
        for i in range(9):
            df[columns_each4[i]] = df[columns_each1[i:i+3]].sum(axis=1)
        for i in range(8):
            df[columns_each5[i]] = df[columns_each1[i:i+3]].sum(axis=1)            
        for i in range(7):
            df[columns_each6[i]] = df[columns_each1[i:i+6]].sum(axis=1)
        df.to_csv(csv_path, index=False)
    return df

browse_feature2 = get_feature_browse2(browse)
browse_feature2[:10]

Unnamed: 0,userid,browse_count_month_each11,browse_count_month_each12,browse_count_month_each13,browse_count_month_each14,browse_count_month_each15,browse_count_month_each16,browse_count_month_each17,browse_count_month_each18,browse_count_month_each19,...,browse_count_month_each56,browse_count_month_each57,browse_count_month_each58,browse_count_month_each61,browse_count_month_each62,browse_count_month_each63,browse_count_month_each64,browse_count_month_each65,browse_count_month_each66,browse_count_month_each67
0,U0000001,0,0,0,0,0,0,0,0,0,...,0,0,5,0,0,0,0,5,31,33
1,U0000002,0,0,0,0,0,0,1,1,1,...,2,3,2,0,1,2,3,3,3,3
2,U0000003,3,1,0,0,0,0,1,0,0,...,1,1,3,4,2,1,1,4,4,4
3,U0000004,2,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
4,U0000005,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4,32
5,U0000006,0,0,1,1,0,0,1,0,0,...,1,1,0,2,3,3,2,1,1,1
6,U0000007,0,0,9,7,3,0,5,4,6,...,9,15,10,19,24,28,25,18,15,15
7,U0000008,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3,9
8,U0000009,0,0,0,0,0,0,13,6,2,...,19,21,8,0,13,19,21,21,21,21
9,U0000010,32,21,13,5,4,22,17,21,6,...,60,44,28,97,82,82,75,71,67,45


### 行为数据提取特征：浏览博客3

In [9]:
def get_feature_browse3(browse_feature1): # 浏览帖子的时间间隔统计量
    csv_path = u'browse_feature3.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
    else:
        df = pd.DataFrame()
        df['userid'] = pd.unique(browse_feature1['userid'])
        gg = browse_feature1.groupby('userid')
        
        df['browse_maxt'] = list(gg['browse_time'].max())
        df['browse_sumt'] = list(gg['browse_time'].sum())
        df['browse_meant'] = list(gg['browse_time'].mean())
        df['browse_vart'] = list(gg['browse_time'].var())
        df.fillna(0, inplace=True)
        df.to_csv(csv_path, index=False)
    return df 

browse_feature3 = get_feature_browse3(browse)
browse_feature3[:10]

Unnamed: 0,userid,browse_maxt,browse_sumt,browse_meant,browse_vart
0,1,1022066,1799464,74977.666667,47267550000.0
1,2,0,0,0.0,0.0
2,3,6556907,6556907,1092817.83333,7165505000000.0
3,4,327,327,327.0,0.0
4,5,74,213,7.607143,382.8399
5,6,0,0,0.0,0.0
6,7,5403224,12613785,741987.352941,2445975000000.0
7,8,37280,45943,7657.166667,222308700.0
8,9,12822,13624,717.052632,8626542.0
9,10,15120456,77612765,834545.860215,6832051000000.0


### 行为数据提取特征：浏览博客4

In [16]:
# 统计夜晚、周末的行为时间  与post一样

### 行为数据提取特征：评论博客1

In [10]:
def get_feature_comment1(comment): # 评论帖子的数量
    csv_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/comment_feature1.csv'
    if os.path.exists(csv_path):
        comment = pd.read_csv(csv_path)
    else:
        comment['time'] = comment['time'].str.split('.').str[0]
        comment['time'] = comment['time'].map(time_stamp)
        comment.sort_values(['userid','contentid','time'], inplace = True)
        comment.index = range(len(comment))
        comment_count = list(comment.groupby(['userid','contentid']).size())
        comment = comment.drop_duplicates(['userid','contentid'])   
        comment['comment_count'] = comment_count
        comment['comment_contentid'] = comment['contentid']
        del comment['time'], comment['contentid']
        comment.to_csv(csv_path, index=False)
    return comment

comment_feature1 = get_feature_comment1(comment)
comment_feature1[:10]

Unnamed: 0,userid,comment_count,comment_contentid
0,37,1,91812
1,37,1,195673
2,37,1,223317
3,37,1,244046
4,37,1,316208
5,37,1,327513
6,37,1,341512
7,37,1,392462
8,37,1,418648
9,37,1,434499


**<font color=red>contentid：用户评论的博客<br>
count：用户评论博客的数量<br>
<br></font>**

### 行为数据提取特征：评论博客2 方法同上

In [20]:
def get_feature_comment2(comment):
    csv_path = u'comment_feature2.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
    else:
        comment['month'] = comment['time'].str.split('-').str[1].astype('int')
        comment['count'] = 1
        comment = comment.groupby(['userid','month'])['count'].sum().reset_index()
        dic_comment = dict(list(comment.groupby(['userid','month'])['count']))
        df = pd.DataFrame()
        df['userid'] = pd.unique(comment['userid'])        
        all_month = []                    
        for month in range(1,13):
            ll = []
            for user in pd.unique(comment['userid']):            
                try:
                    ll.append(dic_comment[user,month].values[0])
                except:
                    ll.append(0)
            all_month.append(ll)
        columns_each1 = ['comment_count_month_each1'+str(i) for i in range(1,13)] # 每个月
        columns_each2 = ['comment_count_month_each2'+str(i) for i in range(1,12)] # 每二个月
        columns_each3 = ['comment_count_month_each3'+str(i) for i in range(1,11)] # 每三个月
        columns_each4 = ['comment_count_month_each4'+str(i) for i in range(1,10)] # 每四个月
        columns_each5 = ['comment_count_month_each5'+str(i) for i in range(1,9)] # 每五个月
        columns_each6 = ['comment_count_month_each6'+str(i) for i in range(1,8)]# 每六个月
        for i in range(len(columns_each1)):
            df[columns_each1[i]] = all_month[i]
        df['comment_count'] = df[columns_each1].sum(axis=1)
        for i in range(11):
            df[columns_each2[i]] = df[columns_each1[i:i+2]].sum(axis=1)
        for i in range(10):
            df[columns_each3[i]] = df[columns_each1[i:i+3]].sum(axis=1)    
        for i in range(9):
            df[columns_each4[i]] = df[columns_each1[i:i+4]].sum(axis=1)
        for i in range(8):
            df[columns_each5[i]] = df[columns_each1[i:i+5]].sum(axis=1)
        for i in range(7):
            df[columns_each6[i]] = df[columns_each1[i:i+6]].sum(axis=1)
        df.to_csv(csv_path, index=False)
    return df

comment_feature2 = get_feature_comment2(comment)
comment_feature2[:10]

Unnamed: 0,userid,comment_count_month_each11,comment_count_month_each12,comment_count_month_each13,comment_count_month_each14,comment_count_month_each15,comment_count_month_each16,comment_count_month_each17,comment_count_month_each18,comment_count_month_each19,...,comment_count_month_each56,comment_count_month_each57,comment_count_month_each58,comment_count_month_each61,comment_count_month_each62,comment_count_month_each63,comment_count_month_each64,comment_count_month_each65,comment_count_month_each66,comment_count_month_each67
0,U0000037,0,0,0,5,17,0,14,0,0,...,14,14,0,22,36,36,36,31,14,14
1,U0000105,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,0,0,3
2,U0000130,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,U0000177,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,U0000260,0,0,0,0,0,0,0,0,8,...,8,19,19,0,0,0,8,8,19,19
5,U0000279,0,0,0,0,0,0,0,0,0,...,3,3,3,0,0,0,0,3,3,3
6,U0000286,0,0,0,0,0,0,0,0,4,...,4,4,4,0,0,0,4,4,4,4
7,U0000307,0,0,0,0,0,0,0,3,0,...,3,3,3,0,0,3,3,3,3,3
8,U0000317,0,0,0,0,2,0,0,0,0,...,0,0,0,2,2,2,2,2,0,0
9,U0000323,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,0,0,3


### 行为数据提取特征：评论博客3 晚上、周末的浏览量 方法同上

### 行为数据提取特征：收藏博客1

In [12]:
def get_feature_favorite1(favorite): # 收藏帖子的数量、收藏的帖子
    csv_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/favorite_feature1.csv'
    if os.path.exists(csv_path):
       df = pd.read_csv(csv_path)
    else:  
        favorite['time'] = favorite['time'].map(time_stamp)
        gg = favorite.groupby('userid')
        g_dict = dict(list(gg))
        df = pd.DataFrame()
        df['userid'] = pd.unique(favorite['userid'])
        df['favorite_contentid'] = [list(g_dict[i]['contentid']) for i in df['userid']]
        df['favorite_count'] = [len(g_dict[i]['contentid']) for i in df['userid']]
        df.to_csv(csv_path, index=False)
    return df

favorite_feature1 = get_feature_favorite1(favorite)
favorite_feature1[:10]

Unnamed: 0,userid,favorite_contentid,favorite_count
0,14911,"[552113, 696764]",2
1,773,"[39946, 80900, 766615, 766615, 766615, 766615,...",11
2,110434,"[527721, 644282, 683582, 5707, 3737, 597258, 1...",19
3,70951,"[4411, 363179, 108251, 14886]",4
4,62070,"[76629, 18753, 68709, 90089, 36336, 111701, 85...",12
5,30979,"[655667, 523132, 126896, 317348]",4
6,42843,"[636475, 659807, 379474, 52823, 78320]",5
7,9168,"[505642, 657302, 444585, 519634]",4
8,73866,"[36949, 95695, 183069, 260]",4
9,135690,"[131917, 325425, 509174, 358443, 443695, 29121...",28


**<font color=red>contentid：用户收藏的博客<br>
count：用户评论博客的数量<br>
<br></font>**

### 行为数据提取特征：收藏博客2

In [19]:
def get_feature_favorite2(favorite):    
    csv_path = u'favorite_feature2.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
    else:
        favorite['month'] = favorite['time'].str.split('-').str[1].astype('int')
        favorite['count'] = 1
        favorite = favorite.groupby(['userid','month'])['count'].sum().reset_index()
        dic_favorite = dict(list(favorite.groupby(['userid','month'])['count']))
        df = pd.DataFrame()
        df['userid'] = pd.unique(favorite['userid'])        
        all_month = []                    
        for month in range(1,13):
            ll = []
            for user in pd.unique(favorite['userid']):            
                try:
                    ll.append(dic_favorite[user,month].values[0])
                except:
                    ll.append(0)
            all_month.append(ll)
        columns_each1 = ['favorite_count_month_each1'+str(i) for i in range(1,13)] # 每个月
        columns_each2 = ['favorite_count_month_each2'+str(i) for i in range(1,12)] # 每二个月
        columns_each3 = ['favorite_count_month_each3'+str(i) for i in range(1,11)] # 每三个月
        columns_each4 = ['favorite_count_month_each4'+str(i) for i in range(1,10)] # 每四个月
        columns_each5 = ['favorite_count_month_each5'+str(i) for i in range(1,9)] # 每五个月
        columns_each6 = ['favorite_count_month_each6'+str(i) for i in range(1,8)]# 每六个月
        for i in range(len(columns_each1)):
            df[columns_each1[i]] = all_month[i]
        for i in range(11):
            df[columns_each2[i]] = df[columns_each1[i:i+2]].sum(axis=1)
        for i in range(10):
            df[columns_each3[i]] = df[columns_each1[i:i+3]].sum(axis=1)
        for i in range(9):
            df[columns_each4[i]] = df[columns_each1[i:i+4]].sum(axis=1)
        for i in range(8):
            df[columns_each5[i]] = df[columns_each1[i:i+5]].sum(axis=1)            
        for i in range(7):
            df[columns_each6[i]] = df[columns_each1[i:i+6]].sum(axis=1)
        df.to_csv(csv_path, index=False)
    return df

favorite_feature2 = get_feature_favorite2(favorite)
favorite_feature2[:10]

Unnamed: 0,userid,favorite_count_month_each11,favorite_count_month_each12,favorite_count_month_each13,favorite_count_month_each14,favorite_count_month_each15,favorite_count_month_each16,favorite_count_month_each17,favorite_count_month_each18,favorite_count_month_each19,...,favorite_count_month_each56,favorite_count_month_each57,favorite_count_month_each58,favorite_count_month_each61,favorite_count_month_each62,favorite_count_month_each63,favorite_count_month_each64,favorite_count_month_each65,favorite_count_month_each66,favorite_count_month_each67
0,U0000007,0,0,0,0,0,0,0,3,0,...,3,3,3,0,0,3,3,3,3,3
1,U0000011,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,1,1
2,U0000014,0,0,5,0,0,8,0,0,0,...,8,0,0,13,13,13,8,8,8,0
3,U0000023,0,0,0,0,0,0,0,0,0,...,0,4,4,0,0,0,0,0,4,4
4,U0000030,0,0,0,0,0,0,1,0,0,...,1,1,0,0,1,1,1,1,1,1
5,U0000031,0,0,0,0,0,0,0,0,0,...,0,1,3,0,0,0,0,0,1,3
6,U0000037,0,1,0,2,0,0,0,0,0,...,0,0,0,3,3,2,2,0,0,0
7,U0000042,0,0,0,0,0,0,0,0,0,...,3,3,3,0,0,0,0,3,3,3
8,U0000044,0,1,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
9,U0000066,0,0,0,0,1,0,0,0,0,...,0,0,0,1,1,1,1,1,0,0


### 行为数据提取特征：收藏博客3  晚上、周末的浏览量 方法同上

### 行为数据提取特征：点赞博客1   

In [14]:
def get_feature_voteup1(voteup): # 点赞的博客、点赞的数量
    csv_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/voteup_feature1.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path) 
    else:
        voteup['time'] = voteup['time'].map(time_stamp)
        gg = voteup.groupby('userid')
        dic_gg = dict(list(gg))
        df = pd.DataFrame()
        df['userid'] = pd.unique(voteup['userid'])
        df['voteup_contentid'] = [list(dic_gg[i]['contentid']) for i in df['userid']]
        df['voteup_count'] = [len(dic_gg[i]['contentid']) for i in df['userid']]
        df.to_csv(csv_path, index=False)
    return df

voteup_feature1 = get_feature_voteup1(voteup)
voteup_feature1[:10]

Unnamed: 0,userid,voteup_contentid,voteup_count
0,111639,"[627490, 743941, 168160, 853209, 642075, 61149...",36
1,43080,"[848333, 369330, 336867, 677858, 476029, 67726...",28
2,27205,"[246166, 144359, 177310, 264211, 796829, 28938...",147
3,103742,"[145377, 177767, 68273, 207921, 878629, 282305...",167
4,90244,"[430112, 441076, 391276, 905837]",4
5,125180,[713482],1
6,113626,"[683835, 841428, 686960, 719076, 647884, 83894...",89
7,140406,"[141295, 558647, 327327, 198585, 999449, 14114...",79
8,44720,"[683451, 317919, 536948, 723014, 692337, 85038...",77
9,123390,"[168474, 857310, 595026, 700724, 877621, 31381...",94


### 行为数据提取特征：点赞博客2   

In [21]:
def get_feature_voteup2(voteup):
    csv_path = u'voteup_feature2.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path) 
    else:
        voteup['month'] = voteup['time'].str.split('-').str[1].astype('int')
        voteup['count'] = 1
        voteup = voteup.groupby(['userid','month'])['count'].sum().reset_index()
        dic_voteup = dict(list(voteup.groupby(['userid','month'])['count']))
        df = pd.DataFrame()
        df['userid'] = pd.unique(voteup['userid'])        
        all_month = []                    
        for month in range(1,13):
            ll = []
            for user in pd.unique(voteup['userid']):            
                try:
                    ll.append(dic_voteup[user,month].values[0])
                except:
                    ll.append(0)
            all_month.append(ll)
        columns_each1 = ['voteup_count_month_each1'+str(i) for i in range(1,13)] # 每个月
        columns_each2 = ['voteup_count_month_each2'+str(i) for i in range(1,12)] # 每二个月
        columns_each3 = ['voteupe_count_month_each3'+str(i) for i in range(1,11)] # 每三个月
        columns_each4 = ['voteup_count_month_each4'+str(i) for i in range(1,10)] # 每四个月
        columns_each5 = ['voteup_count_month_each5'+str(i) for i in range(1,9)] # 每五个月
        columns_each6 = ['voteup_count_month_each6'+str(i) for i in range(1,8)]# 每六个月
        for i in range(len(columns_each1)):
            df[columns_each1[i]] = all_month[i]
        for i in range(11):
            df[columns_each2[i]] = df[columns_each1[i:i+2]].sum(axis=1)
        for i in range(10):
            df[columns_each3[i]] = df[columns_each1[i:i+3]].sum(axis=1)
        for i in range(9):
            df[columns_each4[i]] = df[columns_each1[i:i+4]].sum(axis=1)
        for i in range(8):
            df[columns_each5[i]] = df[columns_each1[i:i+5]].sum(axis=1)            
        for i in range(7):
            df[columns_each6[i]] = df[columns_each1[i:i+6]].sum(axis=1)
        df.to_csv(csv_path, index=False)
    return df

voteup_feature2 = get_feature_voteup2(voteup)
voteup_feature2[:10]

Unnamed: 0,userid,voteup_count_month_each11,voteup_count_month_each12,voteup_count_month_each13,voteup_count_month_each14,voteup_count_month_each15,voteup_count_month_each16,voteup_count_month_each17,voteup_count_month_each18,voteup_count_month_each19,...,voteup_count_month_each56,voteup_count_month_each57,voteup_count_month_each58,voteup_count_month_each61,voteup_count_month_each62,voteup_count_month_each63,voteup_count_month_each64,voteup_count_month_each65,voteup_count_month_each66,voteup_count_month_each67
0,U0000002,0,0,0,0,0,0,0,2,0,...,2,2,2,0,0,2,2,2,2,2
1,U0000010,2,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
2,U0000011,0,0,0,0,0,0,0,1,0,...,1,1,1,0,0,1,1,1,1,1
3,U0000016,0,0,0,0,0,0,0,0,0,...,0,1,2,0,0,0,0,0,1,2
4,U0000022,0,0,1,0,0,0,0,1,0,...,2,2,2,1,1,2,1,2,2,2
5,U0000031,0,0,0,0,1,0,0,0,0,...,0,0,0,1,1,1,1,1,0,0
6,U0000042,0,0,0,0,1,0,0,0,0,...,0,0,0,1,1,1,1,1,0,0
7,U0000045,0,0,0,0,0,0,0,0,0,...,0,1,2,0,0,0,0,0,1,2
8,U0000046,0,0,0,0,0,0,0,1,0,...,1,1,1,0,0,1,1,1,1,1
9,U0000049,0,0,0,1,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1


### 行为数据提取特征：点赞博客3  晚上、周末的浏览量 方法同上

### 行为数据提取特征：点踩博客1   

In [16]:
def get_feature_votedown1(votedown):
    csv_path = u'F:/文档/zhangshuo/SMPCUP2017数据集/votedown_feature1.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path) 
    else:
        votedown['time'] = votedown['time'].map(time_stamp)
        gg = votedown.groupby('userid')
        dic_gg = dict(list(gg))
        df = pd.DataFrame()
        df['userid'] = pd.unique(votedown['userid'])
        df['votedown_contentid'] = [list(dic_gg[i]['contentid']) for i in df['userid']]
        df['votedown_count'] = [len(dic_gg[i]['contentid']) for i in df['userid']]
        df.to_csv(csv_path, index=False)
    return df

votedown_feature1 = get_feature_votedown1(votedown)
votedown_feature1[:10]

Unnamed: 0,userid,votedown_contentid,votedown_count
0,19111,[582423],1
1,107086,"[361451, 266519, 357065, 244357, 983809, 27885...",330
2,86803,"[240354, 791525, 73048, 550507, 831075, 574497...",20
3,9882,[977807],1
4,104831,"[417942, 213293, 175297, 603158, 767838, 13851...",161
5,108810,"[68401, 376141, 838471, 339535, 442305, 294139...",28
6,130596,"[820655, 457198, 735195]",3
7,117083,"[79718, 621331, 265685, 298196, 453623, 981986...",249
8,131552,"[703768, 745190, 747689, 878566, 786931, 70736...",7
9,157386,"[255993, 22482, 59643]",3


### 行为数据提取特征：点踩博客2  

In [22]:
def get_feature_votedown2(votedown):
    csv_path = u'votedown_feature2.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path) 
    else:
        votedown['month'] = votedown['time'].str.split('-').str[1].astype('int')
        votedown['count'] = 1
        votedown = votedown.groupby(['userid','month'])['count'].sum().reset_index()
        dic_votedown = dict(list(votedown.groupby(['userid','month'])['count']))
        df = pd.DataFrame()
        df['userid'] = pd.unique(votedown['userid'])        
        all_month = []                    
        for month in range(1,13):
            ll = []
            for user in pd.unique(votedown['userid']):            
                try:
                    ll.append(dic_votedown[user,month].values[0])
                except:
                    ll.append(0)
            all_month.append(ll)
        columns_each1 = ['votedown_count_month_each1'+str(i) for i in range(1,13)] # 每个月
        columns_each2 = ['votedown_count_month_each2'+str(i) for i in range(1,12)] # 每二个月
        columns_each3 = ['votedowne_count_month_each3'+str(i) for i in range(1,11)] # 每三个月
        columns_each4 = ['votedown_count_month_each4'+str(i) for i in range(1,10)] # 每四个月
        columns_each5 = ['votedown_count_month_each5'+str(i) for i in range(1,9)] # 每五个月
        columns_each6 = ['votedown_count_month_each6'+str(i) for i in range(1,8)]# 每六个月
        for i in range(len(columns_each1)):
            df[columns_each1[i]] = all_month[i]
        for i in range(11):
            df[columns_each2[i]] = df[columns_each1[i:i+2]].sum(axis=1)
        for i in range(10):
            df[columns_each3[i]] = df[columns_each1[i:i+3]].sum(axis=1)
        for i in range(9):
            df[columns_each4[i]] = df[columns_each1[i:i+4]].sum(axis=1)
        for i in range(8):
            df[columns_each5[i]] = df[columns_each1[i:i+5]].sum(axis=1)            
        for i in range(7):
            df[columns_each6[i]] = df[columns_each1[i:i+6]].sum(axis=1)
        df.to_csv(csv_path, index=False)
    return df

votedown_feature2 = get_feature_votedown2(votedown)
votedown_feature2[:10]

Unnamed: 0,userid,votedown_count_month_each11,votedown_count_month_each12,votedown_count_month_each13,votedown_count_month_each14,votedown_count_month_each15,votedown_count_month_each16,votedown_count_month_each17,votedown_count_month_each18,votedown_count_month_each19,...,votedown_count_month_each56,votedown_count_month_each57,votedown_count_month_each58,votedown_count_month_each61,votedown_count_month_each62,votedown_count_month_each63,votedown_count_month_each64,votedown_count_month_each65,votedown_count_month_each66,votedown_count_month_each67
0,U0000075,0,0,0,0,0,0,1,0,0,...,1,1,0,0,1,1,1,1,1,1
1,U0000105,0,0,0,0,0,0,0,1,0,...,1,1,1,0,0,1,1,1,1,1
2,U0000112,0,0,0,0,0,0,0,0,1,...,1,1,1,0,0,0,1,1,1,1
3,U0000129,0,0,0,1,0,0,0,0,0,...,0,1,1,1,1,1,1,0,1,1
4,U0000209,0,0,0,0,0,0,0,0,1,...,1,1,1,0,0,0,1,1,1,1
5,U0000254,0,0,0,1,0,0,0,0,0,...,0,0,0,1,1,1,1,0,0,0
6,U0000260,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,1,1
7,U0000307,0,0,1,0,0,0,0,0,0,...,0,0,0,1,1,1,0,0,0,0
8,U0000314,0,0,0,0,0,0,0,0,0,...,1,1,1,0,0,0,0,1,1,1
9,U0000526,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


### 行为数据提取特征：点踩博客3  晚上、周末的浏览量 方法同上

### 社交数据提取特征：入度、出度、是否构成三角形、四边形、

In [26]:
def load_data():
    df = pd.read_csv('8_Follow.txt', header=None, sep='\001', names=['src','dst'])
#    df = pd.read_csv('9_Letter.txt', header=None, sep='\001', names=['src','dst','time'])
    G = nx.from_pandas_dataframe(df, 'src', 'dst', create_using=nx.DiGraph())
    return df, G
    
def In_degree(node):
    return in_degree[node]

def Out_degree(node):
    return out_degree[node]
    
def friend_out_measure(edges):   
    try:    
        edges = eval(edges)
    except:
        pass
    node1 = edges[0]
    node2 = edges[1]
    c = 0
    if node1 == [] or node2 == []:
        return 0
    else:
        f1 = G.neighbors(node1)
        f2 = G.neighbors(node2) 
        for i in f1:
            for j in f2:
                if str((f1, f2)) in edges:
                    c += 1
    return c

def friend_in_measure(edges):  
    try:    
        edges = eval(edges)
    except:
        pass
    node1 = edges[0]
    node2 = edges[1]
    c = 0
    if node1 == [] or node2 == []:
        return 0
    else:
        f1 = G.neighbors(node1)
        f2 = G.neighbors(node2) 
        for i in f1:
            for j in f2:
                if str((f1, f2)) in edges:
                    c += 1
    return c
    
def common_friend(edges):
    try:    
        edges = eval(edges)
    except:
        pass
    node1 = edges[0]
    node2 = edges[1]
    f1 = G.neighbors(node1)
    f2 = G.neighbors(node2) 
    return len(set(f1) & set(f2))
    
def total_friend(edges):
    try:    
        edges = eval(edges)
    except:
        pass
    node1 = edges[0]
    node2 = edges[1]
    f1 = G.neighbors(node1)
    f2 = G.neighbors(node2)  
    return len(set(f1) | set(f2))

def neighbor_in_degree(node):
    neighbors = G.neighbors(node)
    ll = []
    for neighbor in neighbors:
        ll.append(In_degree(neighbor))
    return ll

def neighbor_out_degree(node):
    neighbors = G.neighbors(node)
    ll = []
    for neighbor in neighbors:
        ll.append(Out_degree(neighbor))
    return ll

def MAX(node):
    try:
        return max(node)
    except:
        return 0

def MIN(node):
    try:
        return min(node)
    except:
        return 0  

def group3(node):
    neighbors = G.neighbors(node)
    c = 0
    for i in neighbors:
        if node in G.neighbors(i):
            c += 1
    return c

def group4(node):
    neighbors = G.neighbors(node)
    c = 0
    for i in neighbors:
        for j in G.neighbors(i):
            if node in G.neighbors(j):
                c += 1
    return c

def main():
    csv_path = u'net.csv'
    if os.path.exists(csv_path):
        net = pd.read_csv(csv_path) 
    else:
        df, G = load_data()
        degree = G.degree()
        in_degree = G.in_degree()
        out_degree = G.out_degree()  
        nodes = G.nodes()
        net = pd.DataFrame()
        net['userid'] = nodes
        net['group3_num'] = net['userid'].map(group3)
        net['group4_num'] = net['userid'].map(group4)
        net.fillna(0, inplace=True)
        net['in_degree'] = net['userid'].map(In_degree)
        net['out_degree'] = net['userid'].map(Out_degree) 
        net['degree'] = net['in_degree'] + net['out_degree']
        net['neighbor_in_degree'] = net['userid'].map(neighbor_in_degree)
        net['neighbor_out_degree'] = net['userid'].map(neighbor_out_degree)
        net['neighbor_degree'] = net['neighbor_in_degree'] + net['neighbor_out_degree']
        net['neighbor_in_degree_sum'] = net['neighbor_in_degree'].map(sum)
        net['neighbor_in_degree_max'] = net['neighbor_in_degree'].map(MAX)
        net['neighbor_in_degree_min'] = net['neighbor_in_degree'].map(MIN)
        net['neighbor_in_degree_mean'] = net['neighbor_in_degree'].map(np.mean)
        net['neighbor_in_degree_median'] = net['neighbor_in_degree'].map(np.median)
        net['neighbor_in_degree_var'] = net['neighbor_in_degree'].map(np.var)
        net['neighbor_out_degree_sum'] = net['neighbor_out_degree'].map(sum)
        net['neighbor_out_degree_max'] = net['neighbor_out_degree'].map(MAX)
        net['neighbor_out_degree_min'] = net['neighbor_out_degree'].map(MIN)
        net['neighbor_out_degree_mean'] = net['neighbor_out_degree'].map(np.mean)
        net['neighbor_out_degree_median'] = net['neighbor_out_degree'].map(np.median)
        net['neighbor_out_degree_var'] = net['neighbor_out_degree'].map(np.var)
        net['neighbor_degree_sum'] = net['neighbor_degree'].map(sum)
        net['neighbor_degree_max'] = net['neighbor_degree'].map(MAX)
        net['neighbor_degree_min'] = net['neighbor_degree'].map(MIN)
        net['neighbor_degree_mean'] = net['neighbor_degree'].map(np.mean)
        net['neighbor_degree_median'] = net['neighbor_degree'].map(np.median)
        net['neighbor_degree_var'] = net['neighbor_degree'].map(np.var)
        net.fillna(0,inplace=True)
        net.to_csv(csv_path,index=False,encoding='utf8')
    return net

net = main()
net.head(50)

Unnamed: 0,userid,group3_num,group3num_rate,group4_num,group4num_rate,if_group3,if_group4,in_degree,out_degree,degree,...,neighbor_out_degree_min,neighbor_out_degree_mean,neighbor_out_degree_median,neighbor_out_degree_var,neighbor_degree_sum,neighbor_degree_max,neighbor_degree_min,neighbor_degree_mean,neighbor_degree_median,neighbor_degree_var
0,U0000001,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,U0000002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,U0000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,U0000004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,U0000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,...,0.0,1.333333,1.0,1.555556,381.0,262.0,0.0,63.5,6.5,9264.25
5,U0000006,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,12.0,12.0,0.0,6.0,6.0,36.0
6,U0000007,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,4.0,...,0.0,5.666667,6.0,20.222222,1182.0,624.0,0.0,197.0,12.0,72605.333333
7,U0000008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,U0000009,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,U0000010,0.0,0.0,0.0,0.0,0.0,0.0,34.0,0.0,34.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 社交数据提取特征：私信  

In [28]:
def get_letter_feature(letter): # 发信数量，收信数量、是否既发信又收信、比率
    csv_path = u'letter_feature.csv'
    if os.path.exists(csv_path):
        letter= pd.read_csv(csv_path)
    else:
        letter = pd.read_csv('9_letter.txt', header=None, sep='\001', names=['src','dst','time'])[['src','dst']]
        letter['count'] = 1
        faxin = letter.groupby('src')['count'].sum().reset_index()
        faxin.columns = ['userid','faxin_count']
        shouxin  = letter.groupby('dst')['count'].sum().reset_index()
        shouxin.columns = ['userid','shouxin_count']
        letter = pd.merge(faxin, shouxin, on='userid' ,how='outer')
        letter['sandf'] = letter['faxin_count'] + letter['shouxin_count']
        letter.fillna(0, inplace=True)
        letter['rate_faxin'] = letter['faxin_count'] / letter['faxin_count'].sum()
        letter['rate_shouxin'] = letter['shouxin_count'] / letter['shouxin_count'].sum()
        letter['shouxin_rate'] = letter['shouxin_count'] / letter['sandf']
        letter['faxin_rate'] = letter['faxin_count'] / letter['sandf']
        letter.fillna(0, inplace=True)
        letter.to_csv(csv_path,index=False,encoding='utf8')
    return letter

letter = get_letter_feature(letter)
letter.head(50)

Unnamed: 0,userid,faxin_count,shouxin_count,sandf,rate_faxin,rate_shouxin,shouxin_rate,faxin_rate
0,U0000001,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,U0000002,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,U0000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,U0000004,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,U0000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,U0000006,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,U0000007,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,U0000008,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,U0000009,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,U0000010,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 浏览后评论的博客

In [31]:
def bro_com():
    csv_path  = 'browse-comment.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
    else:
        bro = pd.read_csv('Browse.csv')[['userid','contentid']]
        bro = bro.groupby('userid')['contentid'].agg(' '.join).reset_index()
        bro['contentid'] = bro['contentid'].apply(lambda x:set(x.split(' ')))
        com = pd.read_csv('Comment.csv')[['userid','contentid']]
        com = com.groupby('userid')['contentid'].agg(' '.join).reset_index()
        com['contentid'] = com['contentid'].apply(lambda x:set(x.split(' ')))
        
        df = pd.merge(bro,com,on='userid')
        df.columns = ['userid','bro_contentid','contentid']
        df['num_bro'] = df['bro_contentid'].map(len)
        ll = [len(df['bro_contentid'][i] & df['contentid'][i]) for i in range(len(df))]
        df['bro_com'] = ll
        df['bro_com_rate'] = df['bro_com'] / df['num_bro']
        df.to_csv(csv_path,index=False,encoding='utf8')
    return df

bro = bro_com()
bro.head(10)

Unnamed: 0,userid,bro_contentid,contentid,bro_com,num_bro,bro_com_rate
0,U0000037,"{'D0111640', 'D0481808', 'D0811906', 'D0418648...","{'D0327513', 'D0481808', 'D0392462', 'D0434499...",13,27,0.481481
1,U0000105,"{'D0163728', 'D0020861', 'D0828490', 'D0102889...",{'D0027672'},0,16,0.0
2,U0000130,"{'D0765669', 'D0463634', 'D0669370', 'D0505318...",{'D0952240'},0,19,0.0
3,U0000177,"{'D0113918', 'D0175297', 'D0674302', 'D0196612...",{'D0100032'},1,28,0.035714
4,U0000260,"{'D0436601', 'D0331782', 'D0520763', 'D0845720...","{'D0800690', 'D0331782', 'D0520763', 'D0819109...",5,8,0.625
5,U0000279,"{'D0345181', 'D0047057', 'D0857313', 'D0625812...",{'D0016623'},0,37,0.0
6,U0000286,"{'D0621623', 'D0312841', 'D0628774', 'D0335257...","{'D0446246', 'D0621623', 'D0628774'}",3,10,0.3
7,U0000307,"{'D0297553', 'D0515095', 'D0220283', 'D0934472...","{'D0610565', 'D0472920'}",1,128,0.007812
8,U0000317,"{'D0074110', 'D0099979', 'D0285843', 'D0176560...","{'D0301338', 'D0385855'}",2,82,0.02439
9,U0000323,"{'D0056024', 'D0128370', 'D0044858', 'D0211795...","{'D0056024', 'D0128370'}",2,34,0.058824


In [30]:
def day(): #由于一年之中并不是每天都产生行为数据、因此通过用户发表时间和浏览时间确定用户的有效天数
    csv_path  = 'day.csv'
    if os.path.exists(csv_path):
        day = pd.read_csv(csv_path)
    else:
        post = pd.read_csv('Post.csv')[['userid','time']]
        post['time'] = post['time'].apply(lambda x:x.split(' ')[0])
        post['time'] = post['time'].apply(lambda x:''.join(x.split('-')))
        post.drop_duplicates(['userid','time'], inplace=True)
        post['count'] = 1
        #post = post.groupby('userid')['count'].sum()
        bro = pd.read_csv('Browse.csv')[['userid','time']]
        bro['time'] = bro['time'].apply(lambda x:x.split(' ')[0])
        bro.drop_duplicates(['userid','time'], inplace=True)
        bro['count'] = 1
        
        day = pd.concat([post, bro],ignore_index=True)
        day = day.groupby('userid')['count'].sum().reset_index()
        day.to_csv('day.csv', index=False, encoding='utf8')
    return day

day = day()
day.head(50)

Unnamed: 0,userid,day
0,U0000001,39
1,U0000002,7
2,U0000003,7
3,U0000004,3
4,U0000005,182
5,U0000006,4
6,U0000007,34
7,U0000008,13
8,U0000009,7
9,U0000010,57


### 发表过的微博被浏览、评论的统计量

In [33]:
def cha(file):
    df = pd.read_csv(file)[['userid','contentid']]
    df['count'] = 1
    df = df.groupby(['contentid'])['count'].sum().reset_index()
    return df

def main():
    csv_path  = 'content_count_feature.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
    else:
        p = pd.read_csv('Post.csv')
        b = cha('Browse.csv')
        c = cha('Comment.csv')
        f = cha('Favorite.csv')
        u = cha('Vote-up.csv')
        d = cha('Vote-down.csv')
        
        p = p.merge(b, on='contentid', how='left').merge(c, on='contentid', how='left').merge(f, on='contentid', how='left').merge(u, on='contentid', how='left').merge(d, on='contentid', how='left')
        p.fillna(0, inplace=True)
        del p['time']
        p.columns = ['userid','contentid','b_contentid','c_contentid','f_contentid','u_contentid','d_contentid']
        p['sum_contentid'] = p[p.columns[2:]].sum(axis=1)
        df = pd.DataFrame()
        df['userid'] = pd.unique(p.userid)
        for i in ['b', 'c', 'f', 'u', 'd', 'sum']:
            df[i+'_count_sum'] = list(p.groupby('userid')[i+'_contentid'].sum())
            df[i+'_count_max'] = list(p.groupby('userid')[i+'_contentid'].max())
            df[i+'_count_min'] = list(p.groupby('userid')[i+'_contentid'].min())
            df[i+'_count_mean'] = list(p.groupby('userid')[i+'_contentid'].mean())
            df[i+'_count_median'] = list(p.groupby('userid')[i+'_contentid'].median())
            df[i+'_count_var'] = list(p.groupby('userid')[i+'_contentid'].var())
            
        df.fillna(0, inplace=True)
        df.to_csv('content_count_feature.csv', index=False, encoding='utf8')
    return df

df = main()
df.head(10)

Unnamed: 0,b_count_sum,b_count_max,b_count_min,b_count_mean,b_count_median,b_count_var,c_count_sum,c_count_max,c_count_min,c_count_mean,...,d_count_mean,d_count_median,d_count_var,sum_count_sum,sum_count_max,sum_count_min,sum_count_mean,sum_count_median,sum_count_var,userid
0,32.0,4.0,0.0,1.230769,1.0,0.664615,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,32.0,4.0,0.0,1.230769,1.0,0.664615,U0024827
1,1.0,1.0,0.0,0.25,0.0,0.25,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,2.0,0.0,0.5,0.0,1.0,U0042700
2,1.0,1.0,0.0,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.333333,0.0,0.333333,U0107217
3,2.0,2.0,0.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,2.0,0.0,1.0,1.0,2.0,U0148074
4,267.0,33.0,0.0,1.711538,1.0,11.187221,2.0,1.0,0.0,0.012821,...,0.038462,0.0,0.050124,284.0,35.0,0.0,1.820513,1.0,13.399835,U0083747
5,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,U0012666
6,34.0,9.0,0.0,1.478261,1.0,3.897233,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,35.0,9.0,0.0,1.521739,1.0,3.806324,U0134425
7,9.0,2.0,0.0,0.9,1.0,0.766667,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,9.0,2.0,0.0,0.9,1.0,0.766667,U0050880
8,1186.0,95.0,0.0,9.123077,3.0,273.333572,0.0,0.0,0.0,0.0,...,0.038462,0.0,0.052773,1229.0,97.0,0.0,9.453846,3.0,300.311807,U0087326
9,2.0,2.0,0.0,0.5,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,2.0,0.0,0.5,0.0,1.0,U0035101


### 特征合并

In [34]:
# 尽量一个个匹配、防止出现全为0的情况
def merge():
    csv_path  = 'task3_feature.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
    else:
        userid = pd.read_csv('userid.csv')

        letter = pd.read_csv('letter_feature.csv')
        day = pd.read_csv('day.csv')
        letter = pd.merge(userid,letter,on='userid',how='left')
        net = pd.read_csv('net.csv')
        net = pd.merge(userid,net,on='userid',how='left')
        content_count = pd.read_csv('content_count_feature.csv')
        content_count = pd.merge(userid,content_count,on='userid',how='left')
        bc = pd.read_csv('browse-comment.csv')
        bc = pd.merge(userid,bc,on='userid',how='left')

        p1 = pd.read_csv('post_feature1.csv')
        p1 = pd.merge(userid,p1,on='userid',how='left')
        p2 = pd.read_csv('post_feature2.csv')
        p2 = pd.merge(userid,p2,on='userid',how='left')
        p3 = pd.read_csv('post_feature3.csv')
        p3 = pd.merge(userid,p3,on='userid',how='left')

        b1 = pd.read_csv('browse_feature1.csv')
        b1 = pd.merge(userid,b1,on='userid',how='left')
        b2 = pd.read_csv('browse_feature2.csv')
        b2 = pd.merge(userid,b2,on='userid',how='left')
        b3 = pd.read_csv('browse_feature3.csv')
        b3 = pd.merge(userid,b3,on='userid',how='left')
        b4 = pd.read_csv('browse_feature4.csv')
        b4 = pd.merge(userid,b4,on='userid',how='left')

        c1 = pd.read_csv('comment_feature1.csv')
        c1 = pd.merge(userid,c1,on='userid',how='left')
        c2 = pd.read_csv('comment_feature2.csv')
        c2 = pd.merge(userid,c2,on='userid',how='left')
        c3 = pd.read_csv('comment_feature3.csv')
        c3 = pd.merge(userid,c3,on='userid',how='left')

        f1 = pd.read_csv('favorite_feature1.csv')
        f1 = pd.merge(userid,f1,on='userid',how='left')
        f2 = pd.read_csv('favorite_feature2.csv')
        f2 = pd.merge(userid,f2,on='userid',how='left')
        f3 = pd.read_csv('favorite_feature3.csv')
        f3 = pd.merge(userid,f3,on='userid',how='left')

        u1 = pd.read_csv('voteup_feature1.csv')
        u1 = pd.merge(userid,u1,on='userid',how='left')
        u2 = pd.read_csv('voteup_feature2.csv')
        u2 = pd.merge(userid,u2,on='userid',how='left')
        u3 = pd.read_csv('voteup_feature3.csv')
        u3 = pd.merge(userid,u3,on='userid',how='left')

        d1 = pd.read_csv('votedown_feature1.csv')
        d1 = pd.merge(userid,d1,on='userid',how='left')
        d2 = pd.read_csv('votedown_feature2.csv')
        d2 = pd.merge(userid,d2,on='userid',how='left')
        d3 = pd.read_csv('votedown_feature3.csv')
        d3 = pd.merge(userid,d3,on='userid',how='left')

        df = pd.concat([day,letter,net,bc,content_count,p1,p2,p3,b1,b2,b3,b4,c1,c2,c3,f1,f2,f3,u1,u2,u3,d1,d2,d3],axis=1)
        contentidname = list(df.columns[df.columns.str.contains('contentid')])
        netnames = ['neighbor_degree','neighbor_out_degree','neighbor_in_degree','group3num_rate','group4num_rate','if_group3','if_group4']
        names = contentidname+netnames
        feature = list(set(df.columns) ^ set(names))
        df = df[feature]
        df.fillna(0,inplace=True)
        del df['userid']
        df.insert(0,'userid',userid.userid)
        df.to_csv(csv_path,index=False,encoding='utf8')
    return df

feature = merge()
feature.head(10)

Unnamed: 0,userid,votedown_count_month_each67,comment_count_month_each58,voteup_evening_count,voteup_count_month_each61,comment_count_month_each22,voteup_count_month_each12,favorite_count_month_each36,votedown_evening_count,favorite_count_month_each46,...,favorite_weekend_count,favorite_count_month_each44,favorite_count_month_each210,post_count_month_each57,f_count_sum,votedowne_count_month_each39,voteupe_count_month_each35,neighbor_degree_median,browse_count_month_each46,votedown_count_month_each61
0,U0000001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,24.0,1.0,0.0,0.0,0.0,0.0,0.0
1,U0000002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,2.0,0.0
2,U0000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,U0000004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,U0000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,69.0,0.0,0.0,0.0,6.5,0.0,0.0
5,U0000006,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,1.0,0.0
6,U0000007,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,3.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,12.0,9.0,0.0
7,U0000008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
8,U0000009,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,0.0
9,U0000010,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0,0.0


### <font color=red>模型选择</font><br>
###  XGBOOST 

In [None]:
try:
    os.mkdir('feature')
    os.mkdir('result')
except:
    pass
    
def run_score3(preds, dtrain):
    gaps = dtrain.get_label()
    score = abs(gaps-preds)/ np.maximum(gaps,preds)
    score = 1-np.mean(score)
    return 'loss', score
    
def xgboost(train, n_ter):  
    train_x = train
    test_IMSI = test.userid    
    target = 'score'   
    feature_name = [x for x in train_x.columns if x not in ['userid','score']]
    train, val = train_test_split(train_x, test_size = 0.1,random_state=2017)
    a = train    
    for i in range(10):
        train = train.append(a)  
    y = train[target]
    valy = val[target]
    train = train[feature_name]
    val = val[feature_name] 
    tests = test[feature_name]
    dtest = xgb.DMatrix(tests, missing=np.nan)
    dtrain = xgb.DMatrix(train, label = y, missing=np.nan)
    dval = xgb.DMatrix(val, label = valy, missing=np.nan)
       
    params = {
                'booster':'gbtree',
                'objective': 'reg:logistic',
#                'objective': 'multi:softmax',
                'early_stopping_rounds':50,
#                'eval_metric': 'merror',
                'gamma':0.001,
#                'num_class':42,
                'max_depth':10,
                'lambda': 0.001,
                'max_delta_step':0,
                'subsample':0.75 ,       
                'colsample_bytree':0.3,
#                'min_child_weight':0.75, 
                'eta': 0.05,
#                'seed':1000
            }   
        
    watchlist  = [(dtrain,'train'),(dval,'val')]
    num_round = n_ter
    model = xgb.train(params, dtrain, num_round, evals=watchlist, feval=run_score3)  
    feature_score = model.get_fscore()
    feature_score = sorted(feature_score.items(), key=lambda x:x[1],reverse=True)
    fs = []
    for (key,value) in feature_score:
        fs.append("{0},{1}\n".format(key,value))
    
    with open('feature/feature_score13.csv','w') as f:
        f.writelines("feature,score\n")
        f.writelines(fs)
                  
    ttest_y = model.predict(dtest, ntree_limit=model.best_ntree_limit)
    result = pd.DataFrame()
    result['userid'] = test_IMSI
    result['growthvalue'] = ttest_y
    result['growthvalue'] = result['growthvalue'].apply(lambda x:round(x,4))
    result.to_csv('result34.csv', index=False, encoding='utf8')
    return result

train = pd.read_csv('SMPCUP2017_3/SMPCUP2017_TrainingData_Task3.txt',sep='\001',header=None,names=['userid','score'])
val = pd.read_csv('task3_feature.csv')

names_evening = list(val.columns[val.columns.str.contains('_evening_count')])
names_weekend = list(val.columns[val.columns.str.contains('_weekend_count')])
val['evening_count'] = val[names_evening].sum(axis=1)
val['weekend_count'] = val[names_weekend].sum(axis=1)   
names_c = list(val.columns[val.columns.str.contains('comment_count_month')])
names_fa = list(val.columns[val.columns.str.contains('favotite_count_month')])
names_vu = list(val.columns[val.columns.str.contains('voteup_count_month')])
names_vd = list(val.columns[val.columns.str.contains('votedown_count_month')])
names = names_c+names_fa+names_vd+names_vu
names = list(set(val.columns) ^ set(names))
val = val[names]

for i in val.columns[val.columns.str.contains('count')]:
    val[i+'_rate'] = val[i]/val['day']

test = pd.read_csv('test3.txt')        
train = pd.merge(train, val, on='userid', how='left')
test = pd.merge(val, test, on='userid', how='right')
train = train[train.score!=0]
result = xgboost(train,352)

###  模型调参

这次比赛我是用的手动调参，给大家主要讲一下该怎么调：
1、首先这次比赛的训练集很小，只有1000左右的样本，很难训练出好的结果，所以我采用重采样的方法，使样本数变为原来的10倍，“为啥是10倍”——调参得出来的。
2、训练集和验证集的划分。由于样本本身就很少、所以只能分给验证集很少的样本数、但又必须保证验证集结果具有代表性、所以我选择验证集数量为0.1，也是调参得出来的。
3、先把学习率调到最大、看是否出现明显的左右抖动情况在慢慢降学习率，一般eta为：0.03，这次比赛的学习率为0.04.
4、再调 subsample， 每次迭代所选取的样本数、过大容易过拟合、过小容易欠拟合、一般选 0.75.
5、再调 colsample_bytree ，每次迭代所选取的特征数，一般选择0.75，但本次比赛的特征数较多，所以需要进行调参得到最合适的数值，本次选0.3.
6、max_depth 、gamma、lambda也都需要进行调整、越小模型越容易过拟合。