<img src='NEFUlogo.png' align='left'>

# 基于查询日志的查询意图识别研究

## 1. 讲在前面

### 1.1 研究背景

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;互联网（World Wide Web）是世界上最大的信息资源库，是人们快速获取信息的重要途径之一，极大的改变了人们的生活方式。据2016年1月中国互联网络信息中心（CNNIC）发布的第37次*《中国互联网络发展状况统计报告》*显示，截至2015年12月，中国网民已达6.88亿，互联网普及率高达50.3%。截至2015年12月，中国网站数量为423万个，年增长26.3%，中国网页的数量为2123亿个，年增长11.8%，如图所示。

<img src='Extraction Features.jpeg' align='left'>

<img src='中国网站数量.png' align='left'>

<img src='中国网页数量.png' align='left'>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;由于用户提交的查询往往较短，自然语言存在模糊性，无法清晰的表达用户的意图。所以包含查询关键字的查询结果，有可能不足以满足用户的需求；搜索引擎返回的数以千计的文档也造成严重的信息过载。有研究显示，在查询日志中，有至少16%的歧义查询，有超过75%的查询具有更复杂的信息需求，难以被简单的答案或者特定的网址满足。

<img src='结果点击排名.png' align='left'>

### 1.2 研究意义

* **增加用户对搜索结果的满意度**：正确对用户意图进行识别，可以有效解决“信息过载”等问题，提高搜索效率；<br>
* **提高广告推荐的精度**：向搜索引擎用户提供和用户当前搜索相关的广告，提高商业价值；<br>
* **帮助搜索引擎组织和检索信息**：可以根据意图整理网络上的信息资源，建立更高效的检索系统。

### 1.3 国内外研究现状

* 查询意图**类目体系构建**<br>
* 查询意图**特征提取**<br>
* 查询意图的**识别方法研究**<br>
* **数据集与评价方法**<br>

### 1.4 论文框架

<img src='论文框架.png' align='left'>

## 2. 相关理论介绍

## 3. 实验

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;导入要用到的python程序库，并且设置数据展示在notebook中。

In [1]:
#-*- coding:utf-8 -*-

In [2]:
import graphlab
import re
import pandas as pd
import string
import jieba.posseg as pseg
import jieba
from collections import OrderedDict
from collections import Counter
from pyltp import Segmentor, Postagger, Parser

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
segmentor = Segmentor()
postagger = Postagger()
parser = Parser()
segmentor.load('E:/Python/pyltp/ltp_data/cws.model')
postagger.load('E:/Python/pyltp/ltp_data/pos.model')
parser.load('E:/Python/pyltp/ltp_data/parser.model')

In [5]:
#graphlab.canvas.set_target('ipynb')

### 3.1 搜集数据

#### 3.1.1 查询日志数据（subset）

In [6]:
data = graphlab.SFrame.read_csv('SogouM.txt', delimiter='\t', header=True, column_type_hints=[str, str, str, str, str])

This non-commercial license of GraphLab Create is assigned to guoxiuhe@nefu.edu.cn and will expire on April 02, 2017. For commercial licensing options, visit https://dato.com/buy/.


2016-05-20 09:10:07,094 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: C:\Users\heguoxiu\AppData\Local\Temp\graphlab_server_1463706587.log.0


In [7]:
data.show()

Canvas is accessible via web browser at the URL: http://localhost:7120/index.html
Opening Canvas in default web browser.


2016-05-20 09:10:33,009 [ERROR] tornado.access, 1946: 500 GET /sketch/%EF%BB%BFtime (::1) 195.00ms


In [10]:
data['result_and_click'] = data['result_click'].apply(lambda x : re.split(r'\s+',x))

In [11]:
def getfirst(l):
    return int(l[0])
def getsecond(l):
    return int(l[1])

data['result_rank'] = data['result_and_click'].apply(lambda l: getfirst(l))
data['click_rank'] = data['result_and_click'].apply(lambda l: getsecond(l))

In [12]:
data['query'] = data['query'].apply(lambda x: str(x)).apply(lambda x: x.strip().lstrip('[').rstrip(']'))

In [13]:
def remove_tail(url):
    if url[-1] == '/':
        return url[:-1]
    else:
        return url[:]

In [14]:
data['url'] = data['url'].apply(remove_tail)

In [15]:
data_words = []
data_poses = []
data_parsers = []
for query in data['query']:
    temp_words = segmentor.segment(query)
    temp_poses = postagger.postag(temp_words)
    temp_parsers = parser.parse(temp_words, temp_poses)
    words = ' '.join(temp_words)
    poses = ' '.join(temp_poses)
    parsers = ' '.join("%s" % (arc.relation) for arc in temp_parsers)
    data_words.append(words)
    data_poses.append(poses)
    data_parsers.append(parsers)

In [16]:
data['words'] = data_words
data['poses'] = data_poses
data['parsers'] = data_parsers

In [17]:
data['len_words'] = data['query'].apply(lambda x: len(x.decode('utf-8')))
data['len_seg'] = data['words'].apply(lambda x: len(re.split(r'\s+',x)))

In [18]:
data.head()

time,user_id,query,result_click,url,result_and_click,result_rank
00:00:00,2982199073774412,360安全卫士,8 3,download.it.com.cn/softwe b/software/firewall/a ...,"[8, 3]",8
00:00:00,7594220010824798,哄抢救灾物资,1 1,news.21cn.com/social/daqi an/2008/05/29/4777194 ...,"[1, 1]",1
00:00:00,5228056822071097,75810部队,14 5,www.greatoo.com/greatoo_c n/list.asp?link_id=27 ...,"[14, 5]",14
00:00:00,6140463203615646,绳艺,62 36,www.jd-cd.com/jd_opus/xx/ 200607/706.html ...,"[62, 36]",62
00:00:00,8561366108033201,汶川地震原因,3 2,www.big38.net,"[3, 2]",3
00:00:00,23908140386148713,莫衷一是的意思,1 2,www.chinabaike.com/articl e/81/82/110/2007/2007 ...,"[1, 2]",1
00:00:00,1797943298449139,星梦缘全集在线观� �� ...,8 5,www.6wei.net/dianshiju/?? ??\xa1\xe9|????do=index ...,"[8, 5]",8
00:00:00,717725924582846,闪字吧,1 2,www.shanziba.com,"[1, 2]",1
00:00:00,41416219018952116,霍震霆与朱玲玲照� �� ...,2 6,bbs.gouzai.cn/thread-6987 36.html ...,"[2, 6]",2
00:00:00,9975666857142764,电脑创业,2 2,ks.cn.yahoo.com/question/ 1307120203719.html ...,"[2, 2]",2

click_rank,words,poses,parsers,len_words,len_seg
3,360 安全 卫士,nz a n,ATT ATT HED,7,3
1,哄抢 救灾 物资,v v n,HED ATT VOB,6,3
5,75810 部队,m n,ATT HED,7,2
36,绳艺,nh,HED,2,1
2,汶川 地震 原因,ns n n,ATT ATT HED,6,3
2,莫衷一是 的 意思,i u n,ATT RAD HED,7,3
5,星 梦 缘 全集 在线 观看 ...,n n p n n v,ATT ATT ATT ATT SBV HED,9,6
2,闪字 吧,n u,HED RAD,3,2
6,霍震霆 与 朱玲玲 照片 ...,nh p nh n,ATT LAD COO HED,9,4
2,电脑 创业,n v,SBV HED,4,2


In [16]:
data['query', 'words', 'poses', 'parsers']

query,words,poses,parsers
360安全卫士,360 安全 卫士,nz a n,ATT ATT HED
哄抢救灾物资,哄抢 救灾 物资,v v n,HED ATT VOB
75810部队,75810 部队,m n,ATT HED
绳艺,绳艺,nh,HED
汶川地震原因,汶川 地震 原因,ns n n,ATT ATT HED
莫衷一是的意思,莫衷一是 的 意思,i u n,ATT RAD HED
星梦缘全集在线观� �� ...,星 梦 缘 全集 在线 观看 ...,n n p n n v,ATT ATT ATT ATT SBV HED
闪字吧,闪字 吧,n u,HED RAD
霍震霆与朱玲玲照� �� ...,霍震霆 与 朱玲玲 照片 ...,nh p nh n,ATT LAD COO HED
电脑创业,电脑 创业,n v,SBV HED


In [17]:
data.save('Sogou1W_ltp1.csv')

#### 3.1.2 Open Directory Project(ODP) 体系

<img src='odp.png' align='left'>

In [18]:
odp = graphlab.SFrame.read_csv('ODP.csv', delimiter=',', header=True)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [19]:
odp.head()

url,name,label,url_new
http://news.jmu.edu.cn/,集美大学新闻网,大专院校,news.jmu.edu.cn/
http://jjxj.swufe.edu.cn/,经济学家,出版物,jjxj.swufe.edu.cn/
http://www.jsacd.gov.cn/,江苏省农业资源开� ��局 ...,江苏,www.jsacd.gov.cn/
http://www.yndaily.com/,云南日报网,地区,www.yndaily.com/
http://www.panda.org.cn/,成都大熊猫繁育研� ��基地 ...,熊猫,www.panda.org.cn/
http://www.fjinfo.gov.cn/,福建科技信息,福建,www.fjinfo.gov.cn/
http://www.klxuexi.com/,快乐学习教育科技� ��团 ...,上海,www.klxuexi.com/
http://www.haier.com/cn/,海尔集团,消费电子产品,www.haier.com/cn/
http://www.jstvu.edu.cn/,江苏广播电视大学,江苏,www.jstvu.edu.cn/
http://www.gxtc.edu.cn/,广西师范学院,大专院校,www.gxtc.edu.cn/


In [20]:
def remove_head(url):
    return string.replace(url, 'http://', '')

In [21]:
def remove_tail(url):
    if url[-1] == '/':
        return url[:-1]
    else:
        return url[:]

In [22]:
odp['url_new'] = odp['url'].apply(remove_head).apply(remove_tail)

In [23]:
odp.head()

url,name,label,url_new
http://news.jmu.edu.cn/,集美大学新闻网,大专院校,news.jmu.edu.cn
http://jjxj.swufe.edu.cn/,经济学家,出版物,jjxj.swufe.edu.cn
http://www.jsacd.gov.cn/,江苏省农业资源开� ��局 ...,江苏,www.jsacd.gov.cn
http://www.yndaily.com/,云南日报网,地区,www.yndaily.com
http://www.panda.org.cn/,成都大熊猫繁育研� ��基地 ...,熊猫,www.panda.org.cn
http://www.fjinfo.gov.cn/,福建科技信息,福建,www.fjinfo.gov.cn
http://www.klxuexi.com/,快乐学习教育科技� ��团 ...,上海,www.klxuexi.com
http://www.haier.com/cn/,海尔集团,消费电子产品,www.haier.com/cn
http://www.jstvu.edu.cn/,江苏广播电视大学,江苏,www.jstvu.edu.cn
http://www.gxtc.edu.cn/,广西师范学院,大专院校,www.gxtc.edu.cn


### 3.2 构建新的类目体系以标注查询日志数据

#### 3.2.1 将ODP主题类目体系映射到Rose类目体系

* Rose类目体系

<img src='rose.png' align='left'>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;通过对Rose分类体系的分析，本文将ODP主题类目体系映射到Rose类目体系的三大类即*信息类* 、*资源类* 和*导航类* 中，主要是*信息类* 和*导航类*。然后对查询日志数据进行标注。

* **导航类**：本文将日志数据中url仅能匹配到web服务器名称的标记为导航类。如：www.nefu.edu.cn
* **资源类**：本文首先利用启发式的方法，把日志数据中url能匹配到download/game/music/movie/book等字符的标记为资源类。然后利用上述的分析，人工筛选出ODP中属于资源类的url，构建资源类url库*resource*，当日志数据中的url可以匹配到*resource*时，标记为资源类。例如：ODP中的购物类、游戏类等都属于资源类。
* **信息类**：将其他不属于以上两类的标注为信息类。

In [24]:
r = ['二手货','交通工具','休闲','体育用品','健康饮食','健康器材','在线销售','化妆美容','出版物','办公用品','化妆美容','古董与收藏','图书',\
     '婴幼儿用品','宠物','家具','家居与园艺','批发','日用商品','服装饰品','消费电子产品','图书','玩具与游戏','杂志','音像制品','化妆品',\
     '珠宝首饰','礼品','视觉艺术','计算机','食品','鲜花','精油香氛','分类','拍卖','目录','大学出版社','家具','家居装饰','电器','体育用品',\
    '乒乓球','渔具','飞镖','家具','文具','办公室服务','购物','批发与分销','烟草','机动车','珠宝首饰','眼镜','行李与包','食品','女装','童装',\
     '鞋帽','饰品','电子通讯','数字卡','虚拟物品交易','摄影','画','饮料','茶','葡萄酒',\
    '卡牌游戏','投币式游戏','棋类游戏','牌类游戏','电子游戏','电脑游戏','益智游戏','网络游戏','角色扮演','赌博',\
        '中国象棋','军棋','围棋','国际象棋','连珠','黑白棋','组织','休闲','体育','冒险','动作','射击','格斗','模拟','益智','策略','网络游戏',\
        '角色扮演','赛车','音乐与舞蹈','射击','格斗','网页游戏','魔兽争霸','魔兽世界','大型多人在线','网页游戏','角色扮演',\
       '网络泥巴','角色扮演','冒险岛','天龙八部','永恒之塔','魔兽世界','手持平台','游戏机平台','网页游戏','计算机平台','手机','新闻与评论',\
        '世嘉','任天堂','微软','索尼','下载','下载','会议展览','作弊与攻略','家族与公会','开发商与发布商','新闻与评论',\
        '电子竞技','聊天与论坛','麻将','体育','彩票','赌场']

In [25]:
f = open('resource_show.csv', 'a')

In [26]:
for i in r:
    for url in odp[odp['label'] == i]['url_new']:
        f.writelines(url+'\n')
f.close()

In [27]:
resource = graphlab.SFrame.read_csv('resource_show.csv',header=False)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [28]:
resource.unique().save('resource_show.csv')

In [19]:
resource = graphlab.SFrame.read_csv('resources.csv',header=True)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [20]:
resource.head()

url
download
book
read
music
movie
software
52384.com
map.baidu.com
htffund.com
hnbys.gov.cn


#### 3.2.2 对查询日志数据进行标注

标注资源类数据的方法

In [21]:
def label_resource(row, resource):
    for url_i in resource['url']:
        if row['url'].find(url_i) != -1:
            return True
    return False

对整个数据集进行标注的方法

In [22]:
def log_label(row, resource):
    # 对导航类数据进行标注
    if re.match(r'www(.*?)\b(com|cn|org|net|gov|xin|red|pub|ink|info|xyz|win|edu|mil|tv|TV|mobi|travel|name|aero|museum|pro|biz|coop|aero)\b$',\
                row['url']):
        return 'Navigation'
    # 对资源类数据进行标注
    elif label_resource(row, resource):
        return 'Resource'
    # 对信息类数据进行标注
    else:
        return 'Information'

In [23]:
data['label'] = data.apply(lambda x: log_label(x, resource))

In [24]:
data['label'].show()

Canvas is accessible via web browser at the URL: http://localhost:7120/index.html
Opening Canvas in default web browser.


In [35]:
data.save('Sogou1W_ltp_label.csv')

### 3.3 利用NLP技术提取特征

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;本文主要利用NLP技术来提取Query中的特征，包括：分词、词性统计等特征。结合点击排序特征和结果排序特征共同作为查询意图识别的特征。

### 3.4 比较Logistic Regression和Boost Tree对查询意图识别的效率

In [36]:
data.head()

time,user_id,query,result_click,url,result_and_click,result_rank
00:00:00,2982199073774412,360安全卫士,8 3,download.it.com.cn/softwe b/software/firewall/a ...,"[8, 3]",8
00:00:00,7594220010824798,哄抢救灾物资,1 1,news.21cn.com/social/daqi an/2008/05/29/4777194 ...,"[1, 1]",1
00:00:00,5228056822071097,75810部队,14 5,www.greatoo.com/greatoo_c n/list.asp?link_id=27 ...,"[14, 5]",14
00:00:00,6140463203615646,绳艺,62 36,www.jd-cd.com/jd_opus/xx/ 200607/706.html ...,"[62, 36]",62
00:00:00,8561366108033201,汶川地震原因,3 2,www.big38.net,"[3, 2]",3
00:00:00,23908140386148713,莫衷一是的意思,1 2,www.chinabaike.com/articl e/81/82/110/2007/2007 ...,"[1, 2]",1
00:00:00,1797943298449139,星梦缘全集在线观� �� ...,8 5,www.6wei.net/dianshiju/?? ??\xa1\xe9|????do=index ...,"[8, 5]",8
00:00:00,717725924582846,闪字吧,1 2,www.shanziba.com,"[1, 2]",1
00:00:00,41416219018952116,霍震霆与朱玲玲照� �� ...,2 6,bbs.gouzai.cn/thread-6987 36.html ...,"[2, 6]",2
00:00:00,9975666857142764,电脑创业,2 2,ks.cn.yahoo.com/question/ 1307120203719.html ...,"[2, 2]",2

click_rank,words,poses,parsers,len_words,len_seg,label
3,360 安全 卫士,nz a n,ATT ATT HED,7,3,Resource
1,哄抢 救灾 物资,v v n,HED ATT VOB,6,3,Information
5,75810 部队,m n,ATT HED,7,2,Information
36,绳艺,nh,HED,2,1,Information
2,汶川 地震 原因,ns n n,ATT ATT HED,6,3,Navigation
2,莫衷一是 的 意思,i u n,ATT RAD HED,7,3,Information
5,星 梦 缘 全集 在线 观看 ...,n n p n n v,ATT ATT ATT ATT SBV HED,9,6,Information
2,闪字 吧,n u,HED RAD,3,2,Navigation
6,霍震霆 与 朱玲玲 照片 ...,nh p nh n,ATT LAD COO HED,9,4,Resource
2,电脑 创业,n v,SBV HED,4,2,Resource


#### 3.4.1 数据准备(train_data, validation_data, test_data)

In [25]:
data['words_count'] = graphlab.text_analytics.count_words(data['words'])
data['poses_count'] = graphlab.text_analytics.count_words(data['poses'])
data['parsers_count'] = graphlab.text_analytics.count_words(data['parsers'])

In [26]:
data['words_tfidf'] = graphlab.text_analytics.tf_idf(data['words_count'])
data['poses_tfidf'] = graphlab.text_analytics.tf_idf(data['poses_count'])
data['parsers_tfidf'] = graphlab.text_analytics.tf_idf(data['parsers_count'])

In [27]:
data.save('Sogou1W_ltp_label.csv')

In [39]:
data['query', 'words_count', 'poses_count', 'parsers_count']

query,words_count,poses_count,parsers_count
360安全卫士,"{'\xe5\x8d\xab\xe5\xa3\xa b': 1L, '\xe5\xae\x89 ...","{'a': 1L, 'nz': 1L, 'n': 1L} ...","{'hed': 1L, 'att': 2L}"
哄抢救灾物资,"{'\xe6\x95\x91\xe7\x81\xb e': 1L, '\xe7\x89\xa9 ...","{'v': 2L, 'n': 1L}","{'hed': 1L, 'att': 1L, 'vob': 1L} ..."
75810部队,"{'\xe9\x83\xa8\xe9\x98\x9 f': 1L, '75810': 1L} ...","{'m': 1L, 'n': 1L}","{'hed': 1L, 'att': 1L}"
绳艺,{'\xe7\xbb\xb3\xe8\x89\xb a': 1L} ...,{'nh': 1L},{'hed': 1L}
汶川地震原因,"{'\xe6\xb1\xb6\xe5\xb7\x9 d': 1L, '\xe5\x8e\x9f ...","{'ns': 1L, 'n': 2L}","{'hed': 1L, 'att': 2L}"
莫衷一是的意思,"{'\xe6\x84\x8f\xe6\x80\x9 d': 1L, '\xe7\x9a\x84': ...","{'i': 1L, 'u': 1L, 'n': 1L} ...","{'hed': 1L, 'rad': 1L, 'att': 1L} ..."
星梦缘全集在线观� �� ...,"{'\xe7\xbc\x98': 1L, '\xe6\x98\x9f': 1L, ' ...","{'p': 1L, 'n': 4L, 'v': 1L} ...","{'hed': 1L, 'att': 4L, 'sbv': 1L} ..."
闪字吧,"{'\xe5\x90\xa7': 1L, '\xe 9\x97\xaa\xe5\xad\x97': ...","{'u': 1L, 'n': 1L}","{'hed': 1L, 'rad': 1L}"
霍震霆与朱玲玲照� �� ...,"{'\xe6\x9c\xb1\xe7\x8e\xb 2\xe7\x8e\xb2': 1L, ...","{'nh': 2L, 'p': 1L, 'n': 1L} ...","{'hed': 1L, 'lad': 1L, 'att': 1L, 'coo': 1L} ..."
电脑创业,"{'\xe7\x94\xb5\xe8\x84\x9 1': 1L, '\xe5\x88\x9b ...","{'n': 1L, 'v': 1L}","{'hed': 1L, 'sbv': 1L}"


In [40]:
data['query', 'words_tfidf', 'poses_tfidf', 'parsers_tfidf']

query,words_tfidf,poses_tfidf,parsers_tfidf
360安全卫士,{'\xe5\x8d\xab\xe5\xa3\xa b': 8.517193191416238 ...,"{'a': 2.6678684114693785, 'nz': 2.701571235004501, ...","{'hed': 0.0, 'att': 0.9496303724859151} ..."
哄抢救灾物资,{'\xe6\x95\x91\xe7\x81\xb e': 3.432688048753526 ...,"{'v': 1.9955439474840764, 'n': 0.35967945295903 ...","{'hed': 0.0, 'att': 0.47481518624295754, ..."
75810部队,"{'\xe9\x83\xa8\xe9\x98\x9 f': 7.600902459542082, ...","{'m': 2.4349742810397914, 'n': 0.35967945295903 ...","{'hed': 0.0, 'att': 0.47481518624295754} ..."
绳艺,{'\xe7\xbb\xb3\xe8\x89\xb a': 7.264430222920869} ...,{'nh': 1.4610179073158271} ...,{'hed': 0.0}
汶川地震原因,{'\xe6\xb1\xb6\xe5\xb7\x9 d': 3.158251203051766 ...,"{'ns': 1.6809339141391704, 'n': ...","{'hed': 0.0, 'att': 0.9496303724859151} ..."
莫衷一是的意思,"{'\xe6\x84\x8f\xe6\x80\x9 d': 6.812445099177812, ...","{'i': 5.0206856299497575, 'u': 2.8051119139453413, ...","{'hed': 0.0, 'rad': 2.8100829266673615, ..."
星梦缘全集在线观� �� ...,"{'\xe7\xbc\x98': 8.517193191416238, ...","{'p': 4.254513314374922, 'n': 1.4387178118361246, ...","{'hed': 0.0, 'att': 1.8992607449718302, ..."
闪字吧,"{'\xe5\x90\xa7': 6.907755278982137, '\ ...","{'u': 2.8051119139453413, 'n': 0.35967945295903 ...","{'hed': 0.0, 'rad': 2.8100829266673615} ..."
霍震霆与朱玲玲照� �� ...,{'\xe6\x9c\xb1\xe7\x8e\xb 2\xe7\x8e\xb2': ...,"{'nh': 2.9220358146316543, 'p': ...","{'hed': 0.0, 'lad': 4.8283137373023015, ..."
电脑创业,{'\xe7\x94\xb5\xe8\x84\x9 1': 6.214608098422191 ...,"{'n': 0.35967945295903114, ...","{'hed': 0.0, 'sbv': 1.6607312068216509} ..."


In [41]:
data['query', 'len_words', 'len_seg']

query,len_words,len_seg
360安全卫士,7,3
哄抢救灾物资,6,3
75810部队,7,2
绳艺,2,1
汶川地震原因,6,3
莫衷一是的意思,7,3
星梦缘全集在线观� �� ...,9,6
闪字吧,3,2
霍震霆与朱玲玲照� �� ...,9,4
电脑创业,4,2


In [42]:
data['query', 'result_rank', 'click_rank']

query,result_rank,click_rank
360安全卫士,8,3
哄抢救灾物资,1,1
75810部队,14,5
绳艺,62,36
汶川地震原因,3,2
莫衷一是的意思,1,2
星梦缘全集在线观� �� ...,8,5
闪字吧,1,2
霍震霆与朱玲玲照� �� ...,2,6
电脑创业,2,2


In [43]:
data['words_tfidf', 'poses_tfidf', 'parsers_tfidf', 'len_words', 'len_seg', 'result_rank', 'click_rank'].save('featureslala.csv')

In [44]:
df = pd.read_csv('featureslala.csv')

In [45]:
df

Unnamed: 0,words_tfidf,poses_tfidf,parsers_tfidf,len_words,len_seg,result_rank,click_rank
0,"{""卫士"":8.51719,""安全"":7.26443,""360"":8.11173}","{""n"":0.359679,""a"":2.66787,""nz"":2.70157}","{""hed"":0,""att"":0.94963}",7,3,8,3
1,"{""物资"":3.4389,""救灾"":3.43269,""哄抢"":3.4389}","{""n"":0.359679,""v"":1.99554}","{""vob"":1.76785,""att"":0.474815,""hed"":0}",6,3,1,1
2,"{""部队"":7.6009,""75810"":9.21034}","{""n"":0.359679,""m"":2.43497}","{""hed"":0,""att"":0.474815}",7,2,14,5
3,"{""绳艺"":7.26443}","{""nh"":1.46102}","{""hed"":0}",2,1,62,36
4,"{""地震"":2.80677,""原因"":3.32424,""汶川"":3.15825}","{""n"":0.719359,""ns"":1.68093}","{""hed"":0,""att"":0.94963}",6,3,3,2
5,"{""意思"":6.81245,""的"":3.00175,""莫衷一是"":9.21034}","{""n"":0.359679,""u"":2.80511,""i"":5.02069}","{""hed"":0,""rad"":2.81008,""att"":0.474815}",7,3,1,2
6,"{""全集"":5.08321,""缘"":8.51719,""在线"":4.75599,""梦"":7.0...","{""v"":0.997772,""p"":4.25451,""n"":1.43872}","{""hed"":0,""sbv"":1.66073,""att"":1.89926}",9,6,8,5
7,"{""吧"":6.90776,""闪字"":8.51719}","{""u"":2.80511,""n"":0.359679}","{""rad"":2.81008,""hed"":0}",3,2,1,2
8,"{""照片"":4.77952,""朱玲玲"":8.51719,""与"":5.42615,""霍震霆"":...","{""n"":0.359679,""p"":4.25451,""nh"":2.92204}","{""hed"":0,""coo"":2.21916,""lad"":4.82831,""att"":0.4...",9,4,2,6
9,"{""创业"":7.41858,""电脑"":6.21461}","{""v"":0.997772,""n"":0.359679}","{""hed"":0,""sbv"":1.66073}",4,2,2,2


In [46]:
data.head()

time,user_id,query,result_click,url,result_and_click,result_rank
00:00:00,2982199073774412,360安全卫士,8 3,download.it.com.cn/softwe b/software/firewall/a ...,"[8, 3]",8
00:00:00,7594220010824798,哄抢救灾物资,1 1,news.21cn.com/social/daqi an/2008/05/29/4777194 ...,"[1, 1]",1
00:00:00,5228056822071097,75810部队,14 5,www.greatoo.com/greatoo_c n/list.asp?link_id=27 ...,"[14, 5]",14
00:00:00,6140463203615646,绳艺,62 36,www.jd-cd.com/jd_opus/xx/ 200607/706.html ...,"[62, 36]",62
00:00:00,8561366108033201,汶川地震原因,3 2,www.big38.net,"[3, 2]",3
00:00:00,23908140386148713,莫衷一是的意思,1 2,www.chinabaike.com/articl e/81/82/110/2007/2007 ...,"[1, 2]",1
00:00:00,1797943298449139,星梦缘全集在线观� �� ...,8 5,www.6wei.net/dianshiju/?? ??\xa1\xe9|????do=index ...,"[8, 5]",8
00:00:00,717725924582846,闪字吧,1 2,www.shanziba.com,"[1, 2]",1
00:00:00,41416219018952116,霍震霆与朱玲玲照� �� ...,2 6,bbs.gouzai.cn/thread-6987 36.html ...,"[2, 6]",2
00:00:00,9975666857142764,电脑创业,2 2,ks.cn.yahoo.com/question/ 1307120203719.html ...,"[2, 2]",2

click_rank,words,poses,parsers,len_words,len_seg,label
3,360 安全 卫士,nz a n,ATT ATT HED,7,3,Resource
1,哄抢 救灾 物资,v v n,HED ATT VOB,6,3,Information
5,75810 部队,m n,ATT HED,7,2,Information
36,绳艺,nh,HED,2,1,Information
2,汶川 地震 原因,ns n n,ATT ATT HED,6,3,Navigation
2,莫衷一是 的 意思,i u n,ATT RAD HED,7,3,Information
5,星 梦 缘 全集 在线 观看 ...,n n p n n v,ATT ATT ATT ATT SBV HED,9,6,Information
2,闪字 吧,n u,HED RAD,3,2,Navigation
6,霍震霆 与 朱玲玲 照片 ...,nh p nh n,ATT LAD COO HED,9,4,Resource
2,电脑 创业,n v,SBV HED,4,2,Resource

words_count,poses_count,parsers_count,words_tfidf
"{'\xe5\x8d\xab\xe5\xa3\xa b': 1L, '\xe5\xae\x89 ...","{'a': 1L, 'nz': 1L, 'n': 1L} ...","{'hed': 1L, 'att': 2L}",{'\xe5\x8d\xab\xe5\xa3\xa b': 8.517193191416238 ...
"{'\xe6\x95\x91\xe7\x81\xb e': 1L, '\xe7\x89\xa9 ...","{'v': 2L, 'n': 1L}","{'hed': 1L, 'att': 1L, 'vob': 1L} ...",{'\xe6\x95\x91\xe7\x81\xb e': 3.432688048753526 ...
"{'\xe9\x83\xa8\xe9\x98\x9 f': 1L, '75810': 1L} ...","{'m': 1L, 'n': 1L}","{'hed': 1L, 'att': 1L}","{'\xe9\x83\xa8\xe9\x98\x9 f': 7.600902459542082, ..."
{'\xe7\xbb\xb3\xe8\x89\xb a': 1L} ...,{'nh': 1L},{'hed': 1L},{'\xe7\xbb\xb3\xe8\x89\xb a': 7.264430222920869} ...
"{'\xe6\xb1\xb6\xe5\xb7\x9 d': 1L, '\xe5\x8e\x9f ...","{'ns': 1L, 'n': 2L}","{'hed': 1L, 'att': 2L}",{'\xe6\xb1\xb6\xe5\xb7\x9 d': 3.158251203051766 ...
"{'\xe6\x84\x8f\xe6\x80\x9 d': 1L, '\xe7\x9a\x84': ...","{'i': 1L, 'u': 1L, 'n': 1L} ...","{'hed': 1L, 'rad': 1L, 'att': 1L} ...","{'\xe6\x84\x8f\xe6\x80\x9 d': 6.812445099177812, ..."
"{'\xe7\xbc\x98': 1L, '\xe6\x98\x9f': 1L, ' ...","{'p': 1L, 'n': 4L, 'v': 1L} ...","{'hed': 1L, 'att': 4L, 'sbv': 1L} ...","{'\xe7\xbc\x98': 8.517193191416238, ..."
"{'\xe5\x90\xa7': 1L, '\xe 9\x97\xaa\xe5\xad\x97': ...","{'u': 1L, 'n': 1L}","{'hed': 1L, 'rad': 1L}","{'\xe5\x90\xa7': 6.907755278982137, '\ ..."
"{'\xe6\x9c\xb1\xe7\x8e\xb 2\xe7\x8e\xb2': 1L, ...","{'nh': 2L, 'p': 1L, 'n': 1L} ...","{'hed': 1L, 'lad': 1L, 'att': 1L, 'coo': 1L} ...",{'\xe6\x9c\xb1\xe7\x8e\xb 2\xe7\x8e\xb2': ...
"{'\xe7\x94\xb5\xe8\x84\x9 1': 1L, '\xe5\x88\x9b ...","{'n': 1L, 'v': 1L}","{'hed': 1L, 'sbv': 1L}",{'\xe7\x94\xb5\xe8\x84\x9 1': 6.214608098422191 ...

poses_tfidf,parsers_tfidf
"{'a': 2.6678684114693785, 'nz': 2.701571235004501, ...","{'hed': 0.0, 'att': 0.9496303724859151} ..."
"{'v': 1.9955439474840764, 'n': 0.35967945295903 ...","{'hed': 0.0, 'att': 0.47481518624295754, ..."
"{'m': 2.4349742810397914, 'n': 0.35967945295903 ...","{'hed': 0.0, 'att': 0.47481518624295754} ..."
{'nh': 1.4610179073158271} ...,{'hed': 0.0}
"{'ns': 1.6809339141391704, 'n': ...","{'hed': 0.0, 'att': 0.9496303724859151} ..."
"{'i': 5.0206856299497575, 'u': 2.8051119139453413, ...","{'hed': 0.0, 'rad': 2.8100829266673615, ..."
"{'p': 4.254513314374922, 'n': 1.4387178118361246, ...","{'hed': 0.0, 'att': 1.8992607449718302, ..."
"{'u': 2.8051119139453413, 'n': 0.35967945295903 ...","{'hed': 0.0, 'rad': 2.8100829266673615} ..."
"{'nh': 2.9220358146316543, 'p': ...","{'hed': 0.0, 'lad': 4.8283137373023015, ..."
"{'n': 0.35967945295903114, ...","{'hed': 0.0, 'sbv': 1.6607312068216509} ..."


In [47]:
labels = data['label']

In [48]:
data_one_hot_encoded = data['label'].apply(lambda x: {x: 1})    
data_unpacked = data_one_hot_encoded.unpack(column_name_prefix='label')
    
# Change None's to 0's
for column in data_unpacked.column_names():
    data_unpacked[column] = data_unpacked[column].fillna(0)

data.add_columns(data_unpacked)
features = data.column_names()

In [49]:
data.head()

time,user_id,query,result_click,url,result_and_click,result_rank
00:00:00,2982199073774412,360安全卫士,8 3,download.it.com.cn/softwe b/software/firewall/a ...,"[8, 3]",8
00:00:00,7594220010824798,哄抢救灾物资,1 1,news.21cn.com/social/daqi an/2008/05/29/4777194 ...,"[1, 1]",1
00:00:00,5228056822071097,75810部队,14 5,www.greatoo.com/greatoo_c n/list.asp?link_id=27 ...,"[14, 5]",14
00:00:00,6140463203615646,绳艺,62 36,www.jd-cd.com/jd_opus/xx/ 200607/706.html ...,"[62, 36]",62
00:00:00,8561366108033201,汶川地震原因,3 2,www.big38.net,"[3, 2]",3
00:00:00,23908140386148713,莫衷一是的意思,1 2,www.chinabaike.com/articl e/81/82/110/2007/2007 ...,"[1, 2]",1
00:00:00,1797943298449139,星梦缘全集在线观� �� ...,8 5,www.6wei.net/dianshiju/?? ??\xa1\xe9|????do=index ...,"[8, 5]",8
00:00:00,717725924582846,闪字吧,1 2,www.shanziba.com,"[1, 2]",1
00:00:00,41416219018952116,霍震霆与朱玲玲照� �� ...,2 6,bbs.gouzai.cn/thread-6987 36.html ...,"[2, 6]",2
00:00:00,9975666857142764,电脑创业,2 2,ks.cn.yahoo.com/question/ 1307120203719.html ...,"[2, 2]",2

click_rank,words,poses,parsers,len_words,len_seg,label
3,360 安全 卫士,nz a n,ATT ATT HED,7,3,Resource
1,哄抢 救灾 物资,v v n,HED ATT VOB,6,3,Information
5,75810 部队,m n,ATT HED,7,2,Information
36,绳艺,nh,HED,2,1,Information
2,汶川 地震 原因,ns n n,ATT ATT HED,6,3,Navigation
2,莫衷一是 的 意思,i u n,ATT RAD HED,7,3,Information
5,星 梦 缘 全集 在线 观看 ...,n n p n n v,ATT ATT ATT ATT SBV HED,9,6,Information
2,闪字 吧,n u,HED RAD,3,2,Navigation
6,霍震霆 与 朱玲玲 照片 ...,nh p nh n,ATT LAD COO HED,9,4,Resource
2,电脑 创业,n v,SBV HED,4,2,Resource

words_count,poses_count,parsers_count,words_tfidf
"{'\xe5\x8d\xab\xe5\xa3\xa b': 1L, '\xe5\xae\x89 ...","{'a': 1L, 'nz': 1L, 'n': 1L} ...","{'hed': 1L, 'att': 2L}",{'\xe5\x8d\xab\xe5\xa3\xa b': 8.517193191416238 ...
"{'\xe6\x95\x91\xe7\x81\xb e': 1L, '\xe7\x89\xa9 ...","{'v': 2L, 'n': 1L}","{'hed': 1L, 'att': 1L, 'vob': 1L} ...",{'\xe6\x95\x91\xe7\x81\xb e': 3.432688048753526 ...
"{'\xe9\x83\xa8\xe9\x98\x9 f': 1L, '75810': 1L} ...","{'m': 1L, 'n': 1L}","{'hed': 1L, 'att': 1L}","{'\xe9\x83\xa8\xe9\x98\x9 f': 7.600902459542082, ..."
{'\xe7\xbb\xb3\xe8\x89\xb a': 1L} ...,{'nh': 1L},{'hed': 1L},{'\xe7\xbb\xb3\xe8\x89\xb a': 7.264430222920869} ...
"{'\xe6\xb1\xb6\xe5\xb7\x9 d': 1L, '\xe5\x8e\x9f ...","{'ns': 1L, 'n': 2L}","{'hed': 1L, 'att': 2L}",{'\xe6\xb1\xb6\xe5\xb7\x9 d': 3.158251203051766 ...
"{'\xe6\x84\x8f\xe6\x80\x9 d': 1L, '\xe7\x9a\x84': ...","{'i': 1L, 'u': 1L, 'n': 1L} ...","{'hed': 1L, 'rad': 1L, 'att': 1L} ...","{'\xe6\x84\x8f\xe6\x80\x9 d': 6.812445099177812, ..."
"{'\xe7\xbc\x98': 1L, '\xe6\x98\x9f': 1L, ' ...","{'p': 1L, 'n': 4L, 'v': 1L} ...","{'hed': 1L, 'att': 4L, 'sbv': 1L} ...","{'\xe7\xbc\x98': 8.517193191416238, ..."
"{'\xe5\x90\xa7': 1L, '\xe 9\x97\xaa\xe5\xad\x97': ...","{'u': 1L, 'n': 1L}","{'hed': 1L, 'rad': 1L}","{'\xe5\x90\xa7': 6.907755278982137, '\ ..."
"{'\xe6\x9c\xb1\xe7\x8e\xb 2\xe7\x8e\xb2': 1L, ...","{'nh': 2L, 'p': 1L, 'n': 1L} ...","{'hed': 1L, 'lad': 1L, 'att': 1L, 'coo': 1L} ...",{'\xe6\x9c\xb1\xe7\x8e\xb 2\xe7\x8e\xb2': ...
"{'\xe7\x94\xb5\xe8\x84\x9 1': 1L, '\xe5\x88\x9b ...","{'n': 1L, 'v': 1L}","{'hed': 1L, 'sbv': 1L}",{'\xe7\x94\xb5\xe8\x84\x9 1': 6.214608098422191 ...

poses_tfidf,parsers_tfidf,label.Information,label.Navigation,label.Resource
"{'a': 2.6678684114693785, 'nz': 2.701571235004501, ...","{'hed': 0.0, 'att': 0.9496303724859151} ...",0,0,1
"{'v': 1.9955439474840764, 'n': 0.35967945295903 ...","{'hed': 0.0, 'att': 0.47481518624295754, ...",1,0,0
"{'m': 2.4349742810397914, 'n': 0.35967945295903 ...","{'hed': 0.0, 'att': 0.47481518624295754} ...",1,0,0
{'nh': 1.4610179073158271} ...,{'hed': 0.0},1,0,0
"{'ns': 1.6809339141391704, 'n': ...","{'hed': 0.0, 'att': 0.9496303724859151} ...",0,1,0
"{'i': 5.0206856299497575, 'u': 2.8051119139453413, ...","{'hed': 0.0, 'rad': 2.8100829266673615, ...",1,0,0
"{'p': 4.254513314374922, 'n': 1.4387178118361246, ...","{'hed': 0.0, 'att': 1.8992607449718302, ...",1,0,0
"{'u': 2.8051119139453413, 'n': 0.35967945295903 ...","{'hed': 0.0, 'rad': 2.8100829266673615} ...",0,1,0
"{'nh': 2.9220358146316543, 'p': ...","{'hed': 0.0, 'lad': 4.8283137373023015, ...",0,0,1
"{'n': 0.35967945295903114, ...","{'hed': 0.0, 'sbv': 1.6607312068216509} ...",0,0,1


In [50]:
features

['\xef\xbb\xbftime',
 'user_id',
 'query',
 'result_click',
 'url',
 'result_and_click',
 'result_rank',
 'click_rank',
 'words',
 'poses',
 'parsers',
 'len_words',
 'len_seg',
 'label',
 'words_count',
 'poses_count',
 'parsers_count',
 'words_tfidf',
 'poses_tfidf',
 'parsers_tfidf',
 'label.Information',
 'label.Navigation',
 'label.Resource']

In [54]:
features1 = ['words_tfidf', 'poses_tfidf', 'parsers_tfidf', 'len_words', 'len_seg', 'result_rank', 'click_rank']
features2 = ['words_count']

In [55]:
train_data, test_data = data.random_split(.8, seed=0)

In [56]:
train_data, validation_data = train_data.random_split(0.75, seed=0)

#### 3.4.2 Logistic Regression

* 分类器训练

In [61]:
logistic_model_1 = graphlab.logistic_classifier.create(train_data, target='label', \
                                                                   features=features1, validation_set=validation_data, \
                                                                   max_iterations=50, l2_penalty=0.01,l1_penalty=0)

In [None]:
logistic_model_1.evaluat

In [58]:
logistic_model_2 = graphlab.logistic_classifier.create(train_data, target='label', \
                                                                   features=features2, validation_set=validation_data, \
                                                                   max_iterations=50, l2_penalty=0.05,l1_penalty=0)

2016-05-19 14:43:56,220 [ERROR] tornado.access, 1946: 500 GET /sketch/%EF%BB%BFtime (::1) 106.00ms


In [75]:
logistic_model_1.evaluate(test_data)

{'accuracy': 0.6676829268292683,
 'auc': 0.7108382904723468,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 9
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |   Resource   |    Navigation   |   15  |
 |  Navigation  |    Navigation   |  110  |
 |  Navigation  |   Information   |  100  |
 | Information  |    Navigation   |   65  |
 | Information  |     Resource    |  196  |
 |  Navigation  |     Resource    |   11  |
 |   Resource   |     Resource    |  133  |
 |   Resource   |   Information   |  267  |
 | Information  |   Information   |  1071 |
 +--------------+-----------------+-------+
 [9 rows x 3 columns],
 'f1_score': 0.5536276282344855,
 'log_loss': 2.561308966948614,
 'precision': 0.5716360872729151,
 'recall': 0.5407578461086612,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 300003
 


In [115]:
boost_tree_1 = graphlab.boosted_trees_classifier.create(train_data, 'label', features=features1, \
                                                      max_iterations=50, validation_set=validation_data, \
                                                     max_depth = 20)

In [117]:
boost_tree_1.evaluate(test_data)

{'accuracy': 0.7235772357723578,
 'auc': 0.7866244990978039,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 8
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 | Information  |     Resource    |   92  |
 | Information  |    Navigation   |   56  |
 |  Navigation  |    Navigation   |  133  |
 |  Navigation  |   Information   |   88  |
 |   Resource   |    Navigation   |   23  |
 |   Resource   |     Resource    |  107  |
 |   Resource   |   Information   |  285  |
 | Information  |   Information   |  1184 |
 +--------------+-----------------+-------+
 [8 rows x 3 columns],
 'f1_score': 0.5941712303098206,
 'log_loss': 0.6858752157606018,
 'precision': 0.6418278900308144,
 'recall': 0.5828433896470749,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 300003
 
 Data:
 +-----------+-----+-----+------+---

In [116]:
boost_tree_2 = graphlab.boosted_trees_classifier.create(train_data, 'label', features=features2, \
                                                      max_iterations=50, validation_set=validation_data, \
                                                     max_depth = 20)

In [81]:
svm_model_1_information = graphlab.svm_classifier.create(train_data, target='label.Information', \
                                                                   features=features1, validation_set=validation_data, \
                                                                   max_iterations=1400)

In [82]:
svm_model_1_information.evaluate(test_data)

{'accuracy': 0.6595528455284553, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  371  |
 |      0       |        0        |  337  |
 |      0       |        1        |  299  |
 |      1       |        1        |  961  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns], 'f1_score': 0.7415123456790124, 'precision': 0.7626984126984127, 'recall': 0.7214714714714715}

In [85]:
svm_model_2_information = graphlab.svm_classifier.create(train_data, target='label.Information', \
                                                                   features=features2, validation_set=validation_data, \
                                                                   max_iterations=1000)

In [86]:
svm_model_2_information.evaluate(test_data)

{'accuracy': 0.6478658536585366, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |  264  |
 |      1       |        1        |  903  |
 |      0       |        0        |  372  |
 |      1       |        0        |  429  |
 +--------------+-----------------+-------+
 [4 rows x 3 columns], 'f1_score': 0.7226890756302522, 'precision': 0.7737789203084833, 'recall': 0.6779279279279279}

In [72]:
logistic_model_1_information = graphlab.logistic_classifier.create(train_data, target='label.Information', \
                                                                   features=features1, validation_set=validation_data, \
                                                                   max_iterations=50, l2_penalty=0.05,l1_penalty=0)

In [73]:
logistic_model_1_information.evaluate(test_data)

{'accuracy': 0.6686991869918699,
 'auc': 0.6361083961319807,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |  339  |
 |      1       |        0        |  313  |
 |      0       |        0        |  297  |
 |      1       |        1        |  1019 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.7576208178438661,
 'log_loss': 2.469118473734102,
 'precision': 0.7503681885125184,
 'recall': 0.765015015015015,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+------+-----+
 | threshold |      fpr       |      tpr       |  p   |  n  |
 +-----------+----------------+----------------+------+-----+
 |    0.0    |      1.0       |      1.0

In [91]:
logistic_model_1_navigation = graphlab.logistic_classifier.create(train_data, target='label.Navigation', \
                                                       features=['tfidf', 'search_tfidf', 'result_rank', 'click_rank'], \
                                                       validation_set=validation_data_navigation, max_iterations=50, l2_penalty=0,l1_penalty=30)

In [92]:
logistic_model_1_navigation.evaluate(test_data)

{'accuracy': 0.9014227642276422,
 'auc': 0.8685283368774397,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        1        |   71  |
 |      0       |        1        |   44  |
 |      1       |        0        |  150  |
 |      0       |        0        |  1703 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.42261904761904767,
 'log_loss': 0.249450913006891,
 'precision': 0.6173913043478261,
 'recall': 0.3212669683257919,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+-----+-----+------+
 | threshold |      fpr       | tpr |  p  |  n   |
 +-----------+----------------+-----+-----+------+
 |    0.0    |      1.0       | 1.0 | 221 | 1747 |
 |   1e-05   |      

In [93]:
logistic_model_1_resource = graphlab.logistic_classifier.create(train_data, target='label.Resource', \
                                                       features=['tfidf', 'search_tfidf', 'result_rank', 'click_rank'], \
                                                       validation_set=validation_data_navigation, max_iterations=50, l2_penalty=0,l1_penalty=30)

In [94]:
logistic_model_1_resource.evaluate(test_data)

{'accuracy': 0.9735772357723578,
 'auc': 0.6935112847222209,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |   5   |
 |      1       |        0        |   47  |
 |      0       |        0        |  1915 |
 |      1       |        1        |   1   |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.037037037037037035,
 'log_loss': 0.10829273084385369,
 'precision': 0.16666666666666666,
 'recall': 0.020833333333333332,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+----+------+
 | threshold | fpr | tpr | p  |  n   |
 +-----------+-----+-----+----+------+
 |    0.0    | 1.0 | 1.0 | 48 | 1920 |
 |   1e-05   | 1.0 | 1.0 | 48 | 1920 |
 |   2e-05   | 1.0 | 1.

In [41]:
logistic_model_2 = graphlab.logistic_classifier.create(train_data, target='label', \
                                                       features=['search_tfidf', 'result_rank', 'click_rank'], \
                                                       validation_set=validation_data)

* 分类器测试

In [42]:
logistic_model_1.evaluate(test_data)

{'accuracy': 0.8689024390243902,
 'auc': 0.7889915995372349,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 9
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |  Navigation  |    Navigation   |  103  |
 | Information  |    Navigation   |   76  |
 | Information  |     Resource    |   23  |
 |  Navigation  |   Information   |  117  |
 |   Resource   |   Information   |   40  |
 | Information  |   Information   |  1600 |
 |  Navigation  |     Resource    |   1   |
 |   Resource   |     Resource    |   7   |
 |   Resource   |    Navigation   |   1   |
 +--------------+-----------------+-------+
 [9 rows x 3 columns],
 'f1_score': 0.5389522755075119,
 'log_loss': 0.46973028282117485,
 'precision': 0.5695572718513214,
 'recall': 0.5178757038047105,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 300003


In [43]:
logistic_model_2.evaluate(test_data)

{'accuracy': 0.8648373983739838,
 'auc': 0.7887784476518876,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 9
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 | Information  |     Resource    |   20  |
 | Information  |    Navigation   |   85  |
 |  Navigation  |    Navigation   |  102  |
 |  Navigation  |   Information   |  118  |
 |   Resource   |   Information   |   41  |
 | Information  |   Information   |  1594 |
 |  Navigation  |     Resource    |   1   |
 |   Resource   |     Resource    |   6   |
 |   Resource   |    Navigation   |   1   |
 +--------------+-----------------+-------+
 [9 rows x 3 columns],
 'f1_score': 0.5274333672364083,
 'log_loss': 0.479065690164296,
 'precision': 0.5580245864682271,
 'recall': 0.5082458006972427,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 300003
 


#### 3.4.3 Boosted Tree

* 分类器训练

In [44]:
boost_model_1 = graphlab.boosted_trees_classifier.create(train_data, target='label', \
                                                         features=['tfidf', 'search_tfidf', 'result_rank', 'click_rank'], \
                                                         validation_set=validation_data)

In [45]:
boost_model_2 = graphlab.boosted_trees_classifier.create(train_data, target='label', \
                                                         features=['search_tfidf', 'result_rank', 'click_rank'], \
                                                         validation_set=validation_data)

* 分类器测试

In [46]:
boost_model_1.evaluate(test_data)

{'accuracy': 0.8785569105691057,
 'auc': 0.746247017529306,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 7
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |   Resource   |    Navigation   |   3   |
 |  Navigation  |    Navigation   |   55  |
 |  Navigation  |   Information   |  166  |
 | Information  |    Navigation   |   26  |
 |   Resource   |   Information   |   44  |
 | Information  |   Information   |  1673 |
 |   Resource   |     Resource    |   1   |
 +--------------+-----------------+-------+
 [7 rows x 3 columns],
 'f1_score': 0.44519569459256186,
 'log_loss': 0.36782414625815874,
 'precision': 0.8477459137310438,
 'recall': 0.41813299737727605,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 300003
 
 Data:
 +-----------+-----+-----+------+-----+-------+
 | threshold | fpr | tpr |  p 

In [47]:
boost_model_2.evaluate(test_data)

{'accuracy': 0.8683943089430894,
 'auc': 0.7198899577124401,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 5
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |  Navigation  |    Navigation   |   19  |
 | Information  |    Navigation   |   9   |
 |  Navigation  |   Information   |  202  |
 |   Resource   |   Information   |   48  |
 | Information  |   Information   |  1690 |
 +--------------+-----------------+-------+
 [5 rows x 3 columns],
 'f1_score': 0.36047901416051675,
 'log_loss': 0.41459987487687805,
 'precision': 0.7748527245949927,
 'recall': 0.3602252056706234,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 300003
 
 Data:
 +-----------+-----+-----+------+-----+-------+
 | threshold | fpr | tpr |  p   |  n  | class |
 +-----------+-----+-----+------+-----+-------+
 |    0.0    | 1.0 | 1

#### 3.4.4 比较不同分类器和不同特征提取对查询意图识别的影响

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;呃，其实这部分还**有待改进**。毕竟虽然有影响，但是影响比较小。
* 不同分类器对查询意图识别会有**1个百分点**的影响；
* 不同特征提取对意图识别的影响仅仅有**0.1个百分点**。

## 4. 结论

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;由于论文没有完成，所以结论部分还有待分析。
* 在答辩前，要深入学习**LTP工具**，优化特征提取；
* 同时要**调整分类器的参数**，优化分类结果。

<img src='classification.jpg' align='left'>

# 欢迎大家批评指正，谢谢！