<img src='NEFUlogo.png' align='left'>

# 基于查询日志的查询意图识别研究

## 1. 讲在前面

### 1.1 研究背景

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;互联网（World Wide Web）是世界上最大的信息资源库，是人们快速获取信息的重要途径之一，极大的改变了人们的生活方式。据2016年1月中国互联网络信息中心（CNNIC）发布的第37次*《中国互联网络发展状况统计报告》*显示，截至2015年12月，中国网民已达6.88亿，互联网普及率高达50.3%。截至2015年12月，中国网站数量为423万个，年增长26.3%，中国网页的数量为2123亿个，年增长11.8%，如图所示。

<img src='Extraction Features.jpeg' align='left'>

<img src='中国网站数量.png' align='left'>

<img src='中国网页数量.png' align='left'>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;由于用户提交的查询往往较短，自然语言存在模糊性，无法清晰的表达用户的意图。所以包含查询关键字的查询结果，有可能不足以满足用户的需求；搜索引擎返回的数以千计的文档也造成严重的信息过载。有研究显示，在查询日志中，有至少16%的歧义查询，有超过75%的查询具有更复杂的信息需求，难以被简单的答案或者特定的网址满足。

<img src='结果点击排名.png' align='left'>

### 1.2 研究意义

* **增加用户对搜索结果的满意度**：正确对用户意图进行识别，可以有效解决“信息过载”等问题，提高搜索效率；<br>
* **提高广告推荐的精度**：向搜索引擎用户提供和用户当前搜索相关的广告，提高商业价值；<br>
* **帮助搜索引擎组织和检索信息**：可以根据意图整理网络上的信息资源，建立更高效的检索系统。

### 1.3 国内外研究现状

* 查询意图**类目体系构建**<br>
* 查询意图**特征提取**<br>
* 查询意图的**识别方法研究**<br>
* **数据集与评价方法**<br>

### 1.4 论文框架

<img src='论文框架.png' align='left'>

## 2. 相关理论介绍

## 3. 实验

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;导入要用到的python程序库，并且设置数据展示在notebook中。

In [1]:
import graphlab
import re
import pandas as pd
import string
import jieba.posseg as pseg
import jieba
from collections import OrderedDict
from collections import Counter

In [2]:
graphlab.canvas.set_target('ipynb')

### 3.1 搜集数据

#### 3.1.1 查询日志数据（subset）

In [3]:
data = graphlab.SFrame.read_csv('SogouQ.txt', delimiter='\t', header=True, column_type_hints=[str, str, str, str, str])

This non-commercial license of GraphLab Create is assigned to guoxiuhe@nefu.edu.cn and will expire on April 02, 2017. For commercial licensing options, visit https://dato.com/buy/.


2016-05-07 17:50:24,084 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: C:\Users\heguoxiu\AppData\Local\Temp\graphlab_server_1462614612.log.0


In [4]:
data

time,user_id,query,result_click,url
00:00:00,2982199073774412,[360安全卫士],8 3,download.it.com.cn/softwe b/software/firewall/a ...
00:00:00,7594220010824798,[哄抢救灾物资],1 1,news.21cn.com/social/daqi an/2008/05/29/4777194 ...
00:00:00,5228056822071097,[75810部队],14 5,www.greatoo.com/greatoo_c n/list.asp?link_id=27 ...
00:00:00,6140463203615646,[绳艺],62 36,www.jd-cd.com/jd_opus/xx/ 200607/706.html ...
00:00:00,8561366108033201,[汶川地震原因],3 2,www.big38.net/
00:00:00,23908140386148713,[莫衷一是的意思],1 2,www.chinabaike.com/articl e/81/82/110/2007/2007 ...
00:00:00,1797943298449139,[星梦缘全集在线观 看] ...,8 5,www.6wei.net/dianshiju/?? ??\xa1\xe9|????do=index ...
00:00:00,717725924582846,[闪字吧],1 2,www.shanziba.com/
00:00:00,41416219018952116,[霍震霆与朱玲玲照 片] ...,2 6,bbs.gouzai.cn/thread-6987 36.html ...
00:00:00,9975666857142764,[电脑创业],2 2,ks.cn.yahoo.com/question/ 1307120203719.html ...


In [5]:
data['result_and_click'] = data['result_click'].apply(lambda x : re.split(r'\s+',x))

In [6]:
def getfirst(l):
    return int(l[0])
def getsecond(l):
    return int(l[1])

data['result_rank'] = data['result_and_click'].apply(lambda l: getfirst(l))
data['click_rank'] = data['result_and_click'].apply(lambda l: getsecond(l))

In [7]:
data['query'] = data['query'].apply(lambda x: str(x)).apply(lambda x: x.strip().lstrip('[').rstrip(']'))

In [8]:
def remove_tail(url):
    if url[-1] == '/':
        return url[:-1]
    else:
        return url[:]

In [9]:
data['url'] = data['url'].apply(remove_tail)

In [10]:
data

time,user_id,query,result_click,url,result_and_click,result_rank
00:00:00,2982199073774412,360安全卫士,8 3,download.it.com.cn/softwe b/software/firewall/a ...,"[8, 3]",8
00:00:00,7594220010824798,哄抢救灾物资,1 1,news.21cn.com/social/daqi an/2008/05/29/4777194 ...,"[1, 1]",1
00:00:00,5228056822071097,75810部队,14 5,www.greatoo.com/greatoo_c n/list.asp?link_id=27 ...,"[14, 5]",14
00:00:00,6140463203615646,绳艺,62 36,www.jd-cd.com/jd_opus/xx/ 200607/706.html ...,"[62, 36]",62
00:00:00,8561366108033201,汶川地震原因,3 2,www.big38.net,"[3, 2]",3
00:00:00,23908140386148713,莫衷一是的意思,1 2,www.chinabaike.com/articl e/81/82/110/2007/2007 ...,"[1, 2]",1
00:00:00,1797943298449139,星梦缘全集在线观� �� ...,8 5,www.6wei.net/dianshiju/?? ??\xa1\xe9|????do=index ...,"[8, 5]",8
00:00:00,717725924582846,闪字吧,1 2,www.shanziba.com,"[1, 2]",1
00:00:00,41416219018952116,霍震霆与朱玲玲照� �� ...,2 6,bbs.gouzai.cn/thread-6987 36.html ...,"[2, 6]",2
00:00:00,9975666857142764,电脑创业,2 2,ks.cn.yahoo.com/question/ 1307120203719.html ...,"[2, 2]",2

click_rank
3
1
5
36
2
2
5
2
6
2


#### 3.1.2 Open Directory Project(ODP) 体系

<img src='odp.png' align='left'>

In [11]:
odp = graphlab.SFrame.read_csv('ODP.csv', delimiter=',', header=True)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [12]:
odp.head()

url,name,label,url_new
http://news.jmu.edu.cn/,集美大学新闻网,大专院校,news.jmu.edu.cn/
http://jjxj.swufe.edu.cn/,经济学家,出版物,jjxj.swufe.edu.cn/
http://www.jsacd.gov.cn/,江苏省农业资源开� ��局 ...,江苏,www.jsacd.gov.cn/
http://www.yndaily.com/,云南日报网,地区,www.yndaily.com/
http://www.panda.org.cn/,成都大熊猫繁育研� ��基地 ...,熊猫,www.panda.org.cn/
http://www.fjinfo.gov.cn/,福建科技信息,福建,www.fjinfo.gov.cn/
http://www.klxuexi.com/,快乐学习教育科技� ��团 ...,上海,www.klxuexi.com/
http://www.haier.com/cn/,海尔集团,消费电子产品,www.haier.com/cn/
http://www.jstvu.edu.cn/,江苏广播电视大学,江苏,www.jstvu.edu.cn/
http://www.gxtc.edu.cn/,广西师范学院,大专院校,www.gxtc.edu.cn/


In [13]:
def remove_head(url):
    return string.replace(url, 'http://', '')

In [14]:
def remove_tail(url):
    if url[-1] == '/':
        return url[:-1]
    else:
        return url[:]

In [15]:
odp['url_new'] = odp['url'].apply(remove_head).apply(remove_tail)

In [16]:
odp.head()

url,name,label,url_new
http://news.jmu.edu.cn/,集美大学新闻网,大专院校,news.jmu.edu.cn
http://jjxj.swufe.edu.cn/,经济学家,出版物,jjxj.swufe.edu.cn
http://www.jsacd.gov.cn/,江苏省农业资源开� ��局 ...,江苏,www.jsacd.gov.cn
http://www.yndaily.com/,云南日报网,地区,www.yndaily.com
http://www.panda.org.cn/,成都大熊猫繁育研� ��基地 ...,熊猫,www.panda.org.cn
http://www.fjinfo.gov.cn/,福建科技信息,福建,www.fjinfo.gov.cn
http://www.klxuexi.com/,快乐学习教育科技� ��团 ...,上海,www.klxuexi.com
http://www.haier.com/cn/,海尔集团,消费电子产品,www.haier.com/cn
http://www.jstvu.edu.cn/,江苏广播电视大学,江苏,www.jstvu.edu.cn
http://www.gxtc.edu.cn/,广西师范学院,大专院校,www.gxtc.edu.cn


### 3.2 构建新的类目体系以标注查询日志数据

#### 3.2.1 将ODP主题类目体系映射到Rose类目体系

* Rose类目体系

<img src='rose.png' align='left'>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;通过对Rose分类体系的分析，本文将ODP主题类目体系映射到Rose类目体系的三大类即*信息类* 、*资源类* 和*导航类* 中，主要是*信息类* 和*导航类*。然后对查询日志数据进行标注。

* **导航类**：本文将日志数据中url仅能匹配到web服务器名称的标记为导航类。如：www.nefu.edu.cn
* **资源类**：本文首先利用启发式的方法，把日志数据中url能匹配到download/game/music/movie/book等字符的标记为资源类。然后利用上述的分析，人工筛选出ODP中属于资源类的url，构建资源类url库*resource*，当日志数据中的url可以匹配到*resource*时，标记为资源类。例如：ODP中的购物类、游戏类等都属于资源类。
* **信息类**：将其他不属于以上两类的标注为信息类。

In [17]:
r = ['二手货','交通工具','休闲','体育用品','健康饮食','健康器材','在线销售','化妆美容','出版物','办公用品','化妆美容','古董与收藏','图书',\
     '婴幼儿用品','宠物','家具','家居与园艺','批发','日用商品','服装饰品','消费电子产品','图书','玩具与游戏','杂志','音像制品','化妆品',\
     '珠宝首饰','礼品','视觉艺术','计算机','食品','鲜花','精油香氛','分类','拍卖','目录','大学出版社','家具','家居装饰','电器','体育用品',\
    '乒乓球','渔具','飞镖','家具','文具','办公室服务','购物','批发与分销','烟草','机动车','珠宝首饰','眼镜','行李与包','食品','女装','童装',\
     '鞋帽','饰品','电子通讯','数字卡','虚拟物品交易','摄影','画','饮料','茶','葡萄酒',\
    '卡牌游戏','投币式游戏','棋类游戏','牌类游戏','电子游戏','电脑游戏','益智游戏','网络游戏','角色扮演','赌博',\
        '中国象棋','军棋','围棋','国际象棋','连珠','黑白棋','组织','休闲','体育','冒险','动作','射击','格斗','模拟','益智','策略','网络游戏',\
        '角色扮演','赛车','音乐与舞蹈','射击','格斗','网页游戏','魔兽争霸','魔兽世界','大型多人在线','网页游戏','角色扮演',\
       '网络泥巴','角色扮演','冒险岛','天龙八部','永恒之塔','魔兽世界','手持平台','游戏机平台','网页游戏','计算机平台','手机','新闻与评论',\
        '世嘉','任天堂','微软','索尼','下载','下载','会议展览','作弊与攻略','家族与公会','开发商与发布商','新闻与评论',\
        '电子竞技','聊天与论坛','麻将','体育','彩票','赌场']

In [18]:
f = open('resource_show.csv', 'a')

In [19]:
for i in r:
    for url in odp[odp['label'] == i]['url_new']:
        f.writelines(url+'\n')
f.close()

In [20]:
resource = graphlab.SFrame.read_csv('resource_show.csv',header=False)

------------------------------------------------------


Inferred types from first 100 line(s) of file as 
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [21]:
resource.unique().save('resource_show.csv')

In [22]:
resource = graphlab.SFrame.read_csv('resource_show.csv',header=False)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument


------------------------------------------------------


In [23]:
resource.head()

X1
X1
wow.uuu9.com/immtc
www.bmw-motorsport.com.cn
zh.wikipedia.org/zh- cn/Xbox ...
www.zglyyx.com
www.csapa.org
app.hicloud.com
www.e800.com.cn
d3.178.com
sports.sohu.com/weiqi.sht ml ...


#### 3.2.2 对查询日志数据进行标注

标注资源类数据的方法

In [24]:
def label_resource(row, resource):
    for url_i in resource['X1']:
        if row['url'].find(url_i) != -1:
            return True
    return False

对整个数据集进行标注的方法

In [25]:
def log_label(row, resource):
    if re.match(r'www(.*?)\b(com|cn|org|net|gov|xin|red|pub|ink|info|xyz|win|edu|mil|tv|TV|mobi|travel|name|aero|museum|pro|biz|coop|aero)\b$',\
                row['url']):
        return 'Navigation'
    elif label_resource(row, resource):
        return 'Resource'
    else:
        return 'Information'

In [26]:
data['label'] = data.apply(lambda x: log_label(x, resource))

In [27]:
#data['label'].show()

### 3.3 利用NLP技术提取特征

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;本文主要利用NLP技术来提取Query中的特征，包括：分词、词性统计等特征。结合点击排序特征和结果排序特征共同作为查询意图识别的特征。

In [28]:
def cut(query):
    flag_cut = ''
    temp = pseg.cut(query)
    for word, flag in temp:
        flag_cut = flag_cut + ' ' + flag
    return flag_cut

In [29]:
data['flag_cut'] = data['query'].apply(cut)

Building prefix dict from the default dictionary ...
2016-05-07 17:52:07,730 [DEBUG] jieba, 111: Building prefix dict from the default dictionary ...
Loading model from cache c:\users\heguoxiu\appdata\local\temp\jieba.cache
2016-05-07 17:52:07,733 [DEBUG] jieba, 131: Loading model from cache c:\users\heguoxiu\appdata\local\temp\jieba.cache
Loading model cost 1.660 seconds.
2016-05-07 17:52:09,392 [DEBUG] jieba, 163: Loading model cost 1.660 seconds.
Prefix dict has been built succesfully.
2016-05-07 17:52:09,400 [DEBUG] jieba, 164: Prefix dict has been built succesfully.


In [30]:
def search_cut(query):
    query_search_cut = ''
    temp = jieba.cut_for_search(query)
    for word in temp:
        query_search_cut = query_search_cut + ' ' + word 
    return query_search_cut

In [31]:
data['query_search_cut'] = data['query'].apply(search_cut)

In [32]:
data

time,user_id,query,result_click,url,result_and_click,result_rank
00:00:00,2982199073774412,360安全卫士,8 3,download.it.com.cn/softwe b/software/firewall/a ...,"[8, 3]",8
00:00:00,7594220010824798,哄抢救灾物资,1 1,news.21cn.com/social/daqi an/2008/05/29/4777194 ...,"[1, 1]",1
00:00:00,5228056822071097,75810部队,14 5,www.greatoo.com/greatoo_c n/list.asp?link_id=27 ...,"[14, 5]",14
00:00:00,6140463203615646,绳艺,62 36,www.jd-cd.com/jd_opus/xx/ 200607/706.html ...,"[62, 36]",62
00:00:00,8561366108033201,汶川地震原因,3 2,www.big38.net,"[3, 2]",3
00:00:00,23908140386148713,莫衷一是的意思,1 2,www.chinabaike.com/articl e/81/82/110/2007/2007 ...,"[1, 2]",1
00:00:00,1797943298449139,星梦缘全集在线观� �� ...,8 5,www.6wei.net/dianshiju/?? ??\xa1\xe9|????do=index ...,"[8, 5]",8
00:00:00,717725924582846,闪字吧,1 2,www.shanziba.com,"[1, 2]",1
00:00:00,41416219018952116,霍震霆与朱玲玲照� �� ...,2 6,bbs.gouzai.cn/thread-6987 36.html ...,"[2, 6]",2
00:00:00,9975666857142764,电脑创业,2 2,ks.cn.yahoo.com/question/ 1307120203719.html ...,"[2, 2]",2

click_rank,label,flag_cut,query_search_cut
3,Information,m nz,360 安全 卫士 安全卫士 ...
1,Information,v l,哄抢 救灾 物资 救灾物资 ...
5,Information,m n,75810 部队
36,Information,n,绳艺
2,Navigation,ns n n,汶川 地震 原因
2,Information,i uj n,莫衷一是 的 意思
5,Information,nr n b v,星梦 星梦缘 全集 在线 观看 ...
2,Navigation,n y,闪字 吧
6,Information,nr p nr n,霍震霆 与 玲玲 朱玲玲 照片 ...
2,Information,n n,电脑 创业


### 3.4 比较Logistic Regression和Boost Tree对查询意图识别的效率

#### 3.4.1 数据准备(train_data, validation_data, test_data)

In [33]:
data['flag_word_count'] = graphlab.text_analytics.count_words(data['flag_cut'])

In [34]:
data['query_search_word_count'] = graphlab.text_analytics.count_words(data['query_search_cut'])

In [35]:
data['tfidf'] = graphlab.text_analytics.tf_idf(data['flag_word_count'])

In [36]:
data['search_tfidf'] = graphlab.text_analytics.tf_idf(data['query_search_word_count'])

In [37]:
train_data, test_data = data.random_split(.8, seed=0)

In [38]:
#train_data, validation_data = train_data.random_split(0.75, seed=0)

#### 3.4.2 Logistic Regression

* 分类器训练

In [39]:
logistic_model_1 = graphlab.logistic_classifier.create(train_data, target='label', \
                                                       features=['tfidf', 'search_tfidf', 'result_rank', 'click_rank'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [40]:
logistic_model_2 = graphlab.logistic_classifier.create(train_data, target='label', \
                                                       features=['search_tfidf', 'result_rank', 'click_rank'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



* 分类器测试

In [41]:
logistic_model_1.evaluate(test_data)

{'accuracy': 0.9028419265294769,
 'auc': 0.8959618612890848,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 9
 
 Data:
 +--------------+-----------------+--------+
 | target_label | predicted_label | count  |
 +--------------+-----------------+--------+
 |   Resource   |    Navigation   |  429   |
 | Information  |     Resource    |  1238  |
 | Information  |    Navigation   | 10609  |
 |  Navigation  |   Information   | 15139  |
 |  Navigation  |    Navigation   | 25166  |
 |  Navigation  |     Resource    |  122   |
 |   Resource   |     Resource    |  778   |
 |   Resource   |   Information   |  5994  |
 | Information  |   Information   | 285643 |
 +--------------+-----------------+--------+
 [9 rows x 3 columns],
 'f1_score': 0.5896148074104733,
 'log_loss': 0.27898994137215655,
 'precision': 0.6633735508033296,
 'recall': 0.5635740414195917,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 

In [42]:
logistic_model_2.evaluate(test_data)

{'accuracy': 0.9018915269559976,
 'auc': 0.8943043480440952,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 9
 
 Data:
 +--------------+-----------------+--------+
 | target_label | predicted_label | count  |
 +--------------+-----------------+--------+
 |   Resource   |    Navigation   |  397   |
 | Information  |     Resource    |  1160  |
 | Information  |    Navigation   | 10592  |
 |  Navigation  |   Information   | 15545  |
 |  Navigation  |    Navigation   | 24761  |
 |  Navigation  |     Resource    |  121   |
 |   Resource   |     Resource    |  760   |
 |   Resource   |   Information   |  6044  |
 | Information  |   Information   | 285738 |
 +--------------+-----------------+--------+
 [9 rows x 3 columns],
 'f1_score': 0.5864773450838197,
 'log_loss': 0.2814398149571041,
 'precision': 0.664911406702427,
 'recall': 0.5595079175529878,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Ro

#### 3.4.3 Boosted Tree

* 分类器训练

In [43]:
boost_model_1 = graphlab.boosted_trees_classifier.create(train_data, target='label', \
                                                         features=['tfidf', 'search_tfidf', 'result_rank', 'click_rank'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [44]:
boost_model_2 = graphlab.boosted_trees_classifier.create(train_data, target='label', \
                                                         features=['search_tfidf', 'result_rank', 'click_rank'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



* 分类器测试

In [45]:
boost_model_1.evaluate(test_data)

{'accuracy': 0.8818925700774808,
 'auc': 0.7855897148994869,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 8
 
 Data:
 +--------------+-----------------+--------+
 | target_label | predicted_label | count  |
 +--------------+-----------------+--------+
 |   Resource   |    Navigation   |  137   |
 |  Navigation  |    Navigation   | 10751  |
 |  Navigation  |   Information   | 29676  |
 |   Resource   |     Resource    |  420   |
 |   Resource   |   Information   |  6644  |
 | Information  |   Information   | 293186 |
 | Information  |    Navigation   |  4303  |
 | Information  |     Resource    |   1    |
 +--------------+-----------------+--------+
 [8 rows x 3 columns],
 'f1_score': 0.47733909090299687,
 'log_loss': 0.35691104906725496,
 'precision': 0.8650402566458418,
 'recall': 0.43659788373308245,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 300003
 
 Data:
 +-----------+-----+-

In [46]:
boost_model_2.evaluate(test_data)

{'accuracy': 0.8760394995334929,
 'auc': 0.7310632720035798,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 8
 
 Data:
 +--------------+-----------------+--------+
 | target_label | predicted_label | count  |
 +--------------+-----------------+--------+
 |  Navigation  |    Navigation   |  4754  |
 |  Navigation  |   Information   | 35673  |
 |   Resource   |     Resource    |  421   |
 |   Resource   |   Information   |  6779  |
 | Information  |   Information   | 297162 |
 |   Resource   |    Navigation   |   1    |
 | Information  |    Navigation   |  327   |
 | Information  |     Resource    |   1    |
 +--------------+-----------------+--------+
 [8 rows x 3 columns],
 'f1_score': 0.4174111051698739,
 'log_loss': 0.39827852907597866,
 'precision': 0.9360293588455798,
 'recall': 0.3916520736545807,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 	class	int
 
 Rows: 300003
 
 Data:
 +-----------+-----+---

#### 3.4.4 比较不同分类器和不同特征提取对查询意图识别的影响

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;呃，其实这部分还**有待改进**。毕竟虽然有影响，但是影响比较小。
* 不同分类器对查询意图识别会有**1个百分点**的影响；
* 不同特征提取对意图识别的影响仅仅有**0.1个百分点**。

## 4. 结论

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;由于论文没有完成，所以结论部分还有待分析。
* 在答辩前，要深入学习**LTP工具**，优化特征提取；
* 同时要**调整分类器的参数**，优化分类结果。

<img src='classification.jpg' align='left'>

# 欢迎大家批评指正，谢谢！