# 竞争性关键词推荐算法

## 数据预处理
#### 1. 读取搜索内容  
从比赛数据的训练集中提取出搜索记录，并以utf-8格式保存，每条记录占一行  
搜索记录保存在以"query_words.train"为名称的文件中

In [19]:
data = open('./user_tag_query.10W.TRAIN','r',encoding='gb18030')
output_data = open('./query_words.train','w',encoding='utf-8')
for line in data:
    print(line)
    line_list = line.split(' ')
    line_list = line_list[4:]
    output_line = '\n'.join(line_list)
    output_data.write(output_line + '\n')
data.close()
output_data.close()

version https://git-lfs.github.com/spec/v1 oid sha256:4131c60c704be000dcfd9450a427a670fea3714f700be5b7b6b91705d654c384 size 224899985



#### 2. 数据清洗 去除链接等非中文文本
观察原始数据，发现文本数据条目中包含http格式的网页链接名称，与要得到的关键词无关，甚至会产生干扰，因而使用正则表达式将其去除  
将去除后网页链接后的数据保存在以"link_clean_data'为名称的文件中

In [20]:
import re
train_data = open('./query_words.train','r',encoding='utf-8')
result_data = open('./link_clean_data.train','w',encoding='utf-8')

for line in train_data:
    word_list = line.split('\t')
    pattern = re.compile(r'[:]?http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+') 
    if pattern.match(word_list[0]):
        # print(word_list[0])
        continue
    line_string = '\t'.join(word_list) + '\n'
    result_data.write(line_string)
train_data.close()
result_data.close()

#### 3.使用jieba分词

In [21]:
import jieba
train_data = open('./link_clean_data.train','r',encoding='utf-8')
result_data = open('./words_segmentation_data.train','w',encoding='utf-8')

for sentence in train_data:
    sentence = sentence[:-1]
    word_seg = jieba.cut(sentence)
    line_string = "\t".join(word_seg) + '\n'
    result_data.write(line_string)
train_data.close()
result_data.close()

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\86150\AppData\Local\Temp\jieba.cache
Loading model cost 0.714 seconds.
Prefix dict has been built successfully.


#### 4.过滤停用词
文本中有很多无效的词，比如“着”，“和”，还有一些标点符号，这些都会对最后的关键词种子筛选产生干扰，因此需要去掉。  
我们下载了一个中文停用词表，其中包含常用的停用词，我们用该表将我们文本的停用词过滤掉。

In [8]:
#停用词表文件
stop_words = "stop_words.txt"
stop_words_dict = open(stop_words, 'r', encoding='utf-8')
stop_words_content = stop_words_dict.read()

#将停用词表转换为list  
stop_words_list = stop_words_content.splitlines()
stop_words_dict.close()

In [9]:
train_data = open('./words_segmentation_data.train','r',encoding='utf-8')
result_data = open('./filter_stopwords_data.train','w',encoding='utf-8')

#过滤分词结果中的停用词
def stop_words_filter(word_list,stop_words_list):
    word_cleaned=[]
    stopwords_list = set(stop_words_list)
    for word in word_list:
        if word not in stop_words_list:
            word_cleaned.append(word)
    return word_cleaned

for line in train_data:
    line = line[:-1]
    word_list = line.split('\t')
    word_list = stop_words_filter(word_list,stop_words_list)
    if len(word_list) == 0:
        continue
    line_string = "\t".join(word_list) + '\n'
    result_data.write(line_string)
train_data.close()
result_data.close()

## 选取种子关键词  
使用python自带的collections.Counter类进行词频统计  
用其中的most_common()方法打印出词频出现前20的词，选择10个，作为本次项目的种子关键词

In [10]:
from collections import Counter

def read_word(filename):
    wordlist = [];
    data_file = open(filename,'r',encoding='utf-8')
    for line in data_file:
        line = line[:-1]
        words = line.split('\t')
        wordlist.extend(words)
    data_file.close()
    return wordlist

word_list = read_word('./filter_stopwords_data.train')
count_result = Counter(word_list)
for key, val in count_result.most_common(20):
    print(key, val)

 6


**选取的10个种子关键词为：图片 手机 小说 视频 下载 qq 电影 百度 英语 游戏** 

## 筛选还有种子关键词的搜索条目
在选取了关键种子后，从10万条原始的搜索数据中，筛选出含有种子关键词的搜索数据  
存在以"seedwords_query.train"为名称的文件中

In [11]:
train_data = open('./query_words.train','r',encoding='utf-8')
result_data = open('./seedwords_query.train','w',encoding='utf-8')

seedwords_list=['图片','手机','小说','视频','下载','qq','电影','百度','英语','游戏']
for line in train_data:
    flag = False
    for seedword in seedwords_list:
        if seedword in line:
            flag = True
            break
    if(flag==True):
        result_data.write(line)
train_data.close()
result_data.close()
    