# 用Word2Vec做自然语言分析

## 准备语料

网络上高质量的语料比较少，中文的高质量语料就更少了，我在知乎的专栏上找到一篇文章总结了[各个领域的公开数据集](https://zhuanlan.zhihu.com/p/25138563)。

在本文中我们使用的是[搜狗新闻语料数据](https://www.sogou.com/labs/resource/list_news.php)，

我下载的新闻语料压缩包大小在640MB，解压后是一个dat格式的文件，数据格式为：

```xml
<doc>
<url>页面URL</url>
<docno>页面ID</docno>
<contenttitle>页面标题</contenttitle>
<content>页面内容</content>
</doc>
```

原始的语料数据是GBK格式的，在linux上是乱码，需要将其转换为UTF-8编码，顺带的我们也只需要`<content>`标签内的内容，因此可以一并将其处理了。使用shell命令可以一步完成：

```shell
cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>"  > sougou_corpus.txt 
```

执行命令后效果如图：

![sourgou_corpus.txt](img/sougou_corpus.png)



## 分词

准备好语料之后，接下来需要做的是分词了。这一步骤对于中文很重要，因为英文是由一个个单词组成，而中文是由字组成，很多时间我们需要由字组成词之后才有意义。

中文的NLP处理中有一些用来做分词的工具，在这里我们使用[结巴分词](https://github.com/fxsjy/jieba)。

In [1]:
#-*-coding:utf-8-*- 
import logging
import os.path
import jieba
import sys
import re

In [2]:
# 用正则表达式提取content
def reTest(content):
    reContent = re.sub('<content>|</content>', '', content)
    return reContent

In [3]:
# 初始化日志配置
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
# logging.info("running %s" % ' '.join(sys.argv))

In [16]:
# 获取当前文件所在绝对路径
base_path = os.path.abspath('.')
dataset_path = base_path + '/' + 'dataset'
input_corpus = dataset_path + '/' + 'sougou_corpus.txt'
output_corpus = dataset_path + '/' + 'seg_sougou_corpus.txt'

i = 0
try:
    fo = open(output_corpus, 'w')
    with open(input_corpus, 'r', encoding='utf-8') as f:
        for line in f:
            new_sentence = re.sub(r'[^\u4e00-\u9fa5]', ' ', reTest(line))
#            line_seg = jieba.cut(reTest(line))
            line_seg = jieba.cut(new_sentence)
            fo.write(' '.join(line_seg)) # 
            i = i + 1
            if (i % 10000 == 0):
                logger.info("Saved " + str(i))
finally:
    fo.close()
        
logger.info("Finished Saved " + str(i))
# finput = open(dataset_path)

Building prefix dict from the default dictionary ...
2018-04-17 22:17:30,288: DEBUG: Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
2018-04-17 22:17:30,291: DEBUG: Loading model from cache /tmp/jieba.cache
Loading model cost 1.615 seconds.
2018-04-17 22:17:31,905: DEBUG: Loading model cost 1.615 seconds.
Prefix dict has been built succesfully.
2018-04-17 22:17:31,907: DEBUG: Prefix dict has been built succesfully.
2018-04-17 22:18:08,713: INFO: Saved 10000
2018-04-17 22:18:48,057: INFO: Saved 20000
2018-04-17 22:19:24,854: INFO: Saved 30000
2018-04-17 22:20:03,984: INFO: Saved 40000
2018-04-17 22:20:41,316: INFO: Saved 50000
2018-04-17 22:21:18,775: INFO: Saved 60000
2018-04-17 22:21:55,472: INFO: Saved 70000
2018-04-17 22:22:30,392: INFO: Saved 80000
2018-04-17 22:23:05,705: INFO: Saved 90000
2018-04-17 22:23:45,710: INFO: Saved 100000
2018-04-17 22:24:25,828: INFO: Saved 110000
2018-04-17 22:25:07,726: INFO: Saved 120000
2018-04-17 22:2

In [4]:
import time
import numpy as np
import tensorflow as tf
from collections import Counter

In [None]:
base_path = os.path.abspath('.')
dataset_path = base_path + '/' + 'dataset'
output_corpus = dataset_path + '/' + 'seg_sougou_corpus.txt'

words_list = []
with open(output_corpus, 'r') as f:
    for line in f:
        words_list = line.split()
    
#word_counts = Counter(words_list) 
#trimmed_words = [word for word in words if word_counts[word] > 5]
logger.info(words_list[:30])
