## TextCNN， BiGRU， CNN-BiGRU分类器

**说明**

封装了TextCNN, BiGRU, CNN-BiGRU, SVM的分类器，支持使用word2vec词向量预训练，并设计了统一的调用接口

**环境要求：**
* 本包基于PyTorch，请先安装[PyTorch v2](http://pytorch.org/)
* numpy
* [word2vec](https://pypi.python.org/pypi/word2vec)

**包说明**

    |-nnclf
        |- bigru.py
        |- textcnn.py
        |- cnnbigru.py
        |- word_vector.py
        |- utils.py

### word2vec词向量

在nnclf下的word_vector.py封装了词向量的相关接口，包括：词向量的训练，传入词典获取word embedding

#### 1.词向量的训练

In [1]:
from nnclf.word_vector import WordVector

w2v = WordVector('./runtime/word2vec.bin') # 必须通过构造函数传入模型的地址
w2v.train('./corpus/data.txt') # 传入已经分好词的语料文件

b'Starting training using file ./corpus/data.txt\n'b'Vocab size: 119\n'b'Words in train file: 2802\n'

#### 2.加载词向量模型

In [5]:
from nnclf.word_vector import WordVector

w2v = WordVector('./runtime/word2vec.bin') 
model = w2v.load()
string = '歌曲'
if string not in model:
    print('当前词不在词库中')
else:
    print('当前词')
    print(string)
    print('词向量')
    print(model[string])
    print('近义词(余弦定理)：')
    indexes, metrics = model.cosine(string)
    print(model.generate_response(indexes, metrics).tolist())

当前词
歌曲
词向量
[  1.84643999e-01  -1.63279787e-01   1.51028126e-01  -6.36137947e-02
   2.97602694e-02  -8.20529604e-05  -2.33809903e-01   1.60071868e-02
  -2.55316585e-01   3.75809595e-02   5.21310084e-02   2.23071173e-01
  -1.57158133e-02  -7.26312771e-02   1.01293981e-01   5.98003343e-02
  -1.81273967e-02   8.28594267e-02  -1.97308492e-02  -1.18765816e-01
  -2.22959071e-02  -3.11312117e-02  -1.73354782e-02  -2.33786300e-01
  -2.18710192e-02  -4.98845056e-02  -1.25703827e-01   6.02751300e-02
  -7.71608995e-03   3.37346494e-02   2.61401036e-03   1.41794518e-01
  -1.84635252e-01   2.02002153e-02   1.10111341e-01   8.26536417e-02
   3.16254981e-02  -1.93025954e-02   8.53149518e-02   6.20240942e-02
  -1.58089474e-01  -2.15821505e-01  -7.68213719e-02   1.09975271e-01
   2.11332679e-01   8.96924287e-02  -4.05575447e-02  -8.47879855e-04
  -7.65577182e-02   1.10694002e-02   5.87318763e-02   1.14083150e-03
  -6.76005557e-02  -1.54716924e-01   1.75064877e-01   8.62063318e-02
   1.35511190e-01  -1.1

**更多模型的使用方法请查看[word2vec使用文档](http://nbviewer.jupyter.org/github/danielfrg/word2vec/blob/master/examples/word2vec.ipynb)**

#### 3.获取词典的word embedding

In [7]:
from nnclf.word_vector import WordVector

vocabs = ['天气', '翻译', '你好', '!', '太阳']
w2v = WordVector('./runtime/word2vec.bin')
word_embedding, embedding_size = w2v.word_embedding(vocabs)
print('word_embedding: ')
print(word_embedding)
print('embedding size:')
print(embedding_size)

word_embedding: 
[[0.11678606271743774, -0.07953370362520218, 0.0841088742017746, -0.00850905105471611, 0.025422848761081696, -0.13431769609451294, -0.19506625831127167, -0.15132324397563934, -0.1915084272623062, 0.06224394589662552, 0.059790752828121185, 0.19149821996688843, -0.02925119176506996, 0.009541577659547329, -0.017398856580257416, 0.242618590593338, 0.011982823722064495, -0.014623040333390236, 0.047549884766340256, -0.007477962411940098, -0.1123051568865776, 0.10901962965726852, 0.06360229849815369, -0.2498617023229599, 0.07356486469507217, -0.12932728230953217, -0.1167450100183487, 0.07358821481466293, 0.10482751578092575, 0.08983588963747025, 0.007777365390211344, 0.023336948826909065, -0.21854901313781738, 0.011877217330038548, 0.07356367260217667, -0.02871868945658207, 0.11140771210193634, -0.054802056401968, 0.11645830422639847, -0.07012410461902618, -0.06390493363142014, -0.1712428480386734, -0.13599325716495514, 0.029902875423431396, 0.1328081488609314, 0.197781562805

对于未在词典中的词，采用了平均值填充的方式

### 模型的训练

**参数说明**
    
    构造函数参数：
    model_dir        模型保存路径,required
    [hidden_size]    隐藏层大小,int,默认64
    [batch_size]     批训练大小,int,默认1
    [lr_rate]        学习速率,float,默认0.01
    [epoch]          训练轮数(总的训练次数=轮数x数据集大小),int,默认3
    [autolr]         是否自动调节学习速率,bool,默认True
    [word2vec]       是否使用词向量训练,bool,默认False
    [word2vec_model] 词向量模型所在路径,string,默认None
    [print_log]      是否打印训练进度,bool,默认False
    [log_interval]   打印训练进度的间隔次数,int,默认10      
     
    train函数参数:
    datas            数据集,list
    labels           标签集,list
    retrain          是否重复训练,bool,默认False(重复训练则清除之前训练的模型，若不重复训练则在之前的模型上继续训练)
    
    predict函数参数:
    texts            数据,list
    k                每次返回前k条分类,默认1
    
    predict_proba函数参数:
    texts            数据,list
    k                每次返回前k条分类,默认1

**训练模型(以BiGRU为例)**

In [12]:
import os
import jieba
from nnclf.bigru import BiGRU

def load_data(corpus_dir='./corpus/data/'):
	datas = []
	labels = []
	all_labels = []
	for filename in os.listdir(corpus_dir):
		with open(os.path.join(corpus_dir, filename), 'r') as f:
			for line in f.readlines():
				line = line.strip().replace(' ', '')
				if len(line) == 0:
					continue
				sentence_list = jieba.lcut(line)

				label = filename.split('.txt')[0]
				datas.append(sentence_list)
				labels.append(label)
	return datas, labels

datas, labels = load_data()
clf = BiGRU('./runtime/bigru', epoch=2, word2vec=False, word2vec_model='./runtime/word2vec.bin')
clf.train(datas, labels, retrain=True)

训练完毕后会在模型目录中写入训练的log.txt文件，主要记录训练过程一些参数的变化，比如lr_rate

**模型的测试**

In [15]:
from nnclf.bigru import BiGRU

clf = BiGRU('./runtime/bigru')
texts = ['查看一下天气预报', '帮我查下南京明天的天气', '我想听你最珍贵这首歌曲', '放首诗', '放首歌', 'ProgressBar', '~~~']

print('predict:')
labels = clf.predict(texts, k=5)
for idx in range(len(texts)):
	print(texts[idx])
	print(labels[idx])
print()
print('predict_proba:')
labels = clf.predict_proba(texts, k=5)
for idx in range(len(texts)):
	print(texts[idx])
	for label, proba in labels[idx]:
		print(label, ',', proba)

predict:
查看一下天气预报
['weather', 'datetime', 'poetry', 'music', 'flight']
帮我查下南京明天的天气
['weather', 'datetime', 'poetry', 'flight', 'news']
我想听你最珍贵这首歌曲
['music', 'datetime', 'news', 'translation', 'weather']
放首诗
['poetry', 'datetime', 'translation', 'flight', 'weather']
放首歌
['music', 'datetime', 'poetry', 'calculate', 'news']
ProgressBar
['datetime', 'flight', 'news', 'poetry', 'weather']
~~~
['datetime', 'news', 'flight', 'weather', 'translation']

predict_proba:
查看一下天气预报
weather , 0.899519789251753
poetry , 0.05757084415898526
music , 0.03049541187102234
datetime , 0.026541493235452556
flight , 0.02025885451264558
帮我查下南京明天的天气
weather , 0.8091128259155912
datetime , 0.11590967034875785
poetry , 0.006010602758339545
flight , 0.0011516178487235168
news , 0.0005142051721974991
我想听你最珍贵这首歌曲
music , 0.09933670839043972
datetime , 0.0441298779317347
calculate , 0.0011244938783366303
news , 0.0009335566017812685
translation , 0.0005220645624636176
放首诗
poetry , 0.7543563226274153
flight , 0.0565502

**使用预训练词向量**

In [1]:
import os
import jieba
from nnclf.textcnn import TextCNN

def load_data(corpus_dir='./corpus/data/'):
	datas = []
	labels = []
	all_labels = []
	for filename in os.listdir(corpus_dir):
		with open(os.path.join(corpus_dir, filename), 'r') as f:
			for line in f.readlines():
				line = line.strip().replace(' ', '')
				if len(line) == 0:
					continue
				sentence_list = jieba.lcut(line)

				label = filename.split('.txt')[0]
				datas.append(sentence_list)
				labels.append(label)
	return datas, labels

datas, labels = load_data()
clf = TextCNN('./runtime/textcnn_word2vec', epoch=2, word2vec=True, word2vec_model='./runtime/word2vec.bin')
clf.train(datas, labels, retrain=True)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Dumping model to file cache /tmp/jieba.cache
Dump cache file failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/jieba/__init__.py", line 152, in initialize
    _replace_file(fpath, cache_file)
PermissionError: [Errno 1] Operation not permitted: '/tmp/tmpql0xzn9i' -> '/tmp/jieba.cache'
Loading model cost 1.223 seconds.
Prefix dict has been built succesfully.


In [4]:
from nnclf.textcnn import TextCNN

clf = TextCNN('./runtime/textcnn_word2vec')
texts = ['查看一下天气预报', '帮我查下南京明天的天气', '我想听你最珍贵这首歌曲', '放首诗', '放首歌', 'ProgressBar', '~~~']

print('predict:')
labels = clf.predict(texts, k=5)
for idx in range(len(texts)):
	print(texts[idx])
	print(labels[idx])
print()
print('predict_proba:')
labels = clf.predict_proba(texts, k=5)
for idx in range(len(texts)):
	print(texts[idx])
	for label, proba in labels[idx]:
		print(label, ',', proba)

predict:
查看一下天气预报
['weather', 'poetry', 'datetime', 'translation', 'flight']
帮我查下南京明天的天气
['weather', 'poetry', 'news', 'flight', 'translation']
我想听你最珍贵这首歌曲
['music', 'news', 'poetry', 'flight', 'weather']
放首诗
['poetry', 'translation', 'flight', 'weather', 'music']
放首歌
['music', 'flight', 'news', 'calculate', 'datetime']
ProgressBar
['music', 'weather', 'poetry', 'translation', 'datetime']
~~~
['music', 'poetry', 'weather', 'translation', 'datetime']

predict_proba:
查看一下天气预报
weather , 0.9994122808664151
poetry , 0.07423397561943722
datetime , 0.0498361481601903
flight , 0.02043510065436431
translation , 0.010683390906245425
帮我查下南京明天的天气
weather , 0.9925422160782983
poetry , 0.011179422477336142
flight , 0.004159712908065514
datetime , 0.00238254319962543
news , 0.0018012380550213494
我想听你最珍贵这首歌曲
music , 0.9019259068374911
news , 0.02519644451310079
poetry , 0.01134424498632857
weather , 0.004700863630096767
datetime , 0.003402854966421797
放首诗
poetry , 0.10573242220872656
translation , 0.0

**更多例子查看目录下demo**