# 7 - TextCNN
上一节梳理了如何利用pytorch中的torchtext包进行`文本数据处理`，这一节将介绍如何搭建并训练`深度学习模型`，以CNN为例处理分类任务。

### 基本原理
`卷积神经网络`起源于图像处理，其基本思路是以卷积窗口为单位沿着二维像素矩阵滑动，在每个窗口内执行卷积操作，从而捕捉相邻像素间的信息，并压缩每一维的长度。应用到文本数据中，则将每一句话看作一个矩阵，每一个词都用定长的向量表示。使用的`卷积核宽度和词向量长度一致`（即只在词之间进行滑动，而不会将词拆分，有点n-gram的意思），高度可以设置多个。新得到的矩阵经过最大池化层后输入全连阶层，最后进行分类。

<img src='images/textcnn.png'>

### 代码实现
pytorch建模可以分为以下几部分：
+ 搭建模型：定义模型类，其中初始函数需要设置模型结构（嵌入层、卷积层、全连阶层...），forward函数用于执行一轮训练
+ 定义train/test/predict函数
+ 定义主模块：参数配置、数据导入、调用模型和函数

阅读时请一定注意代码中的注释【流过的泪没有人知道】

<img src='images/memes/debugging.jpeg'>

In [1]:
import sys
import os
import pickle
import re
import traceback

import torch
import torch.nn as nn 
import torch.autograd as autograd
import torch.nn.functional as F
from torchtext import data, vocab
from torchtext.data import TabularDataset
from torchtext.data import BucketIterator
import fasttext

#### 搭建模型
搭建模型的过程实际上就是将torch.nn中现有的模块进行`组装`。

关键是一定要清楚每一阶段`数据`输入输出的`形状`以及`卷积核/池化核`的`大小`！！

In [2]:
class TextCNN(nn.Module):
    def __init__(self,config):
        super(TextCNN,self).__init__()
        self.config = config
        N = config.embed_num #嵌入空间中词的个数
        D = config.embed_dim #词向量的维数
        C = config.class_num #类别数
        Ci = 1
        Co = config.kernal_thick #每个卷积核的厚度，固定值
        Ws = config.kernal_widths #每个(一维)卷积核的宽/窗口大小，有多个->列表

        self.embed = nn.Embedding(N,D)
        self.conv1ds = nn.ModuleList([nn.Conv1d(D,Co,W) for W in Ws])
        self.dropout = nn.Dropout(config.dropout)
        self.fc1 = nn.Linear(Co*len(Ws),C)

    def forward(self,x):
        x = self.embed(x) #(batch_size, seq_length, embed_dim)
        if self.config.static: 
            # Variable的requires_grad属性默认为False
            x = autograd.Variable(x) 
        x = x.permute(0,2,1) #(batch_size, embed_dim, seq_length)
        
        x = [F.relu(conv(x)) for conv in self.conv1ds] #[(batch_size, Co, Wi)...]
        
        # 利用池化消除窗口大小不同带来的维度差异 -> [(batch_size, Co)]
        # xi.size(2)表示xi第三个维度上的大小
        x = [F.max_pool1d(xi,xi.size(2)).squeeze(2) for xi in x] 
        
        # 将所有卷积核的输出拼接到一起 -> (batch_size, Co*len(Ws))
        x = torch.cat(x,1) 
        x = self.dropout(x)
        return self.fc1(x) # -> (batch_size, C)

#### train/evaluate/predict/save
+ train: 训练指定个epoch，一共更新参数epoch_num\*batch_num次（用steps来表示）。定期进行准确率报告与模型保存。
+ evaluate: 用于评价模型，大体步骤与train一致
+ predict: 利用模型进行分类
+ save: 保存模型至本地，使用方法`torch.save`(model.state_dict(),path)

需要注意的几点：
1. 利用torchtext进行label的转化表示，得到的区间从1开始，但是模型输入要求从0开始->`target.sub_(1)`【带_的操作表示inplace】
2. 注意每一步后要将optimizer的梯度设为0，防止效应累加
3. 评价标准为accuracy

In [10]:
def train(model,train_iter,test_iter,config):
    # optimizer控制模型参数的变化
    optimizer = torch.optim.Adam(model.parameters(),lr=config.lr)
    
    #将self.training设置为True，不同模式下模型的反应策略不同
    model.train() 
    
    step = 0
    best_acc = 0
    for epoch in range(1,config.epochs+1):
        stop_flag = 0
        for batch in train_iter:
            step += 1
            feature, target = batch.text, batch.label
            with torch.no_grad(): # 不计入逆向传播中
                target.sub_(1) # target-1，要求从0开始（否则报错）
            
            # 实际上执行model.forward(feature)操作 -> (batch_size, C)
            logit = model(feature) 

            loss = F.cross_entropy(logit,target)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            
            # 生成报告
            if step % config.log_interval == 0:
                # torch.max(logit,1)[1]取logit第二维中的最大值，即概率最高的类别
                # view相当于reshape
                corrects = (torch.max(logit,1)[1].view(target.size()).data == target.data).sum()
                accuracy = 100.0 * corrects/batch.batch_size
                # '\r'实现覆盖输出
                sys.stdout.write(f"\rStep[{step}] - loss: {loss.item():.6f} \
                    acc: {accuracy:.4f}%({corrects.item()}/{batch.batch_size})")
            
            # 阶段测试、模型保存，并判断是否提前结束训练
            if step % config.test_interval == 0:
                test_acc = evaluate(model,test_iter)
                if test_acc > best_acc:
                    best_acc = test_acc
                    last_step = step
                    save(model, config.save_dir, 'best', step)
                else:
                    if step - last_step >= config.early_stop:
                        print('early stop by {} steps.'.format(config.early_stop))
                        stop_flag = 1
                        break
        if stop_flag: break


evaluate时注意model.training模式的转换。

train时accuracy的输出基于每一个batch，此时要考虑整个数据集。

In [4]:
def evaluate(model,data_iter):
    # 将self.training设置为False
    model.eval() 
    
    corrects, avg_loss = 0,0
    for batch in data_iter: #注意这里的batch表示多组evaluation，最后取平均即可
        feature, target = batch.text, batch.label
        with autograd.no_grad():
            target.data.sub_(1)

        logit = model(feature)
        # reduction='sum'表示每个batch内部不取平均
        loss = F.cross_entropy(logit,target,reduction='sum') 

        avg_loss += loss.item()
        corrects += (torch.max(logit,1)[1].view(target.size()).data 
                    == target.data).sum()

    size = len(data_iter.dataset)
    avg_loss /= size 
    accuracy = 100.0 * corrects/size 
    print(f"\nEvaluation - loss: {avg_loss:.6f} acc: {accuracy:.4f}%({corrects}/{size})\n")
    return accuracy

predict首先要将输入文本进行预处理（和训练前的处理要一致）。

另外注意之前的标签值自减1，最终输出时要加回来。

In [5]:
def predict(text,model,text_field,label_field,vectors=None):
    assert isinstance(text,str), "plz use str object as input."
    # 将self.training设置为False
    model.eval()
    
    # 相当于生成一个example
    text = text_field.preprocess(text)
    # 相当于build_vocab
    text = [[text_field.vocab.stoi[word] for word in text]]
    x = torch.tensor(text) # -> (batch_size,seq_length), batch_size=1
    
    x = autograd.Variable(x)
    output = model(x)
    _, pred = torch.max(output,1)
    # pred.item()等价于pred.data[0]
    print(label_field.vocab.itos[pred.item()+1])

模型保存，利用state_dict()进行全面的参数记录，重载后还可以继续训练。

In [6]:
def save(model,save_dir,save_prefix,steps):
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    save_path = f"{save_dir}/{save_prefix}_steps_{steps}.pt"
    torch.save(model.state_dict(),save_path)

#### config
用于配置参数

In [16]:
class Config:
    # data
    shuffle = False # whether to shuffle data every epoch, de:False

    # model
    dropout = 0.5 # dropout rate, de:0.5
    max_norm = 3.0 # l2 constraint of parameters, de:3.0
    embed_dim = 128 # embedding(word vectors) dimension, de:128
    kernal_thick = 100 # number of each kind of kernel, de:100
    kernal_widths = [3,4,5] # different kernal sizes used for convolution, de:[3,4,5]
    static = False # whether to fix embeddings during training, de:Flase
    save_dir = 'model/TextCNN' # where to save the model

    # training
    lr = 0.001 # initial learning rate, de:0.001
    epochs = 5 # number of epochs for train, de:5
    batch_size = 96 # batch size for training, de:64
    log_interval = 1 # how many steps to wait before logging training status, de:1
    test_interval = 100 # how many steps to wait before testing, de:100
    early_stop = 1000 # iteration numbers to stop without performance increasing, de:1000

    # use
    cuda = False # whether to use cuda, de:False

#### main
主模块，实现数据预处理、模型的训练/调用/测试

数据处理的方法和上一节讲的一样⬇️

In [2]:
config = Config()

### 数据 ###
# 自定义分词器
def tokenizer(text):
    text = text.lower()
    text = re.sub(r',|\.|\?|!','',text)
    tokens = text.split()
    tokens = list(filter(lambda x:len(x)>0,tokens))
    if len(tokens)<5:
        tokens.extend([' ']*(5-len(tokens)))
    return tokens


# 设定text和label的Field
text_field = data.Field(lower=True,tokenize=tokenizer,batch_first=True)
label_field = data.Field(sequential=False,batch_first=True)


# 读取数据并创建train和test数据集
data_path = 'processed_data'
train_dataset, test_dataset = TabularDataset.splits(
        path=data_path, format='csv', skip_header=True,
        train='best3_train.csv', test='best3_test.csv',
        fields=[('text',text_field),('label',label_field)]) # 按表格顺序来的!!


text_field.build_vocab(train_dataset,test_dataset)
label_field.build_vocab(train_dataset,test_dataset)
# 注意保存field便于后续使用
with open('model/text_field.txt','wb') as f:
    pickle.dump(text_field, f)
with open('model/label_field.txt','wb') as f:
    pickle.dump(label_field, f)


# 创建迭代器
train_iter, test_iter = BucketIterator.splits(
    datasets=(train_dataset,test_dataset),
    sort_key=lambda x:len(x.text),shuffle=True,
    batch_sizes=(config.batch_size,int(len(test_dataset)/16)))
# for batch on iterator -> (batch_size,seq_length)

数据处理完毕，开始训练模型。

In [18]:
### 模型 ###
# 模型训练
config.embed_num = len(text_field.vocab)
config.class_num = len(label_field.vocab) - 1
textcnn = TextCNN(config)

# 重新开始训练
for file in os.listdir('model'):
    if file.find('.pt')>=0:
        os.remove('model/'+file)
    
try:
    train(textcnn,train_iter,test_iter,config)
except KeyboardInterrupt:
    print('\n'+'-'*88)
    print("interrupted by keyboard, stop training...")

Step[100] - loss: 0.833781                     acc: 61.4583%(59/96)
Evaluation - loss: 0.846032 acc: 62.8110%(3383/5386)

Step[200] - loss: 0.570657                     acc: 77.0833%(74/96)
Evaluation - loss: 0.780924 acc: 66.0416%(3557/5386)

Step[300] - loss: 0.421939                     acc: 83.3333%(80/96)
Evaluation - loss: 0.757187 acc: 67.6012%(3641/5386)

Step[400] - loss: 0.270577                     acc: 91.6667%(88/96)
Evaluation - loss: 0.774350 acc: 68.5481%(3692/5386)

Step[500] - loss: 0.348337                     acc: 91.6667%(88/96)
Evaluation - loss: 0.797730 acc: 67.2856%(3624/5386)

Step[600] - loss: 0.249587                     acc: 91.6667%(88/96))
Evaluation - loss: 0.848579 acc: 66.4129%(3577/5386)

Step[700] - loss: 0.188181                     acc: 92.7083%(89/96))
Evaluation - loss: 0.852056 acc: 67.3784%(3629/5386)

Step[800] - loss: 0.096367                     acc: 96.8750%(93/96))
Evaluation - loss: 0.894710 acc: 66.7843%(3597/5386)

Step[900] - loss: 0.1

In [19]:
# 选择最优模型
models_path = list(filter(lambda x:x.find('.pt')>0, os.listdir('model')))
models_path = list(map(lambda x:os.path.join('model',x), models_path))
models_path = list(sorted(models_path, key=lambda x:os.path.getmtime(x)))
best_model_path = models_path[-1]
# 注意重载时的模型类要和之前的一样
textcnn = TextCNN(config)
state_dict = torch.load(best_model_path)
textcnn.load_state_dict(state_dict)

# 模型评价
try:
    evaluate(textcnn,test_iter)
except:
    print(traceback.format_exc())


Evaluation - loss: 0.774350 acc: 68.5481%(3692/5386)



In [20]:
# 模型预测->交互式
with open('model/text_field.txt','rb') as f:
    text_field = pickle.load(f)
with open('model/label_field.txt','rb') as f:
    label_field = pickle.load(f)
while True:
    try:
        text = input("Plz enter a sentence for prediction:\n")
        predict(text,textcnn,text_field,label_field)
        print()
    except KeyboardInterrupt:
        print('Exiting...')
        break

Plz enter a sentence for prediction:
 Beam me up, Scotty.


KIRK



Plz enter a sentence for prediction:
 Sensor scan to one half parsec. Negative, Captain.


SPOCK



Plz enter a sentence for prediction:
 Change course, come about to one eight five, mark three.


KIRK



Plz enter a sentence for prediction:
 I believed the Romulans have developed a cloaking device which renders our tracking sensors useless.


SPOCK



Plz enter a sentence for prediction:
 Jim, he's dead.


MCCOY



Plz enter a sentence for prediction:
 Well, you can see for yourself, he's mentally depressed, physically weak, disoriented, displays of feelings of persecution and rebellion.


MCCOY

Exiting...
