# Tensorflow 手写 TextCNN

在本jupyter笔记本中，我们在[**Convolutional Neural Networks for Sentence Classification**](https://arxiv.org/pdf/1408.5882.pdf)中给出了一个简单的实现和凯拉斯在一起。您可以使用本笔记本重现上述论文中**“CNN rand”和“CNN非静态”模型**的结果。

## 加载和预处理数据

为了简单起见，这里我们只加载和处理[MR](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz)[**Convolutional Neural Networks for Sentence Classification**](https://arxiv.org/pdf/1408.5882.pdf)中的数据.

+ **导入依赖包**

In [None]:
import gensim
import numpy as np
import keras
import keras.layers as L
import re
from collections import Counter
from sklearn.model_selection import train_test_split

+ **超参数**

In [None]:
# hyperprameters
USE_PRE_TRAIN_EMBEDDING = True
EMBEDDING_DIM = 300
POSITIVE_DATA_FILE = './rt-polaritydata/rt-polarity.pos'
NEGATIVE_DATA_FILE = './rt-polaritydata/rt-polarity.neg'
DEV_SAMPLE_PERCENTAGE = 0.1
NUM_CLASSES = 2
NUM_FILTERS = 128
FILTER_SIZES = (3, 4, 5)

## 加载MR数据

In [None]:
def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

In [None]:
def load_data_and_labels(positive_data_file, negative_data_file):
    """
    Loads MR polarity data from files, splits the data into words and generates labels.
    Returns split sentences and labels.
    """
    # Load data from files
    positive_examples = list(open(positive_data_file, "r", encoding='utf-8').readlines())
    positive_examples = [s.strip() for s in positive_examples]
    negative_examples = list(open(negative_data_file, "r", encoding='utf-8').readlines())
    negative_examples = [s.strip() for s in negative_examples]
    # Split by words
    x_text = positive_examples + negative_examples
    x_text = [clean_str(sent) for sent in x_text]
    # Generate labels
    positive_labels = [[0, 1] for _ in positive_examples]
    negative_labels = [[1, 0] for _ in negative_examples]
    y = np.concatenate([positive_labels, negative_labels], 0)
    return [x_text, y]

In [None]:
x_text, y = load_data_and_labels(POSITIVE_DATA_FILE, NEGATIVE_DATA_FILE)
print('Total records of the MR data set: ', len(x_text))
max_doc_length = max([len(x.split(' ')) for x in x_text])
print("Max document length: ", max_doc_length)

In [None]:
tokens = [t for doc in x_text for t in doc.split(' ')]
print("Total tokens in the MR data set: ", len(tokens))
counter = Counter(tokens)
index2word = list(counter.keys())
index2word.insert(0, 'PAD')
print("Vocabulary size in MR data set(contains 'PAD' as first): ", len(index2word))

In [None]:
def as_matrix(sequences, max_len, index2word):
    matrix = np.full((len(sequences), max_len), 0)
    for i, seq in enumerate(sequences):
        row_ix = [index2word.index(w) for w in seq.split(' ')]
        matrix[i, :len(row_ix)] = row_ix
    return matrix

In [None]:
x_matrix = as_matrix(x_text, max_doc_length, index2word)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_matrix, y, test_size=DEV_SAMPLE_PERCENTAGE)
print('Train records: ', len(x_train))
print('Test records:', len(x_test))

## 加载预训练的word2vec
我们使用公开可用的word2vec向量，这些向量是在谷歌新闻1000亿字的基础上训练的。向量的维数为300，并使用连续词包结构进行训练[（Mikolov等人，2013年）](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). 不存在于预训练词集合中的词被随机初始化。您可以下载谷歌预培训的word2vec[此处]（）。

In [None]:
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

In [None]:
def get_pre_train_word2vec(model, index2word, vocab_size):
    embedding_size = model.vector_size
    pre_train_word2vec = dict(zip(model.vocab.keys(), model.vectors))
    word_embedding_2dlist = [[]] * vocab_size    # [vocab_size, embedding_size]
    word_embedding_2dlist[0] = np.zeros(embedding_size)    # assign empty for first word:'PAD'
    pre_count = 0    # vocabulary in pre-train word2vec
    # loop for all vocabulary, note that the first is 'PDA'
    for i in range(1, vocab_size):
        if index2word[i] in pre_train_word2vec:
            word_embedding_2dlist[i] = pre_train_word2vec[index2word[i]]
            pre_count += 1
        else:
            # initilaize randomly if vocabulary not exits in pre-train word2vec
            word_embedding_2dlist[i] = np.random.uniform(-0.1, 0.1, embedding_size)
    return np.array(word_embedding_2dlist), pre_count

In [None]:
word_embedding, pre_count = get_pre_train_word2vec(word2vec_model, index2word, len(index2word))

# 2.TextCNN model

该模型与[**Convolutional Neural Networks for Sentence Classification**](https://arxiv.org/pdf/1408.5882.pdf)相同. 图1显示了模型架构。（图1摘自[**Convolutional Neural Networks for Sentence Classification（以及从业者指南**](https://arxiv.org/pdf/1510.03820.pdf))
<br>
![Fig 1 Text CNN](http://aimaksen.bslience.cn/textcnn.jpg)<br>
<center>*Fig 1 Text CNN 模型架构*</center>

+ **用于计算模型精度的函数**

In [None]:
def precision(model, x_test ,y_true):
    y_true = np.argmax(y_true, axis=1)
    y_predict = model.predict(x_test)
    y_predict = np.argmax(y_predict, axis=1)
    true_count = sum(y_true == y_predict)
    return true_count / len(y_true)

+ **图1中的文本CNN架构**

In [None]:
def text_cnn(sequence_length, num_classes, vocab_size, embedding_size, 
             filter_sizes, num_filters, embedding_matrix, drop_out=0.5 ,l2_reg_lambda=0.0):
    input_x = L.Input(shape=(sequence_length,), name='input_x')
    
    # embedding layer
    if embedding_matrix is None:
        embedding = L.Embedding(vocab_size, embedding_size, name='embedding')(input_x)
    else:
        embedding = L.Embedding(vocab_size, embedding_size, weights=[embedding_matrix], name='embedding')(input_x)
    expend_shape = [embedding.get_shape().as_list()[1], embedding.get_shape().as_list()[2], 1]
    # embedding_chars = K.expand_dims(embedding, -1)    # 4D tensor [batch_size, seq_len, embeding_size, 1] seems like a gray picture
    embedding_chars = L.Reshape(expend_shape)(embedding)
    
    # conv->max pool
    pooled_outputs = []
    for i, filter_size in enumerate(filter_sizes):
        conv = L.Conv2D(filters=num_filters, 
                        kernel_size=[filter_size, embedding_size],
                        strides=1,
                        padding='valid',
                        activation='relu',
                        kernel_initializer=keras.initializers.TruncatedNormal(mean=0.0, stddev=0.1),
                        bias_initializer=keras.initializers.constant(value=0.1),
                        name=('conv_%d' % filter_size))(embedding_chars)
        # print("conv-%d: " % i, conv)
        max_pool = L.MaxPool2D(pool_size=[sequence_length - filter_size + 1, 1],
                               strides=(1, 1),
                               padding='valid',
                               name=('max_pool_%d' % filter_size))(conv)
        pooled_outputs.append(max_pool)
        # print("max_pool-%d: " % i, max_pool)
    
    # combine all the pooled features
    num_filters_total = num_filters * len(filter_sizes)
    h_pool = L.Concatenate(axis=3)(pooled_outputs)
    h_pool_flat = L.Reshape([num_filters_total])(h_pool)
    # add dropout
    dropout = L.Dropout(drop_out)(h_pool_flat)
    
    # output layer
    output = L.Dense(num_classes,
                     kernel_initializer='glorot_normal',
                     bias_initializer=keras.initializers.constant(0.1),
                     activation='softmax',
                     name='output')(dropout)
    
    model = keras.models.Model(inputs=input_x, outputs=output)
    
    return model

## CNN-rand

在这里，我们运行原始论文中的**CNN rand**模型。在运行10个epoches后，我们可以达到大约75%的精度。

In [None]:
cnn_rand = text_cnn(x_train.shape[1], NUM_CLASSES, len(index2word), EMBEDDING_DIM, FILTER_SIZES, NUM_FILTERS, None)
cnn_rand.compile('adam', 'categorical_crossentropy', metrics=['accuracy'])
cnn_rand_history = cnn_rand.fit(x_train, y_train, epochs=10, batch_size=128)

+ **评估开发集的准确性。在运行10个阶段后，我们可以达到大约75%。**

In [None]:
precision(cnn_rand, x_test, y_test)

## 2.2 CNN-non-static
在这里，我们运行原始论文中的**CNN非静态**模型。在运行10个epoches后，我们可以达到大约79%的精度。

In [None]:
cnn_non_static = text_cnn(x_train.shape[1], NUM_CLASSES, len(index2word), EMBEDDING_DIM, FILTER_SIZES, NUM_FILTERS, word_embedding)
cnn_non_static.compile('adam', 'categorical_crossentropy', metrics=['accuracy'])
cnn_non_static_history = cnn_non_static.fit(x_train, y_train, epochs=10, batch_size=128)

+ **评估开发集的准确性。在运行10个阶段后，我们可以达到大约75%。**

In [None]:
precision(cnn_non_static, x_test, y_test)