欢迎！

在这个code lab里，你会学习如何创造简单的text embedding，然后用神经网络来进行文本信息分类。

文本信息分类是一个常见的NLP任务。它可以帮助我们自动理解文本信息，寻找相似信息和进行信息推送。

在本次练习里，我们会使用StackOverflow questions数据集。我们会训练一个文本信息分类器，并用它来分类文本中的问题所涉及到的编程语言。

首先，我们import所需的library。

In [33]:
import numpy as np
import pandas as pd
import pickle
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics.pairwise import cosine_distances
import nltk
try:
    from nltk.corpus import stopwords
except:
    nltk.download('stopwords')
    from nltk.corpus import stopwords

import tensorflow as tf

查看并了解数据集：

In [2]:
sample_size = 200000
stackoverflow_df = pd.read_csv('tagged_posts.tsv', sep='\t').sample(sample_size, random_state=0)
tags = pd.read_csv('tagged_posts.tsv', sep='\t')['tag'].unique()
print('tags: ', tags)
stackoverflow_df.head()

tags:  ['c#' 'php' 'c_cpp' 'python' 'ruby' 'java' 'javascript' 'vb' 'r' 'swift']


Unnamed: 0,post_id,title,tag
2168983,43837842,Efficient Algorithm to compose valid expressio...,python
1084095,15747223,Why does this basic thread program fail with C...,c_cpp
1049020,15189594,Link to scroll to top not working,javascript
200466,3273927,Is it possible to implement ping on windows ph...,c#
1200249,17684551,GLSL normal mapping issue,c_cpp


通常，我们会对文字信息进行简化（去除复数，用词根代替衍生词等等）。

定义预处理方法：

In [3]:
def text_prepare(text):
    """Performs tokenization and simple preprocessing."""
    
    replace_by_space_re = re.compile('[/(){}\[\]\|@,;]')
    bad_symbols_re = re.compile('[^0-9a-z #+_]')
    stopwords_set = set(stopwords.words('english'))

    text = text.lower()
    text = replace_by_space_re.sub(' ', text)
    text = bad_symbols_re.sub('', text)
    text = ' '.join([x for x in text.split() if x and x not in stopwords_set])
    
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens = tokenizer.tokenize(text.strip())
    
    stemmer = nltk.stem.PorterStemmer()
    return ' '.join(stemmer.stem(token) for token in tokens)

def one_hot(tag, tags = tags):
    oh = tags == tag
    return np.uint8(np.array(oh))

测试预处理结果：

In [4]:
print(text_prepare('What do pRogRammers love ABOUT@ python?'))
print(one_hot('php'))
print(one_hot('ruby'))

programm love python
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0]


对整个数据集进行预处理：

In [5]:
stackoverflow_df['title_processed'] = stackoverflow_df['title'].apply(text_prepare)
stackoverflow_df['oh_tag'] = stackoverflow_df['tag'].apply(one_hot)
stackoverflow_df.head()

Unnamed: 0,post_id,title,tag,title_processed,oh_tag
2168983,43837842,Efficient Algorithm to compose valid expressio...,python,effici algorithm compos valid express specif t...,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]"
1084095,15747223,Why does this basic thread program fail with C...,c_cpp,basic thread program fail clang pass g++,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]"
1049020,15189594,Link to scroll to top not working,javascript,link scroll top work,"[0, 0, 0, 0, 0, 0, 1, 0, 0, 0]"
200466,3273927,Is it possible to implement ping on windows ph...,c#,possibl implement ping window phone 7,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
1200249,17684551,GLSL normal mapping issue,c_cpp,glsl normal map issu,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]"


将数据集分成训练集和测试集：

In [6]:
X = stackoverflow_df['title_processed'].values
y = np.array(stackoverflow_df['oh_tag'].values.tolist())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, random_state = 42)
# 查看结果
print(X_train[:10])
print(y_train[:10])

['implement scrollview python kivi' 'jqueri drag drop text div smaller div'
 'jdk16 + understand err file' 'node travers can not null hibern updat'
 'subqueri nspredic'
 'stack trace wekacorewekaexcept wekaclassifiersfunctionssmo enough train instanc class label requir 1 provid 0'
 'get number line file text c' 'use deleteallonsubmit tabl primari key'
 'java simpl implement command pattern oncomplet callback'
 'c # tcp server client crosscommun requir']
[[0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0]]


对文本信息进行text embedding处理，将文本信息根据上下文单词出现的频率进行向量化：

In [7]:
def tfidf_features(X_train, X_test):
    tfidf_vectorizer = TfidfVectorizer(ngram_range = (1, 5), token_pattern = '(\S+)', max_df = 0.9, min_df = 5)
    X_train = tfidf_vectorizer.fit_transform(X_train)
    X_test = tfidf_vectorizer.transform(X_test)
    pickle.dump(tfidf_vectorizer, open('tfidf_vectorizer.pkl', 'wb'))
    
    return X_train, X_test, tfidf_vectorizer.vocabulary_, tfidf_vectorizer

In [8]:
X_train_tfidf, X_test_tfidf, tf_idf_vocab, tfidf_trans = tfidf_features(X_train, X_test)
X_train_tfidf

<190000x41030 sparse matrix of type '<class 'numpy.float64'>'
	with 1530064 stored elements in Compressed Sparse Row format>

创建mini batch：

In [9]:
def batch_generator(X, y, batch_size = 32):
    shape = X.shape[0]
    for i in range(0, shape, batch_size):
        yield X[i:i+batch_size].toarray(), y[i:i+batch_size]

创建全连接神经网络架构:

In [10]:
X = tf.placeholder(dtype = tf.float32, shape = (None, X_train_tfidf.shape[1]))
y = tf.placeholder(dtype = tf.float32, shape = (None, 10))

num_neurons = [1024, 128, 10]

fc = tf.layers.dense(X, num_neurons[0], activation = tf.nn.relu)
for i in num_neurons[1:-2]:
    fc = tf.layers.dense(fc, i, activation = tf.nn.relu)
fc = tf.layers.dense(fc, num_neurons[-1], activation = tf.nn.sigmoid)

loss = tf.losses.log_loss(labels = y, predictions = fc, reduction = tf.losses.Reduction.NONE)
cost = tf.reduce_mean(loss, axis = -1)
optimizer = tf.train.AdamOptimizer(0.001)
train_op = optimizer.minimize(cost)

训练分类器并测试准确率。

In [11]:
num_epochs = 6
batch_size = 32
bg_train = batch_generator(X_train_tfidf, y_train, batch_size)
bg_test = batch_generator(X_test_tfidf, y_test, batch_size)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for i in range(num_epochs):
    for X_train_, y_train_ in bg_train:
        sess.run(train_op, feed_dict = {X: X_train_, y: y_train_})
        
    if (i % 2 == 0) or (i == num_epochs - 1):
        ov_acc = []
        for X_test_, y_test_ in bg_test:
            y_pred_ = sess.run(fc, feed_dict = {X: X_test_})
            acc = accuracy_score(np.argmax(y_test_, axis = -1), np.argmax(y_pred_, axis = -1))
            ov_acc.append(acc)
        print('Epoch {} validation accuracy is {}'.format(i, np.mean(ov_acc)))
        
    bg_train = batch_generator(X_train_tfidf, y_train, batch_size)
    bg_test = batch_generator(X_test_tfidf, y_test, batch_size)

Epoch 0 validation accuracy is 0.8061102236421726
Epoch 2 validation accuracy is 0.7875399361022364
Epoch 4 validation accuracy is 0.7846445686900958
Epoch 5 validation accuracy is 0.7807507987220448


In [12]:
# 测试结果
test_case = tfidf_trans.transform([text_prepare('How to use recursion in c++?')]).toarray().astype(np.float32)
y_pred = sess.run(fc, feed_dict = {X: test_case})
print('Test case prediction is \'{}\''.format(tags[np.argmax(y_pred)]))

Test case prediction is 'c_cpp'


用简单的全连接模型，平均测试准确率为80%。

我们可以用StarSpace来创造更复杂的embeddings，并用RNN来理解、分类问题。

In [13]:
# 载入预训练好的StarSpace embedding
embeddings = pickle.load(open('starspace_embedding.pkl', 'rb'))

In [14]:
# utils functions
def build_encode(x, max_len, emb = embeddings, emb_dim = 100):
    emb_ls = []
    for word in x.split():
        try:
            emb_ls.append(emb[word])
        except:
            emb_ls.append(np.zeros((emb_dim,)))
            
    padding = np.zeros((max_len, emb_dim))
    if len(emb_ls) > 0:
        padding[:len(emb_ls)] = np.array(emb_ls)
        return padding
    else:
        return padding

def RNN_batch_generator(X, y, max_len, batch_size = 32):
    shape = X.shape[0]
    for i in range(0, shape, batch_size):
        x_mini_batch = np.array([build_encode(x, max_len) for x in X[i:i+batch_size]]).astype(np.float32)
        y_mini_batch = y[i:i+batch_size]
        yield x_mini_batch, np.float32(y_mini_batch)

创造一个简单的RNN with LSTM cell。

In [15]:
max_len = np.max([len(x.split()) for x in X_train])
emb_dim = 100
X_rnn = tf.placeholder(dtype = tf.float32, shape = (None, max_len, emb_dim))
y_rnn = tf.placeholder(dtype = tf.float32, shape = (None, 10))

cell = tf.nn.rnn_cell.BasicLSTMCell(100, reuse = tf.AUTO_REUSE) 
outputs, state = tf.nn.dynamic_rnn(cell, 
                                   X_rnn, 
                                   dtype = tf.float32)
y_pred = tf.layers.dense(state[0], 10, activation = tf.nn.sigmoid)

loss_rnn = tf.losses.log_loss(labels = y_rnn, predictions = y_pred, reduction = tf.losses.Reduction.NONE)
cost_rnn = tf.reduce_mean(loss_rnn, axis = -1)
optimizer_rnn = tf.train.AdamOptimizer(0.001)
train_op_rnn = optimizer_rnn.minimize(cost_rnn)

In [18]:
num_epochs = 6
batch_size = 32
bg_train = RNN_batch_generator(X_train, y_train, max_len, batch_size)
bg_test = RNN_batch_generator(X_test, y_test, max_len, batch_size)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for i in range(num_epochs):
    for X_train_, y_train_ in bg_train:
        sess.run(train_op_rnn, feed_dict = {X_rnn: X_train_, y_rnn: y_train_})
        
    if (i % 2 == 0) or (i == num_epochs - 1):
        ov_acc = []
        for X_test_, y_test_ in bg_test:
            y_pred_ = sess.run(y_pred, feed_dict = {X_rnn: X_test_})
            acc = accuracy_score(np.argmax(y_test_, axis = -1), np.argmax(y_pred_, axis = -1))
            ov_acc.append(acc)
        print('Epoch {} validation accuracy is {}'.format(i, np.mean(ov_acc)))
        
    bg_train = RNN_batch_generator(X_train, y_train, max_len, batch_size)
    bg_test = RNN_batch_generator(X_test, y_test, max_len, batch_size)

Epoch 0 validation accuracy is 0.8413538338658147
Epoch 2 validation accuracy is 0.8483426517571885
Epoch 4 validation accuracy is 0.8498402555910544
Epoch 5 validation accuracy is 0.8501397763578274


测试训练结果：<br>
case 1：训练准确性<br>
case 2：问题之间的相似性

In [19]:
# Test case 1
test_case = build_encode('How to use list comprehension?', 23)
test_case = np.expand_dims(test_case.astype(np.float32), 0)
test_pred = sess.run(y_pred, feed_dict = {X_rnn: test_case})
print('Test case prediction is {}'.format(tags[np.argmax(test_pred)]))

Test case prediction is python


In [30]:
# Test case 2
test_case_0 = build_encode('How do you set, clear, and toggle a single bit?', 23) #C/C++ problem
test_case_0 = np.expand_dims(test_case_0.astype(np.float32), 0)

test_case_1 = build_encode('Maximum recursion depth in python?', 23)
test_case_1 = np.expand_dims(test_case_1.astype(np.float32), 0)

test_case_2 = build_encode('How to use list comprehension?', 23)
test_case_2 = np.expand_dims(test_case_2.astype(np.float32), 0)

test_emb_0 = np.squeeze(sess.run(state[1], feed_dict = {X_rnn: test_case_0})).reshape(1,-1)
test_emb_1 = np.squeeze(sess.run(state[1], feed_dict = {X_rnn: test_case_1})).reshape(1,-1)
test_emb_2 = np.squeeze(sess.run(state[1], feed_dict = {X_rnn: test_case_2})).reshape(1,-1)

compare_0_1 = cosine_distances(test_emb_0, test_emb_1)
compare_0_2 = cosine_distances(test_emb_0, test_emb_2)
compare_1_2 = cosine_distances(test_emb_1, test_emb_2)

print('Cosine distances between c_cpp and python problems: {} and {}'.format(compare_0_1, compare_0_2))
print('Cosine distance between python problems: {}'.format(compare_1_2))

Cosine distances between c_cpp and python problems: [[ 0.61002129]] and [[ 0.54072559]]
Cosine distance between python problems: [[ 0.07263857]]


从测试结果我们可以看到，准确率明显上升。

总结：

1. 文本信息分类是一个常见的NLP任务。
2. 它的常见流程是：a) 文字预处理，b) 生成单词的embedding（基于计数方法或者训练方法），c) 训练模型。
3. 由训练生成的embedding更能反映一个单词的意义。同时，它的表现方式更为密集（相对于计数方法生成的稀疏向量）。
4. RNN是NLP领域常见的模型，性能往往比普通的全连接模型更好。
5. RNN中的隐藏状态可能可以反映出文本之间的相似度关系。