Sentiment analysis on [z17176 dataset](https://github.com/z17176/Chinese_conversation_sentiment).

This dataset was used in the following research.  They have built a 3m corpus for the research but only released the 30k dataset.

* [1]L. Zhang and C. Chen, “Sentiment Classification with Convolutional Neural Networks: An Experimental Study on a Large-Scale Chinese Conversation Corpus,” in 2016 12th International Conference on Computational Intelligence and Security (CIS), 2016, pp. 165–169. http://ieeexplore.ieee.org/abstract/document/7820437/

In [2]:
path = "data/conversation_sentiment"

In [3]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
import os, math, re, pickle
#import jieba
from keras.models import Model, Sequential
from keras.layers import Embedding, Dense, Flatten, Conv1D, MaxPooling1D, BatchNormalization, Dropout

#jieba.set_dictionary("data/dict.txt.big")

Using TensorFlow backend.
  return f(*args, **kwds)


# Setup

In [4]:
_train = None
_valid = None

def load_train_valid():
    global _train, _valid
    if _train is None:
        _train = pd.read_csv(os.path.join(path, "sentiment_XS_30k.txt"))
    if _valid is None:
        _valid = pd.read_csv(os.path.join(path, "sentiment_XS_test.txt"))
    return _train, _valid

Load word embedding dictionary.

In [5]:
dictionary_path = os.path.join(path, "dictionary.pkl")

def create_dictionary(*data):
    phrases = {}
    for d in data:
        for sentence in d:
            for ph in sentence.split(" "):
                phrases[ph] = True
    with open(os.path.join(path, "dictionary.txt"), "w") as fh:
        fh.writelines([ ph + "\n" for ph in phrases.keys() ])
    !cd $path; mkdir -p models; ln ../fasttext/wiki.zh.bin models/wiki.zh.bin
    !cd $path; ../../../bin/fasttext print-word-vectors models/wiki.zh.bin < dictionary.txt > dictionary.vec
    dictionary = pd.read_csv(os.path.join(path, "dictionary.vec"), 
                             delim_whitespace=True, engine="python", header=None, index_col=0)
    with open(dictionary_path, "wb") as fh:
        pickle.dump([{ ph: i for i, ph in enumerate(dictionary.index) }, dictionary], fh)

def load_dictionary():
    with open(dictionary_path, "rb") as fh:
        [ dict_index, dictionary ] = pickle.load(fh)
        return dict_index, dictionary
    
if not os.path.exists(dictionary_path):
    train, valid = load_train_valid()
    create_dictionary(train.text, valid.text)

dict_index, dictionary = load_dictionary()
phrases_n = len(dictionary)
latent_n = len(dictionary.columns)

Encode lables and embed phrases.

In [6]:
# phrase-length (min, max, mean, std) = (1, 23, 4.7941782325330093, 2.0175720386692686)
input_length = 8

data_path = os.path.join(path, "data.pkl")

if not os.path.exists(data_path):
    def get_label(df):
        labels = df["labels"].values
        labels[labels == "positive"] = 1
        labels[labels == "negative"] = 0
        return labels

    def get_text(df):
        texts = np.zeros((len(df), input_length))
        for i, text in enumerate(df.text.values):
            for j, ph in enumerate(text.split(" ")[:input_length]):
                if ph in dict_index:
                    texts[i, j] = dict_index[ph]
        return texts
    
    train, valid = load_train_valid()
    train_x, train_y = get_text(train), get_label(train)
    valid_x, valid_y = get_text(valid), get_label(valid)
    
    with open(data_path, "wb") as fh:
        pickle.dump([(train_x, train_y), (valid_x, valid_y)], fh)
else:
    with open(data_path, "rb") as fh:
        [(train_x, train_y), (valid_x, valid_y)] = pickle.load(fh)

# Simple CNN

In [7]:
def simple_cnn_model():
    model = Sequential()
    model.add(Embedding(phrases_n, latent_n, input_length=input_length, weights=[dictionary], trainable=False))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    model.add(Conv1D(64, 3, border_mode="same", activation="relu"))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    model.add(MaxPooling1D())
    model.add(Flatten())
    model.add(Dense(100, activation="relu"))
    model.add(Dropout(0.7))
    model.add(Dense(1, activation="sigmoid"))
    return model
    
simple_cnn = simple_cnn_model()
simple_cnn.compile("adam", loss="binary_crossentropy", metrics=["accuracy"])
simple_cnn.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 8, 300)        6644400     embedding_input_1[0][0]          
____________________________________________________________________________________________________
batchnormalization_1 (BatchNorma (None, 8, 300)        1200        embedding_1[0][0]                
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 8, 300)        0           batchnormalization_1[0][0]       
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 8, 64)         57664       dropout_1[0][0]                  
___________________________________________________________________________________________

In [7]:
def train_simple_cnn(lr=None, epoch=1, full=False):
    if lr is not None:
        simple_cnn.optimizer.lr = lr
    if full:
        simple_cnn.layers[0].trainable = True
    simple_cnn.fit(train_x, train_y, nb_epoch=epoch, validation_data=(valid_x, valid_y))
    
train_simple_cnn(1e-4)
train_simple_cnn(1e-1, 4)
train_simple_cnn(1e-2, 16)
train_simple_cnn(1e-3, 16)
train_simple_cnn(1e-4, 2, full=True)

Train on 29613 samples, validate on 11562 samples
Epoch 1/1
Train on 29613 samples, validate on 11562 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Train on 29613 samples, validate on 11562 samples
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16
Train on 29613 samples, validate on 11562 samples
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16
Train on 29613 samples, validate on 11562 samples
Epoch 1/2
Epoch 2/2


In [8]:
simple_cnn.save_weights(os.path.join(path, "models", "simple_cnn.h5"))

# Evaluation

In [9]:
model = simple_cnn
model.load_weights(os.path.join(path, "models", "simple_cnn_random.h5"))

In [14]:
pred = model.predict(valid_x)[:, 0]

In [35]:
train, valid = load_train_valid()

false_positive = pred > 0.6 * ~valid_y
false_negative = pred < 0.4 * valid_y

False positives.

In [36]:
valid["text"][false_positive]

0                                  AP 好 任性
1                   a 甜心 一手 货源 招 代理  这个 骗子
2                    っ ╥ ╯ ﹏ ╰ ╥ c 被 土豪 欺负
3                       ❀ ℋ č 点点  讨厌 不许 亲亲
4                         Don t care  大爷 此
5                           ee 好 霸道 欺负 老实人
6                                     E 无语
7                                 fct 很 讨厌
8                                   fly 白痴
9                                Gay 鉴定 完毕
10                      gg 管理 随波逐流  波哥 好 
11                      gg 无尽 空虚  又 调戏 美 女
12                                 hi 小 傻子
13       jtituthbx bc hjjje 呀呀  干 啥 哭 啥 事情
14                            lol 本身 就是 抄袭
15                                   mc 麻木
16                             MD 你们 这群 屌丝
17                           MM 冬夜  已 病入膏肓
18                         mm 管理 奔驰  切 不 好
19                            mm 管理 奔驰  装傻
20                            MM 雪 忽悠 加 骗子
21                        Moment 花不弃 欺负 毛线
22                               NND 掩饰 一下
23         

False negatives.

In [37]:
valid["text"][false_negative]

6265      Accompa ╮ ° 哎 呦 哒 晚安
6268           bot 不灵 还 得 人 哈哈
6270                cool 组队 插件
6271            DL 丶 无非  果断 分解
6273               en 只要 长得 漂亮
6282               G 晚上   都 不错
6291              mm 岛主 伊人  晚安
6305               mm 素素  真 靓女
6306        mm 优雅  叫 情人 买 就是 咯
6313              M 女 汉子  情人 谁
6316                O ∩ ∩ O 哈哈
6324            SS 高级 中级 分解 嘿嘿
6331           v 句 v 哈哈 白白净净 吧
6332                     XX 威武
6340                 阿姐  阿姐 威武
6343           哎 说 为什么 都 喜欢 女人
6350                  哎 呦 不错 哦
6351            哎 呦 不要 这么 谦虚 啦
6352     哎 这个 社会 发进 走 太笨 没人 喜欢
6357                         爱
6363                      爱 美女
6364                    爱 你们 哟
6373                 爱情 家庭 电视剧
6374                   爱情 经典语录
6376                    爱情 美 好
6377                     爱情 什么
6378                  爱情 什么 东东
6385                      爱人利物
6389                      爱 喜欢
6390                   爱 喜欢 区别
                 ...          
11406                 只要 有人 喜欢
11407   