# Join 平台眾開講

## Setup

In [1]:
path = "data/join"
topic = "立法方式保障"
# topic = "同性伴侣法"
# topic = "同性婚姻法"

In [2]:
from __future__ import division, print_function
import pandas as pd, numpy as np
import jieba
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten, Dropout
import os, math

Using TensorFlow backend.


Read in messages from CSV.

In [3]:
def get_messages_from_orig(topic):
    messages = pd.read_csv(os.path.join(path, topic + ".csv"), index_col=0)
    mask = messages.astype('str').applymap(lambda x: len(x.decode('utf-8'))).content > 20
    messages = messages[mask]
    messages.to_csv(os.path.join(path, topic + "-good.csv"))
    return messages

def get_labeled_messages(topic):
    return pd.read_csv(os.path.join(path, topic + "-good.csv"), index_col=0)

def labeled_only(messages):
    return messages[messages.ORID.notnull()]

def unlabeled_only(messages):
    return messages[messages.ORID.isnull()]
    
all_messages = get_messages_from_orig(topic) if not os.path.exists(os.path.join(path, topic + "-good.csv")) \
                                         else get_labeled_messages(topic)
print("Total messages: {count}".format(count=len(all_messages)))
messages = labeled_only(all_messages)
print("Labeled messages: {count}".format(count=len(messages)))
messages.head()

Total messages: 10215
Labeled messages: 63


Unnamed: 0_level_0,createDate,authorName,content,ORID,comments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
14136,2015-08-03 8:57:21,Shaffer Lin,「政府對全體人民的人權有履行義務且不應以公眾之意見作為履行的條件」那現在在投什麼？自打嘴巴？...,R,
14135,2015-08-03 9:30:52,Oliver Lin,贊成歸贊成但更贊成就直接修民法就好了不用疊床架屋我要的沒有比較特別就是現在一堆人在結的那個婚姻,D,
14134,2015-08-03 10:05:26,蛍一 森里,與其以立法的方式來保障不如用修法的方式來保障不是比較方便一些?,I,
14133,2015-08-03 10:08:00,楊剛,投你個花開富貴啦我要不要結婚關順性別異性戀沙豬什麼事啊？啊連個草案都沒有的東西是要投三小喔？...,R,
14132,2015-08-03 10:14:57,黑桐喵,原來別人要不要結婚須要所有人一起投票決定。既然都說了「政府對全體人民的人權有履行義務且不應以...,R,


Build dictionary of phrases and load word embeddings.

In [4]:
def write_dictionary(messages):
    contents = [ jieba.lcut(c) for c in messages.content ]
    all_phrases = set([ ph for c in contents for ph in c ])
    with open(os.path.join(path, "dictionary.txt"), "w") as fh:
        for ph in all_phrases:
            fh.write(ph.encode("utf-8") + "\n")
            
def read_dictionary():
    dictionary = pd.read_csv(os.path.join(path, "dictionary.vec"), 
                       delim_whitespace=True, engine="python", header=None, index_col=0)
    return dictionary

if not os.path.exists("dictionary.vec"):
    write_dictionary(all_messages)
    !cd data/join; ../../../bin/fasttext print-word-vectors models/wiki.zh.bin < dictionary.txt > dictionary.vec
dictionary = read_dictionary()
dictionary.shape

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/sy/q12w5xyn4lngqxh_j63vwr4h0000gn/T/jieba.cache
Loading model cost 0.453 seconds.
Prefix dict has been built succesfully.


(44956, 300)

Build dictionary index to convert phrases into embedding vectors.

In [5]:
dict_index = { ph.decode("utf-8"): i for i, ph in enumerate(dictionary.index) }
dict_index[u"同性"], dictionary.iloc[dict_index[u"同性"]] == dictionary.loc["同性"]

(35102, 1      True
 2      True
 3      True
 4      True
 5      True
 6      True
 7      True
 8      True
 9      True
 10     True
 11     True
 12     True
 13     True
 14     True
 15     True
 16     True
 17     True
 18     True
 19     True
 20     True
 21     True
 22     True
 23     True
 24     True
 25     True
 26     True
 27     True
 28     True
 29     True
 30     True
        ... 
 271    True
 272    True
 273    True
 274    True
 275    True
 276    True
 277    True
 278    True
 279    True
 280    True
 281    True
 282    True
 283    True
 284    True
 285    True
 286    True
 287    True
 288    True
 289    True
 290    True
 291    True
 292    True
 293    True
 294    True
 295    True
 296    True
 297    True
 298    True
 299    True
 300    True
 Name: 同性, Length: 300, dtype: bool)

In [6]:
input_length = 50
batch_size = 64

In [7]:
def create_embeddings(dictionary, input_length=100):
    phrases, latents = dictionary.shape
    embedding = Embedding(phrases, latents, input_length=input_length, weights=[dictionary])
    return embedding

Separate the messages into training and validation datasets.

In [8]:
select = np.random.random(len(messages)) < 0.9
train = messages[select]
valid = messages[~select]

In [9]:
print("Training data: {count}".format(count=len(train)))
train.head()

Training data: 57


Unnamed: 0_level_0,createDate,authorName,content,ORID,comments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
14134,2015-08-03 10:05:26,蛍一 森里,與其以立法的方式來保障不如用修法的方式來保障不是比較方便一些?,I,
14133,2015-08-03 10:08:00,楊剛,投你個花開富貴啦我要不要結婚關順性別異性戀沙豬什麼事啊？啊連個草案都沒有的東西是要投三小喔？...,R,
14132,2015-08-03 10:14:57,黑桐喵,原來別人要不要結婚須要所有人一起投票決定。既然都說了「政府對全體人民的人權有履行義務且不應以...,R,
14131,2015-08-03 10:21:00,Janus Chang,讓更多人可以結婚到底有甚麼問題???? 一堆拿幾千年前教條來干預現代人生活的人真的是社會的敗類。,R,
14130,2015-08-03 13:06:12,Thomas Chen,支持婚姻平權.一步到位.台灣並不以宗教立國.人民有人民的格調.請政府跟上我们的腳步.如果我们...,D,


In [10]:
print("Validation data: {count}".format(count=len(valid)))
valid.head()

Validation data: 6


Unnamed: 0_level_0,createDate,authorName,content,ORID,comments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
14136,2015-08-03 8:57:21,Shaffer Lin,「政府對全體人民的人權有履行義務且不應以公眾之意見作為履行的條件」那現在在投什麼？自打嘴巴？...,R,
14135,2015-08-03 9:30:52,Oliver Lin,贊成歸贊成但更贊成就直接修民法就好了不用疊床架屋我要的沒有比較特別就是現在一堆人在結的那個婚姻,D,
14126,2015-08-03 15:46:03,馮耿,"１。讚成同性也有婚姻的基本權利,但無須另設新法(改民法的用語定義不就好了@@?)。２。讚成伴...",D,
14122,2015-08-03 17:15:52,小白,命題本身就充滿歧視！我贊成修改民法，返還長期剝奪同性戀者組成配偶的權益！,D,
35,2015-10-31 14:27:20,宮紅雪,XDD我覺得她長的滿可愛 就是笑的有點假最後一張好酷,R,


Convert datasets to word embeddings.

In [11]:
def get_data(messages):
    jieba_cut = np.frompyfunc(lambda x: [ ph for ph in jieba.lcut(x.decode("utf-8")) if ph != u" " ][:input_length], 1, 1)
    word_embed = np.frompyfunc(lambda x: [ dict_index[ph] for ph in x ], 1, 1)
    fill_to_length = np.frompyfunc(lambda x: x + [0] * (input_length - len(x)), 1, 1)
    return np.stack(fill_to_length(word_embed(jieba_cut(messages.content.values))), axis=0)

answers = { "O": [1., 0., 0., 0.], "R": [0., 1., 0., 0.], "I": [0., 0., 1., 0.], "D": [0., 0., 0., 1.] }

def get_answer(messages):
    return np.array([ answers[x] for x in messages.ORID ])

train = (get_data(train), get_answer(train))
valid = (get_data(valid), get_answer(valid))

In [12]:
train[0][10], train[1][10], valid[0][0], valid[1][0]

(array([39403, 35102,  5442, 17112, 20923, 42769, 44736, 27524, 35102,
        10186, 18667, 10186,  5442, 17112,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0]),
 array([ 0.,  0.,  0.,  1.]),
 array([36935, 36621, 21149, 12023, 39845,  4414, 39925, 17642, 18025,
         6176, 18552, 36065, 36659, 11449, 26840, 18025,  4414, 14437,
        11592, 26247, 43183, 43183, 19029, 32546, 13995, 44738, 32546,
        36362, 43962, 27354, 44736, 31155,  5909, 33185, 34499,  4414,
        32548, 29445, 24924, 10395, 37848,  2970,  2757, 27354, 18432,
        29616, 34147, 38528, 27241,  6885]),
 array([ 0.,  1.,  0.,  0.]))

# Single hidden layer model

In [13]:
def linear_model():
    model = Sequential()
    model.add(create_embeddings(dictionary, input_length))
    model.add(Flatten())
    model.add(Dense(512, activation="relu"))
    model.add(Dropout(0.6))
    model.add(Dense(4, activation="softmax"))
    return model

linear = linear_model()
linear.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 50, 300)       13486800    embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 15000)         0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 512)           7680512     flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 512)           0           dense_1[0][0]                    
___________________________________________________________________________________________

In [14]:
linear.compile("adam", loss="categorical_crossentropy", metrics=["accuracy"])

In [15]:
def train_linear(lr=None, epoch=1):
    if lr is not None:
        linear.optimizer.lr = lr
    linear.fit(train[0], train[1], nb_epoch=epoch, validation_data=valid, batch_size=batch_size)
    
train_linear()

Train on 57 samples, validate on 6 samples
Epoch 1/1


In [16]:
train_linear(lr=0.1, epoch=2)
train_linear(lr=0.01, epoch=4)
train_linear(lr=0.001, epoch=4)

Train on 57 samples, validate on 6 samples
Epoch 1/2
Epoch 2/2
Train on 57 samples, validate on 6 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Train on 57 samples, validate on 6 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


# Evaluation

In [17]:
model = linear

In [18]:
unlabeled = unlabeled_only(all_messages)
test_data = get_data(unlabeled)

In [19]:
pred = model.predict(test_data, batch_size=batch_size)

In [20]:
unlabeled.head(10)

Unnamed: 0_level_0,createDate,authorName,content,ORID,comments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
14138,2015-08-03 8:12:21,小K,那麼亞洲國家的現況呢？以新加坡或日本這兩個已開發的國家來說，是否有進行相關政策？,,
14110,2015-08-04 2:41:32,Oliver Lin,你搞錯了，不論性向有錢人是最不受制度限制的一群人如果你擔心這點，請採用共產制度根本上讓大家均貧就好,,
14103,2015-08-04 3:43:13,in.j li,"你(妳)知道性取向是天生的,那試問, 如果自然法則是同性交配才能繁殖下一代, 那你(妳)可以...",,
14097,2015-08-04 4:29:26,black756133,無神論者無法相信任何沒有存在證據的神說的話，跟神溝通也只能透過古代人編寫的各種版本同人作品自...,,
14096,2015-08-04 4:33:21,RED,冰山一角http://news.ltn.com.tw/news/wo...,,
14088,2015-08-04 5:21:16,RED,網路很多新聞，平時你不看的那種。https://chinesemanif.wordpress...,,
14086,2015-08-04 5:22:52,炎暴龍,不支持不適用於異性戀的伴侶法！此議題的命題方式已嚴重歧視各性向的族群我支持可適用於全民的伴侶...,,
14085,2015-08-04 5:27:30,I-Ling Yeh,婚姻和伴侶關係應是獨立兩種關係，兩者都不應因為對象的性別受限。認可同性婚姻應直接在民法明文說...,,
14083,2015-08-04 5:33:41,Jun-Yuan Guo,「隔離而平等不是真平等」我支持修改民法，還給同志族群與異性戀者相同的婚姻權和親權；並新增相關...,,
14081,2015-08-04 5:51:28,Grace Guan,如果個人權力可以凌駕於國家整體考量之上，社會豈不亂乎？跟紅衛兵時代所行有何差別？革命烈士在天...,,


In [21]:
pred[:10]

array([[  2.33472145e-38,   3.96522671e-01,   6.03477359e-01,
          4.94881489e-37],
       [  0.00000000e+00,   9.99618769e-01,   3.81167250e-04,
          0.00000000e+00],
       [  1.31799101e-32,   9.99994159e-01,   5.82005259e-06,
          3.91706221e-32],
       [  3.62049274e-31,   9.99779761e-01,   2.20267350e-04,
          3.46005305e-31],
       [  9.84984932e-37,   9.08753090e-03,   9.90912437e-01,
          7.28294228e-36],
       [  8.88519972e-38,   1.80057094e-01,   8.19942892e-01,
          9.98887517e-37],
       [  1.47125918e-36,   9.99647141e-01,   3.52818635e-04,
          1.06983474e-35],
       [  6.70091965e-32,   9.95061934e-01,   4.93810000e-03,
          9.72486100e-31],
       [  6.39575915e-35,   9.98556316e-01,   1.44370634e-03,
          1.07126428e-33],
       [  1.16256888e-29,   9.90411043e-01,   9.58895311e-03,
          2.20837434e-29]], dtype=float32)