# Join 平台眾開講

## Setup

- [x] Read messages from CSV.
- [x] Cut message content by Jieba.
- [ ] Build embeddings using fastText word vectors pre-trained on Wikipedia corpus.
- [ ] Build training and validation datasets.

In [585]:
path = "data/join"
topic = "立法方式保障"
# topic = "同性伴侣法"
# topic = "同性婚姻法"

In [678]:
from __future__ import division, print_function
import pandas as pd, numpy as np
import jieba
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten, Dropout
import os, math

Read in messages from CSV.

In [648]:
def get_messages_from_orig(topic):
    messages = pd.read_csv(os.path.join(path, topic + ".csv"), index_col=0)
    mask = messages.astype('str').applymap(lambda x: len(x.decode('utf-8'))).content > 20
    messages = messages[mask]
    messages.to_csv(os.path.join(path, topic + "-good.csv"))
    return messages

def get_labeled_messages(topic):
    return pd.read_csv(os.path.join(path, topic + "-good.csv"), index_col=0)

def labeled_only(messages):
    return messages[messages.ORID.notnull()]

def unlabeled_only(messages):
    return messages[messages.ORID.isnull()]
    
all_messages = get_messages_from_orig(topic) if not os.path.exists(os.path.join(path, topic + "-good.csv")) \
                                         else get_labeled_messages(topic)
print("Total messages: {count}".format(count=len(all_messages)))
messages = labeled_only(all_messages)
print("Labeled messages: {count}".format(count=len(messages)))
messages.head()

Total messages: 10215
Labeled messages: 58


Unnamed: 0_level_0,createDate,authorName,content,ORID
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,2015-10-31 15:53:48,黃道明,在台灣已經有同志收養小孩了，你的資訊是多落伍～,O
9,2015-10-31 15:52:09,黃道明,真奇怪，明明是歐美一個個陸續通過同婚，你是眼睛瞎了嗎？,O
10,2015-10-31 15:51:24,高守謙,我來回答吧四處約砲 無固定性伴侶 就算是性解放的一種,I
14,2015-10-31 15:38:13,了了,援交 毒品 賭博 全部比同性戀禍害更深 那麼怕的話別生了地球很危險的,I
16,2015-10-31 15:34:41,路過的，呵呵。,同性戀領養現在是合法的喔，因為我們未婚都是單身者，現在單身者是可以領養小朋友的。,O


Build dictionary of phrases and load word embeddings.

In [588]:
def write_dictionary(messages):
    contents = [ jieba.lcut(c) for c in messages.content ]
    all_phrases = set([ ph for c in contents for ph in c ])
    with open(os.path.join(path, "dictionary.txt"), "w") as fh:
        for ph in all_phrases:
            fh.write(ph.encode("utf-8") + "\n")
            
def read_dictionary():
    dictionary = pd.read_csv(os.path.join(path, "dictionary.vec"), 
                       delim_whitespace=True, engine="python", header=None, index_col=0)
    return dictionary

if not os.path.exists("dictionary.vec"):
    write_dictionary(all_messages)
    !cd data/join; ../../../bin/fasttext print-word-vectors models/wiki.zh.bin < dictionary.txt > dictionary.vec
dictionary = read_dictionary()
dictionary.shape

(44956, 300)

Build dictionary index to convert phrases into embedding vectors.

In [601]:
dict_index = { ph.decode("utf-8"): i for i, ph in enumerate(dictionary.index) }
dict_index[u"同性"], dictionary.iloc[dict_index[u"同性"]] == dictionary.loc["同性"]

(35096, 1      True
 2      True
 3      True
 4      True
 5      True
 6      True
 7      True
 8      True
 9      True
 10     True
 11     True
 12     True
 13     True
 14     True
 15     True
 16     True
 17     True
 18     True
 19     True
 20     True
 21     True
 22     True
 23     True
 24     True
 25     True
 26     True
 27     True
 28     True
 29     True
 30     True
        ... 
 271    True
 272    True
 273    True
 274    True
 275    True
 276    True
 277    True
 278    True
 279    True
 280    True
 281    True
 282    True
 283    True
 284    True
 285    True
 286    True
 287    True
 288    True
 289    True
 290    True
 291    True
 292    True
 293    True
 294    True
 295    True
 296    True
 297    True
 298    True
 299    True
 300    True
 Name: 同性, Length: 300, dtype: bool)

In [591]:
input_length = 50
batch_size = 64

In [592]:
def create_embeddings(dictionary, input_length=100):
    phrases, latents = dictionary.shape
    embedding = Embedding(phrases, latents, input_length=input_length, weights=[dictionary])
    return embedding

Separate the messages into training and validation datasets.

In [665]:
select = np.random.random(len(messages)) < 0.9
train = messages[select]
valid = messages[~select]

In [666]:
print("Training data: {count}".format(count=len(train)))
train.head()

Training data: 51


Unnamed: 0_level_0,createDate,authorName,content,ORID
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,2015-10-31 15:53:48,黃道明,在台灣已經有同志收養小孩了，你的資訊是多落伍～,O
9,2015-10-31 15:52:09,黃道明,真奇怪，明明是歐美一個個陸續通過同婚，你是眼睛瞎了嗎？,O
10,2015-10-31 15:51:24,高守謙,我來回答吧四處約砲 無固定性伴侶 就算是性解放的一種,I
14,2015-10-31 15:38:13,了了,援交 毒品 賭博 全部比同性戀禍害更深 那麼怕的話別生了地球很危險的,I
16,2015-10-31 15:34:41,路過的，呵呵。,同性戀領養現在是合法的喔，因為我們未婚都是單身者，現在單身者是可以領養小朋友的。,O


In [667]:
print("Validation data: {count}".format(count=len(valid)))
valid.head()

Validation data: 7


Unnamed: 0_level_0,createDate,authorName,content,ORID
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,2015-10-31 14:36:05,模糊,"我只是想有些人知道,一夫一妻的制度不見得比同性婚姻好,台灣像我這種家庭一堆,有幾個人真的有去...",R
59,2015-10-31 13:53:07,Ato Otto,最早性解放性自主是女權運動帶領的好嗎？這麼偉大的credit你就直接安給同志運動！妳一定是同...,O
14091,2015-08-04 4:56:35,杜里昂,"同性戀要什麼權利還要你們""給""，你們是哪根蔥啊？人權是最基本的保障好嗎？但至少是第一步…… ...",R
14108,2015-08-04 3:21:45,PIX,不必講什麼大道理，但是自然法則中人類的繁衍需要異性！這一代的人們享受上一代祖先們所創造的一切...,I
14113,2015-08-04 2:20:23,Lih-woei Chen,我不喜歡這個標題，為什麼人權需要由其他人來評論是否可以給予...這應該是政府要保障的基本需求吧。,R


Convert datasets to word embeddings.

In [668]:
def get_data(messages):
    jieba_cut = np.frompyfunc(lambda x: [ ph for ph in jieba.lcut(x.decode("utf-8")) if ph != u" " ][:input_length], 1, 1)
    word_embed = np.frompyfunc(lambda x: [ dict_index[ph] for ph in x ], 1, 1)
    fill_to_length = np.frompyfunc(lambda x: x + [0] * (input_length - len(x)), 1, 1)
    return np.stack(fill_to_length(word_embed(jieba_cut(messages.content.values))), axis=0)

answers = { "O": [1., 0., 0., 0.], "R": [0., 1., 0., 0.], "I": [0., 0., 1., 0.], "D": [0., 0., 0., 1.] }

def get_answer(messages):
    return np.array([ answers[x] for x in messages.ORID ])

train = (get_data(train), get_answer(train))
valid = (get_data(valid), get_answer(valid))

In [669]:
train[0][10], train[1][10], valid[0][0], valid[1][0]

(array([17293, 22897, 38608,  3570, 17341,  6481, 20914, 40590,  2743,
         4100,  6481, 28378, 19941, 10141, 10141, 10141, 10141, 10141,
         3570, 19941, 17363, 25141, 32542,  3069, 30973,  4387,  6481,
        20303,   679, 10141,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0]),
 array([ 0.,  1.,  0.,  0.]),
 array([ 8155, 24372, 15045, 25237, 24918, 15686, 39736, 27440,  4387,
        42074,  5503, 25216,  7705, 35096,  5416, 11512, 39736, 11227,
         6374,  8155,  3159, 42857, 27042, 39736, 17636, 13156, 24918,
         3230, 17636, 37720, 32255, 28114, 43195,  8155,  4387, 23531,
        39736, 44663, 34642,  9379, 39736, 19941, 42166,  4387,  5416,
         2960, 34183,  7799, 39736, 18009]),
 array([ 0.,  1.,  0.,  0.]))

# Single hidden layer model

In [687]:
def linear_model():
    model = Sequential()
    model.add(create_embeddings(dictionary, input_length))
    model.add(Flatten())
    model.add(Dense(512, activation="relu"))
    model.add(Dropout(0.6))
    model.add(Dense(4, activation="softmax"))
    return model

linear = linear_model()
linear.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_20 (Embedding)         (None, 50, 300)       13486800    embedding_input_20[0][0]         
____________________________________________________________________________________________________
flatten_7 (Flatten)              (None, 15000)         0           embedding_20[0][0]               
____________________________________________________________________________________________________
dense_32 (Dense)                 (None, 512)           7680512     flatten_7[0][0]                  
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 512)           0           dense_32[0][0]                   
___________________________________________________________________________________________

In [688]:
linear.compile("adam", loss="categorical_crossentropy", metrics=["accuracy"])

In [689]:
def train_linear(lr=None, epoch=1):
    if lr is not None:
        linear.optimizer.lr = lr
    linear.fit(train[0], train[1], nb_epoch=epoch, validation_data=valid, batch_size=batch_size)
    
train_linear()

Train on 51 samples, validate on 7 samples
Epoch 1/1


In [690]:
train_linear(lr=0.1, epoch=2)
train_linear(lr=0.01, epoch=4)
train_linear(lr=0.001, epoch=4)

Train on 51 samples, validate on 7 samples
Epoch 1/2
Epoch 2/2
Train on 51 samples, validate on 7 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Train on 51 samples, validate on 7 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


# Evaluation

In [674]:
model = linear

In [694]:
unlabeled = unlabeled_only(all_messages)
test_data = get_data(unlabeled)

In [676]:
pred = model.predict(test_data, batch_size=batch_size)

In [695]:
unlabeled.head(10)

Unnamed: 0_level_0,createDate,authorName,content,ORID
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,2015-10-31 15:55:50,大少爺,離婚率高低與否沒辦法當作拒絕同婚的理由同性戀離婚率肯定比異性戀低，因為對象非常難找，一但找到...,
4,2015-10-31 15:55:20,Jasi,異性戀比同性戀更多，強姦案就已經特別多了 也是四處約砲 無固定性伴侶 就算是性解放的一種,
5,2015-10-31 15:54:52,MT Lin,笑死人，你這樣的想法就不自私嗎？你又有多少數據可以證明，對下一代、或是其他人的影響是什麼？如...,
11,2015-10-31 15:51:17,宮紅雪,異性戀有資格離婚為甚麼同性戀沒資格離婚?你還真是可笑,
12,2015-10-31 15:49:02,就像異性婚姻離婚的人，換了對象結婚，還是以,就像異性婚姻離婚的人，換了對象結婚，還是以離婚收場的人比比皆是!! .只是因為目前還沒有現成...,
13,2015-10-31 15:42:19,可笑,奇怪了離婚對下一代沒影響嗎？妳從哪看到單親家庭小孩沒影響的數據?如一樣影響***嘛妳只反對同...,
18,2015-10-31 15:32:16,林聖曜,那麼我這個下一代也想為了我的下一代自私一點呢？,
21,2015-10-31 15:31:19,宮紅雪,喔你提不出證據我們也提不出證據我們差不多差不多而已何況我還只是個孩子喔~~,
22,2015-10-31 15:31:07,Jasi,恐同人士的回复和留言都是一模一樣的，答非所問。,
23,2015-10-31 15:29:29,May Lin,同性戀收養孩子，全世界合法通過的時間並不長，沒有足夠的數據可以證明，對下一代的影響是什麼？如...,


In [693]:
pred[:10]

array([[  0.00000000e+00,   3.64373234e-04,   9.99635637e-01,
          8.24223876e-26],
       [  0.00000000e+00,   2.23523541e-14,   1.00000000e+00,
          2.80774144e-35],
       [  0.00000000e+00,   7.14709889e-03,   9.92852926e-01,
          1.20382141e-28],
       [  0.00000000e+00,   8.42293269e-16,   1.00000000e+00,
          6.63537288e-36],
       [  0.00000000e+00,   1.97134502e-02,   9.80286539e-01,
          2.44966530e-25],
       [  0.00000000e+00,   5.26312396e-11,   1.00000000e+00,
          2.22592137e-30],
       [  0.00000000e+00,   5.24598777e-14,   1.00000000e+00,
          2.63017976e-38],
       [  0.00000000e+00,   1.01536751e-16,   1.00000000e+00,
          3.30662313e-36],
       [  0.00000000e+00,   1.10566260e-15,   1.00000000e+00,
          3.37982119e-38],
       [  1.40129846e-45,   3.63183558e-01,   6.36816442e-01,
          1.66385469e-23]], dtype=float32)