# 1. Few Preprocessings
# 2. Model: FastText by Keras
## 2.1 Change Preprocessings:
- Do lower case 

In [2]:
import numpy as np
import pandas as pd

from collections import defaultdict

import keras
import keras.backend as K
from keras.layers import Dense, GlobalAveragePooling1D, Embedding
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

from sklearn.model_selection import train_test_split

np.random.seed(7)
# 函數可以保證生成的隨機數具有可預測性

Using TensorFlow backend.
  return f(*args, **kwds)
  return f(*args, **kwds)


In [3]:
df = pd.read_csv('./keras_fasttext_data/train.zip')
a2c = {'EAP': 0, 'HPL' : 1, 'MWS' : 2}
# 三個欄位改變成0,1,2
y = np.array([a2c[a] for a in df.author])
#  np.array 之中的每一個元素都必須是相同型態（相同大小）的
#  此時 y.shape = (19579,) 19579個一維陣列

y = to_categorical(y)
# to_categorical = 將類向量（整数）轉換為二進制類矩陣

In [19]:
y.shape

(19579, 3)

In [24]:
y[0]
# 轉成三個二維陣列 都是0,1

array([[0., 1.],
       [1., 0.],
       [1., 0.]], dtype=float32)

In [25]:
#最後一個元素
y[-1]

array([[1., 0.],
       [0., 1.],
       [1., 0.]], dtype=float32)

In [28]:
#一樣19579 只是變成
len(y)

19579

In [9]:
#看一下內容
df.head(5)

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [27]:
# 三個欄位改變成對應 0,1,2
a2c

{'EAP': 0, 'HPL': 1, 'MWS': 2}

In [33]:
# 變成字典
type(a2c)

dict

In [37]:
#查字典
a2c['HPL']

1

# 1. **Few Preprocessings**
**幾乎沒有前處理**

- In traditional NLP tasks, preprocessings play an important role, but...
-  在傳統的NLP(自然語言處理)任務 前處理扮演一個重要角色,但是...

## **Low-frequency words**
** 低頻率單字 **

-   In my experience, fastText is very fast, but I need to delete rare words to avoid overfitting.
-   在我的經驗中, fasttext 非常快速,但是需要刪除較少出現的單字避免 overfitting

**NOTE**:
- Some keywords are rare words, such like *Cthulhu* in *Cthulhu Mythos* of *Howard Phillips Lovecraft*.
- But these are useful for this task.

**NOTE**
- 有些較少出現的單字,像是 *Cthulhu* 在 *克蘇魯神話* 出現在 *克蘇魯的召喚*中
- 但是對這項任務是有用的

## **Removing Stopwords**
**刪除停用詞**

- 停用詞 = 某些NLP任務需要將一些常出现的“無意義”的詞去掉，比如：統計一篇文章頻率最高的100個詞，可能會有大量的“is”、"a"、"the" 這類詞，它們就是 stopwords

- Nothing.
-  To identify author from a sentence, some stopwords play an important role because one has specific usages of them.
-   從句子中識別作者,某些停用詞有重要的作用,因為它們有特定的用法


## **Stemming and Lowercase**
- Stemming = 過去式,未來式變回原型
- Lowercase = 轉小寫

- Nothing.
-   This reason is the same for stopwords removing.
-   And I guess some stemming rules provided by libraries is bad for this task because all author is the older author.
-   同樣是為了刪除停用詞
-   我想 libraries 提供一些停用詞詞庫對這項任務不太有用,因為有些都是較老的作者

## **Cutting long sentence**
**切割較長的句子**

-   Too long documents are cut.
-   切割太長的文件

## **Punctuation**
**標點符號**

-   Because I guess each author has unique punctuations's usage in the novel, I separate them from words.
-   因為我猜作者在小說中都有獨特的標點符號用法,我把它們和文字分開

- e.g. `Don't worry` -> `Don ' t worry`

## **Is it slow?**

- Don't worry! FastText is a very fast algorithm if it runs on CPU. 

# Let's check character distribution per author

In [4]:
# from collections import defaultdict
# 跟dict用法一樣 但是可讀性更高
# set() = 是一個無序不重複元素集, 集合型態
counter = {name : defaultdict(int) for name in set(df.author)}
for (text, author) in zip(df.text, df.author):
    text = text.replace(' ', '')
    for c in text:
    # c = 88(0~87) , text =88
        counter[author][c] += 1
        # author = 'HPL'那些 , c (0~87)
        # 迴圈剛好是把每個字切開 (text)
        # 把zip裡面的文章內容(text) 出現的字 存進counter字典  這樣就有了每個字母出現的次數

chars = set()
# set() = 是一個無序不重複元素集, 集合型態
for v in counter.values():
# counter.values() =  每一個字典內容
    
    chars |= v.keys()
    # v.keys() = 所有字典內容的字母
    # chars 有了所有字典內的字母 但不重複
    
names = [author for author in counter.keys()]
# counter.keys() = dict_keys(['MWS', 'EAP', 'HPL'])

print('c ', end='')
# end=' '意思是末尾不換行，加空格
for n in names:
    print(n, end='   ')
print()
for c in chars:    
    print(c, end=' ')
    for n in names:
    # names = 3種作者(n) , c = 每一個字
        print(counter[n][c], end=' ')
        # counter[作者][字母] = 查詢到字典中 那一個作者中那一個字的出現次數
    print()
    
#查詢到字典中 那一個作者中那一個字的出現次數

c MWS   EAP   HPL   
e 97515 114885 88259 
î 0 1 0 
D 227 491 334 
Σ 0 0 1 
α 0 0 2 
F 232 383 269 
E 445 435 281 
â 0 6 0 
L 307 458 249 
x 1267 1951 1061 
Z 2 23 51 
O 282 414 503 
q 677 1030 779 
Π 0 0 1 
M 415 1065 645 
C 308 395 439 
A 943 1258 1167 
G 246 313 318 
m 20471 22792 17622 
w 16062 17507 15554 
à 0 10 0 
P 365 442 320 
l 27819 35371 30273 
R 385 258 237 
ü 0 1 5 
ï 0 0 7 
t 63142 82426 62235 
δ 0 0 2 
Ν 0 0 1 
ä 0 1 6 
z 400 634 529 
ñ 0 0 7 
" 1469 2987 513 
X 4 17 5 
a 55274 68525 56815 
n 50291 62636 50879 
b 9611 13245 10636 
Å 0 0 1 
d 35315 36862 33366 
Q 7 21 10 
B 395 835 533 
Υ 0 0 1 
: 339 176 47 
? 419 510 169 
h 43738 51580 42770 
, 12045 17594 8581 
j 682 683 424 
i 46080 60952 44250 
N 204 411 345 
é 0 47 15 
K 35 86 176 
s 45962 53841 43915 
I 4917 4846 3480 
V 57 156 67 
J 66 164 210 
g 12601 16088 14951 
H 669 864 741 
ô 0 8 0 
Y 234 282 111 
p 12361 17422 10965 
; 2662 1354 1143 
c 17911 24127 18338 
y 14877 17001 12534 
k 3707 4277 5204 
ê 0 28 2 
. 

In [46]:
type(counter)

dict

In [58]:
# 觀看字典內容 目前還沒有東西
counter

[('HPL', defaultdict(int, {})),
 ('MWS', defaultdict(int, {})),
 ('EAP', defaultdict(int, {}))]

In [8]:
text , author

('Helaidagnarledclawonmyshoulder,anditseemedtomethatitsshakingwasnotaltogetherthatofmirth.',
 'HPL')

In [12]:
len(text)

88

In [13]:
len(author)

3

In [22]:
counter['EAP']['B']

835

In [17]:
counter['MWS'][0]

0

In [18]:
# 字典有了每個字母出現的次數
counter

{'EAP': defaultdict(int,
             {'"': 2987,
              "'": 1334,
              ',': 17594,
              '.': 8406,
              ':': 176,
              ';': 1354,
              '?': 510,
              'A': 1258,
              'B': 835,
              'C': 395,
              'D': 491,
              'E': 435,
              'F': 383,
              'G': 313,
              'H': 864,
              'I': 4846,
              'J': 164,
              'K': 86,
              'L': 458,
              'M': 1065,
              'N': 411,
              'O': 414,
              'P': 442,
              'Q': 21,
              'R': 258,
              'S': 729,
              'T': 2217,
              'U': 166,
              'V': 156,
              'W': 739,
              'X': 17,
              'Y': 282,
              'Z': 23,
              'a': 68525,
              'b': 13245,
              'c': 24127,
              'd': 36862,
              'e': 114885,
              'f': 22354,
              'g': 1

In [40]:
counter.values()

dict_values([defaultdict(<class 'int'>, {'Q': 7, '.': 5761, 'i': 46080, 'F': 232, 'L': 307, ',': 12045, "'": 476, 'Z': 2, 'v': 7948, 'O': 282, '?': 419, 'o': 53386, 'H': 669, 'b': 9611, 'r': 44042, 'Y': 234, 'u': 21025, 'X': 4, 'j': 682, 'p': 12361, ':': 339, 'B': 395, 'M': 415, 't': 63142, 'w': 16062, 'y': 14877, 'f': 18351, 'V': 57, '"': 1469, 'J': 66, 'g': 12601, 'k': 3707, 'U': 46, 'A': 943, 'x': 1267, 's': 45962, 'D': 227, 'n': 50291, 'c': 17911, 'd': 35315, 'R': 385, 0: 0, 'E': 445, 'W': 681, 'C': 308, ';': 2662, 'h': 43738, 'S': 578, 'T': 1230, 'K': 35, 'q': 677, 'I': 4917, 'P': 365, 'z': 400, 'a': 55274, 'N': 204, 'l': 27819, 'm': 20471, 'G': 246, 'e': 97515}), defaultdict(<class 'int'>, {'Q': 21, '.': 8406, 'i': 60952, 'F': 383, 'â': 6, 'L': 458, ',': 17594, "'": 1334, 'Z': 23, 'é': 47, 'à': 10, 'v': 9624, 'O': 414, '?': 510, 'o': 67145, 'H': 864, 'b': 13245, 'r': 51221, 'Y': 282, 'u': 26311, 'X': 17, 'j': 683, 'p': 17422, ':': 176, 'B': 835, 'I': 4846, 't': 82426, 'Æ': 1, 'w'

In [33]:
counter.keys()

dict_keys(['MWS', 'EAP', 'HPL'])

In [43]:
# chars 加入這些
print(v.keys())

dict_keys(['Q', '.', 'i', 'F', 1, 'L', ',', 87, "'", 'Z', 'δ', 'v', 'O', '?', 'o', 'H', 'b', 'r', 'Y', 'u', 'ἶ', 'Υ', 'X', 'æ', 'p', ':', 'B', 'M', 't', 'N', 'w', 'y', 'f', 'V', '"', 'J', 'α', 'Σ', 'g', 'é', 'k', 'U', 'A', 'x', 's', 'D', 'n', 'c', 'ê', 'd', 'R', 'Π', 'E', 'W', 'C', ';', 'h', 'S', 'ä', 'ï', 'Ν', 88, 'T', 'ñ', 'K', 'q', 'I', 'Ο', 'P', 100, 'ë', 'Æ', 'z', 'a', 'j', 'l', 'm', 'G', 0, 'ö', 'e', 'Å', 'ü'])


In [37]:
len(chars)

89

In [38]:
# chars 有了所有字典內的字母 但不重複
chars

{0,
 '.',
 1,
 ',',
 'Z',
 'δ',
 'à',
 'ö',
 'b',
 'r',
 'ἶ',
 ':',
 'B',
 'I',
 'Æ',
 'w',
 'y',
 'f',
 'k',
 'x',
 's',
 'O',
 'X',
 'ê',
 'Υ',
 'd',
 'E',
 'C',
 ';',
 'D',
 'h',
 'S',
 'ï',
 87,
 88,
 'T',
 'K',
 'Ο',
 'é',
 100,
 'ë',
 'Y',
 'N',
 'Ν',
 'm',
 'U',
 'Π',
 'Q',
 'i',
 'F',
 'â',
 "'",
 'v',
 '?',
 'o',
 'H',
 'u',
 'R',
 'p',
 'ç',
 'M',
 't',
 'j',
 'G',
 'æ',
 'ô',
 'V',
 'Σ',
 'g',
 'A',
 'e',
 'Å',
 'î',
 'n',
 'c',
 '"',
 'α',
 'W',
 'ä',
 'ñ',
 'q',
 'P',
 'L',
 'z',
 'a',
 'l',
 'J',
 'è',
 'ü'}

# **Summary of character distribution**
**總結字元內容**

- HPL and EAP used non ascii characters like a `ä`.
- HPL 和 EAP 有使用非ascii 字符  like a `ä`.

- The number of punctuations seems to be good feature
- 標點符號是很好的特徵

# **Preprocessing**

My preproceeings are 

- Separate punctuation from words
- 切割標點符號
- Remove lower frequency words ( <= 2)
- 刪除詞頻<=2 
- Cut a longer document which contains `256` words
- 切割過長的文章, 256字

In [5]:
# 這邊要切割標點符號

def preprocess(text):
    text = text.replace("' ", " ' ")
    # 把'符號 變成 前後都有空白
    signs = set(',.:;"?!')
    prods = set(text) & signs
    if not prods:
    #如果內容沒有(',.:;"?!') 這些符號 直接回傳text
        return text

    for sign in prods:
    # sign= (',.:;"?!') in prods:
        text = text.replace(sign, ' {} '.format(sign) )
        # text = 把sign 格式化 一連串符號變成一個{} , 並且{}不指定位置 , {}可以填入變成{0}
        # prods 內容是{,.,} 那些
    return text

In [6]:
# 這邊要刪除詞頻<=2

def create_docs(df, n_gram_max=2):
    def add_ngram(q, n_gram_max):
            ngrams = []
            for n in range(2, n_gram_max+1):
                # n_gram_max = 3 , n=2
                for w_index in range(len(q)-n+1):
                    ngrams.append('--'.join(q[w_index:w_index+n]))
            return q + ngrams
    docs = []
    for doc in df.text:
        doc = preprocess(doc).split()
        # preprocess = 標準化
        docs.append(' '.join(add_ngram(doc, n_gram_max)))
    return docs

In [7]:
# 這邊要切割過長的文章

min_count = 2

docs = create_docs(df)
tokenizer = Tokenizer(lower=False, filters='')
tokenizer.fit_on_texts(docs)
# fit_on_text(texts) = 使用一系列的檔案來生成token字典，texts為list類，每個元素為一個檔案

num_words = sum([1 for _, v in tokenizer.word_counts.items() if v >= min_count])

tokenizer = Tokenizer(num_words=num_words, lower=False, filters='')

tokenizer.fit_on_texts(docs)

docs = tokenizer.texts_to_sequences(docs)
# texts_to_sequences = 將多個檔案轉換為word下標的向量形式

maxlen = 256

docs = pad_sequences(sequences=docs, maxlen=maxlen)
# maxlen：None或整数，為序列的最大長度
# 其他短於長度的該序都會在後面填充0以達到該長度
# 長於 maxlen 的序列都會被截斷 , 以使其批配目標長度

# **2. Model: FastText by Keras**

FastText is very fast and strong baseline algorithm for text classification based on Continuous Bag-of-Words model a.k.a Word2vec.

FastText contains only three layers:

1. Embeddings layer: Input words (and word n-grams) are all words in a sentence/document
2. Mean/AveragePooling Layer: Taking average vector of Embedding vectors
3. Softmax layer

There are some implementations of FastText:

- Original library provided by Facebook AI research: https://github.com/facebookresearch/fastText
- Keras: https://github.com/fchollet/keras/blob/master/examples/imdb_fasttext.py
- Gensim: https://radimrehurek.com/gensim/models/wrappers/fasttext.html

Original Paper: https://arxiv.org/abs/1607.01759 : More detail information about fastText classification model

# My FastText parameters are:

- The dimension of word vector is 20
- Optimizer is `Adam`
- Inputs are words and word bi-grams
  - you can change this parameter by passing the max n-gram size to argument of `create_docs` function.


In [7]:
input_dim = np.max(docs) + 1
embedding_dims = 20

In [8]:
def create_model(embedding_dims=20, optimizer='adam'):
    model = Sequential()
    model.add(Embedding(input_dim=input_dim, output_dim=embedding_dims))
    model.add(GlobalAveragePooling1D())
    model.add(Dense(3, activation='softmax'))  #(Dense(3, activation='softmax'))

    model.compile(loss='categorical_crossentropy',
                  optimizer=optimizer,
                  metrics=['accuracy'])
    return model

In [9]:
%%time
epochs = 25
x_train, x_test, y_train, y_test = train_test_split(docs, y, test_size=0.2)

model = create_model()

# 當監測不再改善時,該回調函數將中止訓練
# monitor = 需要監視的量
# patience = 當early stop被啟動（如發現loss相比上一個epoch訓練沒有下降），則經過patience個epoch後停止訓練
hist = model.fit(x_train, y_train,
                 batch_size=16,
                 validation_data=(x_test, y_test),
                 epochs=epochs,
                 callbacks=[EarlyStopping(patience=2, monitor='val_loss')])

Train on 15663 samples, validate on 3916 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
CPU times: user 3min 35s, sys: 2min 6s, total: 5min 41s
Wall time: 22min 49s


# **2.1 Change Preprocessings**

Next, I change some parameters and preprocessings to improve fastText model.
## **2.1.1 Do lower case**

In [None]:
docs = create_docs(df)
tokenizer = Tokenizer(lower=True, filters='')
tokenizer.fit_on_texts(docs)
num_words = sum([1 for _, v in tokenizer.word_counts.items() if v >= min_count])

tokenizer = Tokenizer(num_words=num_words, lower=True, filters='')
tokenizer.fit_on_texts(docs)
docs = tokenizer.texts_to_sequences(docs)

maxlen = 256

docs = pad_sequences(sequences=docs, maxlen=maxlen)

input_dim = np.max(docs) + 1

In [11]:
epochs = 16
x_train, x_test, y_train, y_test = train_test_split(docs, y, test_size=0.2)

model = create_model()
hist = model.fit(x_train, y_train,
                 batch_size=16,
                 validation_data=(x_test, y_test),
                 epochs=epochs,
                 callbacks=[EarlyStopping(patience=2, monitor='val_loss')])

Train on 15663 samples, validate on 3916 samples
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


In [14]:
test_df = pd.read_csv('./keras_fasttext_data/test.zip')
docs = create_docs(test_df)
# 丟入test資料集
docs = tokenizer.texts_to_sequences(docs)
docs = pad_sequences(sequences=docs, maxlen=maxlen)
y = model.predict_proba(docs)
# 預測可能的

result = pd.read_csv('./keras_fasttext_data/sample_submission.zip')
for a, i in a2c.items():
    result[a] = y[:, i]

In [None]:
result.to_csv('./keras_fasttext_data/kefastText_result.csv', index=False)

In [8]:
a2c.items()

dict_items([('MWS', 2), ('EAP', 0), ('HPL', 1)])