# **1. Few Preprocessings**
# **2. Model: FastText by Keras**
## **2.1** Change Preprocessings:
- Do lower case

In [1]:
import numpy as np

import pandas as pd

from collections import defaultdict

import keras
import keras.backend as K
from keras.layers import Dense, GlobalAveragePooling1D, Embedding
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

from sklearn.model_selection import train_test_split

np.random.seed(7)
# 函數可以保證生成的隨機數具有可預測性

Using TensorFlow backend.
  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
df = pd.read_csv('./keras_fasttext_data/train.zip')
a2c = {'EAP': 0, 'HPL' : 1, 'MWS' : 2}
y = np.array([a2c[a] for a in df.author])
y = to_categorical(y)

In [4]:
y[0:10]

array([[1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.]], dtype=float32)

# 1. **Few Preprocessings**

In traditional NLP tasks, preprocessings play an important role, but...

## **Low-frequency words**
In my experience, fastText is very fast, but I need to delete rare words to avoid overfitting.

**NOTE**:
Some keywords are rare words, such like *Cthulhu* in *Cthulhu Mythos* of *Howard Phillips Lovecraft*.
But these are useful for this task.

## **Removing Stopwords**

Nothing.
To identify author from a sentence, some stopwords play an important role because one has specific usages of them.

## **Stemming and Lowercase**

Nothing.
This reason is the same for stopwords removing.
And I guess some stemming rules provided by libraries is bad for this task because all author is the older author.

## **Cutting long sentence**

Too long documents are cut.

## **Punctuation**

Because I guess each author has unique punctuations's usage in the novel, I separate them from words.

e.g. `Don't worry` -> `Don ' t worry`

## **Is it slow?**

Don't worry! FastText is a very fast algorithm if it runs on CPU. 

In [3]:
counter = {name : defaultdict(int) for name in set(df.author)}
for (text, author) in zip(df.text, df.author):
    text = text.replace(' ', '')
    for c in text:
        counter[author][c] += 1

chars = set()
for v in counter.values():
    chars |= v.keys()
    
names = [author for author in counter.keys()]

print('c ', end='')
for n in names:
    print(n, end='   ')
print()
for c in chars:    
    print(c, end=' ')
    for n in names:
        print(counter[n][c], end=' ')
    print()


c EAP   MWS   HPL   
. 8406 5761 5908 
a 68525 55274 56815 
V 156 57 67 
J 164 66 210 
O 414 282 503 
l 35371 27819 30273 
A 1258 943 1167 
: 176 339 47 
j 683 682 424 
B 835 395 533 
s 53841 45962 43915 
ä 1 0 6 
ç 1 0 0 
h 51580 43738 42770 
" 2987 1469 513 
Z 23 2 51 
Æ 1 0 4 
H 864 669 741 
I 4846 4917 3480 
N 411 204 345 
i 60952 46080 44250 
Q 21 7 10 
, 17594 12045 8581 
c 24127 17911 18338 
ï 0 0 7 
Υ 0 0 1 
t 82426 63142 62235 
ô 8 0 0 
w 17507 16062 15554 
v 9624 7948 6529 
ö 16 0 3 
m 22792 20471 17622 
P 442 365 320 
æ 36 0 10 
δ 0 0 2 
ë 0 0 12 
ἶ 0 0 2 
k 4277 3707 5204 
T 2217 1230 1583 
E 435 445 281 
M 1065 415 645 
X 17 4 5 
à 10 0 0 
L 458 307 249 
b 13245 9611 10636 
D 491 227 334 
ê 28 0 2 
C 395 308 439 
y 17001 14877 12534 
o 67145 53386 50996 
é 47 0 15 
Π 0 0 1 
S 729 578 841 
Σ 0 0 1 
g 16088 12601 14951 
; 1354 2662 1143 
f 22354 18351 16272 
U 166 46 94 
Ο 0 0 3 
p 17422 12361 10965 
î 1 0 0 
ñ 0 0 7 
n 62636 50291 50879 
' 1334 476 1710 
G 313 246 318 
K 86

# **Summary of character distribution**

- HPL and EAP used non ascii characters like a `ä`.
- The number of punctuations seems to be good feature


# **Preprocessing**

My preproceeings are 

- Separate punctuation from words
- Remove lower frequency words ( <= 2)
- Cut a longer document which contains `256` words

In [4]:
def preprocess(text):
    text = text.replace("' ", " ' ")
    signs = set(',.:;"?!')
    prods = set(text) & signs
    if not prods:
        return text

    for sign in prods:
        text = text.replace(sign, ' {} '.format(sign) )
    return text

In [5]:
def create_docs(df, n_gram_max=2):
    def add_ngram(q, n_gram_max):
            ngrams = []
            for n in range(2, n_gram_max+1):
                for w_index in range(len(q)-n+1):
                    ngrams.append('--'.join(q[w_index:w_index+n]))
            return q + ngrams
        
    docs = []
    for doc in df.text:
        doc = preprocess(doc).split()
        docs.append(' '.join(add_ngram(doc, n_gram_max)))
    
    return docs

In [6]:
min_count = 2

docs = create_docs(df)
tokenizer = Tokenizer(lower=False, filters='')
tokenizer.fit_on_texts(docs)
num_words = sum([1 for _, v in tokenizer.word_counts.items() if v >= min_count])

tokenizer = Tokenizer(num_words=num_words, lower=False, filters='')
tokenizer.fit_on_texts(docs)
docs = tokenizer.texts_to_sequences(docs)

maxlen = 256

docs = pad_sequences(sequences=docs, maxlen=maxlen)

# **2. Model: FastText by Keras**

FastText is very fast and strong baseline algorithm for text classification based on Continuous Bag-of-Words model a.k.a Word2vec.

FastText contains only three layers:

1. Embeddings layer: Input words (and word n-grams) are all words in a sentence/document
2. Mean/AveragePooling Layer: Taking average vector of Embedding vectors
3. Softmax layer

There are some implementations of FastText:

- Original library provided by Facebook AI research: https://github.com/facebookresearch/fastText
- Keras: https://github.com/fchollet/keras/blob/master/examples/imdb_fasttext.py
- Gensim: https://radimrehurek.com/gensim/models/wrappers/fasttext.html

Original Paper: https://arxiv.org/abs/1607.01759 : More detail information about fastText classification model

# My FastText parameters are:

- The dimension of word vector is 20
- Optimizer is `Adam`
- Inputs are words and word bi-grams
  - you can change this parameter by passing the max n-gram size to argument of `create_docs` function.


In [7]:
input_dim = np.max(docs) + 1
embedding_dims = 20

In [8]:
def create_model(embedding_dims=20, optimizer='adam'):
    model = Sequential()
    model.add(Embedding(input_dim=input_dim, output_dim=embedding_dims))
    model.add(GlobalAveragePooling1D())
    model.add(Dense(3, activation='softmax'))

    model.compile(loss='categorical_crossentropy',
                  optimizer=optimizer,
                  metrics=['accuracy'])
    return model

In [9]:
%%time
epochs = 25
x_train, x_test, y_train, y_test = train_test_split(docs, y, test_size=0.2)

model = create_model()
hist = model.fit(x_train, y_train,
                 batch_size=16,
                 validation_data=(x_test, y_test),
                 epochs=epochs,
                 callbacks=[EarlyStopping(patience=2, monitor='val_loss')])

Train on 15663 samples, validate on 3916 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
CPU times: user 3min 34s, sys: 2min 4s, total: 5min 39s
Wall time: 9min 33s


In [10]:
hist.history

{'acc': [0.40745706442218627,
  0.609844857322251,
  0.7850347953890578,
  0.8461342016330725,
  0.8793973057676319,
  0.9060844027363598,
  0.9275362318916689,
  0.9441358615846261,
  0.9574155653488862,
  0.966289982761923,
  0.9742067292345017,
  0.979952754900083,
  0.9848049543548237,
  0.9875502777245738,
  0.9904871352869821,
  0.9921470982608444,
  0.993679371771666],
 'loss': [1.0677945384012362,
  0.9350210330476099,
  0.7280177331546147,
  0.5686626990225202,
  0.4549655123573758,
  0.3686974758938386,
  0.3008757487404646,
  0.24674754453323386,
  0.202612399660117,
  0.1668055929238332,
  0.13685051731093134,
  0.11276970568323458,
  0.09282629702946377,
  0.07696351547214915,
  0.06346491431530935,
  0.052670161264240914,
  0.04374515521879672],
 'val_acc': [0.45352400411624344,
  0.7142492339730384,
  0.7553626149740599,
  0.7867722164866142,
  0.8036261490708839,
  0.819203268702354,
  0.8296731359137943,
  0.8447395301936718,
  0.8449948926868186,
  0.8475485188359503,

# **2.1 Change Preprocessings**

Next, I change some parameters and preprocessings to improve fastText model.
## **2.1.1 Do lower case**

In [11]:
docs = create_docs(df)
tokenizer = Tokenizer(lower=True, filters='')
tokenizer.fit_on_texts(docs)
num_words = sum([1 for _, v in tokenizer.word_counts.items() if v >= min_count])

tokenizer = Tokenizer(num_words=num_words, lower=True, filters='')
tokenizer.fit_on_texts(docs)
docs = tokenizer.texts_to_sequences(docs)

maxlen = 256

docs = pad_sequences(sequences=docs, maxlen=maxlen)

input_dim = np.max(docs) + 1

In [12]:
epochs = 16
x_train, x_test, y_train, y_test = train_test_split(docs, y, test_size=0.2)

model = create_model()
hist = model.fit(x_train, y_train,
                 batch_size=16,
                 validation_data=(x_test, y_test),
                 epochs=epochs,
                 callbacks=[EarlyStopping(patience=2, monitor='val_loss')])

Train on 15663 samples, validate on 3916 samples
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


In [13]:
test_df = pd.read_csv('./keras_fasttext_data/test.zip')
docs = create_docs(test_df)
docs = tokenizer.texts_to_sequences(docs)
docs = pad_sequences(sequences=docs, maxlen=maxlen)
y = model.predict_proba(docs)

result = pd.read_csv('./keras_fasttext_data/sample_submission.zip')
for a, i in a2c.items():
    result[a] = y[:, i]

In [14]:
result.to_csv('./keras_fasttext_data/kefastText_result.csv', index=False)

In [15]:
y

array([[2.4272846e-02, 2.5963726e-02, 9.4976342e-01],
       [9.9948221e-01, 5.1771977e-04, 1.2118203e-07],
       [8.9132594e-04, 9.9599361e-01, 3.1150738e-03],
       ...,
       [7.3178154e-01, 1.8311663e-01, 8.5101813e-02],
       [6.2676385e-02, 6.6918219e-03, 9.3063182e-01],
       [6.0933948e-02, 9.3904918e-01, 1.6923084e-05]], dtype=float32)