# **This notebook's best result: val_acc is 0.8658, val_loss is 0.3467**

# **1. Few Preprocessings**
# **2. Model: FastText by Keras**

In [37]:
import numpy as np
import pandas as pd
from collections import defaultdict
import keras
from keras.layers import Dense, GlobalAveragePooling1D, Embedding
import keras.backend as K
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

np.random.seed(7)

In [38]:
train = pd.read_csv('train.csv')
author_dict = {'EAP': 0, 'HPL' : 1, 'MWS' : 2}


In [39]:
y = np.array([author_dict[a] for a in train.author])
y = to_categorical(y)

# 1. **Few Preprocessings**

In traditional NLP tasks, preprocessings play an important role, but...

## **Low-frequency words**
In my experience, fastText is very fast, but I need to delete rare words to avoid overfitting.

**NOTE**:
Some keywords are rare words, such like *Cthulhu* in *Cthulhu Mythos* of *Howard Phillips Lovecraft*.
But these are useful for this task.

## **Removing Stopwords**

Nothing.
To identify author from a sentence, some stopwords play an important role because one has specific usages of them.

## **Stemming and Lowercase**

Nothing.
This reason is the same for stopwords removing.
And I guess some stemming rules provided by libraries is bad for this task because all author is the older author.

## **Cutting long sentence**

Too long documents are cut.

## **Punctuation**

Because I guess each author has unique punctuations's usage in the novel, I separate them from words.

e.g. `Don't worry` -> `Don ' t worry`

## **Is it slow?**

Don't worry! FastText is a very fast algorithm if it runs on CPU. 

# **Let's check character distribution per author**

In [40]:
counter = {name : defaultdict(int) for name in set(train.author)}
for (text, author) in zip(train.text, train.author):
    text = text.replace(' ', '')
    for c in text:
        counter[author][c] += 1

chars = set()
for v in counter.values():
    chars |= v.keys()
    
names = [author for author in counter.keys()]

print('c ', end='')
for n in names:
    print(n, end='   ')
print()
for c in chars:    
    print(c, end=' ')
    for n in names:
        print(counter[n][c], end=' ')
    print()


c MWS   HPL   EAP   
z 400 529 634 
è 0 0 15 
α 0 2 0 
ἶ 0 2 0 
Ο 0 3 0 
P 365 320 442 
Ν 0 1 0 
Å 0 1 0 
ê 0 2 28 
S 578 841 729 
à 0 0 10 
y 14877 12534 17001 
A 943 1167 1258 
ñ 0 7 0 
J 66 210 164 
V 57 67 156 
U 46 94 166 
: 339 47 176 
Υ 0 1 0 
c 17911 18338 24127 
b 9611 10636 13245 
I 4917 3480 4846 
E 445 281 435 
l 27819 30273 35371 
T 1230 1583 2217 
f 18351 16272 22354 
w 16062 15554 17507 
? 419 169 510 
δ 0 2 0 
L 307 249 458 
; 2662 1143 1354 
X 4 5 17 
B 395 533 835 
F 232 269 383 
ö 0 3 16 
h 43738 42770 51580 
ë 0 12 0 
v 7948 6529 9624 
i 46080 44250 60952 
W 681 732 739 
Π 0 1 0 
m 20471 17622 22792 
ô 0 0 8 
s 45962 43915 53841 
H 669 741 864 
æ 0 10 36 
p 12361 10965 17422 
Σ 0 1 0 
a 55274 56815 68525 
Q 7 10 21 
ä 0 6 1 
, 12045 8581 17594 
g 12601 14951 16088 
D 227 334 491 
n 50291 50879 62636 
e 97515 88259 114885 
K 35 176 86 
î 0 0 1 
. 5761 5908 8406 
N 204 345 411 
r 44042 40590 51221 
t 63142 62235 82426 
Æ 0 4 1 
' 476 1710 1334 
Y 234 111 282 
Z 2 51 2

# **Summary of character distribution**

- HPL and EAP used non ascii characters like a `ä`.
- The number of punctuations seems to be good feature


# **Preprocessing**

My preproceeings are 

- Separate punctuation from words
- Remove lower frequency words ( <= 2)
- Cut a longer document which contains `256` words

In [41]:
def preprocess(text):
    text = text.replace("' ", " ' ")
    signs = set(',.:;"?!')
    prods = set(text) & signs
    if not prods:
        return text

    for sign in prods:
        text = text.replace(sign, ' {} '.format(sign) )
    return text

In [42]:
def create_docs(df, n_gram_max=2):
    def add_ngram(q, n_gram_max):
            ngrams = []
            for n in range(2, n_gram_max+1):
                for w_index in range(len(q)-n+1):
                    ngrams.append('--'.join(q[w_index:w_index+n]))
            return q + ngrams
        
    docs = []
    for doc in df.text:
        doc = preprocess(doc).split()
        docs.append(' '.join(add_ngram(doc, n_gram_max)))
    
    return docs

In [43]:
min_count = 2

docs = create_docs(train)


In [44]:
docs

['This process , however , afforded me no means of ascertaining the dimensions of my dungeon ; as I might make its circuit , and return to the point whence I set out , without being aware of the fact ; so perfectly uniform seemed the wall . This--process process--, ,--however however--, ,--afforded afforded--me me--no no--means means--of of--ascertaining ascertaining--the the--dimensions dimensions--of of--my my--dungeon dungeon--; ;--as as--I I--might might--make make--its its--circuit circuit--, ,--and and--return return--to to--the the--point point--whence whence--I I--set set--out out--, ,--without without--being being--aware aware--of of--the the--fact fact--; ;--so so--perfectly perfectly--uniform uniform--seemed seemed--the the--wall wall--.',
 'It never once occurred to me that the fumbling might be a mere mistake . It--never never--once once--occurred occurred--to to--me me--that that--the the--fumbling fumbling--might might--be be--a a--mere mere--mistake mistake--.',
 'In hi

In [45]:
tokenizer = Tokenizer(lower=False, filters='')
num_words = sum([1 for _, v in tokenizer.word_counts.items() if v >= min_count])

tokenizer = Tokenizer(num_words=num_words, lower=False, filters='')
tokenizer.fit_on_texts(docs)


In [46]:
docs

['This process , however , afforded me no means of ascertaining the dimensions of my dungeon ; as I might make its circuit , and return to the point whence I set out , without being aware of the fact ; so perfectly uniform seemed the wall . This--process process--, ,--however however--, ,--afforded afforded--me me--no no--means means--of of--ascertaining ascertaining--the the--dimensions dimensions--of of--my my--dungeon dungeon--; ;--as as--I I--might might--make make--its its--circuit circuit--, ,--and and--return return--to to--the the--point point--whence whence--I I--set set--out out--, ,--without without--being being--aware aware--of of--the the--fact fact--; ;--so so--perfectly perfectly--uniform uniform--seemed seemed--the the--wall wall--.',
 'It never once occurred to me that the fumbling might be a mere mistake . It--never never--once once--occurred occurred--to to--me me--that that--the the--fumbling fumbling--might might--be be--a a--mere mere--mistake mistake--.',
 'In hi

In [35]:
docs = tokenizer.texts_to_sequences(docs)



In [36]:
docs

[[174,
  6008,
  1,
  224,
  1,
  2481,
  26,
  46,
  469,
  3,
  20045,
  2,
  4827,
  3,
  15,
  10367,
  14,
  21,
  7,
  120,
  282,
  59,
  9408,
  1,
  5,
  482,
  6,
  2,
  393,
  4601,
  7,
  533,
  106,
  1,
  206,
  182,
  1587,
  3,
  2,
  506,
  14,
  49,
  2645,
  11508,
  142,
  2,
  725,
  4,
  76598,
  20046,
  245,
  273,
  45016,
  9409,
  4206,
  1866,
  1312,
  31891,
  31892,
  76599,
  76600,
  90,
  31893,
  76601,
  4602,
  219,
  704,
  16908,
  20047,
  76602,
  76603,
  10,
  16909,
  1792,
  42,
  3239,
  76604,
  24545,
  20048,
  11509,
  2231,
  1046,
  11510,
  76605,
  3666,
  13,
  2368,
  76606,
  1313,
  31894,
  76607,
  76608,
  20049,
  1907,
  5078],
 [78,
  143,
  201,
  1264,
  6,
  26,
  12,
  2,
  16910,
  120,
  33,
  8,
  710,
  4603,
  4,
  76609,
  31895,
  76610,
  5359,
  211,
  1065,
  118,
  76611,
  76612,
  771,
  1086,
  3119,
  76613,
  20050],
 [121,
  20,
  217,
  274,
  11,
  8,
  1242,
  11511,
  982,
  1,
  28,
  23,
  1,
  2

In [13]:
maxlen = 256

docs = pad_sequences(sequences=docs, maxlen=maxlen)

# **2. Model: FastText by Keras**

FastText is very fast and strong baseline algorithm for text classification based on Continuous Bag-of-Words model a.k.a Word2vec.

FastText contains only three layers:

1. Embeddings layer: Input words (and word n-grams) are all words in a sentence/document
2. Mean/AveragePooling Layer: Taking average vector of Embedding vectors
3. Softmax layer

There are some implementations of FastText:

- Original library provided by Facebook AI research: https://github.com/facebookresearch/fastText
- Keras: https://github.com/fchollet/keras/blob/master/examples/imdb_fasttext.py
- Gensim: https://radimrehurek.com/gensim/models/wrappers/fasttext.html

Original Paper: https://arxiv.org/abs/1607.01759 : More detail information about fastText classification model

# My FastText parameters are:

- The dimension of word vector is 20
- Optimizer is `Adam`
- Inputs are words and word bi-grams
  - you can change this parameter by passing the max n-gram size to argument of `create_docs` function.


In [14]:
docs

array([[     0,      0,      0, ...,  20049,   1907,   5078],
       [     0,      0,      0, ...,   3119,  76613,  20050],
       [     0,      0,      0, ...,  76622,  31898,  24547],
       ..., 
       [     0,      0,      0, ..., 257068, 257069, 257070],
       [     0,      0,      0, ..., 257078,  21725,     95],
       [     0,      0,      0, ...,    444, 257086,  50416]])

In [15]:
input_dim = np.max(docs) + 1
embedding_dims = 20

In [16]:
model = Sequential()
model.add(Embedding(input_dim=input_dim, output_dim=embedding_dims))
model.add(GlobalAveragePooling1D())
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [17]:
epochs = 45
x_train, x_test, y_train, y_test = train_test_split(docs, y, test_size=0.15)

n_samples = x_train.shape[0]

hist = model.fit(x_train, y_train,
                 batch_size=16,
                 validation_data=(x_test, y_test),
                 epochs=epochs,
                 callbacks=[EarlyStopping(patience=2, monitor='val_loss')])

Train on 16642 samples, validate on 2937 samples
Epoch 1/45
Epoch 2/45
Epoch 3/45
Epoch 4/45
Epoch 5/45
Epoch 6/45
Epoch 7/45
Epoch 8/45
Epoch 9/45
Epoch 10/45
Epoch 11/45
Epoch 12/45
Epoch 13/45
Epoch 14/45
Epoch 15/45
Epoch 16/45
Epoch 17/45
Epoch 18/45
Epoch 19/45


In [19]:
test_df = pd.read_csv('test.csv')
docs = create_docs(test_df)
docs = tokenizer.texts_to_sequences(docs)
docs = pad_sequences(sequences=docs, maxlen=maxlen)
y = model.predict_proba(docs)

result = pd.read_csv('sample_submission.csv')
for a, i in a2c.items():
    result[a] = y[:, i]



In [20]:
result.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.009186,0.00778,0.9830348
1,id24541,0.999987,1.3e-05,7.818713e-09
2,id00134,0.000862,0.999047,9.093016e-05
3,id27757,0.978151,0.020689,0.001160134
4,id04081,0.538227,0.208888,0.2528852


In [23]:
result.shape

(8392, 4)

In [24]:
sample_sub = pd.read_csv('sample_submission.csv')

In [25]:
sample_sub.shape

(8392, 4)

In [22]:
result.to_csv('predictions.csv', index=False)