# Advanced NLP Project 1

In celebration of Halloween 2017, Kaggle launched a playground competition titled “Spooky Author Identification” which tasked participants with identifying  horror authors from their writings. The dataset for this NLP classification task consists of excerpts from horror stories by Edgar Allan Poe [EAP], Mary Shelley [MWS], and HP Lovecraft [HPL] and is pre-split into a training set (19,580 samples) and testing set (8,393 samples). An entry in either dataset consists of an id (unique identifier for the excerpt), and a single sentence of varied length from one of the three authors (the training set includes the name of the author). Example training observation:

- "id22965","A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.","MWS"


In [2]:
import importlib
import simple_BERT as SSB
importlib.reload(SSB)

%reload_ext autoreload
%autoreload 2

In [3]:
import numpy as np
import tokenization 
import seaborn as sns
import pandas as pd
import time
import tensorflow_hub as hub
from keras.utils import to_categorical
from tensorflow.keras import callbacks

sns.set_style("whitegrid")
notebookstart = time.time()
pd.options.display.max_colwidth = 500

SSB.print_versions()

Tensorflow Version:  2.10.0
TF-Hub version:  0.13.0
Eager mode enabled:  True
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
Metal device set to: Apple M2
GPU available:  True


2023-03-27 13:24:57.992126: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-03-27 13:24:57.992367: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)



systemMemory: 16.00 GB
maxCacheSize: 5.33 GB



My approach to this problem leverages the popular BERT (Bidirectional Encoder Representations from Transformers) language model by fine tuning it for this specific classification task. The BERT model used was pulled from tensorflow hub and follows what the original paper denotes as BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M). L is the number of transformer blocks (hidden layers), H is the hidden size, and A is the number of self attention heads. The weights of this model are those released by the original BERT authors. This model has been pre-trained for English on the Wikipedia and BooksCorpus. 


In [8]:
# Import the bert model from tensorflow hub
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)


2023-03-27 19:30:36.739583: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-03-27 19:30:36.740259: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


CPU times: user 29.1 s, sys: 3.62 s, total: 32.7 s
Wall time: 58.6 s


In [15]:
MAX_LEN = 64*3
BATCH_SIZE = 8
EPOCHS = 15
SEED = 42
NROWS = None
TEXTCOL = "text"
TARGETCOL = "author"
NCLASS = 3

dir = '../spooky-author-identification'

train = pd.read_csv(f"{dir}/train.csv")
test = pd.read_csv(f"{dir}/test.csv")
testdex = test.id
submission = pd.read_csv(f"{dir}/sample_submission.csv")

sub_cols = submission.columns

print("Train Shape: {} Rows, {} Columns".format(*train.shape))
print("Test Shape: {} Rows, {} Columns".format(*test.shape))

length_info = [len(x) for x in np.concatenate([train[TEXTCOL].values, test[TEXTCOL].values])]
print("Train Sequence Length - Mean {:.1f} +/- {:.1f}, Max {:.1f}, Min {:.1f}".format(np.mean(length_info), np.std(length_info), np.max(length_info), np.min(length_info)))


Train Shape: 19579 Rows, 3 Columns
Test Shape: 8392 Rows, 2 Columns
Train Sequence Length - Mean 148.7 +/- 107.7, Max 4663.0, Min 21.0


In [25]:
print(train[TEXTCOL].values[0])
print(len(train[TEXTCOL].values[0]))
print(length_info[0])

This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.
231
231


In [10]:
# https://stackoverflow.com/questions/59654175/how-to-get-the-vocab-file-for-bert-tokenizer-from-tf-hub
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [52]:
# https://github.com/microsoft/SDNet/blob/master/bert_vocab_files/bert-base-uncased-vocab.txt
text = "Hello, how was your day? My name is Ravindra lolokfehfe"
print(text)
text = tokenizer.tokenize(text)
print(text)
text = text[:MAX_LEN-2]
print(text)
input_sequence = ["[CLS]"] + text + ["[SEP]"]
print(input_sequence)
pad_len = MAX_LEN - len(input_sequence)
tokens = tokenizer.convert_tokens_to_ids(input_sequence)
print(tokens)
tokens += [0] * pad_len
pad_masks = [1] * len(input_sequence) + [0] * pad_len
segment_ids = [0] * MAX_LEN
print(tokens)
print(pad_masks)


Hello, how was your day? My name is Ravindra lolokfehfe
['hello', ',', 'how', 'was', 'your', 'day', '?', 'my', 'name', 'is', 'ravi', '##ndra', 'lo', '##lok', '##fe', '##h', '##fe']
['hello', ',', 'how', 'was', 'your', 'day', '?', 'my', 'name', 'is', 'ravi', '##ndra', 'lo', '##lok', '##fe', '##h', '##fe']
['[CLS]', 'hello', ',', 'how', 'was', 'your', 'day', '?', 'my', 'name', 'is', 'ravi', '##ndra', 'lo', '##lok', '##fe', '##h', '##fe', '[SEP]']
[101, 7592, 1010, 2129, 2001, 2115, 2154, 1029, 2026, 2171, 2003, 16806, 17670, 8840, 29027, 7959, 2232, 7959, 102]
[101, 7592, 1010, 2129, 2001, 2115, 2154, 1029, 2026, 2171, 2003, 16806, 17670, 8840, 29027, 7959, 2232, 7959, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [11]:
# Encode the training and testing inputs with bert encoder
train_input = SSB.bert_encode(train[TEXTCOL].values, tokenizer, max_len=MAX_LEN)
test_input = SSB.bert_encode(test[TEXTCOL].values, tokenizer, max_len=MAX_LEN)

# Ditionary with author name as keys and mapping integer as value
label_mapper = {name: i for i,name in enumerate(set(train[TARGETCOL].values))}
# List of author names converted to integer with mapper
num_label = np.vectorize(label_mapper.get)(train[TARGETCOL].values)
# num_label converted from integers to one hot encoding [2 --> [0 0 1]]
train_labels = to_categorical(num_label)

In [42]:
print(train_input[0][1])
print(len(train_input[0]), len(train_input[0][0]))
print(len(train_input[1]), len(train_input[0][1]))
print(len(train_input[2]), len(train_input[1][1]))
print(label_mapper)
print(len(num_label), num_label[8:20])
print(train_labels)

[  101  2009  2196  2320  4158  2000  2033  2008  1996 11865 29256  2453
  2022  1037  8210  6707  1012   102     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

In [12]:
print(train_input)

(array([[  101,  2023,  2832, ...,     0,     0,     0],
       [  101,  2009,  2196, ...,     0,     0,     0],
       [  101,  1999,  2010, ...,     0,     0,     0],
       ...,
       [  101, 14736,  2015, ...,     0,     0,     0],
       [  101,  2005,  2019, ...,     0,     0,     0],
       [  101,  2002,  4201, ...,     0,     0,     0]]), array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]]), array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]]))


In [13]:
# Build model and print summary
model = SSB.build_model(bert_layer, NCLASS, max_len=MAX_LEN)
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_word_ids (InputLayer)    [(None, 192)]        0           []                               
                                                                                                  
 input_mask (InputLayer)        [(None, 192)]        0           []                               
                                                                                                  
 segment_ids (InputLayer)       [(None, 192)]        0           []                               
                                                                                                  
 keras_layer (KerasLayer)       [(None, 1024),       335141889   ['input_word_ids[0][0]',         
                                 (None, 192, 1024)]               'input_mask[0][0]',         

  super().__init__(name, **kwargs)


In [16]:
checkpoint = callbacks.ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True, save_freq='epoch')
es = callbacks.EarlyStopping(monitor='val_loss', min_delta=0.0001,
                             patience=4, verbose=1, mode='min', baseline=None,
                             restore_best_weights=False)


train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=EPOCHS,
    callbacks=[checkpoint, es],
    batch_size=BATCH_SIZE
)

Epoch 1/15

2023-03-27 20:53:52.830221: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/15
Epoch 3/15
Epoch 4/15
 313/1958 [===>..........................] - ETA: 1:08:31 - loss: 0.0208 - accuracy: 0.9940

KeyboardInterrupt: 

In [17]:
import matplotlib.pyplot as plt

In [19]:
model.load_weights('model.h5')
test_pred = model.predict(test_input)
test_pred.shape

2023-03-28 00:05:57.600170: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




(8392, 3)

In [20]:
submission = pd.DataFrame(test_pred, columns=label_mapper.keys())
submission['id'] = testdex

submission = submission[sub_cols]
submission.to_csv('submission_bert.csv', index=False)
print(submission.shape)


(8392, 4)


0: MWS
1: EAP
2: HPL
3: 

In [74]:
to_submit = pd.read_csv('submission_bert.csv')
# print(to_submit)
to_submit['author'] = to_submit.iloc[:, 1:].idxmax(axis=1)
print(to_submit)
checking = pd.read_csv(f"{dir}/sub_fe.csv")
checking2 = pd.read_csv(f"{dir}/grammar_results.csv")

checking['author'] = checking.iloc[:, 1:].idxmax(axis=1)
checking2['author'] = checking2.iloc[:, 1:].idxmax(axis=1)

print(checking)
checking['author2'] = to_submit['author']
print(checking)

def compare_results(df1, df2):
    list1 = list(df1.author)
    list2 = list(df2.author)
    discrepencies = 0

    for i in zip(list1, list2):
        if i[0] != i[1]:
            discrepencies += 1

    return discrepencies

print(compare_results(checking, to_submit))
print(compare_results(checking2, to_submit))
print(compare_results(checking2, checking))

           id       EAP       HPL           MWS author
0     id02310  0.000774  0.007881  9.913454e-01    MWS
1     id24541  0.998308  0.000174  1.518073e-03    EAP
2     id00134  0.000174  0.999821  5.248776e-06    HPL
3     id27757  0.980690  0.018335  9.744923e-04    EAP
4     id04081  0.204278  0.752165  4.355776e-02    HPL
...       ...       ...       ...           ...    ...
8387  id11749  0.038030  0.009188  9.527821e-01    MWS
8388  id10526  0.910455  0.021065  6.848040e-02    EAP
8389  id13477  0.999748  0.000169  8.373301e-05    EAP
8390  id13761  0.312758  0.047380  6.398616e-01    MWS
8391  id04282  0.000062  0.999938  9.586085e-08    HPL

[8392 rows x 5 columns]
           id       EAP       HPL       MWS author
0     id02310  0.033902  0.009306  0.956793    MWS
1     id24541  0.998126  0.001162  0.000712    EAP
2     id00134  0.023430  0.972711  0.003859    HPL
3     id27757  0.708356  0.280364  0.011279    EAP
4     id04081  0.932844  0.035967  0.031188    EAP
...      