<a href="https://colab.research.google.com/github/Doris-QZ/spooky_author_identification/blob/main/2_GloVe_LSTM_Spooky_Author_Identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

This is the second deep learning model for the 'Spooky Author Identification' project. In this notebook, I will directly load the data from my Google Drive to train the **LSTM model with GloVe Embedding**. For the EDA section, please check the notebook: [1_LSTM_Spooky_Author_Identification.ipynb](https://github.com/Doris-QZ/spooky_author_identification/blob/main/1_LSTM_Spooky_Author_Identification.ipynb).

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Install packages
!pip install keras_tuner

# Load Important packages
import pandas as pd
import numpy as np
import math
import re

# Modeling
import tensorflow as tf
from tensorflow import keras
from keras.models import Model
from keras.layers import Input, Embedding, SpatialDropout1D, Bidirectional, LSTM, GlobalMaxPooling1D
from keras.layers import Dropout, Dense, TextVectorization, Concatenate
from keras.optimizers import Adam
from keras.losses import SparseCategoricalCrossentropy
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, Callback
import keras_tuner
from kerastuner.tuners import BayesianOptimization
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, classification_report
from sklearn.preprocessing import StandardScaler
from keras.saving import register_keras_serializable
from keras.models import load_model

Collecting keras_tuner
  Downloading keras_tuner-1.4.7-py3-none-any.whl.metadata (5.4 kB)
Collecting kt-legacy (from keras_tuner)
  Downloading kt_legacy-1.0.5-py3-none-any.whl.metadata (221 bytes)
Downloading keras_tuner-1.4.7-py3-none-any.whl (129 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.1/129.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading kt_legacy-1.0.5-py3-none-any.whl (9.6 kB)
Installing collected packages: kt-legacy, keras_tuner
Successfully installed keras_tuner-1.4.7 kt-legacy-1.0.5


  from kerastuner.tuners import BayesianOptimization


In [3]:
# Load the data
train = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/train.csv')
test = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/test.csv')

### LSTM with GloVe Embedding


In [4]:
# Split the training set to training and validation set
training_set, validation_set = train_test_split(train, test_size = 0.2, stratify = train['Author'], random_state = 1)

In [5]:
# load GloVe embeddings
embedding_dim = 200   # use glove.6B.200d.txt
embedding_index = {}

glove_path = '/content/drive/MyDrive/ColabNotebooks/glove.6B.200d.txt'

with open(glove_path, 'r', encoding = 'utf-8') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embeddings = np.asarray(values[1:], dtype = 'float32')
    embedding_index[word] = embeddings
  f.close()

print(f'Found {len(embedding_index)} word vectors in GloVe.')

Found 400000 word vectors in GloVe.


In [6]:
# Check the number of unique words in the text data
words = re.sub(r'[^\w\s]', '', ' '.join(training_set['text']).lower()).split()
vocab_size = len(set(words))
vocab_size

23230

In [7]:
# Check the length of each text
text_length = training_set['text'].str.split().str.len()
print(text_length.describe())

# Find the 95th percentile
sequence_length = int(text_length.quantile(0.95))
print(f'95% of texts have {sequence_length} words or fewer.')


count    15663.000000
mean        26.726553
std         19.472995
min          2.000000
25%         15.000000
50%         23.000000
75%         34.000000
max        861.000000
Name: text, dtype: float64
95% of texts have 58 words or fewer.


The length of the text data has a wide range, from 2 to 861, with a mean of 27 and a median of 23. Since 95% of texts have 58 words or fewer, I will set the sequence length in the TextVectorization to be 58.

In [8]:
# Create the text vectorizer
vectorizer = TextVectorization(
    max_tokens = vocab_size,
    output_mode = 'int',
    output_sequence_length = sequence_length
)

# Build a vocabulary of all string tokens seen in the training_set['text']
vectorizer.adapt(training_set['text'].values)

# Vectorize the data
training_text = vectorizer(training_set['text'].values)
validation_text = vectorizer(validation_set['text'].values)

In [9]:
# Get the vocabulary index from the vectorizer
word_index = vectorizer.get_vocabulary()
vocab_size = len(word_index)
print(f'Vocabulary size: {vocab_size}')


Vocabulary size: 23230


In [10]:
embedding_matrix = np.zeros((vocab_size + 1, embedding_dim))

for i, word in enumerate(word_index):
  embedding_vector = embedding_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector

#### The first GloVe_LSTM model

In [None]:
# An integer input for vocabulary indices
inputs = Input(shape = (None, ), dtype = 'int64')

# Embedding layer
x = Embedding(vocab_size + 1, embedding_dim, weights = [embedding_matrix])(inputs)

# Embedding dropout
x = SpatialDropout1D(0.1)(x)

# Bidirectional LSTM layer
x = Bidirectional(LSTM(512, recurrent_dropout = 0.2, return_sequences=True))(x)

# Maxpooling layer
x = GlobalMaxPooling1D()(x)

# Dropout
x = Dropout(0.5)(x)

# Output
outputs = Dense(3, activation = 'softmax')(x)

bi_lstm1 = Model(inputs, outputs)
bi_lstm1.summary()

In [None]:
# Compile the model with optimizer, loss function, and metrics
bi_lstm1.compile(optimizer = Adam(learning_rate = 0.001),
    loss = SparseCategoricalCrossentropy(),
    metrics = ['accuracy']
)

I'll define a custom callback called `TemporalAveraging` that performs temporal averaging of model weights at the epoch level. It maintains an exponential moving average (EMA) during training and loads the averaged weights into the model at the end.

In [11]:
@register_keras_serializable()
class TemporalAveraging(Callback):
    def __init__(self, beta=0.99):
        super().__init__()
        self.beta = beta
        self.ema_weights = None
        self.epochs = 0

    # Initialize EMA weights at the start of training
    def on_train_begin(self, logs=None):
        self.ema_weights = [w.numpy() for w in self.model.trainable_weights]
        self.epochs = 0

    # Update EMA weights after each epoch
    def on_epoch_end(self, epoch, logs=None):
        self.epochs += 1
        for i, w in enumerate(self.model.trainable_weights):
            current_w = w.numpy()
            self.ema_weights[i] = (
                self.beta * self.ema_weights[i] + (1.0 - self.beta) * current_w
            )

    # Load the averaged weights into the model at the end of training
    def on_train_end(self, logs=None):
        # Correction to counteract bias towards zero at the start
        correction = 1.0 - self.beta ** self.epochs if self.epochs > 0 else 1.0
        corrected_weights = [w / correction for w in self.ema_weights]

        # Update model weights with averaged weights
        for w, avg_w in zip(self.model.trainable_weights, corrected_weights):
            w.assign(avg_w)

    # Returns the config of the callback. Enables serialization and deserialization
    def get_config(self):
        config = super().get_config()
        config.update({
            'beta': self.beta,
        })
        return config

In [12]:
# Define the other callbacks
early_stoppig = EarlyStopping(
    monitor = 'val_accuracy',
    patience = 5,
    verbose = 1,
    restore_best_weights = True
)

check_point = ModelCheckpoint(
    filepath = '/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/bi_lstm1.keras',
    monitor = 'val_accuracy',
    save_best_only = True
)

reduced_lr = ReduceLROnPlateau(
    monitor = 'val_loss',
    factor = 0.5,
    patience = 3,
    verbose = 1
)

In [None]:
history1 = bi_lstm1.fit(
    training_text,
    training_set['author_encoded'],
    steps_per_epoch = math.ceil(training_set.shape[0] / 64),
    batch_size = 64,
    validation_data = (validation_text, validation_set['author_encoded']),
    validation_steps = math.ceil(validation_set.shape[0] / 64),
    epochs = 20,
    callbacks = [early_stoppig, check_point, reduced_lr, TemporalAveraging()]
)

Epoch 1/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m106s[0m 431ms/step - accuracy: 0.7815 - loss: 0.5408 - val_accuracy: 0.8138 - val_loss: 0.4591 - learning_rate: 0.0010
Epoch 2/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m127s[0m 370ms/step - accuracy: 0.8842 - loss: 0.3000 - val_accuracy: 0.8243 - val_loss: 0.4311 - learning_rate: 0.0010
Epoch 3/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 368ms/step - accuracy: 0.9431 - loss: 0.1620 - val_accuracy: 0.8401 - val_loss: 0.4463 - learning_rate: 0.0010
Epoch 4/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 363ms/step - accuracy: 0.9647 - loss: 0.1028 - val_accuracy: 0.8384 - val_loss: 0.5335 - learning_rate: 0.0010
Epoch 5/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 342ms/step - accuracy: 0.9791 - loss: 0.0596
Epoch 5: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━

Early stopping was triggered at epoch 8, with the best validation accuracy of 0.84 occurring at epoch 3. The training accuracy reached 0.99 in the end, while the validation accuracy stayed at 0.83, indicating overfitting.


#### GloVe_LSTM with linguistic features

I will add linguistic features to the model and see if it improves the performance.

In [13]:
# Extract linguistic features
lingu_train = training_set[['gunning_fog', 'sent_len', 'word_len', 'noun_freq',
            'verb_freq', 'adj_freq', 'adv_freq', 'funct_word', 'type_token']]
lingu_validation = validation_set[['gunning_fog', 'sent_len', 'word_len', 'noun_freq',
            'verb_freq', 'adj_freq', 'adv_freq', 'funct_word', 'type_token']]

# Normalize the linguistic features
scaler = StandardScaler()
lingu_train = scaler.fit_transform(lingu_train)
lingu_validation = scaler.transform(lingu_validation)

In [None]:
# Define the input layer for text
input_text = Input(shape = (None, ), dtype = 'int64')

# Define the input layer for linguistic features
input_lingu = Input(shape = (9, ))

# Embedding layer
x = Embedding(vocab_size + 1, embedding_dim, weights = [embedding_matrix])(input_text)

# Embedding dropout
x = SpatialDropout1D(0.1)(x)

# Bidirectional LSTM layer
x = Bidirectional(LSTM(512, recurrent_dropout = 0.2, return_sequences=True))(x)

# Maxpooling layer
x = GlobalMaxPooling1D()(x)

# Concatenate maxpooling layer and the linguistic features
x = Concatenate()([x, input_lingu])

# Dropout
x = Dropout(0.5)(x)

# Output
outputs = Dense(3, activation = 'softmax')(x)

bi_lstm2 = Model([input_text, input_lingu], outputs)
bi_lstm2.summary()

In [None]:
# Compile the model with optimizer, loss function, and metrics
bi_lstm2.compile(optimizer = Adam(learning_rate = 0.001),
    loss = SparseCategoricalCrossentropy(),
    metrics = ['accuracy']
)

# Define new callbacks
check_point2 = ModelCheckpoint(
    filepath = '/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/bi_lstm2.keras',
    monitor = 'val_accuracy',
    save_best_only = True
)

history2 = bi_lstm2.fit(
    [training_text, lingu_train],
    training_set['author_encoded'],
    steps_per_epoch = math.ceil(training_set.shape[0] / 64),
    batch_size = 64,
    validation_data = ([validation_text, lingu_validation], validation_set['author_encoded']),
    validation_steps = math.ceil(validation_set.shape[0] / 64),
    epochs = 20,
    callbacks = [early_stoppig, check_point2, reduced_lr, TemporalAveraging()]
)

Epoch 1/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m118s[0m 455ms/step - accuracy: 0.5748 - loss: 0.8908 - val_accuracy: 0.7666 - val_loss: 0.5775 - learning_rate: 0.0010
Epoch 2/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m124s[0m 383ms/step - accuracy: 0.8082 - loss: 0.4846 - val_accuracy: 0.8340 - val_loss: 0.4333 - learning_rate: 0.0010
Epoch 3/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m140s[0m 375ms/step - accuracy: 0.9024 - loss: 0.2651 - val_accuracy: 0.8468 - val_loss: 0.4090 - learning_rate: 0.0010
Epoch 4/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m146s[0m 393ms/step - accuracy: 0.9509 - loss: 0.1393 - val_accuracy: 0.8447 - val_loss: 0.4593 - learning_rate: 0.0010
Epoch 5/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 392ms/step - accuracy: 0.9722 - loss: 0.0796 - val_accuracy: 0.8261 - val_loss: 0.5785 - learning_rate: 0.0010
Epoch 6/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━

Early stopping was triggered at epoch 8, with the best weights occurring at epoch 3 again. The model performance improved slightly compared to the first model:

* val_accuracy: 0.8468 vs. 0.8401
* val_loss: 0.409 vs. 0.4463

However, the overfitting problem is still existing.

### Tune the model

I will tune the following hyperparameters of the second model(Bidirectional LSTM with linguistic features) using `BayesianOptimization` from keras_tunner:

* Embedding dropout rate
* Number of units in LSTM layer
* Recurrent dropout rate in LSTM layer
* Max-pooled output dropout rate
* Learning rate

In [None]:
def build_model(hp):
    input_text = Input(shape = (None, ), dtype = 'int64')
    input_lingu = Input(shape = (9, ))
    x = Embedding(vocab_size + 1, embedding_dim, weights = [embedding_matrix])(input_text)
    x = SpatialDropout1D(rate = hp.Choice('embedding_dropout',
                                          values = [0.1, 0.2, 0.3]))(x)
    x = Bidirectional(LSTM(units = hp.Choice('lstm_units',
                                             values = [64, 128, 256, 512]),
                           recurrent_dropout = hp.Choice('recurrent_dropout',
                                                         values = [0.2, 0.3, 0.4]),
                           return_sequences=True))(x)
    x = GlobalMaxPooling1D()(x)
    x = Concatenate()([x, input_lingu])
    x = Dropout(rate = hp.Choice('max_pool_dropout',
                                 values = [0.4, 0.5]))(x)
    outputs = Dense(3, activation = 'softmax')(x)
    model = Model([input_text, input_lingu], outputs)
    model.compile(optimizer = Adam(learning_rate = hp.Choice('learning_rate',
                                                             values = [0.001, 0.0001])),
                  loss = SparseCategoricalCrossentropy(),
                  metrics = ['accuracy'])
    return model

In [None]:
# Create the tuner
tuner = BayesianOptimization(
    build_model,
    objective = 'val_accuracy',
    max_trials = 30
)

# Search for the best hyperparameters
tuner.search(
    [training_text, lingu_train],
    training_set['author_encoded'],
    steps_per_epoch = math.ceil(training_set.shape[0] / 64),
    batch_size = 64,
    validation_data = ([validation_text, lingu_validation], validation_set['author_encoded']),
    validation_steps = math.ceil(validation_set.shape[0] / 64),
    epochs = 10,
    callbacks = [early_stoppig, check_point, reduced_lr, TemporalAveraging()]
)

Trial 21 Complete [00h 18m 17s]
val_accuracy: 0.8416751623153687

Best val_accuracy So Far: 0.8485699892044067
Total elapsed time: 07h 00m 07s

Search: Running Trial #22

Value             |Best Value So Far |Hyperparameter
0.1               |0.3               |embedding_dropout
512               |512               |lstm_units
0.2               |0.3               |recurrent_dropout
0.4               |0.4               |max_pool_dropout
0.001             |0.001             |learning_rate

Epoch 1/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 387ms/step - accuracy: 0.5767 - loss: 0.8934 - val_accuracy: 0.7778 - val_loss: 0.5705 - learning_rate: 0.0010
Epoch 2/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m96s[0m 392ms/step - accuracy: 0.8219 - loss: 0.4565 - val_accuracy: 0.8080 - val_loss: 0.4685 - learning_rate: 0.0010
Epoch 3/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m145s[0m 404ms/step - accuracy: 0.9083 - loss: 0.2488 - va

The runtime of Google Colab was disconnected after 7 hours of tuning, during trial 22. Since the 'Best val_accuracy' had been unchanged for a long time, I will just use the 'Best Value So Far' for the hyperparameters and retrain the model accordingly.

In [None]:
# Define the input layer for text
input_text = Input(shape = (None, ), dtype = 'int64')

# Define the input layer for linguistic features
input_lingu = Input(shape = (9, ))

# Embedding layer
x = Embedding(vocab_size + 1, embedding_dim, weights = [embedding_matrix])(input_text)

# Embedding dropout = 0.3
x = SpatialDropout1D(0.3)(x)

# Bidirectional LSTM layer (lstm_units = 512, recurrent_dropout = 0.3)
x = Bidirectional(LSTM(512, recurrent_dropout = 0.3, return_sequences=True))(x)

# Maxpooling layer
x = GlobalMaxPooling1D()(x)

# Concatenate maxpooling layer and the linguistic features
x = Concatenate()([x, input_lingu])

# max_pool_dropout = 0.4
x = Dropout(0.4)(x)

# Output
outputs = Dense(3, activation = 'softmax')(x)

tuned_lstm = Model([input_text, input_lingu], outputs)
tuned_lstm.summary()


In [None]:
# Compile the model with optimizer, loss function, and metrics
tuned_lstm.compile(optimizer = Adam(learning_rate = 0.001),
    loss = SparseCategoricalCrossentropy(),
    metrics = ['accuracy']
)

# Define new callbacks
check_point_tuned = ModelCheckpoint(
    filepath = '/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/tuned_glove_lstm.keras',
    monitor = 'val_accuracy',
    save_best_only = True
)

history_tuned_lstm = tuned_lstm.fit(
    [training_text, lingu_train],
    training_set['author_encoded'],
    steps_per_epoch = math.ceil(training_set.shape[0] / 64),
    batch_size = 64,
    validation_data = ([validation_text, lingu_validation], validation_set['author_encoded']),
    validation_steps = math.ceil(validation_set.shape[0] / 64),
    epochs = 20,
    callbacks = [early_stoppig, check_point_tuned, reduced_lr, TemporalAveraging()]
)

Epoch 1/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m90s[0m 332ms/step - accuracy: 0.5543 - loss: 0.9236 - val_accuracy: 0.7503 - val_loss: 0.6120 - learning_rate: 0.0010
Epoch 2/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m152s[0m 381ms/step - accuracy: 0.7694 - loss: 0.5556 - val_accuracy: 0.8212 - val_loss: 0.4644 - learning_rate: 0.0010
Epoch 3/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 372ms/step - accuracy: 0.8622 - loss: 0.3558 - val_accuracy: 0.8386 - val_loss: 0.4087 - learning_rate: 0.0010
Epoch 4/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 347ms/step - accuracy: 0.9126 - loss: 0.2344 - val_accuracy: 0.8338 - val_loss: 0.4560 - learning_rate: 0.0010
Epoch 5/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 344ms/step - accuracy: 0.9448 - loss: 0.1549 - val_accuracy: 0.8419 - val_loss: 0.4394 - learning_rate: 0.0010
Epoch 6/20
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[

The tuned_lstm model performs slightly worse than bi_lstm2 in terms of val_accuracy(0.8442 vs. 0.8468) and val_loss (0.4962 vs. 0.4090). I will print out the classification report of both models for further comparison.

In [14]:
tuned_glove_lstm =load_model('/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/tuned_glove_lstm.keras')

In [15]:
# Print the classification report  of the model's performance on the validation set
pred_prob = tuned_glove_lstm.predict([validation_text, lingu_validation], verbose = 0)
pred = np.argmax(pred_prob, axis = 1)
print('Model: tuned_glove_lstm:')
print(classification_report(validation_set['author_encoded'], pred))


              precision    recall  f1-score   support

           0       0.81      0.88      0.84      1580
           1       0.87      0.82      0.84      1209
           2       0.87      0.82      0.84      1127

    accuracy                           0.84      3916
   macro avg       0.85      0.84      0.84      3916
weighted avg       0.85      0.84      0.84      3916



In [14]:
bi_lstm2 = load_model('/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/bi_lstm2.keras')

# Print the classification report  of the model's performance on the validation set
pred_prob = bi_lstm2.predict([validation_text, lingu_validation], verbose = 0)
pred = np.argmax(pred_prob, axis = 1)
print('Model: bi_lstm2:')
print(classification_report(validation_set['author_encoded'], pred))

Model: bi_lstm2:
              precision    recall  f1-score   support

           0       0.84      0.85      0.84      1580
           1       0.83      0.87      0.85      1209
           2       0.88      0.82      0.85      1127

    accuracy                           0.85      3916
   macro avg       0.85      0.85      0.85      3916
weighted avg       0.85      0.85      0.85      3916



The precision of both models is the same, while the recall, F1-score and accuracy of bi_lstm2 are slightly higher than those of the tuned_glove_lstm model.

Next, I will make prediction on the test set using both models and then submit to Kaggle.

In [15]:
# Prepare test data
test_text = vectorizer(test['text'].values)
lingu_test = test[['gunning_fog', 'sent_len', 'word_len', 'noun_freq',
            'verb_freq', 'adj_freq', 'adv_freq', 'funct_word', 'type_token']]

# Normalize the linguistic features
lingu_test = scaler.fit_transform(lingu_test)

In [17]:
# Make prediction using tuned_glove_lstm
prediction = tuned_glove_lstm.predict([test_text, lingu_test], verbose = 0)

glove_lstm_prediction = pd.DataFrame(prediction, columns = ['EAP', 'MWS', 'HPL'])
glove_lstm_prediction = pd.concat([test['id'], glove_lstm_prediction], axis = 1)
glove_lstm_prediction = glove_lstm_prediction[['id', 'EAP', 'HPL', 'MWS']]
glove_lstm_prediction.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.00226,0.000831,0.9969086
1,id24541,0.999772,0.000223,5.334878e-06
2,id00134,0.000218,0.999781,6.477031e-07
3,id27757,0.983829,0.014512,0.001659929
4,id04081,0.546374,0.125165,0.3284606


In [None]:
glove_lstm_prediction.to_csv('/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/glove_lstm_prediction.csv',
                             index = False)

In [16]:
# Make prediction using bi_lstm2
prediction = bi_lstm2.predict([test_text, lingu_test], verbose = 0)

glove_lstm_prediction2 = pd.DataFrame(prediction, columns = ['EAP', 'MWS', 'HPL'])
glove_lstm_prediction2 = pd.concat([test['id'], glove_lstm_prediction2], axis = 1)
glove_lstm_prediction2 = glove_lstm_prediction2[['id', 'EAP', 'HPL', 'MWS']]
glove_lstm_prediction2.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.0214,0.010359,0.968241
1,id24541,0.994623,0.00477,0.000607
2,id00134,0.001531,0.99845,1.9e-05
3,id27757,0.703209,0.28678,0.010011
4,id04081,0.279795,0.062369,0.657836


In [17]:
glove_lstm_prediction2.to_csv('/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/glove_lstm_prediction2.csv',
                             index = False)

After submitting to Kaggle, I got a private score of 0.5007 for the tuned_glove_lstm model and a private score of 0.4169 for the bi_lstm2 model (Log-loss).