# Model Training Journal - News Classification
**Name:** Linfeng Liu  
**Email:** linfeng.liu@mail.mcgill.ca 
**Kaggle:** https://www.kaggle.com/c/hw2-ycbs-273-intro-to-prac-ml/overview 

# 1. 12th-Aug to 14th-Aug work on bag of words model
**Steps:** 
1. Loading and Dataset Partitioning
2. Adding Dense layer  
**Conclusion:**  
1. Bag of word model can reach an accuracy of 0.91 but could not be higher.   
2. No matter how I tune the parameter of dropout layers, the overfitting always exists.  
**Experiment Notes:**  
1. LayerNormalization make model converge more steadily  
2. 'relu' and 'tanh' have almost the same impact on outcome  
3. regularization did not work well  
4. learning rate could be large for example 5e-4 at first, but when it comes to an approximately convergency point, we should set it smaller for example 5e-6  

## 1.1 Loading

In [None]:
import matplotlib.pyplot as plt
import tensorflow as tf
import zipfile
import pandas as pd
import numpy as np

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

Download the data from Competition2 on Kaggle and import the compressed file (data_v2.zip) from it to Google Drive<br>
as following code:<br>
&emsp;**with zipfile.ZipFile('/content/data_v2.zip', 'r') as zip_ref:**

In [None]:
# Execute this only in colab after loading the 'data_v2.zip' in the workspace

with zipfile.ZipFile('/content/data_v2.zip', 'r') as zip_ref:
    zip_ref.extractall('data_v2')

Dataset partitioning,using **keras.preprocessing.text_dataset_from_directory** import data from 'train' directionary.Of course, we need set the **batch_size to 512**,and **set seed to 1337(The seed is the random number seed, the purpose is to make every random number generated is fixed)**<br>


In [None]:
# Loading the dataset from the 'train' directory

batch_size = 512
seed = 1337 # Keep the seed same for both 'train' & 'validation' to avoid overlap

train_ds = keras.preprocessing.text_dataset_from_directory(
    "/content/data_v2/train", 
    batch_size=batch_size,
    label_mode='int',
    validation_split=0.3,
    subset='training',
    seed=seed)

val_ds = keras.preprocessing.text_dataset_from_directory(
    "/content/data_v2/train",
    batch_size=batch_size,
    label_mode='int',
    validation_split=0.3,
    subset='validation',
    seed=seed)

text_only_train_ds = train_ds.map(lambda x, y: x)

Found 120000 files belonging to 4 classes.
Using 84000 files for training.
Found 120000 files belonging to 4 classes.
Using 36000 files for validation.


## buffer


Create a TextVectorization instance using 2-grams and **'count'** mode. Note **'text_vectorization'** can also be used a keras layer. We will use this during the prediction on test data

In [None]:
# max_length = 50
max_tokens = 20000
text_vectorization = TextVectorization(
    ngrams=2,
    output_mode="count",
    max_tokens=max_tokens,
)

# Fit it on the train dataset
text_vectorization.adapt(text_only_train_ds)

# Map the vocabulary on the 'train' and 'validation' sets

count_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
count_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))


In [None]:
# Printing few samples of the raw data

for text_batch, label_batch in train_ds.take(1):
  for i in range(10):
    print("News: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

News:  b'IBM Plans Web Meeting Service, Takes Aim at WebEx (Reuters)Reuters - IBM  plans to offer\\Web-conferencing as a hosted Internet service, seeking to reach\\small and medium-sized business customers while taking on more\\established rivals in the market, the company said on Tuesday.'
Label: 3
News:  b"Dollar Clings to Gains Vs Euro LONDON (Reuters) - The dollar retained most of the previous  session's gains against the euro on Monday after a positive  U.S. jobs report last week reinforced expectations for an  interest rate rise later this month."
Label: 2
News:  b'History promises memorable Chennai encounterOver the years, the ground has produced some of the most endearing moments between the two countries. In the 1969-70 series, the match produced a shoot-out between two of the best post-war '
Label: 1
News:  b'AOL Tests Desktop Search (PC World)PC World - Upcoming browser will feature tools for finding files on your PC.'
Label: 3
News:  b'Belgian Grand Prix, FridayWith rain fo

***Note:*** The Label of each News are well depended on the topics
For example, label 2 is about commercial? 

In [None]:
# Retrieve a batch (of 512 news and labels) from the dataset and printing 1 sample

text_batch, label_batch = next(iter(train_ds))
first_news, first_label = text_batch[0], label_batch[0]
print("News", first_news)
print("Label", first_label)

News tf.Tensor(b'Air Force GPS Satellite Roars Into Space (AP)AP - After a series of delays, a Boeing Delta 2 rocket carrying a Global Positioning System satellite for the Air Force roared into space early Saturday.', shape=(), dtype=string)
Label tf.Tensor(3, shape=(), dtype=int32)


In [None]:
# Helper function for using 'text_vectorization'

def count_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return text_vectorization(text), label

In [None]:
# Printing out vectorized text data using 'text_vectorization' layer

print("'count' vectorized question:",
      count_vectorize_text(first_news, first_label)[0])

'count' vectorized question: tf.Tensor([[22.  1.  0. ...  0.  0.  0.]], shape=(1, 20000), dtype=float32)


## 1.3 Bag of words modelling

In [None]:
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras import regularizers

**After our test, the dense layer can reduce loss to a certain extent, but once the improvement exceeds the three layers, the effect is not very significant.**

For parameter setting, we refer to the parameters in Competition1's model, so we consider the Dense layer to set 256 neurons. Where, the optimizer is set to RMSprop and the learning rate is set to 0.005 (this is a better parameter we use in the process of tuning the model based on Competition1).

In [None]:
inputs = keras.Input(shape=(max_tokens,))
x = layers.Dense(256)(inputs) # ,kernel_regularizer=regularizers.l2(0.0001)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)
x = layers.Dropout(0.3)(x)

x = layers.Dense(128)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation('relu')(x)  
x = layers.Dropout(0.6)(x)

outputs = layers.Dense(4, activation="softmax")(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer=RMSprop(learning_rate=0.005),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])


In [None]:
callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_loss",
                                  patience=2),
    keras.callbacks.ModelCheckpoint("bow_2grams_1.keras",
                                    save_best_only=True)
]

In [None]:
# Train the model and use validation ds for early stopping and model saving

history_bow_2grams_1 = model.fit(count_train_ds,validation_data = count_val_ds, epochs=50, callbacks=callbacks,batch_size=32)
model = keras.models.load_model("bow_2grams_1.keras")
print(f"Test acc: {model.evaluate(count_val_ds)[1]:.3f}")

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Test acc: 0.910


# 2. 17th-Aug work on Sequence modelling
**Steps:**<br>
&emsp;1. Sequence Modeling<br>
&emsp;2. Adding **Embedding Layer**  
**Conclusion:**  
The outcome is still not good in kaggle. I tried a lot but it did not improve the prediction performance.    
**Experiment Note:**  
1. Adding some Dense layers has no positive impact on model, but it takes a lot of time.  
2. Dropout is necessary, but for some reasons we can not simply use dropout to ignore overfitting.  

## 2.1 TextVectorization

In [None]:
from tensorflow.keras.optimizers import RMSprop

**Preparing for sequence modeling**,we use function **TextVectorization()**

In [None]:
max_length = 600
max_tokens = 20000
text_vectorization = TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))

#### Buffer 
The Buffer module is a deprecated model, so I won't go into detail here

In [None]:
'''
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)

        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config
        '''

'\nclass TransformerEncoder(layers.Layer):\n    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):\n        super().__init__(**kwargs)\n        self.embed_dim = embed_dim\n        self.dense_dim = dense_dim\n        self.num_heads = num_heads\n        self.attention = layers.MultiHeadAttention(\n            num_heads=num_heads, key_dim=embed_dim)\n\n        self.dense_proj = keras.Sequential(\n            [layers.Dense(dense_dim, activation="relu"),\n             layers.Dense(embed_dim),]\n        )\n        self.layernorm_1 = layers.LayerNormalization()\n        self.layernorm_2 = layers.LayerNormalization()\n\n    def call(self, inputs, mask=None):\n        if mask is not None:\n            mask = mask[:, tf.newaxis, :]\n        attention_output = self.attention(\n            inputs, inputs, attention_mask=mask)\n        proj_input = self.layernorm_1(inputs + attention_output)\n        proj_output = self.dense_proj(proj_input)\n        return self.layernorm_2(proj_input +

Embedding can be understood as a dimensionality reduction behavior, often translated as vectorization or vector mapping.These are very important "basic operations" in the whole deep learning framework. The problem of sparse input data can be solved by mapping high-dimensional data to low-dimensional space.<br>
That's why we set both the embed_dim and dense_dim parameters here

In [None]:
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32

In [None]:
dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),   # With the new model, where activation= 'relu'
             layers.Dense(embed_dim),]
        )

In [None]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

## 2.2 Adding Emdedding Layer

Firstly, we need to understand some basic concepts of embedding so as to provide theoretical basis for adding embedding layer. Embedding layer comes from the concept of one-hot coding. It integrates a series of texts into a sparse matrix in a specific way, and when the sparse matrix performs matrix calculation, You just multiply the numbers in the position of 1 and add them up, which is a lot easier to compute than a one-dimensional list.

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
    input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)

mask = embedding_layer.compute_mask(inputs)
attention_output = layers.MultiHeadAttention(
            num_heads=num_heads, 
            key_dim=embed_dim
            )(embedded, embedded, attention_mask=mask)

proj_input = layers.LayerNormalization()(embedded + attention_output)
proj_output = dense_proj(proj_input)

x = layers.LayerNormalization()(proj_input + proj_output)

x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(4, activation="softmax")(x)
model = keras.Model(inputs, outputs)

In [None]:
model.compile(optimizer=RMSprop(learning_rate=5e-5),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [None]:
model.fit(int_train_ds,validation_data=int_val_ds,epochs=4,batch_size=32)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7fe1dc3d6050>

In [None]:
model.fit(int_train_ds,validation_data=int_val_ds,epochs=2,batch_size=32)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7fe192117390>

# 3. 17th-Aug work on Model Construction with PositionalEmbedding
**Steps:**<br>
&emsp;1. Text Vectorization<br>
&emsp;2. Model Construction<br>
&emsp;3. Fitting the Model  
**Conclusion:**  
Just have not much improvement on prediction, also below benchmark

## 3.1 TextVectorization

In [None]:
from tensorflow.keras import layers
# from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

max_length = 600
max_tokens = 20000
text_vectorization = layers.experimental.preprocessing.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
#int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

## 3.2 Model Construction with PositionalEmbedding

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

Still using RMSprop as optimizer for model compile, set both **dense_dim** and **embed_dim** just as the model we build before

In [None]:
from tensorflow.keras.optimizers import RMSprop

vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(4, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer=RMSprop(learning_rate=5e-5),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

## 3.3 Fitting the Model
callback参数设置的调整问题

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)
model = keras.models.load_model(
    "full_transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder,
                    "PositionalEmbedding": PositionalEmbedding})

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20

# 4. 17th-Aug Work on model with PositionEmbedding and TransformerEncoder
**Steps:**<br>
&emsp;1. Text Vectorization<br>
&emsp;2. Model Construction<br>
&emsp;3. Fitting the Model  
**Conclusion:**  
Still not much improvement, and it takes even more time than other modelling

## 4.1 Text Vectorization

In [None]:
from tensorflow.keras import layers
# from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

max_length = 600
max_tokens = 20000
text_vectorization = layers.experimental.preprocessing.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
#int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

## 4.2 Model Construction

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

## 4.1 Fitting the Model

层的设置和参数的设置

In [None]:
from tensorflow.keras.optimizers import RMSprop

In [None]:
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
#embedded = layers.Embedding(input_dim=max_tokens, output_dim=256, mask_zero=True)(x)
#x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.4)(x)
x = layers.Dense(128)(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.Dropout(0.4)(x)
outputs = layers.Dense(4, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer=RMSprop(learning_rate=0.005),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
callbacks = [
    keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras",
                                    save_best_only=True)
]



In [None]:
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks,batch_size=512)
model = keras.models.load_model(
    "full_transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder,
                    "PositionalEmbedding": PositionalEmbedding})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20

KeyboardInterrupt: ignored

# 5. 18th-Aug work on pre-train embedding model
**Steps:**  
1. Data Loading(Glove)  
2. Adding Pre-trained layer to the model  
**Conclusion:**  
The performance is even worse! ...

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2021-08-18 16:37:52--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-08-18 16:37:52--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-08-18 16:37:52--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [None]:
import numpy as np
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

Found 400000 word vectors.


In [None]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))

embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Dropout(0.3)(embedded)
x = layers.LayerNormalization()(x)
x = layers.Bidirectional(layers.LSTM(32,return_sequences=True))(x)
x = layers.LayerNormalization()(x) 
x = layers.Bidirectional(layers.LSTM(32))(x)
x = layers.LayerNormalization()(x) 
outputs = layers.Dense(4, activation="softmax")(x)
model = keras.Model(inputs, outputs)

In [None]:
model.compile(optimizer=RMSprop(learning_rate=8e-3),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [None]:
model.fit(int_train_ds, validation_data=int_val_ds, epochs=50,batch_size=32)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
 12/188 [>.............................] - ETA: 47s - loss: 0.4213 - accuracy: 0.8428

KeyboardInterrupt: ignored

# 6. 18th-Aug work on BidirectionalLSTM model  
**Steps:**  
1. TextVectorization  
2. Model Construction  
**Conclusion:**  
This is the model I spent the most of time. It works the best in all of models, but it is still lower than benchmark.  
**Experiment Note:**  
1. output_dim in embedded layer should be 32 (Among 16,32,64,128)  
2. LayerNormalization is really import for model to converge steadily  
3. The layer of Bidirectional LSTM should not be more than two. I guess the reason could be that the time sequence would lose more information if the number of layers is too large  
4. Should not add dropout layer right below bidirectionalLSTM layer, it would have negative impact.  
5. The number of dense layer would not have a significant impact on model prediction   
6. Learning rate should be high at the beginning to shrink trainning time and be low to converge when it almost  reach its peak performance. But if we just want a constant learning rate, 5e-6 would be good.

In [None]:
from tensorflow.keras import layers
# from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

max_length = 700
max_tokens = 30000
text_vectorization = layers.experimental.preprocessing.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
#int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import LearningRateScheduler as LRS

In [None]:
vocab_size = 30000
sequence_length = 700
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
#x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
#x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
'''
embedded = embedding_layer(inputs)
x = layers.LayerNormalization()(embedded) # A LayerNormalization is necessary here because it makes model more stable
x = layers.Dropout(0.3)(x)
'''

embedded = layers.Embedding(input_dim=max_tokens, output_dim=32, mask_zero=True)(inputs)
x = layers.LayerNormalization()(embedded) # A LayerNormalization is necessary here because it makes model more stable
x = layers.Dropout(0.3)(x)
#embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32,return_sequences=True))(x)
x = layers.LayerNormalization()(x)  
#x = layers.Dropout(0.2)(x)  # According to the experiment, had better not add dropout here
x = layers.Bidirectional(layers.LSTM(32))(x)
x = layers.LayerNormalization()(x)
#x = layers.Dropout(0.3)(x)   # According to the experiment, had better not add dropout here
'''
x = layers.Dense(256)(x) #,kernel_regularizer=regularizers.l1(0.0001)
x = layers.BatchNormalization()(x)
x = layers.Activation("tanh")(x)
x = layers.Dropout(0.6)(x)
'''
x = layers.Dense(128)(x) #,kernel_regularizer=regularizers.l1(0.0001)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.Dropout(0.5)(x)

outputs = layers.Dense(4, activation="softmax")(x)
model = keras.Model(inputs, outputs)

lr= LRS(lambda epoch:5e-4 * 10 ** (epoch/20))
optimizer = tf.keras.optimizers.SGD(learning_rate=5e-4,momentum=0.9)



"""
model.compile(loss=tf.keras.losses.Huber(),optimizer=optimizer,metrics=["mae"])
"""
callbacks = [
    keras.callbacks.ModelCheckpoint("jena_lstm.keras",   
                                    save_best_only=True)
]



In [None]:
model.compile(optimizer=RMSprop(learning_rate=8e-4),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [None]:
model.fit(int_train_ds, validation_data=int_val_ds, epochs=50,batch_size=32) #, callbacks=callbacks
#model = keras.models.load_model("full_transformer_encoder.keras") #custom_objects={"TransformerEncoder": TransformerEncoder,"PositionalEmbedding": PositionalEmbedding}
#print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50

KeyboardInterrupt: ignored

# 7. 31th-Aug work on BidirectionalLSTM model with different max_length and max_tokens (Error)

In [None]:
from tensorflow.keras.optimizers import RMSprop

In [None]:
from tensorflow.keras import layers

max_length = 700  # max_length =600 before
max_tokens = 30000  # max_tokens = 20000 before
text_vectorization = TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
#int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [None]:
import tensorflow as tf
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=32, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer=RMSprop(learning_rate=5e-5),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=5, callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/5


ResourceExhaustedError: ignored

# 8. 28th-Aug to 1st-Sep Bidirectional LSTM & Embedding Layer(glove)
**Conclusion:** After adding the Embedding Layer, the validation loss suddenly drop from 0.26- to 0.25-, and the score in kaggle drop from 0.1414 to 0.12239.  


In [None]:
# Execute this only in colab after loading the 'data_v2.zip' in the workspace

with zipfile.ZipFile('/content/data_v2.zip', 'r') as zip_ref:
    zip_ref.extractall('data_v2')

In [None]:
# Loading the dataset from the 'train' directory

batch_size = 128
seed = 1337 # Keep the seed same for both 'train' & 'validation' to avoid overlap

train_ds = keras.preprocessing.text_dataset_from_directory(
    "/content/data_v2/train", 
    batch_size=batch_size,
    label_mode='int',
    validation_split=0.2,
    subset='training',
    seed=seed)

val_ds = keras.preprocessing.text_dataset_from_directory(
    "/content/data_v2/train",
    batch_size=batch_size,
    label_mode='int',
    validation_split=0.2,
    subset='validation',
    seed=seed)

text_only_train_ds = train_ds.map(lambda x, y: x)

In [None]:
from tensorflow.keras.optimizers import RMSprop

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")

embedded = embedding_layer(inputs)
x = layers.LayerNormalization()(embedded)
x = layers.Dropout(0.3)(x)

x = layers.Bidirectional(layers.LSTM(64))(x)
x = layers.LayerNormalization()(x)  

x = layers.Dense(128)(x) #,kernel_regularizer=regularizers.l1(0.0001)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.Dropout(0.5)(x)

outputs = layers.Dense(4, activation="softmax")(x)
model = keras.Model(inputs, outputs)


In [None]:
model.compile(optimizer=RMSprop(learning_rate=5e-5),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [None]:
history = model.fit(count_train_ds,validation_data = count_val_ds, epochs=20, callbacks=callbacks)
model = keras.models.load_model("bow_2grams_1.keras")
print(f"Test acc: {model.evaluate(count_val_ds)[1]:.3f}")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Test acc: 0.914


# Prediction
&emsp; **Prediction and to_csv**

In [None]:
# Using the trained model to make prediction on unseen (test) data
# Here we use the 'adapted' text_vectorization layer and include it as part of a prediction_model

prediction_model = tf.keras.Sequential(
    [text_vectorization, model])

prediction_model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `val_ds`, which yields raw strings
loss, accuracy = prediction_model.evaluate(val_ds)
print("Accuracy: {:2.2%}".format(accuracy))

Accuracy: 91.35%


In [None]:
# Read the test data in the form of a dataframe

df_test_data = pd.read_csv('/content/data_v2/data_test_df.csv')
inputs = df_test_data['data']

In [None]:
# Make sure you use the 'prediction_model' and not the trained 'model' alone
# If you use the 'model' object, you will run int error as the data is still in the 'text' format and needs vectorization

predicted_scores = prediction_model.predict(inputs)
predicted_scores[0:5]

array([[2.4190295e-01, 3.7546334e-04, 7.2275102e-01, 3.4970611e-02],
       [1.3789261e-03, 3.0286697e-04, 3.4502917e-03, 9.9486792e-01],
       [2.5791232e-03, 3.6273059e-04, 1.0878188e-02, 9.8617989e-01],
       [1.8249109e-02, 1.1308333e-02, 9.9732224e-03, 9.6046925e-01],
       [5.0529139e-03, 3.0256074e-04, 1.1376259e-02, 9.8326832e-01]],
      dtype=float32)

In [None]:
# populating the dataframe to make a submission on Kaggle

df_predictions = pd.DataFrame(predicted_scores, columns=['solution_' + str(i+1) for i in range(4)])
df_predictions.index.rename('Id', inplace=True)

df_predictions.head(30)

Unnamed: 0_level_0,solution_1,solution_2,solution_3,solution_4
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.241903,0.000375,0.722751,0.034971
1,0.001379,0.000303,0.00345,0.994868
2,0.002579,0.000363,0.010878,0.98618
3,0.018249,0.011308,0.009973,0.960469
4,0.005053,0.000303,0.011376,0.983268
5,0.0027,0.000632,0.014085,0.982583
6,0.003746,0.000264,0.003889,0.992101
7,0.003598,0.000739,0.007225,0.988439
8,0.004128,0.000116,0.018604,0.977152
9,0.023131,0.002121,0.9353,0.039448


In [None]:
# If using colab, then download this and submit on Kaggle

df_predictions.to_csv('df_predictions.csv')