# BERT - Experiments

This notebook contains AdHoc experiments with document representation using BERT/Transformer mechanisms then using them to perform document classification. This model will later be used to classify corpora of messages and dicern if they are related to a particular set of products and services.

In [1]:
!mkdir -p artifacts/models/keras artifacts/models/sklearn artifacts/results artifacts/results/logs artifacts/results/checkpoints

In [2]:
import os

os.environ['MODIN_CPUS'] = "10"
os.environ['MODIN_OUT_OF_CORE'] = "true"

artifacts_path = os.path.join(os.path.curdir, 'artifacts/')
models_path = os.path.join(artifacts_path, 'models/')
sklearn_models = os.path.join(models_path, 'sklearn/')
kears_modesl = os.path.join(models_path, 'keras/')
results_path = os.path.join(artifacts_path, 'results/')
logs_path = os.path.join(results_path, 'logs/')
data_path = os.path.join(artifacts_path, 'data/')

Perform the document tokenization using the pyspark script. It is much faster than pure python.

In [4]:
%%time
%%bash
export raw_reviews="/media/ohtar10/Adder-Storage/datasets/pre-processed/product-documents-small-shuffle/"
export current_dir=$(pwd)
if [ ! -f "artifacts/data/tokenized_docs.parquet" ]; then
    spark-submit ../../../dataprep/scripts/document_tokenizer.py --input "${raw_reviews}" --output "${current_dir}/artifacts/data/" --column document --vocab-size 200000 --maxlen 300 --batches 1 > spark.log
fi

 Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
20/11/26 18:25:49 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
20/11/26 18:25:49 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
20/11/26 18:25:49 INFO CodeGenerator: Code generated in 7.470777 ms
20/11/26 18:25:49 INFO CodeGenerator: Code generated in 8.837825 ms
20/11/26 18:25:49 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 287.5 KB, free 362.3 MB)
20/11/26 18:25:49 INFO BlockManagerInfo: Removed broadcast_4_piece0 on 192.168.1.110:44875 in memory (size: 12.1 KB, free: 362.9 MB)
20/11/26 18:25:49 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 23.9 KB, free 362.3 MB)
20/11/26 18:25:49 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 192.168.1.110:44875 (size: 23.9 KB, free: 362.9 MB)
20/11/26 18:25:49 INFO SparkContext

In [11]:
%%bash
export current_dir=$(pwd)
if [ ! -f "artifacts/data/tokenized_docs.parquet" ]; then
    mv artifacts/data/*.parquet artifacts/data/tokenized_docs.parquet 
fi

In [5]:
import modin.pandas as pd

reviews_path = os.path.join(data_path, 'tokenized_docs.parquet')
reviews = pd.read_parquet(reviews_path, columns=['categories', 'document', 'tokenized_document'], engine="pyarrow")
reviews.head()

Unnamed: 0,categories,document,tokenized_document
0,Music,Quality Radio App\nI use this app a few times ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,Books,Buy this Book!\nAdmittedly I know the author o...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,Health & Personal Care,No Mess Jewelry Cleaner!\nEasy to use Jewelry ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,"Technology, Electronics & Accessories",IPad cover\nThis cover is wonderful. Well mad...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,Office & School Supplies,Great thin pen\nLove the look and feel of this...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [6]:
print(f"Total documents: {len(reviews):,}")

Total documents: 1,553,620


### Category binarizer

This is a simple category binarizer to encode the document categories

In [7]:
categories = reviews['categories'].apply(lambda cat: cat.split(";")).values.tolist()

In [8]:
from sklearn.preprocessing import MultiLabelBinarizer
import pickle

categories_encoder_path = os.path.join(sklearn_models, 'category_encoder.pkl')
categories_encoder = None
if os.path.exists(categories_encoder_path):
    with open(categories_encoder_path, 'rb') as file:
        categories_encoder = pickle.load(file)
else:
    categories_encoder = MultiLabelBinarizer()
    categories_encoder.fit(categories)
    with open(categories_encoder_path, 'wb') as file:
        pickle.dump(categories_encoder, file)

### Defining a baseline to beat

In [9]:
from collections import Counter

flat_categories = [item for sublist in categories for item in sublist]
count = Counter(flat_categories)

In [10]:
sorted(count.items(), key=lambda kv: kv[1], reverse=True)

[('Books', 457672),
 ('Technology, Electronics & Accessories', 269065),
 ('Home & Kitchen', 249313),
 ('Clothing, Shoes & Jewelry', 162550),
 ('Health & Personal Care', 125661),
 ('Toys & Games', 117803),
 ('Sports & Outdoors', 104651),
 ('Music', 82883),
 ('Movies & TV', 80190),
 ('Office & School Supplies', 28248)]

In [11]:
print(f"Baseline: {count['Books'] / len(categories):.3f}")

Baseline: 0.295


### Creating a stratified sample

While testing different model versions, first let's have a stratified sample of the dataset. In this case, we are going to use only 8% of the data set.

In [12]:
import numpy as np

def stratified_sample(dataset: pd.DataFrame, classes: np.ndarray, fraction: float, seed: int = None, class_col: str = 'categories') -> pd.DataFrame:
    samples = []
    for c in classes:
        if seed:
            samples.append(dataset[dataset[class_col].str.contains(c)].sample(frac=fraction, random_state=seed))
        else:
            samples.append(dataset[dataset[class_col].str.contains(c)].sample(frac=fraction))
    return pd.concat(samples)

In [13]:
sample = stratified_sample(reviews, categories_encoder.classes_, 0.008, 123)
sample.head()

Unnamed: 0,categories,document,tokenized_document
889468,Books,"Just damn good\nThe book is just really, reall...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
894813,Books,Looks Are Very Deceiving\nMs. Weldon wrote a c...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1329162,Books,Great book\nThis book was most helpful in just...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1424706,Books,Breath Taking\nI loved this book. I read it on...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
861722,Books,Daughter of Joy\nI loved this book and the mes...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [14]:
print(f"Sample size: {len(sample):,}")

Sample size: 13,424


In [15]:
sample_categories = sample.categories.apply(lambda cat: cat.split(";")).values.tolist()
flat_categories = [item for sublist in sample_categories for item in sublist]
count = Counter(flat_categories)
sorted(count.items(), key=lambda kv: kv[1], reverse=True)

[('Books', 3782),
 ('Technology, Electronics & Accessories', 2616),
 ('Home & Kitchen', 2413),
 ('Clothing, Shoes & Jewelry', 1650),
 ('Sports & Outdoors', 1233),
 ('Health & Personal Care', 1091),
 ('Toys & Games', 984),
 ('Music', 714),
 ('Movies & TV', 693),
 ('Office & School Supplies', 229)]

This smaller sample preserves the proportion of the original data set. We can use this to try out different models.

### Preparing artifacts for training and evaluation

With the spark script above, we tokenized and saved a word index python dict that we can import in a sklearn based document tokenizer. We don't need to re-tokenize since the document is already tokenized. However, we will load the word index in case we want to tokenize new entries.

In [16]:
vocab_size = 200000
maxlen = 300

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.tokenize import RegexpTokenizer

class DocumentTokenizer(BaseEstimator, TransformerMixin):

    def __init__(self, corpus_column: str, 
                lowercase: bool = True, 
                tokenizer=None, 
                vocab_size=None, 
                maxlen=None,
                word_index=None):
        self.corpus_column = corpus_column
        self.lowercase = lowercase
        self.vocab_size = vocab_size
        self.maxlen = maxlen
        if tokenizer:
            self.tokenizer = tokenizer
        else:
            self.tokenizer = RegexpTokenizer(r"\w+")
        if word_index:
            self.word_index = word_index
        else:
            self.word_index = None

    def fit(self, X, y=None):
        word_count = X[self.corpus_column].apply(
                lambda corpus: self.tokenizer.tokenize(corpus.lower() if self.lowercase else corpus)
            ).explode().value_counts().sort_values(ascending=False)

        if self.vocab_size:
            word_count = word_count.iloc[0:self.vocab_size]
        word_index = word_count.reset_index()['index'].to_dict()
        self.word_index = {v:k for k, v in word_index.items()}
        return self

    def transform(self, X, y=None):
        def tokenize(string):
            tokens = self.tokenizer.tokenize(string.lower())
            tokens = [self.word_index[token] for token in tokens if token in self.word_index]
            if self.maxlen:
                tokens = tokens[:self.maxlen]
            return tokens

        X[f'tokenized_{self.corpus_column}'] = X[self.corpus_column].apply(tokenize)
        return X


In [18]:
%%time

fit_tokenizer = False
doc_tokenizer_path = os.path.join(sklearn_models, 'document_tokenizer.pkl')

if os.path.exists(doc_tokenizer_path):
    with open(doc_tokenizer_path, 'rb') as file:
        doc_tokenizer = pickle.load(file)
else:
    if fit_tokenizer:
        doc_tokenizer = DocumentTokenizer(corpus_column='document', vocab_size=vocab_size, maxlen=maxlen)
        doc_tokenizer.fit(sample)
    else:
        word_index = os.path.join(data_path, 'word_index.pkl')
        doc_tokenizer = DocumentTokenizer(corpus_column='document', vocab_size=vocab_size, maxlen=maxlen, word_index=word_index)
    with open(doc_tokenizer_path, 'wb') as file:
         pickle.dump(doc_tokenizer, file)

CPU times: user 0 ns, sys: 1.18 ms, total: 1.18 ms
Wall time: 670 µs


In [19]:
%%time

tokenized_docs = sample
# Undoment this code if you want to re-tokenize the documents

# tokenized_docs_path = os.path.join(data_path, 'tokenized_docs.parquet')
# if os.path.exists(tokenized_docs_path):
#     tokenized_docs = pd.read_parquet(tokenized_docs_path, columns=['categories', 'document', 'tokenized_document'], engine='pyarrow')
# else:
#     tokenized_docs = doc_tokenizer.transform(sample)
#     tokenized_docs.to_parquet(tokenized_docs_path)


CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.25 µs


In [20]:
print(f"Total tokenized documents: {len(tokenized_docs):,}")

Total tokenized documents: 13,424


## Defining the Attention based text classifier

Resource: https://keras.io/examples/nlp/text_classification_with_transformer/

In [21]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

### Implement multi head self attention as Keras layer

In [22]:
class MultiHeadSelfAttention(layers.Layer):
    def __init__(self, embed_dim, num_heads=8):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        if embed_dim % num_heads != 0:
            raise ValueError(
                f"embedding dimmension = {embed_dim} should be divisible by number of heads = {num_heads}"
            )
        self.projection_dim = embed_dim // num_heads
        self.query_dense = layers.Dense(embed_dim)
        self.key_dense = layers.Dense(embed_dim)
        self.value_dense = layers.Dense(embed_dim)
        self.combine_heads = layers.Dense(embed_dim)

    def attention(self, query, key, value):
        score = tf.matmul(query, key, transpose_b=True)
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
        scaled_score = score / tf.math.sqrt(dim_key)
        weights = tf.nn.softmax(scaled_score, axis=-1)
        output = tf.matmul(weights, value)
        return output, weights

    def separate_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs):
        # x.shape = [batch_size, seq_len, embedding_dim]
        batch_size = tf.shape(inputs)[0]
        query = self.query_dense(inputs) # (batch_size, seq_len, embed_dim)
        key = self.key_dense(inputs) # (batch_size, seq_len, embed_dim)
        value = self.value_dense(inputs) # (batch_size, seq_len, embed_dim)
        query = self.separate_heads(
            query, batch_size
        ) # (batch_size, num_heads, seq_len, projection_dim)
        key = self.separate_heads(
            query, batch_size
        ) # (batch_size, num_heads, seq_len, projection_dim)
        value = self.separate_heads(
            query, batch_size
        ) # (batch_size, num_heads, seq_len, projection_dim)

        attention, weights = self.attention(query, key, value)
        attention = tf.transpose(
            attention, perm=[0, 2, 1, 3]
        ) # (batch_size, seq_len, embed_dim)
        concat_attention = tf.reshape(
            attention, (batch_size, -1, self.embed_dim)
        ) # (batch_size, seq_len, embed_dim)
        output = self.combine_heads(
            concat_attention
        ) # (batch_size, seq_len, embed_dim)
        return output


### Implement a Transformer block as a layer

In [23]:
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadSelfAttention(embed_dim=embed_dim, num_heads=num_heads)
        self.ffn = keras.Sequential(
            [
                layers.Dense(ff_dim, activation='relu'),
                layers.Dense(embed_dim)
            ]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

### Implement embedding layer

In [24]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim, mask_zero=True)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim, mask_zero=True)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

### Create classifier model using transformer layer
Transformer layer outputs one vector for each time step of our input sequence. Here, we take the mean across all time steps and use a feed forward network on top of it to classify text.

In [35]:
embed_dim = 256 # Embedding size for each token
num_heads = 2 # number of attention heads
ff_dim = 128 # Hidden layer size in feed forward network inside transformer

# Note: these below were already defined a some cells above. Redefinding them to remember them
maxlen = maxlen # Only consider the first 300 words of each product review
vocab_size = vocab_size # Only consider the top 200k words

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(512, activation='relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(216, activation='relu')(x)

outputs = layers.Dense(len(categories_encoder.classes_), activation='softmax')(x)

model = keras.Model(inputs=inputs, outputs=outputs)
model.summary()

Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         [(None, 300)]             0         
_________________________________________________________________
token_and_position_embedding (None, 300, 256)          51276800  
_________________________________________________________________
transformer_block_5 (Transfo (None, 300, 256)          330112    
_________________________________________________________________
global_average_pooling1d_5 ( (None, 256)               0         
_________________________________________________________________
dropout_22 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_51 (Dense)             (None, 512)               131584    
_________________________________________________________________
dropout_23 (Dropout)         (None, 512)               0   

### Prepare the data for the experiment from the sample sub set

Prepare the labels using the sample sub set

In [27]:
sample_categories = tokenized_docs['categories'].apply(lambda cat: cat.split(";")).values.tolist()
y = categories_encoder.transform(sample_categories)
print(f"Total categories: {len(y)} with size: {len(y[0])} each")

Total categories: 13424 with size: 10 each


Prepare the encoded corpora from the sample sub set

In [28]:
%%time
import numpy as np

train_data_path = os.path.join(data_path, 'train_data.npz')
if not os.path.exists(train_data_path):
    X = tokenized_docs['tokenized_document'].values
    X = np.array(X.tolist())
    np.savez_compressed(train_data_path, X=X)
else:
    X = np.load(train_data_path)['X']

print(f"Total documents for training: {len(X)}, with sequence length: {len(X[0])}")

Total documents for training: 13424, with sequence length: 300
CPU times: user 3min 29s, sys: 23.8 s, total: 3min 52s
Wall time: 1h 20s


Prepare training, dev, and test stratified sample sets

In [29]:
from sklearn.model_selection import train_test_split

random_state = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state)
X_dev, X_test, y_dev, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=random_state)


### Train and evaluate the model

In [36]:
%%time
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, TensorBoard, ModelCheckpoint

checkpoints_path = os.path.join(results_path, 'checkpoints/')
logs_path = os.path.join(results_path, 'logs/')
callbacks = [
    EarlyStopping(monitor='loss', patience=5, min_delta=1e-7, restore_best_weights=True),
    TensorBoard(log_dir=logs_path),
    ModelCheckpoint(checkpoints_path, monitor='val_acc')
]

optimizer = Adam(learning_rate=1e-4)

# losses
# kullback_leibler_divergence
# categorical_hinge
# categorical_crossentropy
model.compile(optimizer="adam", loss='kullback_leibler_divergence', metrics=['accuracy'])
history = model.fit(X_train, y_train, batch_size=32, epochs=20, validation_data=(X_dev, y_dev), callbacks=callbacks)


Train on 12081 samples, validate on 671 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
CPU times: user 3h 20min 58s, sys: 30min 42s, total: 3h 51min 40s
Wall time: 47min 42s


In [70]:
loaded = tf.keras.models.load_model(checkpoints_path)

In [51]:
model.evaluate(X_test, y_test)



[1.8868522360211326, 0.6785714]