## Import Libraries and Set Mixed Precision Policy

In [1]:
import os
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
from datasets import Dataset, load_from_disk, concatenate_datasets
from tensorflow.keras.mixed_precision import experimental as mixed_precision
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.mixed_precision import LossScaleOptimizer

# Set mixed precision policy
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)


INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 3050 6GB Laptop GPU, compute capability 8.6
Instructions for updating:
Use tf.keras.mixed_precision.LossScaleOptimizer instead. LossScaleOptimizer now has all the functionality of DynamicLossScale


## Set TensorFlow GPU Configuration

In [2]:
# Configure TensorFlow to use GPU and set memory growth
physical_devices = tf.config.list_physical_devices('GPU')
if physical_devices:
    try:
        tf.config.experimental.set_memory_growth(physical_devices[0], True)
    except RuntimeError as e:
        print(e)


Physical devices cannot be modified after being initialized


## Load IndoBERT Tokenizer and Model

In [3]:
# Load IndoBERT tokenizer and model (TensorFlow version)
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-base-p1")
bert_model = TFBertModel.from_pretrained("indobenchmark/indobert-base-p1")

# Enable gradient checkpointing for memory efficiency
bert_model.config.gradient_checkpointing = True


Some layers from the model checkpoint at indobenchmark/indobert-base-p1 were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at indobenchmark/indobert-base-p1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


## Define Generator for Reading Text Files

In [4]:
# Generator to read text files and clean text
def read_text_files_generator(folder_path):
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                content = file.read().strip().lower()
                if content:
                    yield clean_text(content)

def clean_text(text):
    unwanted_chars = ['*', '#', '_', ')', '(', '!', '?', '.', ',', '-']
    for char in unwanted_chars:
        text = text.replace(char, '')
    return text


## Process and Tokenize Text in Batches Using Generator

In [11]:
# Define tokenize function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# Function to tokenize and save batches
def tokenize_and_save_batch(texts, batch_index, tokenizer, save_dir):
    batch_dataset = Dataset.from_dict({"text": texts})
    tokenized_batch = batch_dataset.map(tokenize_function, batched=True, remove_columns=["text"])
    tokenized_batch.save_to_disk(f'{save_dir}/tokenized_dataset_batch_{batch_index}')

# Process texts in batches and save tokenized datasets
def process_in_batches(folder_path, batch_size, tokenizer, save_dir):
    batch_texts = []
    batch_index = 0
    for text in read_text_files_generator(folder_path):
        batch_texts.append(text)
        if len(batch_texts) >= batch_size:
            tokenize_and_save_batch(batch_texts, batch_index, tokenizer, save_dir)
            batch_texts = []
            batch_index += 1
    if batch_texts:
        tokenize_and_save_batch(batch_texts, batch_index, tokenizer, save_dir)

# Example usage:
process_in_batches('../Dataset/nlp_dataset', batch_size=100, tokenizer=tokenizer, save_dir='../saved_model/nlp_saved/nlp_02/new_tokenized')


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/92 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/92 [00:00<?, ? examples/s]

## Load and Prepare Dataset with Prefetching

In [15]:
# Load dataset with prefetching for efficient training
def load_and_prepare_dataset_with_prefetch(batch_index, save_dir):
    batch_dataset = load_from_disk(f'{save_dir}/tokenized_dataset_batch_{batch_index}')
    dataset = batch_dataset.to_tf_dataset(
        columns=["input_ids", "attention_mask"],
        label_cols=["input_ids"],
        shuffle=True,
        batch_size=1,
        collate_fn=None,
    )
    return dataset.prefetch(tf.data.experimental.AUTOTUNE)


## Define and Compile BertAutoencoder Model

In [16]:
# Define the BertAutoencoder model
class BertAutoencoder(tf.keras.Model):
    def __init__(self, bert_model):
        super(BertAutoencoder, self).__init__()
        self.bert = bert_model
        self.dense = tf.keras.layers.Dense(bert_model.config.vocab_size, activation='softmax', dtype='float32')

    def call(self, inputs):
        outputs = self.bert(**inputs)
        sequence_output = outputs.last_hidden_state
        reconstructed = self.dense(sequence_output)
        return reconstructed

# Instantiate the model
autoencoder_model = BertAutoencoder(bert_model)

# Optimizer using LossScaleOptimizer with dynamic loss scale
base_optimizer = Adam(learning_rate=2e-5)
optimizer = LossScaleOptimizer(base_optimizer)

# Compile the model
autoencoder_model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)


## Train Model in Batches with Prefetching

In [17]:
# Function to train the model on prefetch-enabled datasets in batches
def train_model_with_prefetching(model, num_batches, save_dir):
    for i in range(num_batches):
        print(f"Training on batch {i+1}/{num_batches}")
        train_dataset = load_and_prepare_dataset_with_prefetch(i, save_dir)
        model.fit(train_dataset, epochs=3)

# Example usage:
train_model_with_prefetching(autoencoder_model, 500, '../saved_model/nlp_saved/nlp_02/new_tokenized')


Training on batch 1/500
Epoch 1/3


ResourceExhaustedError: Graph execution error:

Detected at node 'bert_autoencoder_2/tf_bert_model/bert/encoder/layer_._2/attention/self/MatMul' defined at (most recent call last):
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\runpy.py", line 196, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\runpy.py", line 86, in _run_code
      exec(code, run_globals)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
      app.start()
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\ipykernel\kernelapp.py", line 701, in start
      self.io_loop.start()
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\tornado\platform\asyncio.py", line 205, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\asyncio\windows_events.py", line 321, in run_forever
      super().run_forever()
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\asyncio\base_events.py", line 603, in run_forever
      self._run_once()
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\asyncio\base_events.py", line 1909, in _run_once
      handle._run()
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\asyncio\events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\ipykernel\kernelbase.py", line 534, in dispatch_queue
      await self.process_one()
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\ipykernel\kernelbase.py", line 523, in process_one
      await dispatch(*args)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\ipykernel\kernelbase.py", line 429, in dispatch_shell
      await result
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\ipykernel\kernelbase.py", line 767, in execute_request
      reply_content = await reply_content
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\ipykernel\ipkernel.py", line 429, in do_execute
      res = shell.run_cell(
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\ipykernel\zmqshell.py", line 549, in run_cell
      return super().run_cell(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\IPython\core\interactiveshell.py", line 3075, in run_cell
      result = self._run_cell(
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\IPython\core\interactiveshell.py", line 3130, in _run_cell
      result = runner(coro)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\IPython\core\async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\IPython\core\interactiveshell.py", line 3334, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\IPython\core\interactiveshell.py", line 3517, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\IPython\core\interactiveshell.py", line 3577, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\gabri\AppData\Local\Temp\ipykernel_25068\2295026956.py", line 9, in <module>
      train_model_with_prefetching(autoencoder_model, 500, '../saved_model/nlp_saved/nlp_02/new_tokenized')
    File "C:\Users\gabri\AppData\Local\Temp\ipykernel_25068\2295026956.py", line 6, in train_model_with_prefetching
      model.fit(train_dataset, epochs=3)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\training.py", line 1384, in fit
      tmp_logs = self.train_function(iterator)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\training.py", line 1021, in train_function
      return step_function(self, iterator)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\training.py", line 1010, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\training.py", line 1000, in run_step
      outputs = model.train_step(data)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\training.py", line 859, in train_step
      y_pred = self(x, training=True)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\AppData\Local\Temp\ipykernel_25068\2740090188.py", line 9, in call
      outputs = self.bert(**inputs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\transformers\modeling_tf_utils.py", line 1208, in run_call_with_unpacked_inputs
      """
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 1235, in call
      outputs = self.bert(
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\transformers\modeling_tf_utils.py", line 1208, in run_call_with_unpacked_inputs
      """
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 995, in call
      encoder_outputs = self.encoder(
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 629, in call
      for i, layer_module in enumerate(self.layer):
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 635, in call
      layer_outputs = layer_module(
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 528, in call
      self_attention_outputs = self.attention(
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 412, in call
      self_outputs = self.self_attention(
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\engine\base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\keras\utils\traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\gabri\anaconda3\envs\myenv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 316, in call
      attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
Node: 'bert_autoencoder_2/tf_bert_model/bert/encoder/layer_._2/attention/self/MatMul'
OOM when allocating tensor with shape[1,12,512,512] and type half on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node bert_autoencoder_2/tf_bert_model/bert/encoder/layer_._2/attention/self/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_53369]