####        This GitHub Repository accompanies the Paper
## **How to Design and Employ Specialized Large Language Models for Accounting and Tax Research: The Example of TaxBERT**
**Frank Hechtner, Lukas Schmidt, Andreas Seebeck, and Marius Weiß**
##### If the following Guide/Repository is used for academic or scientific purposes, please cite the paper Hechtner et al., (2025) How to Design and Employ Specialized Large Language Models for Accounting and Tax Research: The Example of TaxBERT.
##### Link to paper: SSRN/Arxiv
##### Version as of February 2025
## Part 1 of example code: Domain adaptive pretraining


This notebook aims to give a comprehensive 10 step guide for accounting and tax researchers for domain-adaptive pretraining.

1. **Library Imports**: Import necessary libraries from PyTorch and Hugging Face for handling large language models (LLMs) and datasets.

2. **Hardware Setup**: Detect the presence of an NVIDIA GPU to determine the computation device - CUDA for GPU.

3. **Data Loading**: Initialize directories and loads .txt files, handle them with lowercasing and encoding correction.

4. **Text Preprocessing**: Check text and BERT compatibility and defining a process for extracting important tokens for vocabulary enhancement.

5. **Redundancy Reduction** with DataSelector class.

6. **Tokenization**: Convert texts into tokenized datasets and get important tokens.

7. **Model Setup**: Initialize and customize a DistilRoBERTa model for domain-specific training.

8. **Optimizer & Scheduler**: Configure the AdamW optimizer and implement custom learning rate schedulers.

9. **Reproducible Training** Loop with Early Stopping: Define optimal hyperparameter.

10. **Model Evaluation** and Saving.


####**1. Library Imports**
The following libraries and imports are required to start. Additionally, **PyTorch, Huggingface transformers and NVIDIA CUDA** are mandatory depenencies.
**Note**: The example code requires Python 3.9 or later. It is tested on the stable PyTorch 2.6.0, Transformers 4.48.3, and NVIDIA CUDA 12.4.
We cannot guarantee the stability for newer or older versions of these packages.
Note: There are several great websites to start learning Python, depending on your learning style and goals. Here are some recommendations:

[Real Python](https://realpython.com/)

[Codecademy](https://www.codecademy.com/)

[DataCamp](https://www.datacamp.com/)


In [None]:
import os
import math
import random
import re
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.optim import AdamW
from transformers import AdamW, Trainer, TrainingArguments, AutoTokenizer, DataCollatorWithPadding
from transformers import RobertaForMaskedLM, DataCollatorForLanguageModeling
from datasets import load_dataset, Dataset
from collections import Counter
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')

####**2. Hardware Setup**
Training a BERT model involves extensive matrix multiplications and tensor operations, which are highly parallelizable. Thus, training is significantly faster on a **NVIDIA GPU** compared to a CPU.
For training, we used an RTX 4090 GPU, an Intel i9-14900K CPU, and 128 GB of RAM.
The GPU, particularly its VRAM, plays a crucial role in determining the batch size, which directly impacts training speed - insufficient VRAM can significantly slow down or even hinder training.
We recommend at least 12 GB of VRAM (such as an RTX A2000 or RTX 3080) for a corpus of similar size to TaxBERT (20 million tokens). For an initial assessment of suitable 
NVIDIA GPUs, refer to the [CUDA Compute Capability list](https://developer.nvidia.com/cuda-gpus). Higher Compute Capability generally indicates better performance.
Once PyTorch and CUDA are installed, we determine whether a compatible NVIDIA GPU is available for computations and set the device accordingly:


In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using device: {device}")

####**3. Data Loading**
In the following step, we define input folders, as well as the lists that will later contain the training and evaluation texts.


In [None]:
folders = [
    r'C:\Data_Source_Folder_I',
    r'C:\Data_Source_Folder_II',
    r'C:\Data_Source_Folder_III',
]

train_texts = []
eval_texts = []

In the following, we assume that all relevant files are in the form of .txt files. However, you can also use other file formats.
Note: Using .txt files is preferable because they are lightweight, universally compatible, and free from formatting artifacts that might interfere with text processing.
Unlike PDFs or Word documents, which may contain hidden metadata, non-standard encoding, or structural elements, plain text files ensure clean and consistent input for NLP tasks.
The function, **load_text_files(directory)**, takes a folder path as input. It scans the folder for files and selects only those with a .txt extension.
For each of these files, it attempts to read the content using the UTF-8 encoding. If this fails due to encoding issues, it tries again with the Latin-1 encoding.
If that also fails, it makes a final attempt with the Windows-1252 encoding.
The text from the files is processed to ensure consistency: it is converted to lowercase.
Note: Different data sources might require different preprocessing solutions; there is no one-size-fits-all procedure.
Successfully processed texts are then added to a list. Finally, the function returns this list of processed text contents.


In [None]:
import os

def load_text_files(directory):
     texts = []
     for filename in os.listdir(directory):
             if filename.endswith(".txt"):
                     file_path = os.path.join(directory, filename)
                     try:
                             with open(file_path, 'r', encoding='utf-8') as file:
                                     text = file.read().lower()
                                     texts.append(text)
                     except (UnicodeDecodeError, FileNotFoundError, PermissionError) as e1:
                             try:
                                     with open(file_path, 'r', encoding='latin-1') as file:
                                             text = file.read().lower()
                                             texts.append(text)
                             except (UnicodeDecodeError, FileNotFoundError, PermissionError) as e2:
                                     try:
                                             with open(file_path, 'r', encoding='windows-1252') as file:
                                                     text = file.read().lower()
                                                     texts.append(text)
                                     except (UnicodeDecodeError, FileNotFoundError, PermissionError) as e3:
                                             print(f"{filename} {e3}")
     return texts

####**4. Text Preprocessing**
BERT models, including variants like DistilBERT and RoBERTa, have a maximum input limit of 512 tokens.
If a text exceeds this limit, BERT will either truncate the excess tokens or throw an error during processing.
The **chunk_text function** is designed to split long text into smaller chunks that fit within this 512-token limit.
This ensures that longer documents can be processed without losing important information or causing errors.

In [None]:
# Function to split text into chunks
def chunk_text(text, chunk_size=512):
    tokens = tokenizer.encode(text, add_special_tokens=False)
    return [tokens[i:i+chunk_size] for i in range(0, len(tokens), chunk_size)]

The **clean_token** function is used to clean and standardize tokens to ensure that identical words are not treated as different tokens due to capitalization or trailing punctuation.
It performs two main tasks: first, it converts the token to lowercase, ensuring that words like 'Tax' and 'tax' are treated as the same.
Second, it removes specific punctuation marks: periods (.), commas (,), exclamation points (!), and question marks (?) - if they appear at the end of the token.
For example, 'Profit.' becomes 'profit', 'Revenue,' becomes 'revenue', and 'taxes?' becomes 'taxes'.


In [None]:
# Function to clean tokens for new token extraction
def clean_token(token):
    return re.sub(r'[.,!?]$', '', token.lower())

The function **get_important_tokens** is designed to identify and extract the vocabulary to be added during the process of vocabulary augmentation.
Focusing solely on adding terms simplifies vocabulary augmentation by balancing efficiency and effectiveness, making it an ideal strategy for many domain-adaptive pretraining tasks.
In the process of pretraining TaxBERT, we follow Webersinke et al. [2022] by augmenting the vocabulary of DistilRoBERTa through the addition of the 235 most common tokens in our tax corpus to the tokenizer, leading to a vocabulary increase from 50,265 to 50,500.

Below you find a simple custom solution for identifying important tokens. First, we merge the entire corpus into a single string and then split it into individual tokens.
Next, Python’s Counter counts how frequently each token appears in the corpus. However, each token is passed through the previously designed **clean_token function**.
We further recommend filtering out tokens that are not suitable for adding to the tokenizer’s vocabulary. We keep only those tokens that appear at least 50 times in the corpus.
Additionally, tokens that already exist in the tokenizer’s current vocabulary are excluded.
As a further example showcased here, we also remove tokens that contain the special string 'num', tokens that are only one character long, tokens that contain any digits, and tokens that include parentheses or start or end with them.
Note: Since the best filtering approach depends on the specific characteristics of the text corpus, we encourage experimenting with different strategies to optimize token selection.

In [None]:
def get_important_tokens(corpus, tokenizer, min_freq=50, max_vocab_size=50500):
    all_tokens = ' '.join(corpus).split()
    token_freq = Counter(clean_token(token) for token in all_tokens)
    
    # Filter tokens by min frequency, length, not being a number, excluding "<num>", and excluding parentheses
    new_tokens = [
        token for token, freq in token_freq.items() 
        if freq >= min_freq and 
           token not in tokenizer.get_vocab() and 
           '<num>' not in token and
           len(token) > 1 and 
           not any(char.isdigit() for char in token) and
           '(' not in token and ')' not in token and
           not token.startswith('(') and not token.endswith(')')
    ]
    
    # Sort the new tokens by frequency in descending order
    sorted_new_tokens = sorted(new_tokens, key=lambda x: token_freq[x], reverse=True)
    
    # Calculate the number of new tokens to add
    original_vocab_size = len(tokenizer.get_vocab())
    num_new_tokens_to_add = max_vocab_size - original_vocab_size
    
    # Select the most frequent new tokens up to the allowed limit
    important_tokens = sorted_new_tokens[:num_new_tokens_to_add]
    
    return important_tokens

Next, we initialize the relevant folders and create lists that will later store the final training and evaluation data

In [None]:
folders = [
    r'C:\enter_your_first_folder',
    r'C:\enter_your_second_folder'
 ]

# Initialize lists to hold the final train and evaluation data
train_texts = []
eval_texts = []

####**5. Redundancy Reduction**
In the next step, we provide a simple yet effective solution for dealing with redundancy within a given corpus.
Redundancy is a common characteristic in corporate reporting, as firms often operate closely within legal frameworks, leading to standardized language and repetitive phrasing across documents.
To address this, we define **DataSelector** as class specifically designed to filter and select text samples based on diversity metrics.
Upon initialization, it accepts two parameters: **keep**, which determines the proportion of texts to retain; **diversity_metrics**, a list specifying which diversity measures to apply.
In addition, for purposes of demonstration, we also include the **tokenizer** parameter (although it is not currently used in the snippet).
The DataSelector class employs two primary metrics to quantify text diversity: the Type-Token Ratio (TTR) and entropy.
The TTR evaluates lexical variety by calculating the ratio of unique tokens to the total number of tokens within a text.
In contrast, entropy measures the unpredictability or informational richness of a text, derived from the probability distribution of token frequencies.
The fit method computes a diversity score for each text in the dataset by aggregating the selected metrics.
Subsequently, the transform method ranks the texts based on these scores and retains a specified fraction of texts with the lowest diversity scores, effectively prioritizing less diverse or more uniform texts.


In [None]:
import numpy as np
from collections import Counter
import math

class DataSelector:
    def __init__(self, keep, tokenizer, diversity_metrics):
        self.keep = keep
        self.tokenizer = tokenizer
        self.diversity_metrics = diversity_metrics

    def _type_token_ratio(self, text):
        tokens = text.split()
        return len(set(tokens)) / len(tokens) if len(tokens) > 0 else 0

    def _entropy(self, text):
        token_counts = Counter(text.split())
        total = sum(token_counts.values())
        return -sum((count/total) * math.log2(count/total) for count in token_counts.values() if count > 0)

    def fit(self, texts):
        self.scores = []
        for text in texts:
            score = 0
            if "type_token_ratio" in self.diversity_metrics:
                score += self._type_token_ratio(text)
            if "euclidean" in self.diversity_metrics:
                score += self._entropy(text)
            self.scores.append(score)

    def transform(self, texts):
        # Sort texts by their diversity scores
        sorted_indices = np.argsort(self.scores)
        n_keep = int(self.keep * len(texts))
        # Keep the least similar texts (lower scores are kept)
        selected_indices = sorted_indices[:n_keep]
        return [texts[i] for i in selected_indices]

Next, we process the text corpus from multiple sources. The workflow begins by iterating over a set of predefined folders, each containing text files.
For each folder, the script loads the raw text files using the **load_text_files** function and subsequently splits the texts into smaller, manageable chunks of 512 tokens via the **chunk_text function**.
A critical step in the workflow is the selective application of the DataSelector class.
If it is not intended for the DataSelector to iterate over each folder, simply specify the desired folder using an if statement.
In our case, the DataSelector is configured to retain only 50% of the least similar text chunks, using diversity metrics such as the Type-Token Ratio and entropy.
Once the texts are filtered, the script converts them into a structured dataset using the Dataset.from_dict method from the Hugging Face datasets library.
Each dataset is then split into training and evaluation subsets, with 20% of the data reserved for evaluation to ensure robust model validation.
The split is randomized using a **fixed seed for reproducibility**. Finally, the resulting training and evaluation texts from each folder are appended to their respective lists, aggregating data across different sources for comprehensive model training and assessment.


In [None]:
# Load tax corpus from different sources and split into chunks
for folder in folders:
    raw_texts = load_text_files(folder)
    tax_texts = []
    for text in raw_texts:
        tax_texts.extend(chunk_text(text, chunk_size=512))

    # Apply DataSelector to folders that end with _example
    if folder.endswith('_example'):
        selector = DataSelector(
            keep=0.5,
            tokenizer=tokenizer,
            diversity_metrics=[
                "type_token_ratio",
                "entropy",
            ],
        )
        selector.fit(tax_texts)
        tax_texts = selector.transform(tax_texts)

    # Create Dataset for the current folder
    dataset = Dataset.from_dict({'text': tax_texts})
    
    # Split dataset into train and evaluation sets
    train_test_split = dataset.train_test_split(test_size=0.2, seed=42) #Note: seed fixed for reproducibility
    train_dataset = train_test_split['train']
    eval_dataset = train_test_split['test']
    
    # Append the split datasets to the final lists
    train_texts.extend(train_dataset['text'])
    eval_texts.extend(eval_dataset['text'])

In the next step, we consolidate the previously processed data into training and evaluations datasets.


In [None]:
# Split the dataset into training and evaluation sets
final_train_dataset = Dataset.from_dict({'text': train_texts})
final_eval_dataset = Dataset.from_dict({'text': eval_texts})

####**6. Tokenization**
The next script retrieves the original **vocabulary of the tokenizer** using get_vocab(). This vocabulary consists of all tokens that the pretrained model recognizes by default.
Next, we call the previously defined function get_important_tokens to identify frequently occurring or semantically significant tokens that are critical for accurate text representation.
These newly identified tokens are then added to the tokenizer’s vocabulary using **add_tokens()**.
After expanding the vocabulary, the script calculates the difference between the new and original vocabularies to determine which tokens were successfully added.
It also performs an intersection check to identify any tokens from the important set that were already present in the original vocabulary, ensuring that only truly novel terms are considered additions.
**Finally, we recommend manually reviewing** the added tokens to assess whether they meaningfully extend the vocabulary.


In [None]:
# Initialize tokenizer and get important tokens
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
original_vocab = set(tokenizer.get_vocab().keys())

# Call get_important_tokens
important_tokens = get_important_tokens(tax_texts, tokenizer)
tokenizer.add_tokens(important_tokens)
new_vocab = set(tokenizer.get_vocab().keys())
added_tokens = new_vocab - original_vocab

# Check if any of the added tokens are in the original vocab
tokens_already_in_vocab = original_vocab.intersection(important_tokens)
tokens_not_in_vocab = set(important_tokens) - original_vocab

print("Added tokens:", added_tokens)

Next, we define and apply a **tokenize_function** to prepare text datasets for input into a transformer-based model.
The script begins by defining the tokenize_function, which takes a batch of text samples and applies the tokenizer to each example.
The tokenizer processes the text with specific parameters to standardize the input format.
The function ensures that all tokenized sequences are padded to a uniform length using the padding='max_length' parameter.
This uniformity is essential for efficient batch processing, as it allows the model to handle multiple inputs simultaneously without encountering dimensional inconsistencies.
Additionally, the truncation=True setting ensures that any text exceeding the maximum allowable length is appropriately shortened.
The maximum length is set to 512 tokens, aligning with the typical input size constraints of transformer models like BERT or RoBERTa, which ensures that all data conforms to the model’s architectural limitations.
Once the tokenization function is defined, we apply it to both the training and evaluation datasets using the map method from the Hugging Face datasets library.
By setting batched=True, the function processes multiple text samples in parallel, which significantly accelerates the tokenization process, especially when dealing with large datasets.
The resulting tokenized datasets, **tokenized_train_dataset** and **tokenized_eval_dataset**, are now in a format suitable for model training and evaluation.

In [None]:
# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)
# Tokenize both datasets
tokenized_train_dataset = final_train_dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = final_eval_dataset.map(tokenize_function, batched=True)

####**7. Model Setup**
Now, let’s start with domain-adaptive pretraining. First, we initialize a **data collator**.
The DataCollatorForLanguageModeling is imported from the Hugging Face transformers library and is designed to dynamically apply masking to tokens within input sequences during the training process.
By applying masks dynamically, the data collator ensures that the model is continuously exposed to new masking patterns in each epoch.
We configure the data collator with the previously initialized tokenizer, ensuring that the tokenization scheme remains consistent throughout the preprocessing pipeline.
The parameter mlm=True specifies that the collator will prepare data for masked language modeling.
Note: You can also configure the mlm_probability parameter. A value of, for example, 0.10 indicates that 10% of the tokens in each input sequence will be randomly masked during training.
Second, we load a pretrained transformer model and customize it for a masked language modeling (MLM) task.
We begin by importing **RobertaConfig** and **RobertaForMaskedLM** from the Hugging Face transformers library.
For many researchers, especially those in academic or smaller institutional settings, computational resources are limited and expensive to acquire.
In line with Webersinke et al. [2022], we therefore recommend using **DistilRoBERTa** as the starting point for training.
It is a distilled version of RoBERTa, which itself is an optimized version of the BERT model, enhancing performance across various NLP tasks.
Thus, the RobertaConfig object is initialized using the configuration of DistilRoBERTa.
To better adapt the model to the specific dataset and potentially address overfitting issues, we recommend modifying key hyperparameters within this configuration.
Both the **hidden_dropout_prob** and **attention_probs_dropout_prob** can be increased (the default is 0.1; in this case, for example, they are increased to 0.3).
The hidden dropout affects the feedforward layers of the model, while the attention dropout applies to the attention mechanisms that allow the model to focus on different parts of the input text.
By increasing these dropout rates, the model is regularized more aggressively, which helps prevent overfitting. Overfitting is a common issue when fine-tuning on smaller or highly specialized datasets, such as those used in accounting and taxation research.
We also recommend increasing these rates when dealing with repetitive language and limited variation, which might cause the model to memorize patterns rather than generalize effectively.
The example script then loads the pretrained DistilRoBERTa model with the modified configuration. Next, the model's token embeddings are **resized** using model.resize_token_embeddings(len(tokenizer)).
This step is crucial because the tokenizer was previously expanded to include new, domain-specific tokens not present in the original vocabulary.
Finally, the model is moved to the specified GPU using model.to(device). This ensures that subsequent training steps leverage the available **GPU** resources for optimal performance.
This stage sets the foundation for adapting a general-purpose language model to domain-specific texts.

In [None]:
# Data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.20)

In [None]:
# Load model and resize token embeddings
from transformers import RobertaConfig, RobertaForMaskedLM

# Lade die Modellkonfiguration und passe die Dropout-Rate an
config = RobertaConfig.from_pretrained('distilroberta-base')
config.hidden_dropout_prob = 0.3
config.attention_probs_dropout_prob = 0.3

model = RobertaForMaskedLM.from_pretrained('distilroberta-base', config=config)
model.resize_token_embeddings(len(tokenizer))
model.to(device)
print(f"Model is on device: {next(model.parameters()).device}")

####**8. Optimizer & Scheduler**
Let’s set up the next stages. We import the **AdamW optimizer** from the Hugging Face transformers library.
AdamW is an advanced optimization algorithm designed to adjust the model’s weights during training, ensuring that the model learns effectively from the data.
The optimizer is configured to manage the model's parameters during training. The learning rate (here: lr=5e-5) controls how much the model’s weights are adjusted with each step.
A learning rate that’s too high could make the model unstable, causing it to jump over the optimal solution, while a rate that’s too low would slow down the learning process.
The eps parameter, known as epsilon, is a small number added to prevent division by zero during calculations. It ensures numerical stability without significantly affecting the learning process.
The betas (here: 0.9, 0.999) parameters are coefficients for calculating moving averages of the gradients and their squares.
These values help smooth out the learning process by balancing how much the optimizer relies on past gradients versus new information.
This makes the learning process more stable, even when the data is noisy or complex, which is often the case in accounting and tax-related texts.
Finally, weight_decay=0.01 adds regularization by penalizing large weights, helping the model generalize better to new, unseen data.

In [None]:
from transformers import AdamW

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, eps=1e-6, betas=(0.9, 0.999), weight_decay=0.01)

The next section of the code introduces an early stopping mechanism for your BERT model training, which is a critical feature to ensure the model doesn’t overfit your dataset.
Overfitting occurs when a model learns patterns that are too specific to the training data and fails to generalize well to new, unseen data.
The code starts by importing the necessary classes from the Hugging Face transformers library. These imports allow you to customize and control the training process.
The **TrainerCallback** class serves as a base for creating hooks that can intervene at various stages of training, such as during evaluation or after each epoch.
**TrainerState** holds the current state of training, and **TrainerControl** manages how the training process proceeds based on certain conditions.
When initializing the **EarlyStoppingCallback** class, we define how patient the model should be before stopping training if no improvement occurs.
The **early_stopping_patience** parameter, set to 1 here, means that if the model doesn’t improve after one evaluation cycle, it will stop training.
The best_metric keeps track of the lowest evaluation loss achieved, while **patience_counter** counts how many consecutive times the model has failed to improve.
The core functionality lies in the **on_evaluate** method. Every time the model evaluates its performance on the validation set, this method checks the evaluation loss.
If the current evaluation loss is better (i.e., lower) than the best one recorded, it updates best_metric and resets the patience_counter to zero.
However, if there’s no improvement, it increases the patience_counter by one. Once the patience counter exceeds the set threshold (early_stopping_patience), the training stops automatically by setting control.should_training_stop = True.
This ensures that the model remains efficient and avoids overfitting.

In [None]:
from transformers import TrainerCallback, TrainerState, TrainerControl

class EarlyStoppingCallback(TrainerCallback):
    def __init__(self, early_stopping_patience=1):
        self.early_stopping_patience = early_stopping_patience
        self.best_metric = None
        self.best_model_checkpoint = None
        self.patience_counter = 0

    def on_evaluate(self, args, state: TrainerState, control: TrainerControl, **kwargs):
        logs = kwargs.get("logs", {})
        current_metric = logs.get("eval_loss")

        if self.best_metric is None or current_metric < self.best_metric:
            self.best_metric = current_metric
            self.best_model_checkpoint = state.global_step
            self.patience_counter = 0
        else:
            self.patience_counter += 1

        if self.patience_counter >= self.early_stopping_patience:
            control.should_training_stop = True

The next part of the code sets up the **data loading** process, which is essential for efficiently feeding your dataset into the BERT model during both training and evaluation.
First, we import the DataLoader class from PyTorch. DataLoader handles batching, shuffling, and loading data in parallel, optimizing the training process.
Next, we create two data loaders: one for training and one for evaluation. The train_dataloader takes the tokenized_train_dataset and organizes it into batches of 64 samples.
The shuffle=True argument ensures that the training data is randomly mixed at the start of each epoch.
This randomness prevents the model from learning the sequence of the data, which helps improve generalization and reduce overfitting.
The eval_dataloader handles the validation dataset (tokenized_eval_dataset) with the same batch size of 64, but without shuffling.
Keeping the evaluation data in a consistent order ensures that model performance is measured reliably across epochs.
This consistency is crucial when tracking improvements or identifying potential issues during training.

In [None]:
from torch.utils.data import DataLoader

In [None]:
train_dataloader = DataLoader(tokenized_train_dataset, batch_size=64, shuffle=True)
eval_dataloader = DataLoader(tokenized_eval_dataset, batch_size=64)

Next, we set the core training parameters. The **num_train_epochs** = 12 means the model will pass through the entire training dataset 12 times.
This gives the model multiple opportunities to learn from the data, refining its understanding with each epoch. However, too many epochs can lead to overfitting, while too few may result in underfitting.
The **train_batch_size** = 64 defines how many samples the model processes before updating its weights.
A batch size of 64 is large enough to provide stable gradient estimates while still being efficient for most hardware setups.
The **gradient_accumulation_steps** = 8 allows for accumulating gradients over multiple mini-batches before performing a backward pass to update the model weights.
This effectively simulates a larger batch size without requiring additional memory.
For instance, with a batch size of 64 and accumulation steps of 8, the model behaves as if it’s processing a batch size of 512. This is particularly useful when memory constraints are a limiting factor.


In [None]:
# Training parameters
num_train_epochs = 12
train_batch_size = 64
gradient_accumulation_steps = 8 


Further, the code calculates how many steps your BERT model will take during training, which is crucial for managing the learning rate schedule and optimizing the training process.
Here, **steps_per_epoch** is determined by dividing the total number of samples in the tokenized_train_dataset by the train_batch_size.
The math.ceil function rounds this number up to ensure that all training data is included, even if the total number of samples isn’t perfectly divisible by the batch size.
This ensures that every piece of data, including those in the last incomplete batch, is used during training. Next, num_train_epochs represents how many times the model will go through the entire dataset.
Multiplying this by steps_per_epoch gives the total number of steps for the entire training process.
Since gradient_accumulation_steps is used to accumulate gradients over several mini-batches before updating the model weights, dividing by this number adjusts the total to reflect how many actual weight updates will occur.
Finally, the **warmup_steps** parameter controls how many steps will be used to gradually increase the learning rate at the start of training.
Setting it to 2.5% of the total training steps means that the learning rate will start small and slowly ramp up over the first 2.5% of the training process.
This gradual warm-up helps prevent large, unstable updates at the beginning of training, which could otherwise cause the model to perform poorly or even fail to converge.

In [None]:
# Calculate total training steps
steps_per_epoch = math.ceil(len(tokenized_train_dataset) / train_batch_size)
total_training_steps = num_train_epochs * steps_per_epoch // gradient_accumulation_steps
warmup_steps = int(0.025 * total_training_steps)  # 2,5% of total training steps

The next part of our code sets up a custom **learning rate scheduler**, which controls how the learning rate changes throughout the training process.
While get_linear_schedule_with_warmup is a standard scheduler from the Hugging Face Transformers library, our script demonstrates the possibility of implementing a custom schedule using LambdaLR from PyTorch.
**LambdaLR** allows for defining a flexible learning rate function through the lr_lambda parameter. **initial_lr** represents the learning rate after the warm-up phase, and **final_lr** is the rate the model will decay towards as training progresses.
Starting with a lower learning rate (5e-6) helps stabilize the model in the early stages, especially when fine-tuning on sensitive financial data.
The learning rate then increases and eventually decays towards 5e-5 as training advances. The lr_lambda function defines how the learning rate changes at each training step.
In the warm-up phase, when current_step is less than warmup_steps, the learning rate increases linearly from a defined starting point (here: 0 as an example) to initial_lr.
After the warm-up period, the learning rate begins to decay linearly from initial_lr towards final_lr.
The decay is based on the progress of training, calculated as the ratio of completed steps to total training steps.
This gradual reduction in the learning rate allows the model to make finer adjustments as it converges, helping it settle into an optimal state without overshooting.
Finally, we connect the custom lr_lambda function to the optimizer, ensuring that the learning rate updates at every step according to the defined schedule.


In [None]:
from transformers import get_linear_schedule_with_warmup
from torch.optim.lr_scheduler import LambdaLR
# Learning rate schedule parameters
initial_lr = 5e-6
final_lr = 5e-5

def lr_lambda(current_step: int):
    if current_step < warmup_steps:
        # Linear warmup from 0 to initial_lr
        return float(current_step) / float(max(1, warmup_steps))
    else:
        # Linear decay from initial_lr to final_lr
        progress = float(current_step - warmup_steps) / float(max(1, total_training_steps - warmup_steps))
        return max(0.0, initial_lr - progress * (initial_lr - final_lr)) / initial_lr

scheduler = LambdaLR(optimizer, lr_lambda)

As **alternative**, you could also set up a **cosine learning rate scheduler** with warm-up.
This scheduler adjusts the learning rate following a cosine function, which means the learning rate starts high, gradually decreases, and then flattens out as training progresses.
This pattern helps the model converge more smoothly compared to linear decay schedules.

In [None]:
from transformers import get_cosine_schedule_with_warmup
scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_training_steps)

####**9. Reproducible Training**
The following code defines the **training parameters for pretraining** a transformer-based model on a specific domain using the TrainingArguments class from the Hugging Face transformers library.
Hyperparameters determine the configuration and behavior of the model during training, influencing how effectively it learns from the domain-specific corpus.

First, the **number of epochs** refers to the total number of complete passes the model makes through the training dataset during pretraining.
A sufficient number of epochs allows the model to learn complex patterns and relationships within the domain-specific corpus.
However, too many epochs can lead to overfitting, where the model memorizes the training data instead of generalizing well to unseen data.
For domain-adaptive pretraining, the number of epochs is typically set between three and 20, depending on the dataset’s size and complexity.

Second, the effective **batch size** specifies the total number of training samples used collectively to update the model’s parameters.
There is a trade-off when selecting a batch size: larger batch sizes improve computational efficiency and provide smoother gradient updates but require more memory, whereas smaller batch sizes introduce more noise into updates, which can sometimes enhance generalization but slow down convergence.
The effective batch size is determined by the batch size per device and gradient accumulation, a technique that accumulates gradients over multiple steps before updating the model.
This approach simulates a larger batch size, stabilizing training.

To speed up data loading, dataloader_num_workers is set to 32, leveraging multiprocessing to fetch training samples in parallel. This is particularly beneficial when working with large datasets.
Mixed-precision training (fp16=True) is enabled, allowing the model to use lower-precision floating-point arithmetic. This significantly accelerates computations on GPUs while reducing memory consumption.

Third, the **learning rate** controls the step size at which the model updates its parameters (weights) during training in response to computed gradients.
It determines how quickly the model learns from the data. Choosing the optimal learning rate is challenging:
if it is too low, the model takes very small steps, resulting in slow training.
If it is too high, the model may take excessively large steps, potentially overshooting the optimal point in the loss function, which measures the difference between the model’s predictions and the true labels.

Finally, setting the **random seed** to 42 ensures **reproducibility**, making it possible to obtain consistent results when re-running the training process.

We further recommend performing a **grid-search** on the ranges ϵ {2e-4, 2e-5, 3e-4, 3e-5, 4e-4, 4e-5, 5e-4, 5e-5} for learning rate and ϵ {8, 16, 32, 64} for batch size.
This hyperparameter tuning method systematically explores different values for these parameters, identifying the combination that yields the best performance.


In [None]:
# Training arguments
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=r'C:\insert_your_path\Results',
    overwrite_output_dir=True,
    num_train_epochs=12,
    per_device_train_batch_size=32,  # Bigger batch sizes for bigger GPUs
    gradient_accumulation_steps=16,  # Accumulate gradients to simulate larger batch size
    dataloader_num_workers=32,  # Number of subprocesses to use for data loading
    save_steps=500,
    save_total_limit=3,
    fp16=True,  # Enable mixed precision for more efficient training
    logging_dir=r'C:\insert_your_path\Logs',  # Specify a directory for logs
    eval_strategy="epoch",  # Evaluate the model at epochs
    warmup_steps=warmup_steps,  # Number of warmup steps as previously defined
    learning_rate=5e-5,
    weight_decay=0.001,
    seed=42 #crucial for reproducibility
)

####**10. Model Evaluation**
The remainder is straightforward: initialize the trainer, train, and save the trained model.


In [None]:
# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    optimizers=(optimizer, scheduler),
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

In [None]:
torch.cuda.empty_cache()

In [None]:
# Train the model
trainer.train()

In [None]:
model.save_pretrained(r'C:\your_path\Model')
tokenizer.save_pretrained(r'C:\your_path\Model')

In [None]:
# Evaluate the model
trainer.evaluate()

In [None]:
trainer.save_model(r"C:\your_path\Model\model")

MIT License
Copyright (c) 2025 Marius Weiß

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the Software), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
##### If the Software is used for academic or scientific purposes, cite the paper Hechtner et al., (2025) How to Design and Employ Specialized Large Language Models for Accounting and Tax Research: The Example of TaxBERT.

THE SOFTWARE IS PROVIDED **AS IS**, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
