####        This GitHub Repository accompanies the Paper
## **How to Design and Employ Specialized Large Language Models for Accounting and Tax Research: The Example of TaxBERT**
**Frank Hechtner, Lukas Schmidt, Andreas Seebeck, and Marius Weiß**
##### If the following Guide/Repository is used for academic or scientific purposes, please cite the paper Hechtner et al., (2025) How to Design and Employ Specialized Large Language Models for Accounting and Tax Research: The Example of TaxBERT.
##### Link to paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5146523
##### Version as of February 2025
## Part 2 of example code: Fine-tuning


This notebook aims to give a comprehensive guide for fine-tuning the previously pretrained LLM.

1. **Library Imports**: Import necessary libraries from PyTorch and Hugging Face for handling large language models (LLMs) and datasets.

2. **Reproducibilty**: Set the random seed to a fixed number.

3. **Set-up examples in Excel**: Organize the examples in Excel and store them in lists.

4. **Create dataset**: Define a custom dataset class.

5. **Create training and validation dataset**: Split the dataset in training and validation.

6. **Fine-tuning**: Run the fine-tuning and evaluation process.

7. **Save the model**


####**1. Library Imports**
The following libraries and imports are required to start. Additionally, **PyTorch, Huggingface transformers and NVIDIA CUDA** are mandatory dependencies.
**Note**: The example code requires Python 3.9 or later. It is tested on the stable PyTorch 2.6.0, Transformers 4.48.3, and NVIDIA CUDA 12.4.
We cannot guarantee the stability for newer or older versions of these packages.


In [None]:
import pandas as pd
import re
import torch
from torch.utils.data import Dataset, DataLoader, random_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW
from tqdm import tqdm
from sklearn.metrics import f1_score
import numpy as np

import random
import numpy as np
import torch

####**2. Reproducibilty**
Setting the **random seed** to a fixed number (here: 42) ensures **reproducibility**, making it possible to obtain consistent results when re-running the training process

In [None]:
random_seed = 42
random.seed(random_seed)
np.random.seed(random_seed)
torch.manual_seed(random_seed)
torch.cuda.manual_seed_all(random_seed)

####**3. Set up examples in Excel**
We recommend organizing the examples in Excel with only two columns.
The Text column, which contains sentences/paragraphs/documents, is converted into a list called texts.
The Label column contains categorical classifications (e.g., tax-related vs. not tax-related)
Both columns are stored in separate lists.

In [None]:
# read in your examples
data = pd.read_excel(r'C:\specify_your_path.xlsx')

# two lists for texts and labels
texts = data['Text'].tolist()
labels = data['Label'].tolist()

####**4. Create dataset**
The AutoTokenizer processes the text data, applying truncation to ensure that inputs do not exceed the maximum sequence length of 512 tokens,
padding shorter sequences to maintain uniform input size, and converting the text into tensors formatted for PyTorch **return_tensors=pt**.
This transformation is essential because transformer models operate on numerical tensor representations, and not on raw text.
Next, a custom **dataset class**, TextDataset, is defined, inheriting from PyTorchs Dataset class.
This class takes the tokenized encodings and corresponding labels as input and structures them into a format that can be efficiently fed into the model.
Within the dataset class, the **__getitem__** method ensures that each sample is accessed as a dictionary containing both the tokenized text and its corresponding label, which is explicitly converted into a PyTorch tensor of type long.
The **__len__** method provides the total number of samples in the dataset.
Finally, an instance of TextDataset is created using the processed encodings and labels.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path)
encodings = tokenizer(texts, truncation=True, padding=True, max_length=512, return_tensors="pt")

class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)  # Typ festlegen
        return item

    def __len__(self):
        return len(self.labels)

dataset = TextDataset(encodings, labels)

####**5. Create training and validation dataset**
The next step is responsible for splitting the dataset into training and validation subsets and preparing it for efficient batch processing during model training and evaluation.
It also initializes key variables to track model performance over multiple evaluation rounds.
The previously defined dataset is first divided into two subsets.
The training set comprises 80% of the data, while the remaining 20% is allocated to validation.
This is done using the random_split function from PyTorch, ensuring that the split is randomized to prevent any bias in training or validation. Note: For exact reproducibility,
you will need to set a fixed seed.
Once the datasets are created, they are wrapped in DataLoader (**train_loader** and **val_loader**) objects, which facilitate efficient batch processing.
Let's initializes a few variables.
num_evaluations is set to 1.
Two empty lists, val_losses and val_f1_scores, are also initialized.
These will store the validation loss and F1 scores, respectively.

In [None]:
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)

num_evaluations = 1
val_losses = []
val_f1_scores = []

####**6. Fine-tuning**
Let's run the fine-tuning and evaluation process. The script begins by determining whether a GPU (CUDA) is available and sets the computing device accordingly.
The model, is loaded from your specified path (see Code part 1 step 10!) and moved to **GPU**.
Again, **AdamW** is the optimizer.
The training process consists of multiple epochs, each involving two main steps: training and validation.
During **training**, the model iterates over the batches of data, computing the loss, performing backpropagation, and updating the model’s parameters.
The loss is accumulated to monitor training progress.
After each epoch, the model is **evaluated** on the validation set. The **validation loss**, **F1 score**, and **accuracy** are calculated.
To prevent overfitting, we employ **early stopping** by tracking the validation loss.
If no improvement is observed over multiple epochs, training halts early to avoid unnecessary computations.
If a new best validation loss is achieved, the model's parameters are saved.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
model_path = r'C:\specify_your_models_path'

for evaluation in range(num_evaluations):
    print(f'Starting Evaluation Run {evaluation + 1}/{num_evaluations}')

    # initialize model
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    model.config.problem_type = "single_label_classification"
    model.to(device)
    optimizer = AdamW(model.parameters(), lr=5e-5)
    
    # Early stopping
    patience = 4
    best_val_loss = float('inf')
    epochs_no_improve = 0
    num_epochs = 5

    for epoch in range(num_epochs):
        model.train()
        total_train_loss = 0
        for batch in tqdm(train_loader, desc=f'Training Epoch {epoch + 1}'):
            batch = {k: v.to(device) for k, v in batch.items()}
            optimizer.zero_grad()
            outputs = model(**batch)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            total_train_loss += loss.item()
        
        avg_train_loss = total_train_loss / len(train_loader)
        
        # evaluation after each epoch
        model.eval()
        val_loss = 0
        val_f1 = 0
        total = 0
        correct = 0
        with torch.no_grad():
            for batch in val_loader:
                batch = {k: v.to(device) for k, v in batch.items()}
                outputs = model(**batch)
                val_loss += outputs.loss.item()

                predictions = torch.argmax(outputs.logits, dim=-1)
                f1 = f1_score(batch['labels'].cpu(), predictions.cpu(), average='weighted')
                val_f1 += f1

                total += batch['labels'].size(0)
                correct += (predictions == batch['labels']).sum().item()
        
        avg_val_loss = val_loss / len(val_loader)
        avg_val_f1 = val_f1 / len(val_loader)
        val_accuracy = correct / total

        print(f'Epoch {epoch + 1}: Training Loss: {avg_train_loss:.4f}, Validation Loss: {avg_val_loss:.4f}, Validation F1 Score: {avg_val_f1:.4f}, Validation Accuracy: {val_accuracy:.4f}')

        # early stopping
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            epochs_no_improve = 0
            # optional: save the best model
            torch.save(model.state_dict(), f'best_model_run{evaluation+1}.pt')
        else:
            epochs_no_improve += 1
            if epochs_no_improve >= patience:
                print('Early stopping triggered')
                break

    val_losses.append(best_val_loss)
    val_f1_scores.append(avg_val_f1)

mean_val_loss = np.mean(val_losses)
std_val_loss = np.std(val_losses)
mean_val_f1 = np.mean(val_f1_scores)
std_val_f1 = np.std(val_f1_scores)

print(f'Validation Loss (Mean ± Std): {mean_val_loss:.4f} ± {std_val_loss:.4f}')
print(f'Validation F1 Score (Mean ± Std): {mean_val_f1:.4f} ± {std_val_f1:.4f}')

####**7. Save the model**
In the last step, we save the model.


In [None]:
save_path = r'C:\specify_your_path_here'
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

MIT License
Copyright (c) 2025 Marius Weiß

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the Software), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
##### If the Software is used for academic or scientific purposes, cite the paper Hechtner et al., (2025) How to Design and Employ Specialized Large Language Models for Accounting and Tax Research: The Example of TaxBERT.


THE SOFTWARE IS PROVIDED **AS IS**, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
