# Self-Supervised Fine-Tuning of Mistral-7B on Audit Reports

This notebook demonstrates how to adapt a Large Language Model (Mistral-7B) to the domain of professional audit reports using self-supervised fine-tuning (continued pretraining). 

**Objective**: Enhance the model's domain fluency, vocabulary, and stylistic consistency for audit documentation.
**Method**: Causal Language Modeling (Next-Token Prediction) on raw text extracted from PDF reports.
**Hardware**: Optimized for a T4 GPU (Google Colab free tier compatible) using QLoRA (4-bit quantization + LoRA).

## 1. Setup and Installation
We need to install the necessary libraries for PDF extraction, efficient model loading, and training.

**IMPORTANT**: After running the installation cell below, you MUST restart the runtime/session (Runtime > Restart session) for the updates to take effect, then run the cells starting from the imports.

In [1]:
# Install all key dependencies including PyTorch components to ensure version compatibility
!pip install -q -U torch torchvision torchaudio transformers peft datasets bitsandbytes trl pdfplumber accelerate

print("Installation complete. Please RESTART the runtime (Runtime > Restart session) to apply changes, then run the next cells.")

Installation complete. Please RESTART the runtime (Runtime > Restart session) to apply changes, then run the next cells.


In [2]:
import os
import glob
import pdfplumber
import torch
from datasets import Dataset, DatasetDict
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
import re

# Set seed for reproducibility
torch.manual_seed(42)

<torch._C.Generator at 0x7b00d3b86a30>

In [3]:
import sys
from pathlib import Path

# Check if running in Colab
if 'google.colab' in sys.modules:
    from google.colab import drive
    try:
        drive.mount('/content/drive')
    except:
        pass
    # Update DATA_DIR to point to mounted Drive
    # Make sure you have uploaded the Data folder to your Google Drive root
    DATA_DIR = Path('/content/drive/MyDrive/Data')
    print(f"Mounted Google Drive. DATA_DIR set to: {DATA_DIR}")
else:
    DATA_DIR = Path("./Data")
    print(f"Not running in Colab. Using local Data directory: {DATA_DIR}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Mounted Google Drive. DATA_DIR set to: /content/drive/MyDrive/Data


## 2. Data Preparation

We will extract text from the PDF audit reports located in the `Data` directory. 

**Cleaning Steps**:
- Extract text using `pdfplumber`.
- Remove potential headers and footers (heuristic: very short lines at top/bottom of pages).
- Normalize whitespace.
- Anonymize sensitive patterns (placeholder implementation).

In [4]:
def extract_text_from_pdf(pdf_path):
    text_content = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract text
            text = page.extract_text()
            if not text:
                continue
            
            lines = text.split('\n')
            
            # Basic Heuristic: Remove first and last lines if they likely resemble headers/footers (e.g., page numbers or short titles)
            # Adjust this logic based on your specific PDF layout
            if len(lines) > 2:
                # Remove header if short (arbitrary length < 50 chars as a heuristic)
                if len(lines[0]) < 50:
                    lines = lines[1:]
                # Remove footer if short and looks like page number
                if len(lines) > 0 and len(lines[-1]) < 20:
                    lines = lines[:-1]
            
            page_text = "\n".join(lines)
            text_content.append(page_text)
    
    full_text = "\n\n".join(text_content)
    return full_text

def clean_data(text):
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Placeholder for anonymization (e.g., replace emails, phone numbers)
    # This regex is a simple example and should be expanded for real production use
    text = re.sub(r'[\w\.-]+@[\w\.-]+', '[EMAIL]', text)
    
    return text

# Main Data Loading Loop
try:
    data_dir = DATA_DIR
except NameError:
    data_dir = "./Data"
    
print(f"Searching for PDFs in: {data_dir}")
pdf_files = glob.glob(str(data_dir / "*.pdf"))

raw_texts = []
for pdf_file in pdf_files:
    print(f"Processing {pdf_file}...")
    try:
        raw_text = extract_text_from_pdf(pdf_file)
        cleaned_text = clean_data(raw_text)
        if len(cleaned_text) > 500: # Only keep documents with substantial content
            raw_texts.append(cleaned_text)
    except Exception as e:
        print(f"Error processing {pdf_file}: {e}")

print(f"\nSuccessfully loaded {len(raw_texts)} documents.")

Searching for PDFs in: /content/drive/MyDrive/Data
Processing /content/drive/MyDrive/Data/Annual_Review_of_Audit_Quality_2025.pdf...
Processing /content/drive/MyDrive/Data/BDO_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...
Processing /content/drive/MyDrive/Data/Deloitte_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...
Processing /content/drive/MyDrive/Data/Ernst__Young_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...
Processing /content/drive/MyDrive/Data/Forvis_Mazars_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...
Processing /content/drive/MyDrive/Data/KPMG_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...
Processing /content/drive/MyDrive/Data/PricewaterhouseCoopers_LLP_Audit_Quality_Inspection_and_Supervision_2025.pdf...
Processing /content/drive/MyDrive/Data/Annual_Review_of_Audit_Quality_2024_7yhxTsi.pdf...
Processing /content/drive/MyDrive/Data/Tier_1_Firms__Overview_2023.pdf...
Processing /content/drive/MyDrive/Data/FRC_Audit_Quality_In

## 3. Dataset Tokenization and Chunking

We need to process the text into chunks suitable for the model's context window. 
- **Context Window**: 1024 tokens (Reduced from 2048 to save VRAM).
- **Overlap**: No overlap in packing strategy.
- **Format**: Prepare as a Hugging Face Dataset.

In [5]:
# Create HF Dataset
dataset = Dataset.from_dict({"text": raw_texts})

# Split into train and validation
dataset = dataset.train_test_split(test_size=0.1, seed=42)
print(dataset)

# Load Tokenizer
model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Mistral has no pad token by default

def chunk_and_tokenize(examples):
    # Flatten texts into a single long string of tokens
    chunk_size = 1024 # Reduced from 2048 to save VRAM
    
    # Basic tokenization without padding/truncated
    tokens = tokenizer(examples["text"], truncation=False, return_attention_mask=False)["input_ids"]
    
    # Flatten list of lists into one big list of tokens
    concatenated_tokens = [tok for doc in tokens for tok in doc]
    
    # Calculate total length divisible by chunk_size
    # We drop the small remainder at the very end of the entire dataset
    total_length = len(concatenated_tokens)
    if total_length >= chunk_size:
        total_length = (total_length // chunk_size) * chunk_size
    else:
        # Handle highly unlikely case where entire dataset < chunk_size tokens
        # Pad to chunk_size
        concatenated_tokens += [tokenizer.eos_token_id] * (chunk_size - total_length)
        total_length = chunk_size

    # Split by chunks of max_len
    result = {
        "input_ids": [concatenated_tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)],
        "labels": [concatenated_tokens[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    }
    
    return result

# Apply processing
tokenized_dataset = dataset.map(
    chunk_and_tokenize,
    batched=True,
    remove_columns=dataset["train"].column_names,
    desc="Chunking and Tokenizing"
)

print(f"Train chunks: {len(tokenized_dataset['train'])}")
print(f"Test chunks: {len(tokenized_dataset['test'])}")

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9
    })
    test: Dataset({
        features: ['text'],
        num_rows: 2
    })
})


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


Chunking and Tokenizing:   0%|          | 0/9 [00:00<?, ? examples/s]

Chunking and Tokenizing:   0%|          | 0/2 [00:00<?, ? examples/s]

Train chunks: 107
Test chunks: 23


## 4. Model Loading with QLoRA

We load Mistral-7B in 4-bit quantization to fit on a T4 GPU.
Then we attach LoRA adapters for parameter-efficient fine-tuning.

In [6]:
# 4-bit Quantization Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16, # or bfloat16 if supported by hardware
    bnb_4bit_use_double_quant=False,
)

# Load Base Model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Enable gradient checkpointing to save memory
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

# LoRA Configuration
peft_config = LoraConfig(
    r=16, # Rank
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    # Target all linear layers for better adaptation
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

trainable params: 41,943,040 || all params: 7,283,675,136 || trainable%: 0.5758


## 5. Training

We use the basic `Trainer` with `DataCollatorForLanguageModeling`. 
The objective is purely self-supervised next-token prediction.

In [7]:
# Clear cache before training
torch.cuda.empty_cache()

# Data Collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Training Arguments
training_args = TrainingArguments(
    output_dir="./audit-mistral-finetuned",
    per_device_train_batch_size=1, # Reduced to 1 to fit T4 VRAM
    gradient_accumulation_steps=8, # Increased to 8 to maintain effective batch size
    learning_rate=2e-4,
    logging_steps=10,
    num_train_epochs=3, # Increased to 3 epochs
    save_strategy="epoch",
    eval_strategy="steps", # Evaluate more frequently
    eval_steps=10,
    fp16=True,
    optim="paged_adamw_8bit", # Memory efficient optimizer
    report_to="none"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
)

# Start Training
trainer.train()

  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
10,2.144592,2.107712
20,1.78909,1.95172
30,1.445677,1.882343
40,1.28202,1.864601


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


TrainOutput(global_step=42, training_loss=1.6343639180773781, metrics={'train_runtime': 1842.9007, 'train_samples_per_second': 0.174, 'train_steps_per_second': 0.023, 'total_flos': 1.4106535567294464e+16, 'train_loss': 1.6343639180773781, 'epoch': 3.0})

## 6. Evaluation

We calculate Perplexity as a quantitative metric of how well the model predicts the domain text.

In [8]:
import math

eval_results = trainer.evaluate()
perplexity = math.exp(eval_results['eval_loss'])
print(f"Perplexity: {perplexity:.2f}")

Perplexity: 6.47


## 7. Inference

Test the model's generation capabilities on an audit-related prompt.

In [None]:
# Save the model (adapters only)
trainer.save_model("/content/drive/MyDrive/Self_Supervised_finetuning_Model/audit-mistral-7b-qlora")

# Inference Prompt
prompt = "The audit of the financial statements reveals that"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


The audit of the financial statements reveals that the audited entity is likely to be impacted by a significant risk related to climate change. The entity must include a statement in its annual report, as required under the Companies Act 2006, to that effect. In this example, the entity’s statement is included on the inside front cover of its annual report and sets out the following: “Climate change is one of the greatest threats to our future prosperity. We are taking action to reduce our impact on the environment and to prepare for the opportunities and risks of climate change. We are setting science-based targets to reduce our emissions and to increase our use of renewable energy. We are developing our climate risk disclosures to provide greater transparency to our stakeholders


In [10]:
trainer.save_model("/content/drive/MyDrive/Self_Supervised_finetuning_Model/audit-mistral-7b-qlora")
