# DPO Fine-Tuning for Audit Report Generation

This notebook implements **Direct Preference Optimization (DPO)** to align your fine-tuned Mistral model to:
1. **Stop hallucinating** - Only use facts from provided data
2. **Follow audit report structure** - Generate professional, structured reports
3. **Avoid repetition** - Produce coherent, non-looping text

## What is DPO?
DPO teaches the model to prefer "good" responses over "bad" ones by training on pairs:
- ‚úÖ **Chosen**: Accurate, grounded, professional audit text
- ‚ùå **Rejected**: Hallucinated, repetitive, or poorly structured text

## Workflow
1. Load your self-supervised fine-tuned model
2. Create preference dataset (good vs bad examples)
3. Train with DPO
4. Evaluate improvements

## 1. Setup & Installation

In [None]:
!pip install -q -U transformers peft datasets bitsandbytes accelerate trl

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from datasets import Dataset
from trl import DPOTrainer, DPOConfig
import gc

torch.manual_seed(42)

<torch._C.Generator at 0x790cb9607630>

In [None]:
# Mount Google Drive
import sys
from pathlib import Path

if 'google.colab' in sys.modules:
    from google.colab import drive
    try:
        drive.mount('/content/drive')
        print("‚úÖ Google Drive mounted!")
    except:
        pass
else:
    print("‚ÑπÔ∏è Not in Colab")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Google Drive mounted!


## 2. Load Your Fine-Tuned Model
We load the model you already trained with self-supervised learning.

In [None]:
# Clear memory
!pip install -q -U google-genai

torch.cuda.empty_cache()
gc.collect()

BASE_MODEL_ID = "mistralai/Mistral-7B-v0.1"
FINETUNED_MODEL_PATH = "/content/drive/MyDrive/Self_Supervised_finetuning_Model/audit-mistral-7b-qlora"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False,
    llm_int8_enable_fp32_cpu_offload=True
)

print("Loading base model...")
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    max_memory={0: "14GB", "cpu": "30GB"}
)

print("Loading fine-tuned adapters...")
model = PeftModel.from_pretrained(model, FINETUNED_MODEL_PATH)

# Merge adapters into base model for DPO training
print("Merging adapters...")
model = model.merge_and_unload()

# Prepare for new LoRA training (DPO)
model = prepare_model_for_kbit_training(model)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("‚úÖ Model loaded and ready for DPO!")

Loading base model...


AcceleratorError: CUDA error: out of memory
Search for `cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


## 3. Create DPO Preference Dataset

**Format**: Each example has:
- `prompt`: The instruction/query
- `chosen`: Good response (grounded in data, professional)
- `rejected`: Bad response (hallucinated, repetitive, or off-topic)

**TODO**: Replace the examples below with your own audit-specific data.

In [None]:
import time
from tqdm import tqdm
from google import genai
from google.genai import types
import torch
# --- Configuration ---
GEMINI_API_KEY = "AIzaSyA9kPKdLbfK3PENP6bjQjFtajWtl0hXpXY" 
NUM_SAMPLES = 10 
# Initialize Gemini Client
client = genai.Client(api_key=GEMINI_API_KEY)
# Use the new 2.5 Flash model for high-quality corrections
JUDGE_MODEL_ID = "gemini-2.5-flash"
# List of questions
questions = [
    "Draft the Revenue Recognition section for a company with ¬£109.1 million turnover.",
    "Describe the independence threats related to non-audit fees.",
    "What were the key weaknesses identified in the audit inspections?",
    "Explain the concept of professional skepticism in auditing.",
    "How should an auditor determine materiality?",
    "What are the auditor's responsibilities regarding fraud?",
    "Summarize the findings on Going Concern assessments.",
    "What are the requirements for partner rotation?",
    "Draft an opinion on the financial statements.",
    "Explain the audit risk model components."
]
def get_gemini_correction(prompt, bad_response):
    """
    Asks Gemini to rewrite the bad response into a perfect audit response.
    """
    correction_prompt = f"""
    You are an expert Audit Partner at a Big 4 firm. I will give you a Question and a Draft Answer.
    The Draft Answer might be repetitive, hallucinated, or unprofessionally written.
    
    **Task**: rewrite the answer to be:
    1. Factually accurate (generalize if specific numbers in draft are suspicious/hallucinated).
    2. Professional, concise, and structured (like a real audit report).
    3. Free of repetition or looping text.
    
    **Question**: {prompt}
    **Draft Answer (Rejected)**: {bad_response}
    
    **Output**: Just the corrected text. No preamble.
    """
    
    try:
        response = client.models.generate_content(
            model=JUDGE_MODEL_ID,
            contents=correction_prompt,
            config=types.GenerateContentConfig(
                temperature=0.2,
                max_output_tokens=1024
            )
        )
        return response.text.strip()
    except Exception as e:
        print(f"Gemini API Error: {e}")
        return None

for q in tqdm(questions):
    # 1. Generate 'Rejected' response from your model
    inputs = tokenizer(q, return_tensors="pt").to("cuda")
    with torch.no_grad():
        # Generate with parameters that might induce current bad behavior (to catch it)
        outputs = model.generate(
            **inputs, 
            max_new_tokens=200, 
            temperature=0.7, 
            do_sample=True,
            repetition_penalty=1.0 # Low penalty to catch repetition if it exists
        )
    
    # Decode and strip the prompt to get just the response
    rejected_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Remove the prompt from the start if it repeats
    if rejected_response.startswith(q):
        rejected_response = rejected_response[len(q):].strip()
        
    print(f"\nüìù Question: {q}")
    print(f"‚ùå Rejected (Model): {rejected_response[:100]}...")
    
    # 2. Get 'Chosen' response from Gemini
    chosen_response = get_gemini_correction(q, rejected_response)
    
    if chosen_response:
        print(f"‚úÖ Chosen (Gemini): {chosen_response[:100]}...")
        
        # 3. Add to dataset
        dpo_data.append({
            "prompt": q,
            "chosen": chosen_response,
            "rejected": rejected_response
        })
    else:
        print("‚ö†Ô∏è Skipping due to API error")
        
    # Sleep briefly to avoid rate limits
    time.sleep(1)

# Convert to Dataset for training
dpo_dataset = Dataset.from_list(dpo_data)
print(f"\n‚ú® Generated {len(dpo_dataset)} DPO pairs successfully!")

## 4. Configure DPO Training
We add new LoRA adapters on top of the merged model.

In [None]:
# New LoRA config for DPO
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

# DPO Training Arguments
training_args = DPOConfig(
    output_dir="./audit-mistral-dpo",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,  # Lower LR for DPO
    num_train_epochs=3,
    logging_steps=5,
    save_strategy="epoch",
    fp16=True,
    optim="paged_adamw_8bit",
    beta=0.1,  # DPO temperature (how strongly to prefer chosen over rejected)
    max_length=1024,
    max_prompt_length=512,
    remove_unused_columns=False,
    report_to="none"
)

print("‚úÖ DPO configuration ready")

## 5. Train with DPO

In [None]:
# Initialize DPO Trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,  # Use implicit reference model (saves memory)
    args=training_args,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
)

print("Starting DPO training...")
dpo_trainer.train()
print("‚úÖ DPO training complete!")

## 6. Save the DPO-Aligned Model

In [None]:
# Save DPO adapters
output_path = "/content/drive/MyDrive/Self_Supervised_finetuning_Model/audit-mistral-7b-dpo"
dpo_trainer.save_model(output_path)
tokenizer.save_pretrained(output_path)

print(f"‚úÖ DPO model saved to: {output_path}")

## 7. Test the DPO-Aligned Model

In [None]:
# Load the DPO model for inference
print("Loading DPO model for testing...")
test_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

test_model = PeftModel.from_pretrained(test_model, output_path)
test_model.eval()

# Test prompts
test_prompts = [
    "Draft the Revenue Recognition section for a company with ¬£109.1 million turnover.",
    "Describe independence threats when non-audit fees exceed audit fees by 400%.",
    "What were the key weaknesses in going concern assessments?"
]

print("\n" + "="*70)
print("TESTING DPO-ALIGNED MODEL")
print("="*70)

for prompt in test_prompts:
    print(f"\nüìã PROMPT: {prompt}")
    print("-"*70)
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = test_model.generate(
            **inputs,
            max_new_tokens=200,
            temperature=0.3,
            top_p=0.9,
            do_sample=True,
            repetition_penalty=1.2
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Remove the prompt from response
    response = response[len(prompt):].strip()
    
    print(f"ü§ñ RESPONSE:\n{response}")
    print("="*70)

## 8. Compare: Before vs After DPO
Load both models and compare their outputs side-by-side.

In [None]:
print("Loading BEFORE DPO model (original fine-tuned)...")
before_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
before_model = PeftModel.from_pretrained(before_model, FINETUNED_MODEL_PATH)
before_model.eval()

test_prompt = "Describe independence threats when non-audit fees exceed audit fees by 400%."

print("\n" + "="*70)
print("COMPARISON: BEFORE vs AFTER DPO")
print("="*70)
print(f"\nüìã PROMPT: {test_prompt}\n")

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")

# Before DPO
print("-"*70)
print("‚ùå BEFORE DPO (Original Fine-Tuned Model):")
print("-"*70)
with torch.no_grad():
    before_output = before_model.generate(**inputs, max_new_tokens=150, temperature=0.3, repetition_penalty=1.1)
before_response = tokenizer.decode(before_output[0], skip_special_tokens=True)[len(test_prompt):].strip()
print(before_response)

# After DPO
print("\n" + "-"*70)
print("‚úÖ AFTER DPO (Aligned Model):")
print("-"*70)
with torch.no_grad():
    after_output = test_model.generate(**inputs, max_new_tokens=150, temperature=0.3, repetition_penalty=1.2)
after_response = tokenizer.decode(after_output[0], skip_special_tokens=True)[len(test_prompt):].strip()
print(after_response)

print("\n" + "="*70)