This use case fine-tunes the Mistral-7B model on the Gretel Financial Risk Analysis dataset using parameter-efficient fine-tuning (PEFT) via LoRA (Low-Rank Adaptation). The end goal is a domain-adapted model that can perform structured financial risk assessment in natural language using a constrained, cost-effective approach suitable for deployment and continual learning in the form of feature extraction and test summurisation. The use case proritizes the feasibility of local-run approach that utilizes consumer grade GPU to deliver real world solution.

Task: Sequence generation for structured financial risk extraction.

Setting: Resource-efficient fine-tuning using quantized models and LoRA adapters.

Challenges:

        Domain shift from general pretraining to financial documents.

        Memory constraints from 7B parameters on consumer hardware.

        Double NLP tasks: feature extraction + text summarisation       

This use case is originally proposed and authored by Daniel Gan, https://github.com/FinalTwilite.



Mistral-7B Financial Risk Analysis Fine-Tuning


This notebook documents the process of fine-tuning the Mistral-7B large language model for specialized financial risk analysis. We employ parameter-efficient techniques including Low-Rank Adaptation (LoRA) and quantization to create a model capable of identifying financial risks in documents and providing structured assessments.
The approach leverages state-of-the-art techniques to enable fine-tuning of a 7B parameter model with modest computational resources, producing a specialized model for financial domain applications.
Environment Setup and Dependencies
We begin by importing necessary libraries that provide the foundation for our fine-tuning process. The key libraries include:

transformers: Hugging Face's library providing access to pre-trained models and training utilities
peft: Parameter-Efficient Fine-Tuning library that implements LoRA and other PEFT methods
datasets: Hugging Face's dataset handling library for efficient data loading and processing
torch: PyTorch for deep learning operations
BitsAndBytesConfig: Specialized quantization configuration for reducing memory requirements


 Base Model: Mistral-7B
Chosen for its open-access, competitive performance, and architectural simplicity at consumer grade hardware. 

Mistral is a dense decoder-only transformer, similar to LLaMA2, but with architectural improvements in attention efficiency and context handling.

In [None]:
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer

import torch

Dataset Selection and Analysis

For this fine-tuning task, we use the specialized gretelai/gretel-financial-risk-analysis-v1 dataset from the Hugging Face Hub. This dataset is particularly valuable because it contains:

Financial documents and texts with varying levels of risk indicators
Expert-crafted risk assessments in structured JSON format
Diverse financial contexts and scenarios
Pre-split train and test sets for proper evaluation

The dataset enables the model to learn patterns of financial risk indicators and how to express them in a standardized, structured format suitable for further automated processing.


In [None]:
# Load the financial risk analysis dataset from HuggingFace Hub
print("Loading dataset...")
dataset = load_dataset("gretelai/gretel-financial-risk-analysis-v1")

# Convert to pandas DataFrames for easier inspection (optional)
train_df = dataset["train"].to_pandas()
val_df = dataset["test"].to_pandas()

# Dataset exploration (not in original code but useful for understanding)
# print(f"Training examples: {len(train_df)}")
# print(f"Validation examples: {len(val_df)}")
# print(f"Sample input:\n{train_df['input'][0][:500]}...")
# print(f"Sample output:\n{train_df['output'][0][:500]}...")
# otherwise join the huggingface to "see" more

Tokenizer Configuration

The tokenizer is responsible for converting raw text into tokens that the model can process. Mistral-7B uses a specialized tokenizer based on BPE (Byte-Pair Encoding) that must be properly configured to handle our specific prompt format and task requirements.
Key tokenizer considerations:

We ensure the pad token is set to match the EOS (End of Sequence) token, as this is standard for Mistral
The tokenizer must properly handle the chat format we'll use, including system messages and instruction formatting
We load it from the same path as our base model to ensure compatibility

In [None]:
# Initialize the Mistral tokenizer from local path
print("Loading tokenizer...")
model_path = "./Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Mistral uses eos token as pad token - this is crucial for proper formatting
# Without this, padding would use a different token and confuse the model
tokenizer.pad_token = tokenizer.eos_token

# Note: We could inspect tokenizer properties here
# print(f"Vocabulary size: {len(tokenizer)}")
# print(f"Model max length: {tokenizer.model_max_length}")
# print(f"Special tokens: {tokenizer.all_special_tokens}")

Quantization Strategy

Quantization is a technique to reduce the memory footprint and computational requirements of large language models by representing weights with lower precision. We use 4-bit quantization through the BitsAndBytesConfig to dramatically reduce the VRAM requirements for fine-tuning.
This approach enables us to fine-tune a 7B parameter model on hardware with limited GPU memory (e.g., a single consumer GPU with 16-24GB VRAM) that would otherwise be impossible with full precision training.
Key quantization parameters:

4-bit precision for weights (1/8 the memory of FP32)
Double quantization for further memory savings
NF4 data type balancing precision and efficiency
Float16 compute dtype for faster operations while maintaining reasonable precision

#Author used RTX4090, if stronger hardware is available, consequencetly the expansion of this qunatization performance, hyperparameter, etc. 

In [None]:
# Configure 4-bit quantization to reduce memory requirements
# Without quantization, a 7B model would require ~28GB in FP16 and ~56GB in FP32
print("Setting up quantization...")
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,              # Use 4-bit quantization for weights (vs FP16/FP32)
    bnb_4bit_compute_dtype=torch.float16,  # Compute operations in float16 precision
    bnb_4bit_use_double_quant=True, # Nested quantization: quantize the quantized weights
    bnb_4bit_quant_type="nf4"       # NF4 format: normalized float 4 - balances quality & compression
)

# Load the base model with quantization settings applied
print("Loading model with quantization...")
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,  # Apply our quantization settings
    device_map="auto",              # Automatically optimize model placement across available devices
    torch_dtype=torch.float16       # Use half-precision for non-quantized tensors
)

Low-Rank Adaptation (LoRA) Implementation

LoRA is a parameter-efficient fine-tuning technique that dramatically reduces the number of trainable parameters by adding small, trainable rank decomposition matrices to existing weights rather than modifying all model weights.
This approach offers several significant advantages:

       Memory efficiency: Only a small fraction of parameters need to be stored and updated       

       Training speed: Fewer parameter updates means faster training       

       Adaptability: Low-rank updates can effectively adapt pre-trained knowledge to specialized domains       

       Stability: Limited parameter updates help prevent catastrophic forgetting and overfitting       

We carefully configure LoRA for the Mistral architecture by:

Targeting attention layers and MLP projections
Using an appropriate rank (r=16) that balances expressivity and efficiency
Applying proper scaling (alpha=32) and regularization (dropout=0.5)

In [None]:
# Configure Low-Rank Adaptation (LoRA) for efficient fine-tuning
# LoRA adds small trainable matrices to existing weights using the decomposition:
# ΔW = A×B where A ∈ ℝ^(d×r), B ∈ ℝ^(r×k), and typically r << min(d,k)
print("Configuring LoRA...")
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,   # Configure for causal (autoregressive) language modeling
    r=16,                           # Rank of LoRA decomposition - higher means more capacity but more parameters
    lora_alpha=32,                  # Scaling factor for LoRA updates (effectively α/r scaling for updates)
    lora_dropout=0.5,               # Dropout probability for LoRA layers - important for regularization
    
    # Target specific modules in the Mistral architecture
    # This is a critical choice that impacts performance and efficiency
    target_modules=[
        "q_proj",                   # Query projection in attention mechanism
        "k_proj",                   # Key projection in attention mechanism
        "v_proj",                   # Value projection in attention mechanism
        "o_proj",                   # Output projection from attention mechanism
        "gate_proj",                # Gate projection in MLP blocks (for SwiGLU activation)
        "up_proj",                  # Up-projection in MLP blocks
        "down_proj",                # Down-projection in MLP blocks
    ],
    bias="none",                    # Do not train bias parameters to reduce overfitting risk
)

# Apply LoRA configuration to the model
model = get_peft_model(model, lora_config)

# Print parameter statistics to verify our configuration
# This shows what percentage of parameters will be trained vs frozen
model.print_trainable_parameters()  # Expected output: trainable params << total params

Data Formatting and Instruction Tuning Strategy

Instruction fine-tuning requires careful formatting of prompts and responses to match the model's expected input pattern. For Mistral-7B, we use the chat template format with a system message, user instruction, and assistant response.
Our approach:

       Include a consistent system message defining the financial risk analysis role       
       Format each example using Mistral's specific chat syntax: <s>[INST] prompt [/INST] response</s>       
       Include raw inputs from financial documents and structured outputs with risk assessments       
       Ensure proper tokenization, padding, and attention masking       

The resulting training examples teach the model both:

       How to understand financial documents and identify risks        

In [None]:
# Define a clear system prompt that establishes the model's role and output format
# This guides the model to produce structured, consistent risk analyses
SYSTEM_MESSAGE = (
    "You are an expert financial risk analyst. Analyze the provided text for financial risks, "
    "and output a structured assessment in JSON format including risk detection, specific risk flags, "
    "financial exposure details, and analysis notes."
)

# Data preprocessing function to format examples according to Mistral's chat template
def preprocess_function(examples):
    formatted_prompts = []
    for i in range(len(examples["input"])):
        # Format using Mistral's chat template with system message, instruction, and response
        # Format: <s>[INST] <<SYS>>\nsystem message\n<</SYS>>\n\nuser input [/INST] model response</s>
        prompt = f"<s>[INST] <<SYS>>\n{SYSTEM_MESSAGE}\n<</SYS>>\n\n{examples['input'][i]} [/INST] {examples['output'][i]}</s>"
        formatted_prompts.append(prompt)
    
    # Tokenize all prompts with proper padding and truncation settings
    tokenized = tokenizer(
        formatted_prompts,
        padding="max_length",       # Pad all sequences to same length for batch processing
        truncation=True,            # Truncate if exceeds maximum allowed length
        max_length=512,             # Maximum sequence length (context window)
        return_tensors="pt"         # Return PyTorch tensors
    )
    
    # Prepare dataset in the format expected by the Trainer
    # For causal LM, labels are identical to input_ids for teacher forcing
    result = {
        "input_ids": tokenized["input_ids"],
        "attention_mask": tokenized["attention_mask"],
        "labels": tokenized["input_ids"].clone()  # Copy input_ids to labels
    }
    
    return result

# Process training and validation datasets
print("Processing training data...")
train_dataset = dataset["train"].map(
    preprocess_function,
    batched=True,                     # Process multiple examples at once for efficiency
    batch_size=16,                    # Number of examples to process in each batch
    remove_columns=dataset["train"].column_names  # Remove original text columns after processing
)

print("Processing validation data...")
val_dataset = dataset["test"].map(
    preprocess_function,
    batched=True,
    batch_size=16,
    remove_columns=dataset["test"].column_names
)

# Optionally inspect the processed dataset (not in original code)
# print(f"Training example shape: {train_dataset[0]['input_ids'].shape}")
# print(f"Number of training examples: {len(train_dataset)}")
# print(f"Number of validation examples: {len(val_dataset)}")

Training Configuration and Hyperparameter Selection

Selecting appropriate training hyperparameters is critical for effective fine-tuning. Our configuration balances:

       Memory constraints: Using small batch sizes with gradient accumulation       
       Learning dynamics: Proper learning rate, schedule, and regularization       
       Training efficiency: Optimal epochs, evaluation frequency, and checkpoint saving       
       Model quality: Validation-based model selection and proper precision settings       

Key hyperparameter choices:

       Small per-device batch size (2) with gradient accumulation (16) simulating a larger batch of 32       
       Moderate learning rate (2e-5) with cosine schedule and 10% warmup       
       Strong weight decay (0.5) to prevent overfitting given the small dataset       
       Mixed precision training (fp16) for memory efficiency       
       Checkpoint management saving only the best 3 models       

In [None]:
# Configure the training process with carefully tuned hyperparameters
# These settings are optimized for LoRA fine-tuning of Mistral-7B on consumer hardware
training_args = TrainingArguments(
    output_dir="./mistral_risk_finetuned",  # Directory to save checkpoints
    
    # Evaluation strategy
    eval_strategy="epochs",         # Evaluate after each epoch
    eval_steps=100,                 # Evaluate every 100 steps (backup if epochs are very long)
    
    # Saving strategy
    save_strategy="epochs",         # Save model after each epoch
    save_steps=100,                 # Save checkpoint every 100 steps (backup)
    save_total_limit=3,             # Only keep 3 best checkpoints to save disk space
    load_best_model_at_end=True,    # Load best model after training (based on eval metric)
    
    # Learning rate and schedule
    learning_rate=2e-5,             # 2e-5 is effective for LoRA fine-tuning
    lr_scheduler_type="cosine",     # Cosine schedule works well for LLM fine-tuning
    warmup_ratio=0.1,               # Warm up learning rate for 10% of training steps
    
    # Batch size configuration - critical for memory management
    per_device_train_batch_size=2,  # Small batch size due to memory constraints
    per_device_eval_batch_size=2,   # Small batch size for evaluation
    gradient_accumulation_steps=16, # Accumulate gradients over 16 steps (effective batch size = 32)
    
    # Training length
    num_train_epochs=6,             # Total number of training epochs
    
    # Regularization
    weight_decay=0.5,               # Strong L2 regularization to prevent overfitting
    
    # Mixed precision
    fp16=True,                      # Use mixed precision training for memory efficiency
    
    # Logging
    logging_steps=10,               # Log metrics every 10 steps
    report_to="tensorboard",        # Log metrics to TensorBoard for visualization
    
    # Model selection
    metric_for_best_model="eval_loss",  # Select best model based on validation loss
    greater_is_better=False,        # For loss, lower values are better
)

# Data collator handles batching and formatting for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,            # Use our configured tokenizer
    mlm=False                       # Use causal language modelling (not masked LM)
)

# Initialize and configure the HuggingFace Trainer
# The Trainer handles the training loop, evaluation, and checkpointing
trainer = Trainer(
    model=model,                    # Our LoRA-adapted model
    args=training_args,             # Training configuration
    train_dataset=train_dataset,    # Processed training data
    eval_dataset=val_dataset,       # Processed validation data
    data_collator=data_collator,    # Data formatting utility
)

# Start the fine-tuning process
print("Starting training...")
trainer.train()

Model Saving and Deployment Strategy

After fine-tuning, we need to properly save the model artifacts for future use. With LoRA, we have two main options:

Save adapter weights only: Much smaller files, but requires loading the base model separately
Create a merged model: Combines base model and adaptations for simpler deployment

We implement both approaches to provide flexibility for different deployment scenarios:

In [None]:
# Save the fine-tuned LoRA adapter weights
# These are much smaller than the full model (typically <100MB vs 13GB+)
save_directory = "./mistral-risk-finetuned-final"
os.makedirs(save_directory, exist_ok=True)

# Save the LoRA adapter weights and tokenizer configuration
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
print(f"Model adapter and tokenizer saved to {save_directory}")

# Optional: Create a merged model for easier inference
# This combines the base model weights with the LoRA adaptations
print("Creating merged model for easier inference...")
merged_model_path = "./mistral-risk-merged"
os.makedirs(merged_model_path, exist_ok=True)

# Note: The actual merging code would be implemented here
# Typically using model.merge_and_unload() from PEFT library
# This would create a standalone model that doesn't require separate adapter loading

Inference and Model Usage

To effectively use our fine-tuned model for financial risk analysis, we implement a streamlined inference pipeline. 
This approach loads the model with its LoRA adapters and provides a straightforward function for analyzing financial documents:



       Model Loading: We load the base model and apply LoRA weights separately for maximum flexibility       
       Memory Management: We use device mapping and offloading capabilities to handle memory constraints      
        
Generation Parameters:

       Temperature of 0.7 balances creativity with accuracy       
       Top-p sampling of 0.95 ensures reasonable output diversity       
       Maximum token generation of 512 allows for comprehensive analysis       


Performance Measurement: Tracking generation time helps optimize deployment

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import time

# Load fine-tuned model and tokenizer
base_model_path = "./Mistral-7B-v0.1"             # Base model
finetuned_model_path = "./mistral-risk-finetuned-final"    # Fine-tuned weights

# Initialize tokenizer from base model
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
tokenizer.pad_token = tokenizer.eos_token  # Ensure proper padding token

# Load the base model first
model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16,         # Use half-precision for efficiency
    device_map="auto",                 # Automatically distribute across available GPUs
    offload_folder="./offload"         # Optional disk offloading for large models
)

# Apply fine-tuned LoRA weights to the base model
model = PeftModel.from_pretrained(model, finetuned_model_path)
model.eval()  # Set model to evaluation mode

# Define the inference function for financial risk analysis
def generate_analysis(input_data: str, max_new_tokens=512, temperature=0.7, top_p=0.95):
    """
    Generate financial risk analysis from input text.
    
    Args:
        input_data: Financial text to analyze
        max_new_tokens: Maximum number of tokens to generate
        temperature: Controls randomness (lower = more deterministic)
        top_p: Nucleus sampling parameter (higher = more diverse)
        
    Returns:
        Generated risk analysis text
    """
    # Create prompt with system instructions
    prompt = f"You are an expert financial risk analyst. Analyze the provided text for financial risks, and output a structured assessment in JSON format including risk detection, specific risk flags, financial exposure details, and analysis notes. {input_data}"
    
    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Generate output with timing
    with torch.no_grad():
        start = time.time()
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            pad_token_id=tokenizer.eos_token_id
        )
        end = time.time()
    
    # Decode and return the result
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\n Generation time: {end - start:.2f} seconds")
    return result

# Example usage
if __name__ == "__main__":
    # Example financial text for analysis
    input_text = """
    The Company has entered into a five-year contract to purchase raw materials
    from a single supplier in a volatile market. The contract requires minimum
    purchases of $10M annually with no cancellation clause. Recent market analysis
    suggests potential price fluctuations of up to 40% in the next year.
    """
    
    # Generate and display the analysis
    output = generate_analysis(input_text)
    print("\n Generated Analysis:\n", output)

LLM Compliance Evaluation Framework

This document provides comprehensive documentation for a framework designed to evaluate language model compliance using a multi-metric approach. The framework systematically evaluates model responses against predefined test prompts using established NLP evaluation metrics.
Overview

The framework performs the following operations:

       Loads a causal language model (LLM) and tokenizer       
       Iterates through predefined compliance-focused test prompts       
       Generates model responses for each prompt       
       Computes multiple evaluation metrics for each response       
       alculates performance statistics across multiple test runs       
       Exports results to a structured document format       

In [None]:
# Core libraries
import time
import torch
import numpy as np
import matplotlib.pyplot as plt

# NLP evaluation packages
import evaluate
from Levenshtein import distance as levenshtein_distance
from bert_score import score as bert_score

# Hugging Face components
from transformers import AutoModelForCausalLM, AutoTokenizer

# Document handling
from docx import Document

# Load evaluation metrics using the evaluate package from Hugging Face
rouge = evaluate.load("rouge")  # Text summarization evaluation metric
bleu = evaluate.load("bleu")    # Machine translation evaluation metric

# 100 diverse test prompts for a compliance-focused LLM
# These prompts have been omitted for brevity in the code but would typically
# contain a variety of scenarios to test model compliance across different domains
prompts = [
    "What are the key environmental risks highlighted in this report?",
    "Summarize key cybersecurity vulnerabilities in the filing.",
    "Assess financial statement accuracy based on disclosed risks.",
    "Identify any potential conflicts of interest in governance structures.",
    "What are the tax risks identified in the latest filing?",
    "Highlight any liquidity risks mentioned in this filing.",
    "Evaluate compliance with international anti-bribery laws.",
    "Analyze the impact of market risk factors disclosed in this report.",
    "What are the key risks identified in this company’s supply chain?",
    "Summarize risk factors from the latest 10-Q filing.",
    "Identify any references to potential violations of the Foreign Corrupt Practices Act (FCPA).",
    "What governance issues are raised in this proxy statement?",
    "Detect any conflicts between financial projections and risk disclosures.",
    "Identify operational inefficiencies highlighted in the filing.",
    "Summarize management's response to identified risks.",
    "What mitigation strategies are proposed for identified risks?",
    "Review the risk management framework outlined in the document.",
    "Analyze potential strategic risks identified in the report.",
    "Highlight reputational risks mentioned in the document.",
    "Identify regulatory penalties or fines discussed in the report.",
    "Detect any signs of financial misreporting in the filing.",
    "Analyze the company’s risk appetite based on the disclosed risks.",
    "Identify political risks related to operations in foreign countries.",
    "Assess the adequacy of the company’s disaster recovery plans.",
    "What are the sustainability risks identified in this report?",
    "Summarize legal proceedings related to risk in this filing.",
    "Identify insurance coverage gaps discussed in the filing.",
    "Detect any potential issues with intellectual property management.",
    "Highlight key fraud risks disclosed in the document.",
    "Summarize the company’s risk tolerance as outlined in the filing.",
    "Assess the company’s risk diversification strategy.",
    "Review how the company plans to handle potential supply chain disruptions.",
    "Identify reputational risks related to the company’s brand.",
    "Analyze risk exposure from foreign exchange fluctuations.",
    "Summarize any risks related to customer concentration.",
    "Detect conflicts of interest in the company’s executive compensation plan.",
    "Identify risks related to mergers and acquisitions in the filing.",
    "Analyze whether the company has adequate legal compliance programs.",
    "Review how the company addresses regulatory changes in this document.",
    "Summarize the company’s risk management priorities for the next year.",
    "Assess risks related to changes in government policy.",
    "Identify any material weaknesses in internal controls.",
    "Evaluate the company’s risk management performance over time.",
    "Highlight risks associated with operational outsourcing.",
    "Assess the financial impact of risk events disclosed in the filing.",
    "Summarize risks related to the company’s digital transformation efforts.",
    "Identify key social risks disclosed in the filing.",
    "What are the potential risks associated with the company’s new product launch?",
    "Evaluate the company’s approach to mitigating operational risks.",
    "Summarize risks related to the company’s leadership transitions.",
    "Highlight financial risks related to the company’s capital structure.",
    "What are the potential risks associated with the company’s debt?",
    "Identify any environmental liabilities discussed in the filing.",
    "Assess the company’s readiness for changes in tax law.",
    "What are the key factors contributing to the company’s credit risk?",
    "Analyze the company’s approach to managing legal risks.",
    "Highlight risks related to compliance with labor laws.",
    "What are the financial implications of disclosed risks?",
    "Assess the company’s approach to managing reputation risk.",
    "Summarize the company’s governance structure and related risks.",
    "Identify risks related to competition in the company’s industry.",
    "Analyze risks associated with the company’s expansion strategy.",
    "Evaluate how the company mitigates risks from geopolitical tensions.",
    "Summarize the company’s approach to managing climate-related risks.",
    "Detect any emerging risks in the company’s business environment.",
    "What are the risks associated with the company’s reliance on technology?",
    "Highlight any risks identified in the company’s corporate social responsibility (CSR) reports.",
    "Identify risks associated with intellectual property infringement.",
    "Summarize risks related to compliance with the GDPR.",
    "Evaluate the company’s exposure to risks from commodity price volatility.",
    "What operational risks are associated with the company’s logistics network?",
    "Identify key market risks affecting the company’s performance.",
    "Analyze risks arising from changes in consumer behavior.",
    "Summarize risks related to the company’s reliance on key suppliers.",
    "Identify risks related to changes in the regulatory landscape for healthcare.",
    "Evaluate risks associated with the company’s use of third-party vendors.",
    "What are the risks related to the company’s employee compensation plans?",
    "Summarize risks identified in the company’s sustainability reports.",
    "Identify risks related to the company’s reliance on renewable energy.",
    "Assess the company’s approach to managing risks from natural disasters.",
    "What are the risks related to the company’s strategic investments?",
    "Summarize the risks associated with the company’s real estate holdings.",
    "What are the company’s plans to address emerging regulatory risks?",
    "Detect risks associated with the company’s reliance on digital marketing.",
    "What are the risks associated with the company’s international operations?",
    "Summarize the company’s approach to managing workforce-related risks.",
    "Highlight risks related to the company’s pension liabilities.",
    "Identify risks related to potential supply shortages.",
    "What are the risks associated with the company’s cybersecurity measures?",
    "Assess the company’s compliance with industry-specific regulations.",
    "Summarize risks related to the company’s research and development activities.",
    "What are the potential risks related to the company’s legal disputes?",
    "Identify risks related to changes in consumer privacy laws.",
    "What are the emerging risks in the company’s market?",
    "Summarize risks related to the company’s debt refinancing efforts.",
    "Identify risks associated with fluctuations in raw material prices.",
    "Assess risks related to the company’s foreign investment strategies.",
    "Summarize the company’s risk management approach to emerging markets.",
    "Evaluate the company’s risk management framework against best practices."
]

# Experimental parameters
num_tests = 3          # Number of test runs per prompt for statistical reliability
win_threshold = 0.5    # BERTScore threshold for considering a response "successful"

# Ensure model is in evaluation mode to disable dropout and other training-specific behaviors
model.eval()

# Initialize results container
results = []

# Iterate through each test prompt
for prompt in prompts:
    # Initialize performance metric tracking for current prompt
    time_taken_list = []        # Response generation time in seconds
    tokens_per_second_list = [] # Throughput measurement
    perplexity_list = []        # Model confidence/fluency metric
    rouge_scores = []           # Content overlap metric
    bleu_scores = []            # Content precision metric  
    edit_distance_list = []     # String similarity metric
    bert_score_list = []        # Semantic similarity metric
    win_rate_list = []          # Binary success/failure metric

    # Run multiple tests for each prompt to account for generation stochasticity
    for _ in range(num_tests):
        # Start timing the response generation
        start_time = time.time()

        # Prepare input and move to GPU if available
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

        # Generate response using the model
        with torch.no_grad():  # Disable gradient computation for inference
            output = model.generate(**inputs, max_new_tokens=64, temperature=0.7)

        # Decode the generated token IDs back to text
        decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

        # Calculate response time
        time_taken = time.time() - start_time
        time_taken_list.append(time_taken)

        # Calculate throughput (tokens per second)
        num_tokens = len(inputs["input_ids"][0]) + 64  # Input tokens + generated tokens
        tokens_per_second = num_tokens / time_taken
        tokens_per_second_list.append(tokens_per_second)

        # Calculate perplexity (lower is better, indicates higher confidence)
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss
            perplexity = torch.exp(loss).item()
            perplexity_list.append(perplexity)

        # Calculate ROUGE score (content overlap)
        rouge_result = rouge.compute(predictions=[decoded_output], references=[prompt])
        rouge_scores.append(rouge_result["rougeL"])

        # Calculate BLEU score (precision-based similarity)
        bleu_result = bleu.compute(predictions=[decoded_output], references=[prompt])
        bleu_scores.append(bleu_result["bleu"])

        # Calculate Levenshtein edit distance (character-level similarity)
        edit_distance = levenshtein_distance(decoded_output, prompt)
        edit_distance_list.append(edit_distance)

        # Calculate BERTScore (contextualized semantic similarity)
        P, R, F1 = bert_score([decoded_output], [prompt], lang="en", rescale_with_baseline=True)
        f1_score = F1.mean().item()
        bert_score_list.append(f1_score)
        
        # Determine if this response meets the quality threshold (binary success metric)
        win_rate_list.append(1 if f1_score >= win_threshold else 0)

    # Aggregate metrics across test runs for this prompt
    results.append({
        "prompt": prompt,
        "avg_time": sum(time_taken_list) / num_tests,
        "avg_tps": sum(tokens_per_second_list) / num_tests,
        "avg_perplexity": sum(perplexity_list) / num_tests,
        "avg_rouge": sum(rouge_scores) / num_tests,
        "avg_bleu": sum(bleu_scores) / num_tests,
        "avg_edit_distance": sum(edit_distance_list) / num_tests,
        "avg_bert_score": sum(bert_score_list) / num_tests,
        "win_rate": sum(win_rate_list) / num_tests * 100  # Convert to percentage
    })

# Create a Word document for the evaluation report
doc = Document()
doc.add_heading('LLM Compliance Evaluation Report', 0)

# Add metadata about the experiment
doc.add_paragraph(f"Model: {model.__class__.__name__}")
doc.add_paragraph(f"Number of test runs per prompt: {num_tests}")
doc.add_paragraph(f"BERTScore Win Threshold: {win_threshold}")

# Create a structured table for results
table = doc.add_table(rows=1, cols=9)
table.style = 'Table Grid'

# Define table headers
hdr_cells = table.rows[0].cells
hdr_cells[0].text = 'Prompt'
hdr_cells[1].text = 'Time (s)'
hdr_cells[2].text = 'Tokens/sec'
hdr_cells[3].text = 'Perplexity'
hdr_cells[4].text = 'ROUGE-L'
hdr_cells[5].text = 'BLEU'
hdr_cells[6].text = 'Edit Dist'
hdr_cells[7].text = 'BERTScore'
hdr_cells[8].text = 'Win Rate (%)'

# Populate the table with evaluation results
for r in results:
    row_cells = table.add_row().cells
    row_cells[0].text = r['prompt']
    row_cells[1].text = f"{r['avg_time']:.2f}"
    row_cells[2].text = f"{r['avg_tps']:.2f}"
    row_cells[3].text = f"{r['avg_perplexity']:.2f}"
    row_cells[4].text = f"{r['avg_rouge']:.4f}"
    row_cells[5].text = f"{r['avg_bleu']:.4f}"
    row_cells[6].text = f"{r['avg_edit_distance']:.2f}"
    row_cells[7].text = f"{r['avg_bert_score']:.4f}"
    row_cells[8].text = f"{r['win_rate']:.2f}"

# Save the report document
doc.save("compliance_mistral_finetuned_evaluation.docx")