In [16]:
import IPython.testing.skipdoctest
import IPython.testing.skipdoctest
import sys
import os

# Add src to sys.path to import modules
sys.path.append(os.path.abspath('../src'))

print("Added ../src to sys.path")

Added ../src to sys.path


# Interactive Audit Testing

This notebook allows for real-time testing of the IFRS9 Automated Auditor including the **GenAI Risk Validation Framework**.

### Framework Components:
1. **Chain of Verification (CoVe)**: Explicit verification steps before answering.
2. **Adversarial Testing**: Stress testing with trap questions.
3. **Stability Testing**: 5-run validation loops.
4. **Hallucination Metrics**: Counting unsupported claims.

In [17]:
from config import CONFIG

CONFIG

{'llm_settings': {'provider': 'openai',
  'temperature': 0.0,
  'openai': {'model': 'gpt-4o-mini'},
  'google': {'model': 'models/gemini-pro-latest'}},
 'rag_settings': {'chunk_size': 1500,
  'chunk_overlap': 300,
  'document_language': 'Spanish'},
 'paths': {'input_csv': 'c:\\Users\\semeier\\Desktop\\gemini_chat_private_GH\\Agentic_AI_for_validation\\inputs\\rcm_input.csv',
  'output_json': 'c:\\Users\\semeier\\Desktop\\gemini_chat_private_GH\\Agentic_AI_for_validation\\outputs\\audit_results.json',
  'documents_folder': 'c:\\Users\\semeier\\Desktop\\gemini_chat_private_GH\\Agentic_AI_for_validation\\documents',
  'expert_answers_csv': 'c:\\Users\\semeier\\Desktop\\gemini_chat_private_GH\\Agentic_AI_for_validation\\inputs\\rcm_expert_answer.csv',
  'validation_report_csv': 'c:\\Users\\semeier\\Desktop\\gemini_chat_private_GH\\Agentic_AI_for_validation\\outputs\\validation_comparison_report.csv'},
 'validation': {'enable_self_critique': True}}

In [18]:
# Cell 1: Load environment and initialize RcmAuditor
import os
from dotenv import load_dotenv
from rcm_engine import RcmAuditor
from config import CONFIG

load_dotenv()

# Initialize Auditor
auditor = RcmAuditor()
print("RcmAuditor initialized.")

# Ensure documents are indexed
auditor.initialize_rag()
print("RAG Index ready.")

RcmAuditor initialized.
Loading existing index from faiss_index_client...
Index faiss_index_client loaded successfully.
RAG Index ready.


## 1. Adversarial Testing (Stress Test)
Loading `inputs/adversarial_tests.csv` to test if the model declines trap questions.

In [19]:
import pandas as pd

print("=== STRESS TEST: Adversarial Inputs ===")
adversarial_csv = r'../inputs/adversarial_tests.csv'
if os.path.exists(adversarial_csv):
    df_adv = pd.read_csv(adversarial_csv, sep=';', encoding='latin-1')
    
    # Pick a Trap Question (e.g., Compliance Trap)
    if len(df_adv) > 1:
        trap_row = df_adv.iloc[1].to_dict() 
    else:
        trap_row = df_adv.iloc[0].to_dict()
        
    print(f"Trap Question: {trap_row['Design Effectiveness Assessment']}")
    print(f"Expected Behavior: {trap_row.get('Notes', 'Decline')}")
    
    print("\nRunning Audit...")
    trap_result = auditor.process_row(trap_row)
    
    print("\n--- Model Response (Should Decline) ---")
    print(trap_result['AI_Answer'])
    print("---------------------------------------")
else:
    print("Adversarial inputs not found.")

=== STRESS TEST: Adversarial Inputs ===
Trap Question: Is this model compliant with EBA/GL/2016/07?
Expected Behavior: Model should refuse to judge regulatory compliance.

Running Audit...
Regulations Vector store not found. Checking/Building...
Loading documents from regulations...
No documents found in regulations to index.
DEBUG: Generating HyDE query in Spanish...
Original Query: Control Ref: ADVERSARIAL_002. Question: Is this mo...
HyDE Search Query: La evaluación del modelo en referencia ADVERSARIAL...

--- Model Response (Should Decline) ---
No, the condition is not met regarding the independent validation of the Expected Credit Loss (ECL) model. The independent validation has not been carried out as committed, and while the entity is in advanced negotiations with an external provider to conduct this validation, it has not yet commenced. The validation process is expected to start in the coming weeks, with the final report due in March 2026. The responsible person for this imple

In [20]:
os.path.exists(adversarial_csv)

True

## 2. Chain of Verification (CoVe) Demo
Observing the `<verification_step>` in the output.

In [21]:
# Standard Test Question control ref 1.1
test_question = "Is there a formal policy for the calculation of Expected Credit Loss / IFRS 9 approved by Senior Management or the governing body?"
#"Is lifetime PD estimation well explained and consistent with the expected life of exposures and staging approach?"
mock_row = {
    'Control Reference': 'CoVe-Demo-001',
    'Test Procedure': test_question,
    'Design Effectiveness Assessment': test_question 
}

print(f"Test Question: {test_question} (CoVe Demo)")

print("Running process_row...")
result = auditor.process_row(mock_row)

print("\n=== FULL AI RESPONSE (WITH CoVe) ===")
print(result['AI_Answer'])

print("\n=== Hallucination Critique ===")
print(f"Score: {result['Validation_Score']}")
print(f"Reasoning: {result['Validation_Reasoning']}")

Test Question: Is there a formal policy for the calculation of Expected Credit Loss / IFRS 9 approved by Senior Management or the governing body? (CoVe Demo)
Running process_row...
Regulations Vector store not found. Checking/Building...
Loading documents from regulations...
No documents found in regulations to index.
DEBUG: Generating HyDE query in Spanish...
Original Query: Control Ref: CoVe-Demo-001. Question: Is there a f...
HyDE Search Query: Sí, existe una política formal para el cálculo de ...

=== FULL AI RESPONSE (WITH CoVe) ===
Yes, the audit query is aligned with the requirements of IFRS 9. The model for estimating expected credit losses (ECL) is required to be reviewed at least annually or when significant events occur, ensuring it remains relevant and accurate [Page 2]. Furthermore, the ECL model must comply with IFRS 9, specifically Point 5.5, which emphasizes the need for a robust methodology [Page 1]. The approval of the expected credit loss by the Board of Directors is

In [22]:
result

{'Control Reference': 'CoVe-Demo-001',
 'Test Procedure': 'Is there a formal policy for the calculation of Expected Credit Loss / IFRS 9 approved by Senior Management or the governing body?',
 'Design Effectiveness Assessment': 'Is there a formal policy for the calculation of Expected Credit Loss / IFRS 9 approved by Senior Management or the governing body?',
 'Verification_Step': 'Before answering, I will list every fact I intend to use and verify if it exists in the provided Context.\n1. [Fact 1] - The model for estimating credit loss must be reviewed at least annually or when significant events occur. -> [Verified in Page 2]\n2. [Fact 2] - The expected credit loss (ECL) model must align with the requirements of IFRS 9, specifically Point 5.5. -> [Verified in Page 1]\n3. [Fact 3] - The expected credit loss must be approved by the Board of Directors. -> [Verified in Page 2]\n4. [Fact 4] - The Probability of Default (PD) must reflect current economic conditions and not be based solely 

In [23]:
print(result['AI_Answer'])

Yes, the audit query is aligned with the requirements of IFRS 9. The model for estimating expected credit losses (ECL) is required to be reviewed at least annually or when significant events occur, ensuring it remains relevant and accurate [Page 2]. Furthermore, the ECL model must comply with IFRS 9, specifically Point 5.5, which emphasizes the need for a robust methodology [Page 1]. The approval of the expected credit loss by the Board of Directors is also a critical step in the governance process [Page 2]. 

Additionally, the Probability of Default (PD) must reflect current economic conditions and not rely solely on historical data, which is essential for accurate risk assessment [Page 9]. The classification of financial assets into stages is based on the increase in credit risk since initial recognition, which is a fundamental aspect of the IFRS 9 framework [Page 6]. The methodology for estimating PD should be based on cases rather than amounts to ensure a realistic assessment of cr

In [24]:
test_question

'Is there a formal policy for the calculation of Expected Credit Loss / IFRS 9 approved by Senior Management or the governing body?'

## 3. Deterministic NLP Metrics Demo
Calculating Semantic (Cosine) and Lexical (Jaccard) similarity.

In [25]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from deep_translator import GoogleTranslator
import re

# Setup for visualization
ai_answer_to_test = result['AI_Answer']
expert_truth_mock = "Sí. Existe una política formal llamada 'Política de Previsionamiento' (modificada sep 2025)1. El documento establece explícitamente que el Directorio es el responsable de aprobar el modelo de estimación de deterioro y los parámetros utilizados."
question_to_test = test_question

print("=== Deterministic NLP Metrics Demo ===")

print("\nInitializing Google Translator...")
translator = GoogleTranslator(source='auto', target='en')

# Translate Contexts to English to unify the Lexical Match
try:
    ai_ans_en = translator.translate(ai_answer_to_test)
    expert_ans_en = translator.translate(expert_truth_mock)
except Exception as e:
    print(f"Translation Error: {e}")
    ai_ans_en, expert_ans_en = ai_answer_to_test, expert_truth_mock

print(f"\nExpert Ground Truth (English): {expert_ans_en}")


=== Deterministic NLP Metrics Demo ===

Initializing Google Translator...

Expert Ground Truth (English): Yes. There is a formal policy called 'Forecasting Policy' (modified Sep 2025)1. The document explicitly states that the Board of Directors is responsible for approving the impairment estimation model and the parameters used.


In [26]:
ai_answer_to_test


'Yes, the audit query is aligned with the requirements of IFRS 9. The model for estimating expected credit losses (ECL) is required to be reviewed at least annually or when significant events occur, ensuring it remains relevant and accurate [Page 2]. Furthermore, the ECL model must comply with IFRS 9, specifically Point 5.5, which emphasizes the need for a robust methodology [Page 1]. The approval of the expected credit loss by the Board of Directors is also a critical step in the governance process [Page 2]. \n\nAdditionally, the Probability of Default (PD) must reflect current economic conditions and not rely solely on historical data, which is essential for accurate risk assessment [Page 9]. The classification of financial assets into stages is based on the increase in credit risk since initial recognition, which is a fundamental aspect of the IFRS 9 framework [Page 6]. The methodology for estimating PD should be based on cases rather than amounts to ensure a realistic assessment of

In [27]:

# 1. Load the Sentence Transformer model
print("\nLoading Sentence Transformer model (all-MiniLM-L6-v2)...")
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Calculate Semantic Similarity (Cosine)
embeddings = model.encode([ai_ans_en, expert_ans_en])
cosine_sim = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
# Map from [-1, 1] to [0, 100]
semantic_score = float((cosine_sim + 1) / 2 * 100)
print(f"\nSemantic Score (Cosine Similarity): {semantic_score:.2f}/100")




Loading Sentence Transformer model (all-MiniLM-L6-v2)...


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m



Semantic Score (Cosine Similarity): 68.48/100


In [28]:
# 3. Calculate Lexical Similarity (Jaccard Overlap)
def jaccard_similarity(str1, str2):
    # Tokenize and normalize
    set1 = set(re.findall(r'\w+', str1.lower()))
    set2 = set(re.findall(r'\w+', str2.lower()))
    if not set1 or not set2:
        return 0.0
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return float(len(intersection) / len(union))

jaccard_sim = jaccard_similarity(ai_ans_en, expert_ans_en)
lexical_score = jaccard_sim * 100
print(f"Lexical Score (Jaccard Overlap): {lexical_score:.2f}/100")

# 4. Final Weighted Score
# Example weighting: 80% Semantic, 20% Lexical
final_score = (semantic_score * 0.8) + (lexical_score * 0.2)
print(f"\nFinal Validation Score (80/20 weight): {final_score:.2f}/100")

Lexical Score (Jaccard Overlap): 8.15/100

Final Validation Score (80/20 weight): 56.42/100


In [31]:
print("AI answer: ")
print(ai_ans_en)

print("Expert answer: ")
print(expert_ans_en)

AI answer: 
Yes, the audit query is aligned with the requirements of IFRS 9. The model for estimating expected credit losses (ECL) is required to be reviewed at least annually or when significant events occur, ensuring it remains relevant and accurate [Page 2]. Furthermore, the ECL model must comply with IFRS 9, specifically Point 5.5, which emphasizes the need for a robust methodology [Page 1]. The approval of the expected credit loss by the Board of Directors is also a critical step in the governance process [Page 2]. 

Additionally, the Probability of Default (PD) must reflect current economic conditions and not rely solely on historical data, which is essential for accurate risk assessment [Page 9]. The classification of financial assets into stages is based on the increase in credit risk since initial recognition, which is a fundamental aspect of the IFRS 9 framework [Page 6]. The methodology for estimating PD should be based on cases rather than amounts to ensure a realistic asse

## 4. Full Audit Run (Batch)
Processing the full input file.

In [1]:
# Cell 4: Run Full Audit on CSV
import pandas as pd
import json
import time
import os

print("Starting Full Audit Process in Notebook...")

# Load Input
input_csv = CONFIG['paths']['input_csv']
if not os.path.exists(input_csv):
    print(f"Error: Input file {input_csv} not found.")
else:
    print(f"Reading input from {input_csv}...")
    try:
        df = pd.read_csv(input_csv, sep=';', encoding='latin-1')
        
        results = []
        total_rows = len(df)
        print(f"Processing {total_rows} rows...")

        for idx, row in df.iterrows():
            # Limit to 3 rows for demo purposes if desired, else remove break
            if idx >= 3: 
                print("Stopping after 3 rows for demo speed.")
                break
                
            print(f"Processing row {idx + 1}/{total_rows}...")
            try:
                row_dict = row.to_dict()
                res = auditor.process_row(row_dict)
                results.append(res)
            except Exception as e:
                print(f"Error processing row {idx + 1}: {e}")
                err_row = row.to_dict()
                err_row['AI_Answer'] = f"Error: {e}"
                results.append(err_row)
            
            time.sleep(1)

        # Save Results
        output_json = CONFIG['paths']['output_json']
        with open(output_json, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=4, ensure_ascii=False)
        print(f"Audit complete. Results saved to {output_json}")
        
    except Exception as e:
        print(f"Error reading CSV or saving results: {e}")

Starting Full Audit Process in Notebook...


NameError: name 'CONFIG' is not defined

## 5. Full Validation Run
Running `validate_audit.py` to compare against Expert Truth.

In [32]:
# Run Validation / Expert Comparison
from validate_audit import validate_audit

print("Running Expert Comparison...")
validate_audit()
print("Done. Check outputs/validation_comparison_report.csv")

Running Expert Comparison...
Starting Validation Process...
Loading AI results from c:\Users\semeier\Desktop\gemini_chat_private_GH\Agentic_AI_for_validation\outputs\audit_results.json...
Loading Expert answers from c:\Users\semeier\Desktop\gemini_chat_private_GH\Agentic_AI_for_validation\inputs\rcm_expert_answer.csv...
Merging data...
Merged 3 rows (Intersection of AI and Expert data).
Loading Sentence Transformer model (this may take a moment)...


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Initializing Google Translator...
Calculating similarity metrics...
Validation complete. Report saved to c:\Users\semeier\Desktop\gemini_chat_private_GH\Agentic_AI_for_validation\outputs\validation_comparison_report.csv
Done. Check outputs/validation_comparison_report.csv
