# **Summarisation & Evasion Notebook**

# **Handover Notes:** [delete after]
- Library imports and versions are saved in environments/summarisation_evasion_env.txt
- This notebook was originally built for a macbook pro M3 chip so some settings may need to be altered depending on your machine
- All files related/ generated by this notebook can be found in notebooks/summarisation_evasion_files

### **Work progress**
1. **Complete**
- Summarise banker answers using baseline model.
- Use Local RAG pipeline to bring in relevant external documents (PRA risk definitions) to create PRA aligned summaries.
- Developed a evasion detection prototype that generates evasion scores based on bankers answers (uses baseline model, LLM- natural language inference using RoBERTa and a blended score)
- Used jpm_2025 transcripts to get the pipeline working. Validated the evasion pipeline using jpm-23-1q data (involved human labelling the answer as Direct or Evasive- file saved in notebooks/summarisation_evasion_files).

2. **Not complete**
- Visualisations e.g. how many evasive answers were there? etc - apply evasion pipeline to dataset and generate statistics on evasiveness 
- Need to test pipleine on larger data set (e.g. jpm 2023-2025) and check against HSBC to make conclusions & comment on generalisability (answering research question: How does one bank’s tone and thematic profile compare to peers? Are divergences systemic or firm specific?)
- Summarisation pipeline could be improved using a two-stage pipeline: by first extractive summarisation to capture the context and details and then a second model to reframe the summary to be PRA and evasion aligned.
- Post-processing on the output file for the PRA aligned summaries by Mistral model so they are clearer- can this output be fed into another model to extract more insights/ detect evasion or risk?
- Increase the size of the validation set for the evasion pipeline prototype (e.g. more human labelling)
- Need to fine tune the evasion pipeline to increase accuracy
- Optional extensions e.g. using Agents, more complex RAG pipeline (including more useful context for the model), validation of instances of evasion using external news sources)

# 1. **Objectives**

# **2. Set up Workspace**

In [175]:
# Import libraries
# Core python
import os
import numpy as np
import pandas as pd
import re
import json
import pathlib
from pathlib import Path
from typing import List, Dict, Any 
import csv
import math

# NLP & Summarisation
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from llama_cpp import Llama 
import torch
import torch.nn.functional as F

# Retrieval
from sentence_transformers import SentenceTransformer 

# ML
from sklearn.model_selection import train_test_split, GroupShuffleSplit
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, f1_score, precision_score, recall_score, accuracy_score
from sklearn.isotonic import IsotonicRegression

# Visualisations
import matplotlib.pyplot as plt
import seaborn as sns 

# Set global SEED.
SEED = 42

# **3. Load the dataset**

In [2]:
# Load the dataset.
jpm_2025_df = pd.read_csv('../data/processed/jpm/all_jpm_2025.csv')

# View the data.
jpm_2025_df.head()

Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf
0,1,,Ken Usdin,analyst,Autonomous Research,"Good morning, Jeremy. Wondering if you could s...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
1,1,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Sure, Ken. So I mean, at a high level, I would...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
2,2,,Ken Usdin,analyst,Autonomous Research,Yeah. And just one question on the NII ex. Mar...,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
3,2,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Yeah, that's a good question, Ken. You're righ...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
4,2,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorganChase,In the curve basically.,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...


# **4. Preprocessing**

- Used all_jpm_2025.csv dataset
- Preliminary preprocessing to label roles as analyst vs banker (invalid roles were corrected) to make downstream analysis easier. Created a new column 'role_normalised'.

In [3]:
# View speaker roles.
jpm_2025_df['role'].unique()

array(['analyst', 'Chief Financial Officer',
       'Chairman & Chief Executive Officer',
       'And then some. Theres a lot of value added.', 'Okay'],
      dtype=object)

In [4]:
# View rows with invalid roles.
valid_roles = 'analyst', 'Chief Financial Officer', 'Chairman & Chief Executive Officer'
invalid_roles_df = jpm_2025_df[~jpm_2025_df['role'].isin(valid_roles)]

# Number of rows with invalid roles.
print('Number of rows:', invalid_roles_df.shape[0])

# View the rows.
invalid_roles_df.head()

Number of rows: 2


Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf
201,35,5.0,"Chief Financial Officer, JPMorganChase",And then some. Theres a lot of value added.,JPMorganChase,"Yeah. And obviously, I mean, we're not going t...",2025,Q2,data/raw/jpm/jpm-2q25-earnings-call-transcript...
205,36,3.0,"Chief Financial Officer, JPMorganChase",Okay,there you have it.,"But it's not like I thought it would do badly,...",2025,Q2,data/raw/jpm/jpm-2q25-earnings-call-transcript...


In [5]:
# Input the correct role information.
jpm_2025_df.at[205, 'role'] = 'Chief Financial Officer'
jpm_2025_df.at[209, 'role'] = 'Chief Financial Officer'

# Verify the roles have been updated.
jpm_2025_df['role'].unique()

array(['analyst', 'Chief Financial Officer',
       'Chairman & Chief Executive Officer',
       'And then some. Theres a lot of value added.'], dtype=object)

In [6]:
# Define role mapping.
role_map = {
    'analyst': 'analyst',
    'Chief Financial Officer': 'banker',
    'Chairman & Chief Executive Officer': 'banker'
}

# Apply to dataset.
jpm_2025_df['role_normalised'] = jpm_2025_df['role'].map(role_map)

In [7]:
# View the dataset.
jpm_2025_df.head()

Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf,role_normalised
0,1,,Ken Usdin,analyst,Autonomous Research,"Good morning, Jeremy. Wondering if you could s...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,analyst
1,1,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Sure, Ken. So I mean, at a high level, I would...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker
2,2,,Ken Usdin,analyst,Autonomous Research,Yeah. And just one question on the NII ex. Mar...,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,analyst
3,2,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Yeah, that's a good question, Ken. You're righ...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker
4,2,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorganChase,In the curve basically.,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker


# **5. Summarisation**

## **5.1 Baseline**

- Initial model exploration using BART and mistral-7B-instruct to summarise banker's answers (no additional context given to model)

### **5.1.1 BART**

In [8]:
# Filter data to banker answers only.
banker_answers = jpm_2025_df[jpm_2025_df['role_normalised'] == 'banker']['content'].tolist()
print(banker_answers[0][:200])

Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the w


In [9]:
# Summarisation baseline (BART)
bart = pipeline('summarization', model='facebook/bart-large-cnn')

sample_text = banker_answers[0]
summary_bart = bart(sample_text, max_length=80, min_length=30, do_sample=False)
print('Original:', sample_text[:400])
print('Summary:', summary_bart[0]['summary_text'])

Device set to use mps:0


Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: The main thing that we see there, what would appear to be a certain amount of frontloading of spending ahead of people expecting price increases from tariffs. So ironically, that's actually somewhat supportive, all else equal. In terms of our corporate clients, obviously, they've been reacting to the changes in tariff policy.


- bart was able to extract ket ideas, focussing on fronloading of spending and tariff policy. 
- Compressed the response into two sentences and the summary is coherent, removing filler phrases.
- However, the summary is not fully neutral (e.g. includes ironically) and preserves tone
- Also there is a loss of context- e.g. consumer side vs wholesale side distinction is no longer explicit.

In [10]:
# Prompt conditioning to make PRA relevant.
prompt = "Summarise this answer, focusing on risk, capital and evasion of detail: " + sample_text
summary_bart_prompted = bart(prompt, max_length=80, min_length=30)
print('Original:', sample_text[:400])
print('Summary:', summary_bart_prompted[0]['summary_text'])

Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: Corporates are taking a wait-and-see approach to tariff policy. Some sectors are going to be much more exposed than others. Small business and smaller corporates are probably a little more challenged.


- Prompted summary shifts emphasis and includes interpretation around risk, even though those words were no explicit in the original
- This version is more aligned to evasion detection but moves away from concrete detail 
- Improved approach would be to have a two stage-pipeline: first extractive summarisation to capture the context and details and then a second model to reframe the summary to be PRA and evasion aligned.

### **5.1.2 Mistral-7B-Instruct**

- Mistral model: mistral-7b-instruct-v0.1.Q4_K_M.gguf
- Mistral-7B-Instruct model download: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF?show_file_info=mistral-7b-instruct-v0.1.Q4_K_M.gguf
- Also saved in shared team folder models

In [11]:
# Summarisation baseline (Mistral-7B-Instruct) with basic prompt.
llm = Llama(model_path='/Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
            n_ctx=4096, n_gpu_layers=-1, verbose=False, seed=SEED)  # change path as needed 

prompt = f"<s>[INST] Summarise the following answer in 2 sentences, focusing on concrete facts. Avoid opinions.\n\n{sample_text}\n[/INST]"

output = llm.create_chat_completion(
    messages=[{'role': 'user', 'content': prompt}],
    max_tokens=180,
    temperature=0.1,
    stop=['</s>']
)

summary_mistral = output['choices'][0]['message']['content'].strip()  

print('Original:', sample_text[:400])
print('Summary:', summary_mistral)

llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 

Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: The speaker is discussing the impact of recent news flow on the consumer and corporate sides. On the consumer side, there has been some frontloading of spending ahead of expected price increases from tariffs, which may distort the data and make it difficult to draw larger conclusions. On the corporate side, clients are reacting to changes in tariff policy by shifting their focus towards short-term work and optimizing supply chains. The speaker characterizes the attitude of corporate clients as a wait-and-see attitude, with smaller clients and smaller corporates being more c

- Preserves details and nuance and is more contextual and interpretive than the BART baseline model.
- However, the result is longer with heavier phrasing and includes phrases like 'distort the data' which is not explicit in the original.

In [12]:
# Summarisation baseline (Mistral-7B-Instruct) with more detailed prompt.
llm = Llama(model_path='/Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
            n_ctx=4096, n_gpu_layers=-1, verbose=False, seed=SEED)  # change path as needed 

prompt = f"<s>[INST] Summarise the following answer in 2 sentences, focusing on concrete facts. Avoid opinions. Focus on risk, capital and evasion of detail.\n\n{sample_text}\n[/INST]"

output = llm.create_chat_completion(
    messages=[{'role': 'user', 'content': prompt}],
    max_tokens=180,
    temperature=0.1,
    stop=['</s>']
)

summary_mistral_prompted = output['choices'][0]['message']['content'].strip()  

print('Original:', sample_text[:400])
print('Summary:', summary_mistral_prompted)

llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 

Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: The speaker is discussing the impact of recent news flow on the consumer and corporate sides of their business. On the consumer side, they have observed some frontloading of spending ahead of expected price increases from tariffs, which may distort data and make it difficult to draw larger conclusions. On the corporate side, clients are shifting their focus towards optimizing supply chains and responding to the current environment, rather than prioritizing more strategic work. The speaker notes that smaller clients and smaller corporates may be more challenged than larger o

- This summary brings in risk- language and is closer to the task objective.
- However, some interpretations are generated by the model rather than explicitly detailed in the answer

## **5.2 Adding Context**

Retrieve PRA risk categories to give greater PRA focus to summaries (local RAG loop).
- measure cosine similarity between transcript chunks and PRA risk categories (vectors)
- retrieve the top 2-3 most relevant risk categories 
- prepend them to the summarisation prompt to make summaries PRA-aligned instead of just summarised answers

- Attempting to use BART resulted in prompt echoing.
- New attempt using Mistral-7B-Instruct.
- Using sentence-BERT vs TF-IDF for vectorisation.

### **5.2.1 Mistral-7B-Instruct**

**Process**
- Performed some light cleaning of the transcript to remove whitespace.
- Split the transcript into smaller chunks that the model can summarise to avoid truncation
- Loaded the PRA categories csv file (contains category and definition)
- Embedded the PRA categories and chunks, evaluated the similarity to extract the PRA risk categories that were relevant to the text
- Summarised the chunk using detailed prompted and relevant PRA categories as additional context. 

**Output File**:
- The output file of this can be found in notebooks/summarisation_evasion_files, name = jpm_mistral_pra_summary.json
- It is in the format: summary, evidence, PRA category that relates to summary and reasoning for selecting these categories.

- Needed to use a lot of fine tuning for the prompt and set strict rules for the model
- Need to be very clear about the output expected or else the model deviates a lot, especially as it processes more data.
- Include lines about lack of evidence if not the model may hallucinate

In [13]:
# Function to remove whitespace in text.
def clean_text(text: str):
    return re.sub(r'\s+', ' ', text).strip()

In [14]:
# Function to split the transcript into smaller chunks.
def chunk_text(text: str, max_chars: int = 6000):
    sentences = re.split(r'(?<=[.!?])\s+', text.strip()) # split into sentences 
    chunks, current_chunk, current_len = [], [], 0 # list of chunks, sentences collecting for current chunk, character count for current chunk

    for s in sentences:
        if current_len + len(s) + 1 <= max_chars: # if the characters of current chunk + new sentence is below the limit:
            current_chunk.append(s) # add sentence to current chunk 
            current_len += len(s) + 1 # update running character count 
        
        else: # if the characters is above the limit:
            chunks.append(' '.join(current_chunk)) # add the current chunk to the final chunk list
            current_chunk, current_len = [s], len(s) # start a new chunk containing the sentence and update current len

    if current_chunk:
        chunks.append(' '.join(current_chunk)) # add any sentences in current chunk after loop ends 

    return chunks 

In [15]:
# Function to load PRA categories and definitions from CSV.
def load_pra_categories(path: Path):
    with open(path, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        return [
            (row.get('category', '').strip(), [row.get('definition', '').strip()])
            for row in reader if row.get('category')
        ]

In [16]:
# Build a Sentence-BERT embedding index for PRA categories.
def build_embedding_index(pra_categories):
    embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    docs = [f"{name} {' '.join(defs)}" for name, defs in pra_categories]
    pra_risk_embeddings = embedder.encode(docs, batch_size=32, normalize_embeddings=True)

    return embedder, np.asarray(pra_risk_embeddings)

In [17]:
# Function to find the relevant PRA categories to the transcript chunks.
def find_rel_categories(chunk, pra_categories, embedder, pra_risk_embeddings, top_k=2):
    query_vec = embedder.encode([chunk], normalize_embeddings=True) # turns chunk into embedding
    sims = cosine_similarity(query_vec, pra_risk_embeddings).ravel() # compares the chunk to each category doc 
    top_indices = np.argsort(-sims)[:top_k] # sorts scores descending and selected top k cateogories 

    return [pra_categories[i] for i in top_indices]

In [18]:
# Function to parse JSON
def parse_tagged_json(raw):
    m = re.search(r"<json>\s*(\{[\s\S]*?\})\s*</json>", raw, flags=re.IGNORECASE)
    if not m:
        return None
    try:
        return json.loads(m.group(1))
    except json.JSONDecodeError:
        return None

In [19]:
# Function to summarise the text chunks.
def summarise_chunk(model, chunk, relevant_categories, max_evidence=5):

    # Build PRA notes (limit to 2 bullets per category)
    lines = []
    for name, definition in relevant_categories:
        lines.append(f'- {name}:')
        for d in list(definition)[:2]:
            lines.append(f'- {d}')
    notes_block = '\n'.join(lines)

    system_prompt = (
        "You are a careful data extraction model. "
        "Return ONLY valid JSON wrapped in <json>...</json> tags."
    )

    user_prompt = f"""
TRANSCRIPT:
{chunk}

PRA NOTES:
{notes_block}

TASK:
Return JSON ONLY, wrapped exactly like this:
<json>{{"summary": "...", "evidence": ["..."], "pra_categories": [{{"category":"...","why":"..."}}]}}</json>

RULES:
- 4-6 sentence neutral summary.
- Up to {max_evidence} evidence bullets (quotes/facts).
- 1-3 pra_categories objects.
- If evidence is lacking, use a single bullet "Insufficient evidence".
- Only choose categories supported by the evidence.
""".strip()

    response = model.create_chat_completion(
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user', 'content': user_prompt},
        ],
        temperature=0.2,
        top_p=0.9,
        max_tokens=700,
        repeat_penalty=1.1,
    )

    raw = (response['choices'][0]['message']['content'] or '').strip()

    # Parse the tagged JSON
    parsed = parse_tagged_json(raw)

    # Fallback if model didn’t follow instructions
    if not parsed:
        return (
            {'summary': '', 'evidence': ['Insufficient evidence'], 'pra_categories': []},
            raw,
        )

    # Light coercion to guarantee keys exist
    result = {
        'summary': parsed.get('summary', '') or '',
        'evidence': parsed.get('evidence', []) or [],
        'pra_categories': parsed.get('pra_categories', []) or []
    }
    return result, raw

In [20]:
# Define variables.
MODEL_PATH = '/Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf'
PRA_NOTES_PATH = '../data/RAG-resources/PRA_risk_categories.csv'
TRANSCRIPT_PATH = '../data/processed/jpm/all_jpm_2025.csv'
OUTPUT_PATH = pathlib.Path('../notebooks/summarisation_evasion_files/jpm_mistral_pra_summary_raw.json')
TOP_K = 2

In [21]:
# Runner code.
pra_categories = load_pra_categories(Path(PRA_NOTES_PATH))
embedder, category_embeddings = build_embedding_index(pra_categories)

# Load and chunk transcript
transcript_text = Path(TRANSCRIPT_PATH).read_text(encoding='utf-8')
transcript_chunks = chunk_text(transcript_text)

n_threads = max(4, (os.cpu_count() or 8) - 2)

# Define the model.
model = Llama(
    model_path=str(MODEL_PATH),
    n_ctx=4096,
    n_gpu_layers=20,
    chat_format='mistral-instruct',
    n_threads=n_threads,
)

raw_outputs = []

for i, chunk in enumerate(transcript_chunks, 1):
    try:
        top_categories = find_rel_categories(
            chunk, pra_categories, embedder, category_embeddings, top_k=TOP_K
        )
        _, raw = summarise_chunk(
            model, chunk, top_categories, max_evidence=5
        )
        raw_outputs.append({'chunk': i, 'raw': raw})

    except Exception:
        raw_outputs.append({'chunk': i, 'raw': ''})

final_output = {'raw_outputs': raw_outputs}

OUTPUT_PATH.write_text(json.dumps(final_output, indent=2, ensure_ascii=False), encoding='utf-8')
print(f'Wrote final JSON to: {OUTPUT_PATH.resolve()}')

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama_model_load_from_file_impl: using device Metal (Apple M3) - 3559 MiB free
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_len

KeyboardInterrupt: 

- Need to preprocess the output so it is visually clearer (summary, evidence, PRA categories (name & why the model chose this))
- Can this information be fed to the model again and can it detect any early PRA risk indicators?

# **6. Evasion Detection Pipeline**

1. **Baseline Evasion score** (rule-based) is made up of three components:
- **Cosine similarity**- similarity of the question and answer, lower similarity = more evasive
- **Numeric specificity check**- does the question require a number, if so does the answer contain a number?, e.g. requests for financial data
- **Evasive phrases**- does the answer contain evasive phrases?, presence = more evasive

2. **LLM evasion score** (RoBERTa-MNLI) uses entailment/neutral/contradiction between the question and answer
- Lower entailment (and higher neutral + contradiction) = more evasive
  
3. **Blended evasion score** combines both scores including a weight for the LLM component
- Rationale is that baseline enforces precision while the LLM will capture semantics

### **Data Preprocessing**

In [45]:
# Load dataset
all_jpm_2023_2025 = pd.read_csv('../data/processed/jpm/all_jpm_2023_2025.csv')

# View dataset.
display(all_jpm_2023_2025.head())

# Number of rows.
print('Number of rows:', all_jpm_2023_2025.shape[0])

Unnamed: 0,section,question_number,answer_number,speaker_name,role,company,content,year,quarter,is_pleasantry,source_pdf
0,presentation,,,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Thanks, and good morning, everyone. The presen...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...
1,qa,,,Steven Chubak,analyst,Wolfe Research LLC,"Hey, good morning.",2023,Q1,True,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...
2,qa,,,Jeremy Barnum,Chief Financial Officer,JPMorgan Chase & Co.,"Good morning, Steve.",2023,Q1,True,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...
3,qa,1.0,,Steven Chubak,analyst,Wolfe Research LLC,"So, Jamie, I was actually hoping to get your p...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...
4,qa,1.0,1.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,"Well, I think you were already kind of complet...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...


Number of rows: 1411


In [46]:
# Remove pleasantries.
all_jpm_2023_2025_cleaned = all_jpm_2023_2025[all_jpm_2023_2025['is_pleasantry'] == False]
print('Number of rows:', all_jpm_2023_2025_cleaned.shape[0])

Number of rows: 1241


In [47]:
# Check content column.
print('Number of rows with no content:', all_jpm_2023_2025_cleaned['content'].isna().sum())

Number of rows with no content: 23


In [48]:
# Drop rows with no content.
all_jpm_2023_2025_cleaned = all_jpm_2023_2025_cleaned.dropna(subset=['content'])

In [49]:
# Check content column.
print('Number of rows with no content:', all_jpm_2023_2025_cleaned['content'].isna().sum())

Number of rows with no content: 0


In [50]:
# View roles.
all_jpm_2023_2025_cleaned['role'].unique()

array(['Chief Financial Officer', 'analyst',
       'Chairman & Chief Executive Officer',
       'And then some. Theres a lot of value added.', 'Okay',
       "We're fundamentally", 'Thanks', 'Almost no chance.'], dtype=object)

- Some text has leaked into role column.

In [51]:
# View rows with invalid roles. 
valid_roles = 'analyst', 'Chief Financial Officer', 'Chairman & Chief Executive Officer'
invalid_roles_df = all_jpm_2023_2025_cleaned[~all_jpm_2023_2025_cleaned['role'].isin(valid_roles)]
invalid_roles_df.head(10)

Unnamed: 0,section,question_number,answer_number,speaker_name,role,company,content,year,quarter,is_pleasantry,source_pdf
305,qa,22.0,4.0,"Chief Financial Officer, JPMorganChase",And then some. Theres a lot of value added.,JPMorganChase,"Yeah. And obviously, I mean, we're not going t...",2025,Q2,False,data/raw/jpm/.ipynb_checkpoints/jpm-2q25-earni...
309,qa,23.0,3.0,"Chief Financial Officer, JPMorganChase",Okay,there you have it.,"But it's not like I thought it would do badly,...",2025,Q2,False,data/raw/jpm/.ipynb_checkpoints/jpm-2q25-earni...
650,qa,10.0,3.0,Who knows how important politics are in all th...,We're fundamentally,"as I said, I think on the press call, happy to...",little bit cautious about the pull-forward dyn...,2024,Q1,False,data/raw/jpm/jpm-1q24-earnings-call-transcript...
924,qa,8.0,2.0,"Chief Financial Officer, JPMorgan Chase & Co.",Thanks,Glenn.,"Operator: Next, we'll go to the line of Matt O...",2024,Q2,False,data/raw/jpm/jpm-2q24-earnings-call-transcript...
1059,qa,22.0,4.0,"Chief Financial Officer, JPMorganChase",And then some. Theres a lot of value added.,JPMorganChase,"Yeah. And obviously, I mean, we're not going t...",2025,Q2,False,data/raw/jpm/jpm-2q25-earnings-call-transcript...
1063,qa,23.0,3.0,"Chief Financial Officer, JPMorganChase",Okay,there you have it.,"But it's not like I thought it would do badly,...",2025,Q2,False,data/raw/jpm/jpm-2q25-earnings-call-transcript...
1274,qa,23.0,1.0,"Chairman & Chief Executive Officer, JPMorgan C...",Almost no chance.,JPMorganChase,"Well, but having – it's very important. While ...",2024,Q3,False,data/raw/jpm/jpm-3q24-earnings-conference-call...


In [52]:
# Input the correct role information.
all_jpm_2023_2025_cleaned.loc[[305, 309, 924, 1059, 1063], 'role'] = 'Chief Financial Officer'
all_jpm_2023_2025_cleaned.loc[[1274], 'role'] = 'Chairman & Chief Executive Officer'

# Drop nonsence row.
all_jpm_2023_2025_cleaned = all_jpm_2023_2025_cleaned.drop(index=650)

In [53]:
# Check the roles have been updated.
all_jpm_2023_2025_cleaned['role'].unique()

array(['Chief Financial Officer', 'analyst',
       'Chairman & Chief Executive Officer'], dtype=object)

In [54]:
# Normalise role names.
role_map = {
    'analyst': 'analyst',
    'Chief Financial Officer': 'banker',
    'Chairman & Chief Executive Officer': 'banker'
}

# Map roles.
all_jpm_2023_2025_cleaned['role_normalised'] = all_jpm_2023_2025_cleaned['role'].map(role_map)

In [55]:
# View dataset.
display(all_jpm_2023_2025_cleaned.head())
print('Number of rows:', all_jpm_2023_2025_cleaned.shape[0])

Unnamed: 0,section,question_number,answer_number,speaker_name,role,company,content,year,quarter,is_pleasantry,source_pdf,role_normalised
0,presentation,,,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Thanks, and good morning, everyone. The presen...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,banker
3,qa,1.0,,Steven Chubak,analyst,Wolfe Research LLC,"So, Jamie, I was actually hoping to get your p...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,analyst
4,qa,1.0,1.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,"Well, I think you were already kind of complet...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,banker
5,qa,1.0,1.0,Steven Chubak,analyst,Wolfe Research LLC,Got it. And just in terms of appetite for the ...,2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,analyst
6,qa,1.0,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,"Oh, yeah.",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,banker


Number of rows: 1217


In [56]:
# Save the cleaned dataset.
all_jpm_2023_2025_cleaned.to_csv('../data/processed/jpm/cleaned/all_jpm_2023_2025_cleaned') 

In [57]:
# Helper function to remove duplicates within questions and answers. 
def clean_repeats(text):
    if not isinstance(text, str):
        return text

    # 1) Normalize whitespace
    t = ' '.join(text.split()).strip()
    if not t:
        return t

    # 2) If the whole-string is a back-to-back duplicate (A+A) = keep first half
    mid = len(t) // 2
    if len(t) % 2 == 0 and t[:mid] == t[mid:]:
        t = t[:mid]

    # 3) Collapse immediate repeated token spans (n-grams)
    toks = t.split()
    out = []
    i = 0
    while i < len(toks):
        matched = False
        max_span = min(50, len(toks) - i)  # cap span to remaining length
        for n in range(max_span, 4, -1):  # try longer spans first: 50..5
            if i + 2*n <= len(toks) and toks[i:i+n] == toks[i+n:i+2*n]:
                out.extend(toks[i:i+n])  # keep one copy
                i += 2*n                # skip the duplicate block
                matched = True
                break
        if not matched:
            out.append(toks[i])
            i += 1
    t = ' '.join(out)

    # 4) Remove duplicate sentences globally (order-preserving)
    sents = re.split(r'(?<=[.!?])\s+', t)
    seen = set()
    uniq = []
    for s in sents:
        s_norm = s.strip()
        if not s_norm:
            continue
        key = ' '.join(s_norm.lower().split())
        if key not in seen:
            seen.add(key)
            uniq.append(s_norm)
    return ' '.join(uniq)

In [58]:
# Function to convert datasets into question and answer pairs.
def create_qa_pairs(df, min_answer_words=30):
    # Keep only the Q&A section.
    qa_df = df[df['section'].astype(str).str.lower() == 'qa'].copy()

    # Split into roles.
    analyst_rows = qa_df[qa_df['role_normalised'] == 'analyst'].copy()
    banker_rows  = qa_df[qa_df['role_normalised'] == 'banker' ].copy()

    # Keys to keep quarters separated
    key_q = ['year', 'quarter', 'question_number']

    # Build full question text per (year, quarter, question_number)
    question_text_map = (
        analyst_rows
        .groupby(key_q, dropna=False)['content']
        .apply(lambda parts: clean_repeats(' '.join(parts.astype(str))))
        .rename('question')
        .reset_index()
    )

    # Ensure bankers have an answer_number — sequential per (year, quarter, question_number) if missing
    if 'answer_number' not in banker_rows.columns or banker_rows['answer_number'].isna().any():
        banker_rows = banker_rows.sort_index().copy()
        banker_rows['answer_number'] = (
            banker_rows
            .groupby(key_q, dropna=False)
            .cumcount() + 1
        )

    # Combine multiple banker utterances belonging to the same answer
    banker_answers = (
        banker_rows
        .groupby(key_q + ['answer_number'], dropna=False)
        .agg({
            'content':        lambda parts: clean_repeats(' '.join(parts.astype(str))),
            'speaker_name':   'first',
            'role':           'first',
            'role_normalised':'first',
            'source_pdf':     'first'
        })
        .rename(columns={'content': 'answer'})
        .reset_index()
    )

    # Merge question text back onto each answer row
    qa_pairs = banker_answers.merge(
        question_text_map,
        on=key_q,
        how='left',
        validate='many_to_one'
    )

    # Order columns for readability
    column_order = [
        'year', 'quarter', 'question_number', 'answer_number',
        'question', 'answer',
        'speaker_name', 'role', 'role_normalised',
        'source_pdf'
    ]
    qa_pairs = qa_pairs.reindex(columns=[c for c in column_order if c in qa_pairs.columns])

    # Sort and reset index.
    qa_pairs = qa_pairs.sort_values(['year', 'quarter', 'question_number', 'answer_number']).reset_index(drop=True)

    # Drop duplicate answers.
    qa_pairs = qa_pairs.drop_duplicates(subset=['answer'])

    # Drop short answers below threshold to ensure quality answers.
    qa_pairs = qa_pairs[qa_pairs['answer'].astype(str).str.split().str.len() >= int(min_answer_words)]

    return qa_pairs

In [59]:
# Create q&A pairs.
all_jpm_2023_2025_qa = create_qa_pairs(all_jpm_2023_2025_cleaned)

In [60]:
# View number of examples.
print('Number of examples:', all_jpm_2023_2025_qa.shape[0])

Number of examples: 309


In [64]:
# Split into prediction set and validation/test set.
jpm_2025_predict_qa = all_jpm_2023_2025_qa[all_jpm_2023_2025_qa['year'] == 2025]
jpm_2023_2024_qa = all_jpm_2023_2025_qa[all_jpm_2023_2025_qa['year'].isin([2023, 2024])]

# Save the datasets.
jpm_2025_predict_qa.to_csv('../data/processed/jpm/cleaned/jpm_2025_predict_qa.csv') 
jpm_2023_2024_qa.to_csv('../data/processed/jpm/cleaned/jpm_2023_2024_qa.csv')  

The jpm_2023_2024_qa dataset was then manually labelled according to whether the banker's answer was deemed 'Direct' or 'Evasive'. The label was appended by a new column 'label'.

In [176]:
# Load the labelled dataset.
jpm_2023_2024_qa_labelled = pd.read_csv('../data/processed/jpm/cleaned/jpm_2023_2024_qa_labelled.csv')

# View the dataset.
jpm_2023_2024_qa_labelled = jpm_2023_2024_qa_labelled.drop('Unnamed: 0', axis=1)
display(jpm_2023_2024_qa_labelled.head())
print('Number of examples:', jpm_2023_2024_qa_labelled.shape[0])

Unnamed: 0,year,quarter,question_number,answer_number,question,answer,speaker_name,role,role_normalised,source_pdf,label
0,2023,Q4,1.0,1.0,Good morning. Thanks for all the comments on t...,"Yeah. Matt, not particularly updating. I think...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct
1,2023,Q4,2.0,1.0,"Okay. And then just separately, you bought bac...",Yeah. Good question. And I think you framed it...,Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct
2,2023,Q4,3.0,1.0,"Thanks. Jeremy, could you give a little more c...","Yeah. Actually, John, this quarter, that's all...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct
3,2023,Q4,4.0,1.0,"Okay. And then, just to follow up on the NII, ...","Sure. Yeah, happy to do that, John. So, I thin...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct
4,2023,Q4,5.0,1.0,Hey. Good morning. Maybe just to follow up in ...,Yeah. Both good questions. So let's do reprice...,Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct


Number of examples: 215


In [177]:
# Function to split the jpm_2023_2024 dataset into test and validation ensuring answers are not leaked.
def val_test_split(df, group_key, label_col='label', test_size=0.5):

    is_evasive = df[label_col].astype(str).str.lower().eq("evasive")
    g = df.assign(_ev=is_evasive.astype(int)).groupby(group_key).agg(
        n=("__dummy__", "size") if "__dummy__" in df.columns else ("_ev", "size"),
        ev=("_ev", "sum")
    )

    # Order groups: evasive-heavy first, then larger groups
    order = g.sort_values(["ev", "n"], ascending=False).index.tolist()

    # Greedy pack groups into two halves balancing evasive counts, then size
    A, B = [], []
    evA = evB = nA = nB = 0
    target_n_each = len(df) * (1 - test_size)

    for grp in order:
        ev, n = int(g.loc[grp, "ev"]), int(g.loc[grp, "n"])
        # choose the side with fewer evasives; on tie, choose the smaller side by n
        if (evA < evB) or (evA == evB and nA <= nB):
            A.append(grp); evA += ev; nA += n
        else:
            B.append(grp); evB += ev; nB += n

    # Build frames: A = validation, B = test (roughly 50/50 by rows)
    val_set  = df[df[group_key].isin(A)].reset_index(drop=True)
    test_set = df[df[group_key].isin(B)].reset_index(drop=True)
    return val_set, test_set

In [178]:
# Make a group key so that all answers for the same question stay in the same set. 
jpm_2023_2024_qa_labelled['group_key'] = (
    jpm_2023_2024_qa_labelled['year'].astype(str) + '_' +
    jpm_2023_2024_qa_labelled['quarter'].astype(str) + '_' +
    jpm_2023_2024_qa_labelled['question_number'].astype(str)
)

In [179]:
# Split into validation and test set.
jpm_val_qa_labelled, jpm_test_qa_labelled = val_test_split(
    jpm_2023_2024_qa_labelled,
    group_key='group_key',
    label_col='label'
)

print(f'Number of validation examples: {jpm_val_qa_labelled.shape[0]} \n{jpm_val_qa_labelled["label"].value_counts()}')
print(f'Number of test examples: {jpm_test_qa_labelled.shape[0]} \n{jpm_test_qa_labelled["label"].value_counts()}')

Number of validation examples: 108 
label
Direct     87
Evasive    21
Name: count, dtype: int64
Number of test examples: 107 
label
Direct     86
Evasive    21
Name: count, dtype: int64


In [180]:
# Save the datasets.
jpm_val_qa_labelled.to_csv('../data/processed/jpm/cleaned/jpm_val_qa_labelled.csv')
jpm_test_qa_labelled.to_csv('../data/processed/jpm/cleaned/jpm_test_qa_labelled.csv')

### **LLM Model Set-up**

In [181]:
# Model name checkpoints.
roberta_name = 'roberta-large-mnli'
deberta_name = 'microsoft/deberta-large-mnli'
zs_deberta_name = 'MoritzLaurer/deberta-v3-large-zeroshot-v2.0'

# Load tokenizers and models.
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_name)
roberta = AutoModelForSequenceClassification.from_pretrained(roberta_name)

deberta_tokenizer = AutoTokenizer.from_pretrained(deberta_name)
deberta = AutoModelForSequenceClassification.from_pretrained(deberta_name)

zs_deberta_tokenizer = AutoTokenizer.from_pretrained(zs_deberta_name)
zs_deberta = AutoModelForSequenceClassification.from_pretrained(zs_deberta_name)

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. ini

In [182]:
# Verify label order per model.
print("roberta id2label:", roberta.config.id2label)
print("deberta id2label:", deberta.config.id2label)
print("zs_deberta id2label:", zs_deberta.config.id2label)

roberta id2label: {0: 'CONTRADICTION', 1: 'NEUTRAL', 2: 'ENTAILMENT'}
deberta id2label: {0: 'CONTRADICTION', 1: 'NEUTRAL', 2: 'ENTAILMENT'}
zs_deberta id2label: {0: 'entailment', 1: 'not_entailment'}


- Roberta and deberta have the standard 3 MNLI labels whereas zero shot deberta is binary.

In [183]:
# Add models and tokenizers to dictionary.
models_and_tokenizers = {
        'roberta': (roberta, roberta_tokenizer),
        'deberta': (deberta, deberta_tokenizer),
        'zs_deberta': (zs_deberta, zs_deberta_tokenizer)
        }

In [184]:
# Set device 
USE_MPS = True

if USE_MPS:
    device = torch.device('mps')
    DTYPE = torch.float16
else:
    device = device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    DTYPE = torch.float16 if device.type == "cuda" else torch.float32

for model, tok in models_and_tokenizers.values():
    model.to(device, dtype=DTYPE).eval()

torch.set_grad_enabled(False)

torch.autograd.grad_mode.set_grad_enabled(mode=False)

### **Baseline Functions**

In [185]:
# List of evasive phrases
EVASIVE_PHRASES = [
    r"\btoo early\b",
    r"\bcan't (?:comment|share|discuss)\b",
    r"\bwon't (?:comment|share|provide)\b",
    r"\bno (?:update|comment)\b",
    r"\bwe (?:don't|do not) (?:break out|provide guidance)\b",
    r"\bnot (?:going to|able to) (?:comment|share|provide)\b",
    r"\bwe'll (?:come back|circle back)\b",
    r"\bnot something we disclose\b",
    r"\bas (?:we|I) (?:said|mentioned)\b",
    r"\bgenerally speaking\b",
    r"\bit's premature\b",
    r"\bit's difficult to say\b",
    r"\bI (?:wouldn't|won't) want to (?:speculate|get into)\b",
    r"\bI (?:think|guess|suppose)\b",
    r"\bkind of\b",
    r"\bsort of\b",
    r"\baround\b",
    r"\broughly\b",
    r"\bwe (?:prefer|plan) not to\b",
    r"\bwe're not prepared to\b",
]

# List of words that suggest the answer needs specific financial numbers to properly answer the question.
SPECIFICITY_TRIGGERS = [
    "how much","how many","what is","what are","when","which","where","who","why",
    "range","guidance","margin","capex","opex","revenue","sales","eps","ebitda",
    "timeline","date","target","growth","update","split","dividend","cost","price",
    "units","volumes","gross","net","tax","percentage","utilization","order book"
]

NUMERIC_PATTERN = r"(?:\d+(?:\.\d+)?%|\b\d{1,3}(?:,\d{3})*(?:\.\d+)?\b|£|\$|€)"

In [186]:
# Function to calculate cosine similarity between question and answers.
def cosine_sim(q, a):
    vec = TfidfVectorizer(stop_words='english').fit_transform([q, a]) # converts text to vectors 
    sim = float(cosine_similarity(vec[0], vec[1])[0, 0]) # calculate the cosine similarity between the two vectors

    return sim

In [187]:
# Function to compute baseline evasion score.
def baseline_evasion_score(q, a):
    # 1. Cosine similarity
    sim = cosine_sim(q, a) # calculates cosine similarity using previous function
    sim_component = (1 - sim) * 45 # less similar the answer is, the bigger the contribution to the evasion score, scaled by 45

    # 2. Numerical specificity- Does the question require and answer with financial figures/ a specific answer?
    needs_num = any(t in q.lower() for t in SPECIFICITY_TRIGGERS) # true if the question requires a numeric/ specific answer
    has_num = bool(re.search(NUMERIC_PATTERN, a)) # true if the answer includes a number 
    numeric_component = 25 if needs_num and not has_num else 0 # score of 25 if the question needs a number but the answer doesn't give one

    # 3. Evasive phrases- does the answer contain evasive phrases?
    phrase_hits = sum(len(re.findall(p, a.lower())) for p in EVASIVE_PHRASES) # counts how many times an evasive phrase appears in the answer
    phrase_component = min(3, phrase_hits) * 8 # max of 3 hits counted, each hit = 8 points 

    # Final evasion score.
    score = min(100, sim_component + numeric_component + phrase_component) # adds components together and caps score at 100
    
    return score, sim, phrase_hits, needs_num, has_num

### **LLM Functions**

In [188]:
# Function to build the premise for the model (question + answer).
def build_premise(q, a):
    return f'[QUESTION] {q} [ANSWER] {a}'

In [189]:
def model_max_len(tokenizer, model):
    m = getattr(tokenizer, "model_max_length", None)
    if m is None or m == int(1e30):
        m = getattr(getattr(model, "config", None), "max_position_embeddings", 512)
    return int(m or 512)

def token_len(tokenizer, text):
    return len(tokenizer.encode(text, add_special_tokens=False))

def compute_answer_budget(tokenizer, model, question, hyp_max_tokens, q_cap=128, safety_margin=12):
    max_len = model_max_len(tokenizer, model)            # usually 512
    specials = tokenizer.num_special_tokens_to_add(pair=True)
    q_tokens = min(token_len(tokenizer, question), q_cap)
    budget = max_len - specials - q_tokens - hyp_max_tokens - safety_margin
    return max(32, budget)

def chunk_answer_for_pair(tokenizer, answer, answer_budget, stride_tokens=128):
    """
    Chunk the ANSWER using tokenizer.tokenize (no model max-length checks),
    then stitch back to text with convert_tokens_to_string.
    """
    toks = tokenizer.tokenize(answer)  
    if len(toks) <= answer_budget:
        return [answer]

    chunks, i = [], 0
    while i < len(toks):
        window_tokens = toks[i:i+answer_budget]
        window_text = tokenizer.convert_tokens_to_string(window_tokens)
        chunks.append(window_text)
        if i + answer_budget >= len(toks):
            break
        i += max(1, answer_budget - stride_tokens)
    return chunks

def pair_logits_chunks(model, tokenizer, device, premise, hypothesis, max_length=None, stride=128):
    if max_length is None:
        max_length = model_max_len(tokenizer, model)

    enc = tokenizer(
        premise,
        hypothesis,
        return_tensors='pt',
        truncation='only_first',          # split/truncate Q+A only
        max_length=max_length,
        stride=stride,
        return_overflowing_tokens=True,
        padding='max_length'             
    )

    # keep only keys the model expects
    input_names = set(getattr(tokenizer, "model_input_names",
                              ["input_ids", "attention_mask", "token_type_ids"]))

    def to_batch(enc_dict, i=None):
        batch = {}
        for k, v in enc_dict.items():
            if k in input_names and isinstance(v, torch.Tensor):
                batch[k] = (v[i:i+1] if i is not None else v).to(device)
        return batch

    # single chunk
    if enc["input_ids"].shape[0] == 1:
        batch = to_batch(enc)
        with torch.no_grad():
            logits = model(**batch).logits
        return [logits.squeeze(0)]

    # multiple overflowed chunks
    logits_list = []
    n = enc["input_ids"].shape[0]
    for i in range(n):
        batch = to_batch(enc, i)
        with torch.no_grad():
            out = model(**batch).logits
        logits_list.append(out.squeeze(0))
    return logits_list

def get_label_idx(model, name, default):
    id2label = getattr(model.config, "id2label", {})
    if id2label:
        for k, v in id2label.items():
            if name in str(v).lower():
                return int(k)
    return default

def p_entail_from_logits(logits, model, temperature=1.0):
    nlab = logits.shape[-1]
    ent_i = get_label_idx(model, "entail", 2 if nlab==3 else 1)
    probs = torch.softmax(logits / float(temperature), dim=-1)
    return float(probs[ent_i])

# --- your templates (unchanged) ---
DIRECT_TEMPLATES = [
    "The answer gives a direct and specific response to the question.",
    "The answer addresses the question explicitly and concretely.",
    "The answer responds directly with actionable specifics.",
]
EVASIVE_TEMPLATES = [
    "The answer avoids giving a direct response to the question.",
    "The answer is evasive or deflects without specifics.",
    "The answer sidesteps the question and withholds details.",
]

def llm_evasion_score(question, answer, model, tokenizer, device, temperature=2.0, stride=128):
    max_len = model_max_len(tokenizer, model)
    n_dir, n_eva = len(DIRECT_TEMPLATES), len(EVASIVE_TEMPLATES)

    p_ent_direct_list, p_ent_evasive_list = [], []

    premise = f"Q: {question}\nA: {answer}"

    # Collect P(entailment) for DIRECT hypotheses (over chunks), then mean over templates
    for h in DIRECT_TEMPLATES:
        logits_chunks = pair_logits_chunks(model, tokenizer, device, premise, h, max_length=max_len, stride=stride)
        # For each chunk, compute P(entail); take the max across chunks (recall-friendly)
        pents = [p_entail_from_logits(log, model, temperature) for log in logits_chunks]
        p_ent_direct_list.append(max(pents))

    # Same for EVASIVE hypotheses
    for h in EVASIVE_TEMPLATES:
        logits_chunks = pair_logits_chunks(model, tokenizer, device, premise, h, max_length=max_len, stride=stride)
        pents = [p_entail_from_logits(log, model, temperature) for log in logits_chunks]
        p_ent_evasive_list.append(max(pents))

    # Mean over templates
    p_ent_direct  = float(torch.tensor(p_ent_direct_list).mean())
    p_ent_evasive = float(torch.tensor(p_ent_evasive_list).mean())

    # Neutral-aware normalization (don’t force a 2-class softmax over logits)
    denom = p_ent_evasive + p_ent_direct + 1e-9
    p_evasive = float(p_ent_evasive / denom)
    p_direct  = 1.0 - p_evasive

    return {
        'p_direct': p_direct,
        'p_evasive': p_evasive,
        'p_ent_direct': p_ent_direct,
        'p_ent_evasive': p_ent_evasive
    }

### **Generate Raw Evasion Scores**

- v1 tests three LLM models, an average of these (ensemble) vs a baseline (rule-based) with a blended score of avg + baseline 

In [190]:
# Function to generate raw evasion scores.
def evasion_pipeline(df, models_and_tokenizers, device, LLM_WEIGHT, selected_llm=None):
    rows = []
    for _, row in df.iterrows():
        q, a = str(row['question']), str(row['answer'])

        # baseline (raw 0..100 float)
        base_score, *_ = baseline_evasion_score(q, a)

        rec = {
            'question_number': row.get('question_number'),
            'question': q,
            'answer': a,
            'evasion_score_baseline': float(base_score),
            'label': row['label'],  # keep human label
        }

        if selected_llm is None:
            # ----- v1 mode: compute ALL LLMs + llm_avg + blended -----
            llm_scores = {}
            for name, (m, t) in models_and_tokenizers.items():
                r = llm_evasion_score(q, a, m, t, device)
                s = float(100.0 * r['p_evasive'])
                rec[f'evasion_score_{name}'] = s
                llm_scores[name] = s

            llm_avg = float(np.mean(list(llm_scores.values()))) if llm_scores else 0.0
            rec['evasion_score_llm_avg'] = llm_avg
            rec['evasion_score_blended'] = float(np.clip((1.0 - LLM_WEIGHT)*base_score + LLM_WEIGHT*llm_avg, 0.0, 100.0))

        else:
            # ----- v2 mode: compute SELECTED LLM only + blended -----
            m, t = models_and_tokenizers[selected_llm]
            r = llm_evasion_score(q, a, m, t, device)
            sel = float(100.0 * r['p_evasive'])

            # Keep both the model-specific column and the 'deberta' alias for compatibility
            rec[f'evasion_score_{selected_llm}'] = sel
            rec['evasion_score_deberta'] = sel
            rec['evasion_score_blended'] = float(np.clip((1.0 - LLM_WEIGHT)*base_score + LLM_WEIGHT*sel, 0.0, 100.0))

        rows.append(rec)

    return pd.DataFrame(rows)

### **Fine-Tune Score Thresholds**

In [191]:
# Perform an initial run to get raw evasion scores on validation set. 
LLM_WEIGHT = 0.30

jpm_val_qa_scores = evasion_pipeline(
    jpm_val_qa_labelled, 
    models_and_tokenizers, 
    device, 
    LLM_WEIGHT
    )

In [192]:
# View the results.
jpm_val_qa_scores.head()

Unnamed: 0,question_number,question,answer,evasion_score_baseline,label,evasion_score_roberta,evasion_score_deberta,evasion_score_zs_deberta,evasion_score_llm_avg,evasion_score_blended
0,1.0,Good morning. Thanks for all the comments on t...,"Yeah. Matt, not particularly updating. I think...",52.351637,Direct,45.764036,83.365531,82.812918,70.647495,57.840394
1,4.0,"Okay. And then, just to follow up on the NII, ...","Sure. Yeah, happy to do that, John. So, I thin...",39.634667,Direct,25.137211,54.344919,48.187033,42.556387,40.511183
2,7.0,"And maybe just taking that a step further, the...","Yeah. So, good question on the multi-family, a...",46.315719,Direct,41.942957,73.980192,25.500992,47.14138,46.563417
3,9.0,"Thanks. And just as a follow-up, the $90 billi...","A little bit of that is in there, yeah. So, yo...",41.902543,Direct,18.178387,56.679955,53.427371,42.761904,42.160351
4,15.0,Hey. Good morning. Thanks for taking my questi...,"Yeah, sure. So, as you know, all else equal, t...",37.891113,Direct,42.432212,57.817084,20.090643,40.113313,38.557773


### **Calibrate Validation**

In [193]:
# Function to extract ground truth (1 = Evasive, 0 = Direct)
def extract_y_true(df):
    return (df['label'].astype(str).str.strip().str.lower() == 'evasive').astype(int).values

In [194]:
def fit_calibrator(raw_scores, y, method):
    s = np.asarray(raw_scores, dtype=float) / 100.0

    if method == 'isotonic':
        iso = IsotonicRegression(out_of_bounds='clip')
        iso.fit(s, y)
        return lambda v: iso.predict(np.asarray(v, dtype=float) / 100.0)

    if method == 'sigmoid':
        lr = LogisticRegression(max_iter=1000)
        lr.fit(s.reshape(-1, 1), y)
        return lambda v: lr.predict_proba((np.asarray(v, dtype=float)/ 100.0).reshape(-1, 1))[:, 1]

In [None]:
# Fit calibrators on validation set. 
y_true_val = extract_y_true(jpm_val_qa_scores)
CALIB_METHOD = 'isotonic'

# cal_base   = fit_calibrator(jpm_val_qa_scores['evasion_score_baseline'].values,   y_true_val, method='sigmoid')
cal_llmavg = fit_calibrator(jpm_val_qa_scores['evasion_score_llm_avg'].values,    y_true_val, CALIB_METHOD)
cal_blend  = fit_calibrator(jpm_val_qa_scores['evasion_score_blended'].values,    y_true_val, CALIB_METHOD)

cal_rob = fit_calibrator(jpm_val_qa_scores['evasion_score_roberta'].values,       y_true_val, CALIB_METHOD)
cal_deb = fit_calibrator(jpm_val_qa_scores['evasion_score_deberta'].values,       y_true_val, CALIB_METHOD)
cal_zsd = fit_calibrator(jpm_val_qa_scores['evasion_score_zs_deberta'].values,    y_true_val, CALIB_METHOD)

In [206]:
# Generate calibrated probabilities.
p_base_val = jpm_val_qa_scores['evasion_score_baseline'].values
p_llmavg_val = cal_llmavg(jpm_val_qa_scores['evasion_score_llm_avg'].values)
p_blend_val  = cal_blend(jpm_val_qa_scores['evasion_score_blended'].values)

p_rob_val = cal_rob(jpm_val_qa_scores['evasion_score_roberta'].values)
p_deb_val = cal_deb(jpm_val_qa_scores['evasion_score_deberta'].values)
p_zsd_val = cal_zsd(jpm_val_qa_scores['evasion_score_zs_deberta'].values)

In [207]:
from sklearn.metrics import (
    precision_score, recall_score, f1_score, accuracy_score,
    precision_recall_fscore_support
)

def tune_threshold_calibrated(probs, y, thr_grid=np.linspace(0.05, 0.95, 19)):
    rows = []
    for thr in thr_grid:
        yp = (probs >= float(thr)).astype(int)

        # Overall (Evasive=1)
        precision = precision_score(y, yp, zero_division=0)
        recall    = recall_score(y, yp, zero_division=0)
        f1        = f1_score(y, yp, zero_division=0)
        accuracy  = accuracy_score(y, yp)

        # Per-class: labels=[0,1] => [Direct, Evasive]
        prec_cls, rec_cls, f1_cls, sup_cls = precision_recall_fscore_support(
            y, yp, labels=[0,1], average=None, zero_division=0
        )

        rows.append({
            'threshold': float(thr),                 # prob space 0..1
            'threshold_pct': float(thr * 100.0),    # display like 65.0
            'precision': precision,
            'recall':    recall,
            'f1':        f1,
            'accuracy':  accuracy,
            'f1_macro':  (f1_cls[0] + f1_cls[1]) / 2.0,

            'precision_direct':  prec_cls[0],
            'recall_direct':     rec_cls[0],
            'f1_direct':         f1_cls[0],

            'precision_evasive': prec_cls[1],
            'recall_evasive':    rec_cls[1],
            'f1_evasive':        f1_cls[1],
        })

    return pd.DataFrame(rows)

In [208]:
# Define thr_grid.
thr_grid = np.linspace(0.1, 0.9, 33)

In [209]:
# Tune
# --- Tune thresholds on VAL (calibrated) ---
base_calib_results   = tune_threshold_calibrated(p_base_val,   y_true_val, thr_grid)
llmavg_calib_results = tune_threshold_calibrated(p_llmavg_val, y_true_val, thr_grid)
blend_calib_results  = tune_threshold_calibrated(p_blend_val,  y_true_val, thr_grid)

rob_calib_results = tune_threshold_calibrated(p_rob_val, y_true_val, thr_grid)
deb_calib_results = tune_threshold_calibrated(p_deb_val, y_true_val, thr_grid)
zsd_calib_results = tune_threshold_calibrated(p_zsd_val, y_true_val, thr_grid)

In [210]:
# Rank & take top 5 (pick your sort — here: Evasive F1 then Evasive recall)
rank_keys = ['f1_evasive','recall_evasive']
ascending = [False, False]
cols = [
    'threshold_pct','precision','recall','f1','accuracy','f1_macro',
    'precision_evasive','recall_evasive','f1_evasive','support_evasive',
    'precision_direct','recall_direct','f1_direct','support_direct'
]

baseline_top5_df   = base_calib_results.sort_values(rank_keys, ascending=ascending).head(10)
llm_avg_top5_df    = llmavg_calib_results.sort_values(rank_keys, ascending=ascending).head(5)
blended_top5_df    = blend_calib_results.sort_values(rank_keys, ascending=ascending).head(5)
roberta_top5_df    = rob_calib_results.sort_values(rank_keys, ascending=ascending).head(5)
deberta_top5_df    = deb_calib_results.sort_values(rank_keys, ascending=ascending).head(5)
zs_deberta_top5_df = zsd_calib_results.sort_values(rank_keys, ascending=ascending).head(5)

print("\nBaseline — Top 5\n", display(baseline_top5_df))
print("\nLLM Avg — Top 5\n", display(llm_avg_top5_df))
print("\nBlended — Top 5\n", display(blended_top5_df))
print("\nRoBERTa — Top 5\n", display(roberta_top5_df))
print("\nDeBERTa — Top 5\n", display(deberta_top5_df))
print("\nZS-DeBERTa — Top 5\n", display(zs_deberta_top5_df))

Unnamed: 0,threshold,threshold_pct,precision,recall,f1,accuracy,f1_macro,precision_direct,recall_direct,f1_direct,precision_evasive,recall_evasive,f1_evasive
0,0.1,10.0,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581
1,0.125,12.5,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581
2,0.15,15.0,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581
3,0.175,17.5,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581
4,0.2,20.0,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581
5,0.225,22.5,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581
6,0.25,25.0,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581
7,0.275,27.5,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581
8,0.3,30.0,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581
9,0.325,32.5,0.194444,1.0,0.325581,0.194444,0.162791,0.0,0.0,0.0,0.194444,1.0,0.325581



Baseline — Top 5
 None


Unnamed: 0,threshold,threshold_pct,precision,recall,f1,accuracy,f1_macro,precision_direct,recall_direct,f1_direct,precision_evasive,recall_evasive,f1_evasive
2,0.15,15.0,0.254237,0.714286,0.375,0.537037,0.503676,0.877551,0.494253,0.632353,0.254237,0.714286,0.375
3,0.175,17.5,0.254237,0.714286,0.375,0.537037,0.503676,0.877551,0.494253,0.632353,0.254237,0.714286,0.375
4,0.2,20.0,0.254237,0.714286,0.375,0.537037,0.503676,0.877551,0.494253,0.632353,0.254237,0.714286,0.375
5,0.225,22.5,0.32,0.380952,0.347826,0.722222,0.585678,0.843373,0.804598,0.823529,0.32,0.380952,0.347826
6,0.25,25.0,0.32,0.380952,0.347826,0.722222,0.585678,0.843373,0.804598,0.823529,0.32,0.380952,0.347826



LLM Avg — Top 5
 None


Unnamed: 0,threshold,threshold_pct,precision,recall,f1,accuracy,f1_macro,precision_direct,recall_direct,f1_direct,precision_evasive,recall_evasive,f1_evasive
0,0.1,10.0,0.207921,1.0,0.344262,0.259259,0.246599,1.0,0.08046,0.148936,0.207921,1.0,0.344262
1,0.125,12.5,0.207921,1.0,0.344262,0.259259,0.246599,1.0,0.08046,0.148936,0.207921,1.0,0.344262
2,0.15,15.0,0.207921,1.0,0.344262,0.259259,0.246599,1.0,0.08046,0.148936,0.207921,1.0,0.344262
3,0.175,17.5,0.207921,1.0,0.344262,0.259259,0.246599,1.0,0.08046,0.148936,0.207921,1.0,0.344262
4,0.2,20.0,0.2125,0.809524,0.336634,0.37963,0.377012,0.857143,0.275862,0.417391,0.2125,0.809524,0.336634



Blended — Top 5
 None


Unnamed: 0,threshold,threshold_pct,precision,recall,f1,accuracy,f1_macro,precision_direct,recall_direct,f1_direct,precision_evasive,recall_evasive,f1_evasive
2,0.15,15.0,0.222222,0.857143,0.352941,0.388889,0.386997,0.888889,0.275862,0.421053,0.222222,0.857143,0.352941
3,0.175,17.5,0.222222,0.857143,0.352941,0.388889,0.386997,0.888889,0.275862,0.421053,0.222222,0.857143,0.352941
0,0.1,10.0,0.203883,1.0,0.33871,0.240741,0.223703,1.0,0.057471,0.108696,0.203883,1.0,0.33871
1,0.125,12.5,0.203883,1.0,0.33871,0.240741,0.223703,1.0,0.057471,0.108696,0.203883,1.0,0.33871
4,0.2,20.0,0.315789,0.285714,0.3,0.740741,0.570455,0.831461,0.850575,0.840909,0.315789,0.285714,0.3



RoBERTa — Top 5
 None


Unnamed: 0,threshold,threshold_pct,precision,recall,f1,accuracy,f1_macro,precision_direct,recall_direct,f1_direct,precision_evasive,recall_evasive,f1_evasive
2,0.15,15.0,0.25,0.761905,0.376471,0.509259,0.485945,0.886364,0.448276,0.59542,0.25,0.761905,0.376471
3,0.175,17.5,0.25,0.761905,0.376471,0.509259,0.485945,0.886364,0.448276,0.59542,0.25,0.761905,0.376471
4,0.2,20.0,0.25,0.761905,0.376471,0.509259,0.485945,0.886364,0.448276,0.59542,0.25,0.761905,0.376471
5,0.225,22.5,0.25,0.761905,0.376471,0.509259,0.485945,0.886364,0.448276,0.59542,0.25,0.761905,0.376471
1,0.125,12.5,0.236111,0.809524,0.365591,0.453704,0.442958,0.888889,0.367816,0.520325,0.236111,0.809524,0.365591



DeBERTa — Top 5
 None


Unnamed: 0,threshold,threshold_pct,precision,recall,f1,accuracy,f1_macro,precision_direct,recall_direct,f1_direct,precision_evasive,recall_evasive,f1_evasive
1,0.125,12.5,0.285714,0.666667,0.4,0.611111,0.556164,0.881356,0.597701,0.712329,0.285714,0.666667,0.4
2,0.15,15.0,0.285714,0.666667,0.4,0.611111,0.556164,0.881356,0.597701,0.712329,0.285714,0.666667,0.4
3,0.175,17.5,0.285714,0.666667,0.4,0.611111,0.556164,0.881356,0.597701,0.712329,0.285714,0.666667,0.4
4,0.2,20.0,0.285714,0.666667,0.4,0.611111,0.556164,0.881356,0.597701,0.712329,0.285714,0.666667,0.4
5,0.225,22.5,0.295455,0.619048,0.4,0.638889,0.570861,0.875,0.643678,0.741722,0.295455,0.619048,0.4



ZS-DeBERTa — Top 5
 None


- ZS-deberta (0.20)- recall evasive = 62%, F1 evasive= 0.39 (most balanced and higher precision)
- DeBERTa (0.20) - recall evasive = 76%, F1 evaisive = 0.37 (recall optimised)
- Baseline threshold = 0.45

- Best balanced performance of LLM models was the deberta model with threshold = 60, giving 79% recall, 46% accuracy and F1 = 0.36.
- Use baseline threshold = 0.40 as baseline detector as this gave the highest F1 score across the grid search, giving the most balanced model and so is the fairest comparison. 

In [215]:
SELECTED_LLM = 'deberta'
SELECTED_THR = 0.20
BASE_THR = 0.40
BLENDED_THR = 0.05

### **Re-calibrate blended uisng selected LLM**

In [212]:
# Calibrated VAL probabilities (use your VAL-fitted calibrators)
p_base_val = cal_base(jpm_val_qa_scores['evasion_score_baseline'].values)
p_deb_val  = cal_deb(jpm_val_qa_scores['evasion_score_deberta'].values)
y_val = (jpm_val_qa_scores['label'].astype(str).str.strip().str.lower() == 'evasive').astype(int).values

# Train a simple logistic blend on calibrated probs
blend_lr = LogisticRegression(max_iter=1000)
X_blend_val = np.column_stack([p_base_val, p_deb_val])
blend_lr.fit(X_blend_val, y_val)

# Blended probabilities on VAL + pick best threshold (0..1) using your tuner
p_blend_val = blend_lr.predict_proba(X_blend_val)[:, 1]
blend_deb_results = tune_threshold_calibrated(p_blend_val, y_val)  # uses your existing function
best_thr_blend_sel = float(
    blend_deb_results.sort_values(['f1_evasive','recall_evasive'], ascending=[False, False]).iloc[0]['threshold']
)

# Helper to score TEST later with this learned blend
def blend_sel_prob_from_raw(base_score_0_100, deb_score_0_100):
    b = cal_base(np.asarray(base_score_0_100))
    d = cal_deb(np.asarray(deb_score_0_100))
    X = np.column_stack([b, d])
    return blend_lr.predict_proba(X)[:, 1]

In [214]:
# Show the chosen blended threshold (probability 0..1) and a top-5 table
print("Best blended threshold (prob):", round(best_thr_blend_sel, 4),
      "| percent:", round(100*best_thr_blend_sel, 1))

cols = [
    'threshold_pct','precision','recall','f1','accuracy','f1_macro',
    'precision_evasive','recall_evasive','f1_evasive',
    'precision_direct','recall_direct','f1_direct'
]
blend_top5 = blend_deb_results.sort_values(['f1_evasive','recall_evasive'],
                                           ascending=[False, False]).head(5)[cols]
print("\nBlended (Baseline+DeBERTa) — Top 5 configs on VAL")
print(blend_top5)

Best blended threshold (prob): 0.05 | percent: 5.0

Blended (Baseline+DeBERTa) — Top 5 configs on VAL
   threshold_pct  precision  recall  ...  precision_direct  recall_direct  f1_direct
0            5.0   0.194444     1.0  ...          0.000000            0.0   0.000000
1           10.0   0.194444     1.0  ...          0.000000            0.0   0.000000
2           15.0   0.194444     1.0  ...          0.000000            0.0   0.000000
3           20.0   0.000000     0.0  ...          0.805556            1.0   0.892308
4           25.0   0.000000     0.0  ...          0.805556            1.0   0.892308

[5 rows x 12 columns]


### **Main Pipeline v2**

- This pipeline incorperates the results from the validation threshold tuning and best model performances 
- Updates some of the previous functions.

In [None]:
# Run evasion pipeline v2 with test dataset. 
LLM_WEIGHT= 0.70

jpm_test_qa_scores = evasion_pipeline(
    jpm_test_qa_labelled,
    models_and_tokenizers,
    device,
    LLM_WEIGHT,
    SELECTED_LLM
)

In [216]:
# ====== 5B) Calibrated probabilities on TEST (use VAL-fitted calibrators) ======
p_base_test = jpm_test_qa_scores['evasion_score_baseline'].values
if SELECTED_LLM == 'roberta':
    p_sel_test = cal_rob(jpm_test_qa_scores['evasion_score_deberta'].values)
elif SELECTED_LLM == 'deberta':
    p_sel_test = cal_deb(jpm_test_qa_scores['evasion_score_deberta'].values)
else:
    p_sel_test = cal_zsd(jpm_test_qa_scores['evasion_score_deberta'].values)
p_blend_test = cal_blend(jpm_test_qa_scores['evasion_score_blended'].values)

In [220]:
def evaluate_evasion_scores(df):
    # True labels: 1 = Evasive, 0 = Direct
    y_true = (df['label'].astype(str).str.strip().str.lower() == 'evasive').astype(int).values

    def to_binary(pred_series):
        return (pred_series.astype(str).str.strip().str.lower() == 'evasive').astype(int).values

    y_pred_base  = to_binary(df['prediction_baseline'])
    y_pred_deb   = to_binary(df['prediction_deberta'])   # holds the SELECTED_LLM predictions
    y_pred_blend = to_binary(df['prediction_blended'])

    return {
        'baseline': {
            'classification_report': classification_report(
                y_true, y_pred_base, target_names=["Direct", "Evasive"], digits=3, zero_division=0
            ),
            'confusion_matrix': confusion_matrix(y_true, y_pred_base)
        },
        'deberta': {  # keep key name so downstream code works (even if selected LLM isn't DeBERTa)
            'classification_report': classification_report(
                y_true, y_pred_deb, target_names=["Direct", "Evasive"], digits=3, zero_division=0
            ),
            'confusion_matrix': confusion_matrix(y_true, y_pred_deb)
        },
        'blended': {
            'classification_report': classification_report(
                y_true, y_pred_blend, target_names=["Direct", "Evasive"], digits=3, zero_division=0
            ),
            'confusion_matrix': confusion_matrix(y_true, y_pred_blend)
        }
    }

In [221]:
# ====== 5C) Apply the frozen VAL thresholds (prob space 0..1) ======
jpm_test_qa_scores = jpm_test_qa_scores.copy()
jpm_test_qa_scores['prediction_baseline'] = np.where(p_base_test >= BASE_THR, 'Evasive', 'Direct')
jpm_test_qa_scores['prediction_deberta']  = np.where(p_sel_test  >= SELECTED_THR,  'Evasive', 'Direct')  # 'deberta' col = selected
jpm_test_qa_scores['prediction_blended']  = np.where(p_blend_test>= BLENDED_THR, 'Evasive', 'Direct')

# keep human label already present; evaluate
eval_test = evaluate_evasion_scores(jpm_test_qa_scores)
print("\n=== TEST — Baseline ===\n", eval_test['baseline']['classification_report'])
print("\n=== TEST — {} ===\n".format(SELECTED_LLM), eval_test['deberta']['classification_report'])
print("\n=== TEST — Blended ===\n", eval_test['blended']['classification_report'])


=== TEST — Baseline ===
               precision    recall  f1-score   support

      Direct      0.000     0.000     0.000        86
     Evasive      0.196     1.000     0.328        21

    accuracy                          0.196       107
   macro avg      0.098     0.500     0.164       107
weighted avg      0.039     0.196     0.064       107


=== TEST — deberta ===
               precision    recall  f1-score   support

      Direct      0.848     0.453     0.591        86
     Evasive      0.230     0.667     0.341        21

    accuracy                          0.495       107
   macro avg      0.539     0.560     0.466       107
weighted avg      0.726     0.495     0.542       107


=== TEST — Blended ===
               precision    recall  f1-score   support

      Direct      1.000     0.023     0.045        86
     Evasive      0.200     1.000     0.333        21

    accuracy                          0.215       107
   macro avg      0.600     0.512     0.189       10

In [222]:
def show_confusion_matrices(eval_dict):
    order = ['baseline', 'deberta', 'blended']  # 'deberta' = selected LLM in your pipeline
    for k in order:
        cm = eval_dict[k]['confusion_matrix']   # [[TN, FP],[FN, TP]]
        df_cm = pd.DataFrame(
            cm,
            index=['Direct (true)', 'Evasive (true)'],
            columns=['Direct (pred)', 'Evasive (pred)']
        )
        print(f"\n=== Confusion Matrix — {k} ===")
        print(df_cm)

In [223]:
show_confusion_matrices(eval_test)


=== Confusion Matrix — baseline ===
                Direct (pred)  Evasive (pred)
Direct (true)               0              86
Evasive (true)              0              21

=== Confusion Matrix — deberta ===
                Direct (pred)  Evasive (pred)
Direct (true)              39              47
Evasive (true)              7              14

=== Confusion Matrix — blended ===
                Direct (pred)  Evasive (pred)
Direct (true)               2              84
Evasive (true)              0              21
