# **Summarisation & Evasion Notebook**

# **Handover Notes:** [delete after]
- Library imports and versions are saved in environments/summarisation_evasion_env.txt
- This notebook was originally built for a macbook pro M3 chip so some settings may need to be altered depending on your machine
- All files related/ generated by this notebook can be found in notebooks/summarisation_evasion_files

### **Work progress**
1. **Complete**
- Summarise banker answers using baseline model.
- Use Local RAG pipeline to bring in relevant external documents (PRA risk definitions) to create PRA aligned summaries.
- Developed a evasion detection prototype that generates evasion scores based on bankers answers (uses baseline model, LLM- natural language inference using RoBERTa and a blended score)
- Used jpm_2025 transcripts to get the pipeline working. Validated the evasion pipeline using jpm-23-1q data (involved human labelling the answer as Direct or Evasive- file saved in notebooks/summarisation_evasion_files).

2. **Not complete**
- Visualisations e.g. how many evasive answers were there? etc - apply evasion pipeline to dataset and generate statistics on evasiveness 
- Need to test pipleine on larger data set (e.g. jpm 2023-2025) and check against HSBC to make conclusions & comment on generalisability (answering research question: How does one bank’s tone and thematic profile compare to peers? Are divergences systemic or firm specific?)
- Summarisation pipeline could be improved using a two-stage pipeline: by first extractive summarisation to capture the context and details and then a second model to reframe the summary to be PRA and evasion aligned.
- Post-processing on the output file for the PRA aligned summaries by Mistral model so they are clearer- can this output be fed into another model to extract more insights/ detect evasion or risk?
- Increase the size of the validation set for the evasion pipeline prototype (e.g. more human labelling)
- Need to fine tune the evasion pipeline to increase accuracy
- Optional extensions e.g. using Agents, more complex RAG pipeline (including more useful context for the model), validation of instances of evasion using external news sources)

# 1. **Objectives**

# **2. Set up Workspace**

In [1]:
# Import libraries
# Core python
import os
import numpy as np
import pandas as pd
import re
import json
import pathlib
from pathlib import Path
from typing import List, Dict, Any 
import csv
import math

# NLP & Summarisation
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from llama_cpp import Llama 
import torch
import torch.nn.functional as F

# Retrieval
from sentence_transformers import SentenceTransformer 

# ML
from sklearn.model_selection import train_test_split, GroupShuffleSplit
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Visualisations
import matplotlib.pyplot as plt
import seaborn as sns 

# Set SEED.
SEED = 42


  from .autonotebook import tqdm as notebook_tqdm


# **3. Load the dataset**

In [2]:
# Load the dataset.
jpm_2025_df = pd.read_csv('../data/processed/jpm/all_jpm_2025.csv')

# View the data.
jpm_2025_df.head()

Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf
0,1,,Ken Usdin,analyst,Autonomous Research,"Good morning, Jeremy. Wondering if you could s...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
1,1,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Sure, Ken. So I mean, at a high level, I would...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
2,2,,Ken Usdin,analyst,Autonomous Research,Yeah. And just one question on the NII ex. Mar...,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
3,2,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Yeah, that's a good question, Ken. You're righ...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
4,2,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorganChase,In the curve basically.,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...


# **4. Preprocessing**

- Used all_jpm_2025.csv dataset
- Preliminary preprocessing to label roles as analyst vs banker (invalid roles were corrected) to make downstream analysis easier. Created a new column 'role_normalised'.

In [3]:
# View speaker roles.
jpm_2025_df['role'].unique()

array(['analyst', 'Chief Financial Officer',
       'Chairman & Chief Executive Officer',
       'And then some. Theres a lot of value added.', 'Okay'],
      dtype=object)

In [4]:
# View rows with invalid roles.
valid_roles = 'analyst', 'Chief Financial Officer', 'Chairman & Chief Executive Officer'
invalid_roles_df = jpm_2025_df[~jpm_2025_df['role'].isin(valid_roles)]

# Number of rows with invalid roles.
print('Number of rows:', invalid_roles_df.shape[0])

# View the rows.
invalid_roles_df.head()

Number of rows: 2


Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf
201,35,5.0,"Chief Financial Officer, JPMorganChase",And then some. Theres a lot of value added.,JPMorganChase,"Yeah. And obviously, I mean, we're not going t...",2025,Q2,data/raw/jpm/jpm-2q25-earnings-call-transcript...
205,36,3.0,"Chief Financial Officer, JPMorganChase",Okay,there you have it.,"But it's not like I thought it would do badly,...",2025,Q2,data/raw/jpm/jpm-2q25-earnings-call-transcript...


In [5]:
# Input the correct role information.
jpm_2025_df.at[205, 'role'] = 'Chief Financial Officer'
jpm_2025_df.at[209, 'role'] = 'Chief Financial Officer'

# Verify the roles have been updated.
jpm_2025_df['role'].unique()

array(['analyst', 'Chief Financial Officer',
       'Chairman & Chief Executive Officer',
       'And then some. Theres a lot of value added.'], dtype=object)

In [6]:
# Define role mapping.
role_map = {
    'analyst': 'analyst',
    'Chief Financial Officer': 'banker',
    'Chairman & Chief Executive Officer': 'banker'
}

# Apply to dataset.
jpm_2025_df['role_normalised'] = jpm_2025_df['role'].map(role_map)

In [7]:
# View the dataset.
jpm_2025_df.head()

Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf,role_normalised
0,1,,Ken Usdin,analyst,Autonomous Research,"Good morning, Jeremy. Wondering if you could s...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,analyst
1,1,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Sure, Ken. So I mean, at a high level, I would...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker
2,2,,Ken Usdin,analyst,Autonomous Research,Yeah. And just one question on the NII ex. Mar...,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,analyst
3,2,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Yeah, that's a good question, Ken. You're righ...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker
4,2,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorganChase,In the curve basically.,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker


# **5. Summarisation**

## **5.1 Baseline**

- Initial model exploration using BART and mistral-7B-instruct to summarise banker's answers (no additional context given to model)

### **5.1.1 BART**

In [8]:
# Filter data to banker answers only.
banker_answers = jpm_2025_df[jpm_2025_df['role_normalised'] == 'banker']['content'].tolist()
print(banker_answers[0][:200])

Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the w


In [9]:
# Summarisation baseline (BART)
bart = pipeline('summarization', model='facebook/bart-large-cnn')

sample_text = banker_answers[0]
summary_bart = bart(sample_text, max_length=80, min_length=30, do_sample=False)
print('Original:', sample_text[:400])
print('Summary:', summary_bart[0]['summary_text'])

Device set to use mps:0


Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: The main thing that we see there, what would appear to be a certain amount of frontloading of spending ahead of people expecting price increases from tariffs. So ironically, that's actually somewhat supportive, all else equal. In terms of our corporate clients, obviously, they've been reacting to the changes in tariff policy.


- bart was able to extract ket ideas, focussing on fronloading of spending and tariff policy. 
- Compressed the response into two sentences and the summary is coherent, removing filler phrases.
- However, the summary is not fully neutral (e.g. includes ironically) and preserves tone
- Also there is a loss of context- e.g. consumer side vs wholesale side distinction is no longer explicit.

In [10]:
# Prompt conditioning to make PRA relevant.
prompt = "Summarise this answer, focusing on risk, capital and evasion of detail: " + sample_text
summary_bart_prompted = bart(prompt, max_length=80, min_length=30)
print('Original:', sample_text[:400])
print('Summary:', summary_bart_prompted[0]['summary_text'])

Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: Corporates are taking a wait-and-see approach to tariff policy. Some sectors are going to be much more exposed than others. Small business and smaller corporates are probably a little more challenged.


- Prompted summary shifts emphasis and includes interpretation around risk, even though those words were no explicit in the original
- This version is more aligned to evasion detection but moves away from concrete detail 
- Improved approach would be to have a two stage-pipeline: first extractive summarisation to capture the context and details and then a second model to reframe the summary to be PRA and evasion aligned.

### **5.1.2 Mistral-7B-Instruct**

- Mistral model: mistral-7b-instruct-v0.1.Q4_K_M.gguf
- Mistral-7B-Instruct model download: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF?show_file_info=mistral-7b-instruct-v0.1.Q4_K_M.gguf
- Also saved in shared team folder models

In [11]:
# Summarisation baseline (Mistral-7B-Instruct) with basic prompt.
llm = Llama(model_path='/Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
            n_ctx=4096, n_gpu_layers=-1, verbose=False, seed=SEED)  # change path as needed 

prompt = f"<s>[INST] Summarise the following answer in 2 sentences, focusing on concrete facts. Avoid opinions.\n\n{sample_text}\n[/INST]"

output = llm.create_chat_completion(
    messages=[{'role': 'user', 'content': prompt}],
    max_tokens=180,
    temperature=0.1,
    stop=['</s>']
)

summary_mistral = output['choices'][0]['message']['content'].strip()  

print('Original:', sample_text[:400])
print('Summary:', summary_mistral)

llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 

Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: The speaker is discussing the impact of recent news flow on the consumer and corporate sides. On the consumer side, there has been some frontloading of spending ahead of expected price increases from tariffs, which may distort the data and make it difficult to draw larger conclusions. On the corporate side, clients are reacting to changes in tariff policy by shifting their focus towards short-term work and optimizing supply chains. The speaker characterizes the attitude of corporate clients as a wait-and-see attitude, with smaller clients and smaller corporates being more c

- Preserves details and nuance and is more contextual and interpretive than the BART baseline model.
- However, the result is longer with heavier phrasing and includes phrases like 'distort the data' which is not explicit in the original.

In [12]:
# Summarisation baseline (Mistral-7B-Instruct) with more detailed prompt.
llm = Llama(model_path='/Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
            n_ctx=4096, n_gpu_layers=-1, verbose=False, seed=SEED)  # change path as needed 

prompt = f"<s>[INST] Summarise the following answer in 2 sentences, focusing on concrete facts. Avoid opinions. Focus on risk, capital and evasion of detail.\n\n{sample_text}\n[/INST]"

output = llm.create_chat_completion(
    messages=[{'role': 'user', 'content': prompt}],
    max_tokens=180,
    temperature=0.1,
    stop=['</s>']
)

summary_mistral_prompted = output['choices'][0]['message']['content'].strip()  

print('Original:', sample_text[:400])
print('Summary:', summary_mistral_prompted)

llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 

Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: The speaker is discussing the impact of recent news flow on the consumer and corporate sides of their business. On the consumer side, they have observed some frontloading of spending ahead of expected price increases from tariffs, which may distort data and make it difficult to draw larger conclusions. On the corporate side, clients are shifting their focus towards optimizing supply chains and responding to the current environment, rather than prioritizing more strategic work. The speaker notes that smaller clients and smaller corporates may be more challenged than larger o

- This summary brings in risk- language and is closer to the task objective.
- However, some interpretations are generated by the model rather than explicitly detailed in the answer

## **5.2 Adding Context**

Retrieve PRA risk categories to give greater PRA focus to summaries (local RAG loop).
- measure cosine similarity between transcript chunks and PRA risk categories (vectors)
- retrieve the top 2-3 most relevant risk categories 
- prepend them to the summarisation prompt to make summaries PRA-aligned instead of just summarised answers

- Attempting to use BART resulted in prompt echoing.
- New attempt using Mistral-7B-Instruct.
- Using sentence-BERT vs TF-IDF for vectorisation.

### **5.2.1 Mistral-7B-Instruct**

**Process**
- Performed some light cleaning of the transcript to remove whitespace.
- Split the transcript into smaller chunks that the model can summarise to avoid truncation
- Loaded the PRA categories csv file (contains category and definition)
- Embedded the PRA categories and chunks, evaluated the similarity to extract the PRA risk categories that were relevant to the text
- Summarised the chunk using detailed prompted and relevant PRA categories as additional context. 

**Output File**:
- The output file of this can be found in notebooks/summarisation_evasion_files, name = jpm_mistral_pra_summary.json
- It is in the format: summary, evidence, PRA category that relates to summary and reasoning for selecting these categories.

- Needed to use a lot of fine tuning for the prompt and set strict rules for the model
- Need to be very clear about the output expected or else the model deviates a lot, especially as it processes more data.
- Include lines about lack of evidence if not the model may hallucinate

In [13]:
# Function to remove whitespace in text.
def clean_text(text: str):
    return re.sub(r'\s+', ' ', text).strip()

In [14]:
# Function to split the transcript into smaller chunks.
def chunk_text(text: str, max_chars: int = 6000):
    sentences = re.split(r'(?<=[.!?])\s+', text.strip()) # split into sentences 
    chunks, current_chunk, current_len = [], [], 0 # list of chunks, sentences collecting for current chunk, character count for current chunk

    for s in sentences:
        if current_len + len(s) + 1 <= max_chars: # if the characters of current chunk + new sentence is below the limit:
            current_chunk.append(s) # add sentence to current chunk 
            current_len += len(s) + 1 # update running character count 
        
        else: # if the characters is above the limit:
            chunks.append(' '.join(current_chunk)) # add the current chunk to the final chunk list
            current_chunk, current_len = [s], len(s) # start a new chunk containing the sentence and update current len

    if current_chunk:
        chunks.append(' '.join(current_chunk)) # add any sentences in current chunk after loop ends 

    return chunks 

In [15]:
# Function to load PRA categories and definitions from CSV.
def load_pra_categories(path: Path):
    with open(path, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        return [
            (row.get('category', '').strip(), [row.get('definition', '').strip()])
            for row in reader if row.get('category')
        ]

In [16]:
# Build a Sentence-BERT embedding index for PRA categories.
def build_embedding_index(pra_categories):
    embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    docs = [f"{name} {' '.join(defs)}" for name, defs in pra_categories]
    pra_risk_embeddings = embedder.encode(docs, batch_size=32, normalize_embeddings=True)

    return embedder, np.asarray(pra_risk_embeddings)

In [17]:
# Function to find the relevant PRA categories to the transcript chunks.
def find_rel_categories(chunk, pra_categories, embedder, pra_risk_embeddings, top_k=2):
    query_vec = embedder.encode([chunk], normalize_embeddings=True) # turns chunk into embedding
    sims = cosine_similarity(query_vec, pra_risk_embeddings).ravel() # compares the chunk to each category doc 
    top_indices = np.argsort(-sims)[:top_k] # sorts scores descending and selected top k cateogories 

    return [pra_categories[i] for i in top_indices]

In [18]:
# Function to parse JSON
def parse_tagged_json(raw):
    m = re.search(r"<json>\s*(\{[\s\S]*?\})\s*</json>", raw, flags=re.IGNORECASE)
    if not m:
        return None
    try:
        return json.loads(m.group(1))
    except json.JSONDecodeError:
        return None

In [19]:
# Function to summarise the text chunks.
def summarise_chunk(model, chunk, relevant_categories, max_evidence=5):

    # Build PRA notes (limit to 2 bullets per category)
    lines = []
    for name, definition in relevant_categories:
        lines.append(f'- {name}:')
        for d in list(definition)[:2]:
            lines.append(f'- {d}')
    notes_block = '\n'.join(lines)

    system_prompt = (
        "You are a careful data extraction model. "
        "Return ONLY valid JSON wrapped in <json>...</json> tags."
    )

    user_prompt = f"""
TRANSCRIPT:
{chunk}

PRA NOTES:
{notes_block}

TASK:
Return JSON ONLY, wrapped exactly like this:
<json>{{"summary": "...", "evidence": ["..."], "pra_categories": [{{"category":"...","why":"..."}}]}}</json>

RULES:
- 4-6 sentence neutral summary.
- Up to {max_evidence} evidence bullets (quotes/facts).
- 1-3 pra_categories objects.
- If evidence is lacking, use a single bullet "Insufficient evidence".
- Only choose categories supported by the evidence.
""".strip()

    response = model.create_chat_completion(
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user', 'content': user_prompt},
        ],
        temperature=0.2,
        top_p=0.9,
        max_tokens=700,
        repeat_penalty=1.1,
    )

    raw = (response['choices'][0]['message']['content'] or '').strip()

    # Parse the tagged JSON
    parsed = parse_tagged_json(raw)

    # Fallback if model didn’t follow instructions
    if not parsed:
        return (
            {'summary': '', 'evidence': ['Insufficient evidence'], 'pra_categories': []},
            raw,
        )

    # Light coercion to guarantee keys exist
    result = {
        'summary': parsed.get('summary', '') or '',
        'evidence': parsed.get('evidence', []) or [],
        'pra_categories': parsed.get('pra_categories', []) or []
    }
    return result, raw

In [20]:
# Define variables.
MODEL_PATH = '/Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf'
PRA_NOTES_PATH = '../data/RAG-resources/PRA_risk_categories.csv'
TRANSCRIPT_PATH = '../data/processed/jpm/all_jpm_2025.csv'
OUTPUT_PATH = pathlib.Path('../notebooks/summarisation_evasion_files/jpm_mistral_pra_summary_raw.json')
TOP_K = 2

In [21]:
# Runner code.
pra_categories = load_pra_categories(Path(PRA_NOTES_PATH))
embedder, category_embeddings = build_embedding_index(pra_categories)

# Load and chunk transcript
transcript_text = Path(TRANSCRIPT_PATH).read_text(encoding='utf-8')
transcript_chunks = chunk_text(transcript_text)

n_threads = max(4, (os.cpu_count() or 8) - 2)

# Define the model.
model = Llama(
    model_path=str(MODEL_PATH),
    n_ctx=4096,
    n_gpu_layers=20,
    chat_format='mistral-instruct',
    n_threads=n_threads,
)

raw_outputs = []

for i, chunk in enumerate(transcript_chunks, 1):
    try:
        top_categories = find_rel_categories(
            chunk, pra_categories, embedder, category_embeddings, top_k=TOP_K
        )
        _, raw = summarise_chunk(
            model, chunk, top_categories, max_evidence=5
        )
        raw_outputs.append({'chunk': i, 'raw': raw})

    except Exception:
        raw_outputs.append({'chunk': i, 'raw': ''})

final_output = {'raw_outputs': raw_outputs}

OUTPUT_PATH.write_text(json.dumps(final_output, indent=2, ensure_ascii=False), encoding='utf-8')
print(f'Wrote final JSON to: {OUTPUT_PATH.resolve()}')

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama_model_load_from_file_impl: using device Metal (Apple M3) - 3559 MiB free
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_len

KeyboardInterrupt: 

- Need to preprocess the output so it is visually clearer (summary, evidence, PRA categories (name & why the model chose this))
- Can this information be fed to the model again and can it detect any early PRA risk indicators?

# **6. Evasion Detection Pipeline**

1. **Baseline Evasion score** (rule-based) is made up of three components:
- **Cosine similarity**- similarity of the question and answer, lower similarity = more evasive
- **Numeric specificity check**- does the question require a number, if so does the answer contain a number?, e.g. requests for financial data
- **Evasive phrases**- does the answer contain evasive phrases?, presence = more evasive

2. **LLM evasion score** (RoBERTa-MNLI) uses entailment/neutral/contradiction between the question and answer
- Lower entailment (and higher neutral + contradiction) = more evasive
  
3. **Blended evasion score** combines both scores including a weight for the LLM component
- Rationale is that baseline enforces precision while the LLM will capture semantics

### **Data Preprocessing**

In [45]:
# Load dataset
all_jpm_2023_2025 = pd.read_csv('../data/processed/jpm/all_jpm_2023_2025.csv')

# View dataset.
display(all_jpm_2023_2025.head())

# Number of rows.
print('Number of rows:', all_jpm_2023_2025.shape[0])

Unnamed: 0,section,question_number,answer_number,speaker_name,role,company,content,year,quarter,is_pleasantry,source_pdf
0,presentation,,,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Thanks, and good morning, everyone. The presen...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...
1,qa,,,Steven Chubak,analyst,Wolfe Research LLC,"Hey, good morning.",2023,Q1,True,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...
2,qa,,,Jeremy Barnum,Chief Financial Officer,JPMorgan Chase & Co.,"Good morning, Steve.",2023,Q1,True,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...
3,qa,1.0,,Steven Chubak,analyst,Wolfe Research LLC,"So, Jamie, I was actually hoping to get your p...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...
4,qa,1.0,1.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,"Well, I think you were already kind of complet...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...


Number of rows: 1411


In [46]:
# Remove pleasantries.
all_jpm_2023_2025_cleaned = all_jpm_2023_2025[all_jpm_2023_2025['is_pleasantry'] == False]
print('Number of rows:', all_jpm_2023_2025_cleaned.shape[0])

Number of rows: 1241


In [47]:
# Check content column.
print('Number of rows with no content:', all_jpm_2023_2025_cleaned['content'].isna().sum())

Number of rows with no content: 23


In [48]:
# Drop rows with no content.
all_jpm_2023_2025_cleaned = all_jpm_2023_2025_cleaned.dropna(subset=['content'])

In [49]:
# Check content column.
print('Number of rows with no content:', all_jpm_2023_2025_cleaned['content'].isna().sum())

Number of rows with no content: 0


In [50]:
# View roles.
all_jpm_2023_2025_cleaned['role'].unique()

array(['Chief Financial Officer', 'analyst',
       'Chairman & Chief Executive Officer',
       'And then some. Theres a lot of value added.', 'Okay',
       "We're fundamentally", 'Thanks', 'Almost no chance.'], dtype=object)

- Some text has leaked into role column.

In [51]:
# View rows with invalid roles. 
valid_roles = 'analyst', 'Chief Financial Officer', 'Chairman & Chief Executive Officer'
invalid_roles_df = all_jpm_2023_2025_cleaned[~all_jpm_2023_2025_cleaned['role'].isin(valid_roles)]
invalid_roles_df.head(10)

Unnamed: 0,section,question_number,answer_number,speaker_name,role,company,content,year,quarter,is_pleasantry,source_pdf
305,qa,22.0,4.0,"Chief Financial Officer, JPMorganChase",And then some. Theres a lot of value added.,JPMorganChase,"Yeah. And obviously, I mean, we're not going t...",2025,Q2,False,data/raw/jpm/.ipynb_checkpoints/jpm-2q25-earni...
309,qa,23.0,3.0,"Chief Financial Officer, JPMorganChase",Okay,there you have it.,"But it's not like I thought it would do badly,...",2025,Q2,False,data/raw/jpm/.ipynb_checkpoints/jpm-2q25-earni...
650,qa,10.0,3.0,Who knows how important politics are in all th...,We're fundamentally,"as I said, I think on the press call, happy to...",little bit cautious about the pull-forward dyn...,2024,Q1,False,data/raw/jpm/jpm-1q24-earnings-call-transcript...
924,qa,8.0,2.0,"Chief Financial Officer, JPMorgan Chase & Co.",Thanks,Glenn.,"Operator: Next, we'll go to the line of Matt O...",2024,Q2,False,data/raw/jpm/jpm-2q24-earnings-call-transcript...
1059,qa,22.0,4.0,"Chief Financial Officer, JPMorganChase",And then some. Theres a lot of value added.,JPMorganChase,"Yeah. And obviously, I mean, we're not going t...",2025,Q2,False,data/raw/jpm/jpm-2q25-earnings-call-transcript...
1063,qa,23.0,3.0,"Chief Financial Officer, JPMorganChase",Okay,there you have it.,"But it's not like I thought it would do badly,...",2025,Q2,False,data/raw/jpm/jpm-2q25-earnings-call-transcript...
1274,qa,23.0,1.0,"Chairman & Chief Executive Officer, JPMorgan C...",Almost no chance.,JPMorganChase,"Well, but having – it's very important. While ...",2024,Q3,False,data/raw/jpm/jpm-3q24-earnings-conference-call...


In [52]:
# Input the correct role information.
all_jpm_2023_2025_cleaned.loc[[305, 309, 924, 1059, 1063], 'role'] = 'Chief Financial Officer'
all_jpm_2023_2025_cleaned.loc[[1274], 'role'] = 'Chairman & Chief Executive Officer'

# Drop nonsence row.
all_jpm_2023_2025_cleaned = all_jpm_2023_2025_cleaned.drop(index=650)

In [53]:
# Check the roles have been updated.
all_jpm_2023_2025_cleaned['role'].unique()

array(['Chief Financial Officer', 'analyst',
       'Chairman & Chief Executive Officer'], dtype=object)

In [54]:
# Normalise role names.
role_map = {
    'analyst': 'analyst',
    'Chief Financial Officer': 'banker',
    'Chairman & Chief Executive Officer': 'banker'
}

# Map roles.
all_jpm_2023_2025_cleaned['role_normalised'] = all_jpm_2023_2025_cleaned['role'].map(role_map)

In [55]:
# View dataset.
display(all_jpm_2023_2025_cleaned.head())
print('Number of rows:', all_jpm_2023_2025_cleaned.shape[0])

Unnamed: 0,section,question_number,answer_number,speaker_name,role,company,content,year,quarter,is_pleasantry,source_pdf,role_normalised
0,presentation,,,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Thanks, and good morning, everyone. The presen...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,banker
3,qa,1.0,,Steven Chubak,analyst,Wolfe Research LLC,"So, Jamie, I was actually hoping to get your p...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,analyst
4,qa,1.0,1.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,"Well, I think you were already kind of complet...",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,banker
5,qa,1.0,1.0,Steven Chubak,analyst,Wolfe Research LLC,Got it. And just in terms of appetite for the ...,2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,analyst
6,qa,1.0,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,"Oh, yeah.",2023,Q1,False,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,banker


Number of rows: 1217


In [56]:
# Save the cleaned dataset.
all_jpm_2023_2025_cleaned.to_csv('../data/processed/jpm/cleaned/all_jpm_2023_2025_cleaned') 

In [57]:
# Helper function to remove duplicates within questions and answers. 
def clean_repeats(text):
    if not isinstance(text, str):
        return text

    # 1) Normalize whitespace
    t = ' '.join(text.split()).strip()
    if not t:
        return t

    # 2) If the whole-string is a back-to-back duplicate (A+A) = keep first half
    mid = len(t) // 2
    if len(t) % 2 == 0 and t[:mid] == t[mid:]:
        t = t[:mid]

    # 3) Collapse immediate repeated token spans (n-grams)
    toks = t.split()
    out = []
    i = 0
    while i < len(toks):
        matched = False
        max_span = min(50, len(toks) - i)  # cap span to remaining length
        for n in range(max_span, 4, -1):  # try longer spans first: 50..5
            if i + 2*n <= len(toks) and toks[i:i+n] == toks[i+n:i+2*n]:
                out.extend(toks[i:i+n])  # keep one copy
                i += 2*n                # skip the duplicate block
                matched = True
                break
        if not matched:
            out.append(toks[i])
            i += 1
    t = ' '.join(out)

    # 4) Remove duplicate sentences globally (order-preserving)
    sents = re.split(r'(?<=[.!?])\s+', t)
    seen = set()
    uniq = []
    for s in sents:
        s_norm = s.strip()
        if not s_norm:
            continue
        key = ' '.join(s_norm.lower().split())
        if key not in seen:
            seen.add(key)
            uniq.append(s_norm)
    return ' '.join(uniq)

In [58]:
# Function to convert datasets into question and answer pairs.
def create_qa_pairs(df, min_answer_words=30):
    # Keep only the Q&A section.
    qa_df = df[df['section'].astype(str).str.lower() == 'qa'].copy()

    # Split into roles.
    analyst_rows = qa_df[qa_df['role_normalised'] == 'analyst'].copy()
    banker_rows  = qa_df[qa_df['role_normalised'] == 'banker' ].copy()

    # Keys to keep quarters separated
    key_q = ['year', 'quarter', 'question_number']

    # Build full question text per (year, quarter, question_number)
    question_text_map = (
        analyst_rows
        .groupby(key_q, dropna=False)['content']
        .apply(lambda parts: clean_repeats(' '.join(parts.astype(str))))
        .rename('question')
        .reset_index()
    )

    # Ensure bankers have an answer_number — sequential per (year, quarter, question_number) if missing
    if 'answer_number' not in banker_rows.columns or banker_rows['answer_number'].isna().any():
        banker_rows = banker_rows.sort_index().copy()
        banker_rows['answer_number'] = (
            banker_rows
            .groupby(key_q, dropna=False)
            .cumcount() + 1
        )

    # Combine multiple banker utterances belonging to the same answer
    banker_answers = (
        banker_rows
        .groupby(key_q + ['answer_number'], dropna=False)
        .agg({
            'content':        lambda parts: clean_repeats(' '.join(parts.astype(str))),
            'speaker_name':   'first',
            'role':           'first',
            'role_normalised':'first',
            'source_pdf':     'first'
        })
        .rename(columns={'content': 'answer'})
        .reset_index()
    )

    # Merge question text back onto each answer row
    qa_pairs = banker_answers.merge(
        question_text_map,
        on=key_q,
        how='left',
        validate='many_to_one'
    )

    # Order columns for readability
    column_order = [
        'year', 'quarter', 'question_number', 'answer_number',
        'question', 'answer',
        'speaker_name', 'role', 'role_normalised',
        'source_pdf'
    ]
    qa_pairs = qa_pairs.reindex(columns=[c for c in column_order if c in qa_pairs.columns])

    # Sort and reset index.
    qa_pairs = qa_pairs.sort_values(['year', 'quarter', 'question_number', 'answer_number']).reset_index(drop=True)

    # Drop duplicate answers.
    qa_pairs = qa_pairs.drop_duplicates(subset=['answer'])

    # Drop short answers below threshold to ensure quality answers.
    qa_pairs = qa_pairs[qa_pairs['answer'].astype(str).str.split().str.len() >= int(min_answer_words)]

    return qa_pairs

In [59]:
# Create q&A pairs.
all_jpm_2023_2025_qa = create_qa_pairs(all_jpm_2023_2025_cleaned)

In [60]:
# View number of examples.
print('Number of examples:', all_jpm_2023_2025_qa.shape[0])

Number of examples: 309


In [64]:
# Split into prediction set and validation/test set.
jpm_2025_predict_qa = all_jpm_2023_2025_qa[all_jpm_2023_2025_qa['year'] == 2025]
jpm_2023_2024_qa = all_jpm_2023_2025_qa[all_jpm_2023_2025_qa['year'].isin([2023, 2024])]

# Save the datasets.
jpm_2025_predict_qa.to_csv('../data/processed/jpm/cleaned/jpm_2025_predict_qa.csv') 
jpm_2023_2024_qa.to_csv('../data/processed/jpm/cleaned/jpm_2023_2024_qa.csv')  

The jpm_2023_2024_qa dataset was then manually labelled according to whether the banker's answer was deemed 'Direct' or 'Evasive'. The label was appended by a new column 'label'.

In [68]:
# Load the labelled dataset.
jpm_2023_2024_qa_labelled = pd.read_csv('../data/processed/jpm/cleaned/jpm_2023_2024_qa_labelled.csv')

# View the dataset.
jpm_2023_2024_qa_labelled = jpm_2023_2024_qa_labelled.drop('Unnamed: 0', axis=1)
display(jpm_2023_2024_qa_labelled.head())
print('Number of examples:', jpm_2023_2024_qa_labelled.shape[0])

Unnamed: 0,year,quarter,question_number,answer_number,question,answer,speaker_name,role,role_normalised,source_pdf,label
0,2023,Q4,1.0,1.0,Good morning. Thanks for all the comments on t...,"Yeah. Matt, not particularly updating. I think...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct
1,2023,Q4,2.0,1.0,"Okay. And then just separately, you bought bac...",Yeah. Good question. And I think you framed it...,Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct
2,2023,Q4,3.0,1.0,"Thanks. Jeremy, could you give a little more c...","Yeah. Actually, John, this quarter, that's all...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct
3,2023,Q4,4.0,1.0,"Okay. And then, just to follow up on the NII, ...","Sure. Yeah, happy to do that, John. So, I thin...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct
4,2023,Q4,5.0,1.0,Hey. Good morning. Maybe just to follow up in ...,Yeah. Both good questions. So let's do reprice...,Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/jpm-4q23-earnings-call-transcript...,Direct


Number of examples: 215


In [102]:
# Function to split the jpm_2023_2024 dataset into test and validation ensuring answers are not leaked.
def val_test_split(df, group_key, label_col='label', test_size=0.5):

    is_evasive = df[label_col].astype(str).str.lower().eq("evasive")
    g = df.assign(_ev=is_evasive.astype(int)).groupby(group_key).agg(
        n=("__dummy__", "size") if "__dummy__" in df.columns else ("_ev", "size"),
        ev=("_ev", "sum")
    )

    # Order groups: evasive-heavy first, then larger groups
    order = g.sort_values(["ev", "n"], ascending=False).index.tolist()

    # Greedy pack groups into two halves balancing evasive counts, then size
    A, B = [], []
    evA = evB = nA = nB = 0
    target_n_each = len(df) * (1 - test_size)

    for grp in order:
        ev, n = int(g.loc[grp, "ev"]), int(g.loc[grp, "n"])
        # choose the side with fewer evasives; on tie, choose the smaller side by n
        if (evA < evB) or (evA == evB and nA <= nB):
            A.append(grp); evA += ev; nA += n
        else:
            B.append(grp); evB += ev; nB += n

    # Build frames: A = validation, B = test (roughly 50/50 by rows)
    val_set  = df[df[group_key].isin(A)].reset_index(drop=True)
    test_set = df[df[group_key].isin(B)].reset_index(drop=True)
    return val_set, test_set

In [103]:
# Make a group key so that all answers for the same question stay in the same set. 
jpm_2023_2024_qa_labelled['group_key'] = (
    jpm_2023_2024_qa_labelled['year'].astype(str) + '_' +
    jpm_2023_2024_qa_labelled['quarter'].astype(str) + '_' +
    jpm_2023_2024_qa_labelled['question_number'].astype(str)
)

In [104]:
# Split into validation and test set.
jpm_val_qa, jpm_test_qa = val_test_split(
    jpm_2023_2024_qa_labelled,
    group_key='group_key',
    label_col='label'
)

print(f'Number of validation examples: {jpm_val_qa.shape[0]} \n{jpm_val_qa["label"].value_counts()}')
print(f'Number of test examples: {jpm_test_qa.shape[0]} \n{jpm_test_qa["label"].value_counts()}')

Number of validation examples: 108 
label
Direct     87
Evasive    21
Name: count, dtype: int64
Number of test examples: 107 
label
Direct     86
Evasive    21
Name: count, dtype: int64


In [105]:
# Save the datasets.
jpm_val_qa.to_csv('../data/processed/jpm/cleaned/jpm_val_qa.csv')
jpm_test_qa.to_csv('../data/processed/jpm/cleaned/jpm_test_qa.csv')

- Human label the validation and test dataset with evasive or direct labels.

### **Load the Clean & Labelled Datasets**

In [2]:
# Load the training and labelled validation & test datasets.
jpm_predict_qa = pd.read_csv('../data/processed/jpm/cleaned/jpm_predict_qa.csv')
jpm_test_qa_labelled = pd.read_csv('../data/processed/jpm/cleaned/jpm_test_qa_labelled.csv')
jpm_val_qa_labelled = pd.read_csv('../data/processed/jpm/cleaned/jpm_val_qa_labelled.csv')

### **Re-make the validation and test datasets**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit
from collections import Counter

def _class_props(y):
    c = Counter(y); n = sum(c.values())
    return {k: c[k]/n for k in c}

def _score_stratification(y_overall, y_subset):
    """Lower is better. Sum of absolute diffs in class proportions."""
    p_all = _class_props(y_overall)
    p_sub = _class_props(y_subset)
    keys = set(p_all) | set(p_sub)
    return sum(abs(p_all.get(k,0) - p_sub.get(k,0)) for k in keys)

def stratified_group_shuffle_split(
    df: pd.DataFrame,
    y_col: str,
    group_col: str,
    test_size: float = 0.25,
    min_evasive_test: int = 100,
    evasive_label='Evasive',  # or 1 if your 2-class uses ints
    max_trials: int = 500,
    random_state: int = 42,
):
    """
    Returns new_val_df, new_test_df
    - Group-aware (by group_col)
    - Approx stratified (choose the split that best matches overall class mix)
    - Enforces at least `min_evasive_test` items in test
    """
    df = df.copy()
    assert y_col in df and group_col in df, "Missing y_col or group_col."
    y_all   = df[y_col].values
    groups  = df[group_col].values

    best = None
    best_score = np.inf
    rng = np.random.RandomState(random_state)

    for _ in range(max_trials):
        rs = int(rng.randint(0, 1e9))
        gss = GroupShuffleSplit(n_splits=1, test_size=test_size, random_state=rs)
        idx_tr, idx_te = next(gss.split(np.zeros(len(df)), y_all, groups))
        y_te = y_all[idx_te]

        # ensure enough evasives in test
        evasive_count = np.sum(y_te == evasive_label)
        if evasive_count < min_evasive_test:
            continue

        score = _score_stratification(y_all, y_te)
        if score < best_score:
            best_score = score
            best = (idx_tr, idx_te)

    if best is None:
        raise ValueError(
            f"Could not satisfy min_evasive_test={min_evasive_test}. "
            "Increase test_size, lower min_evasive_test, or pool more data."
        )

    idx_tr, idx_te = best
    new_val  = df.iloc[idx_tr].reset_index(drop=True)
    new_test = df.iloc[idx_te].reset_index(drop=True)

    # quick report
    def _counts(d):
        c = Counter(d[y_col]); return dict(sorted(c.items(), key=lambda x: str(x[0])))
    print("=== New Split Summary ===")
    print(f"Total: {len(df)} | Val: {len(new_val)} | Test: {len(new_test)}")
    print("Overall:", _counts(df))
    print("Val:    ", _counts(new_val))
    print("Test:   ", _counts(new_test))
    print("Stratification score (abs diff sum):", best_score)

    return val, test


In [6]:
# Pool validation + test
dev = pd.concat([jpm_val_qa_labelled, jpm_test_qa_labelled], ignore_index=True)

# Build a group key so all answers to the same Q stay together
dev['group_id'] = dev['year'].astype(str) + "_" + dev['quarter'].astype(str) + "_" + dev['question_number'].astype(str)

# Run split
jpm_val_qa_labelled, jpm_test_qa_labelled = stratified_group_shuffle_split(
    dev,
    y_col='label',          # your ground-truth label column
    group_col='group_id',   # <- key!
    test_size=0.30,
    min_evasive_test=12,
    evasive_label='Evasive',
    random_state=42,
    max_trials=2000
)

=== New Split Summary ===
Total: 215 | Val: 143 | Test: 72
Overall: {'Direct': 173, 'Evasive': 42}
Val:     {'Direct': 115, 'Evasive': 28}
Test:    {'Direct': 58, 'Evasive': 14}
Stratification score (abs diff sum): 0.0018087855297157507


### **continue**

In [244]:
# View the labelled datasets.
display(jpm_test_qa_labelled.head())
display(jpm_val_qa_labelled.head())

Unnamed: 0.1,Unnamed: 0,year,quarter,question_number,answer_number,question,answer,speaker_name,role,role_normalised,source_pdf,label,group_id
0,0,2023,Q1,1.0,1.0,"So, Jamie, I was actually hoping to get your p...","Well, I think you were already kind of complet...",Jamie Dimon,Chairman & Chief Executive Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_1.0
1,2,2023,Q1,1.0,3.0,"So, Jamie, I was actually hoping to get your p...","Well, we've told you that we're kind of pencil...",Jamie Dimon,Chairman & Chief Executive Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_1.0
2,10,2023,Q1,7.0,2.0,"So, as you think about all of what you've just...",Okay. Let's take a crack. Let's see what the b...,Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_7.0
3,19,2023,Q1,13.0,1.0,"Hi, good morning. I guess, maybe one question,...","Yeah, so Ebrahim let me sort of respond narrow...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_13.0
4,21,2023,Q1,13.0,3.0,"Hi, good morning. I guess, maybe one question,...","Yeah. And then in terms of the office space, a...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_13.0


Unnamed: 0.1,Unnamed: 0,year,quarter,question_number,answer_number,question,answer,speaker_name,role,role_normalised,source_pdf,label,group_id
0,4,2023,Q1,2.0,1.0,"Hey, thanks. Good morning. Hey, Jeremy, I was ...","Yeah, sure. So let me just summarize the drive...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_2.0
1,5,2023,Q1,3.0,1.0,"Yeah, and as a follow-up on the point about ra...","Well first of all, I don't quite believe it. S...",Jamie Dimon,Chairman & Chief Executive Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_3.0
2,6,2023,Q1,4.0,1.0,"Hi, thanks. Jeremy, wanted to follow up again ...","Yeah. John, it's a really good question, and w...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_4.0
3,7,2023,Q1,5.0,1.0,Okay. And then I wanted to ask Jamie – there's...,Yeah. I wouldn't use the word credit crunch if...,Jamie Dimon,Chairman & Chief Executive Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_5.0
4,8,2023,Q1,6.0,1.0,Hi. Good morning. My first question is you men...,"Yeah. So, Erika, as you know, we take – not go...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Evasive,2023_Q1_6.0


### **LLM Model Set-up**

- Initial testing three LLM models

In [245]:
# Model name checkpoints.
roberta_name = 'roberta-large-mnli'
deberta_name = 'microsoft/deberta-large-mnli'
zs_deberta_name = 'MoritzLaurer/deberta-v3-large-zeroshot-v2.0'

# Load tokenizers and models.
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_name)
roberta = AutoModelForSequenceClassification.from_pretrained(roberta_name)

deberta_tokenizer = AutoTokenizer.from_pretrained(deberta_name)
deberta = AutoModelForSequenceClassification.from_pretrained(deberta_name)

zs_deberta_tokenizer = AutoTokenizer.from_pretrained(zs_deberta_name)
zs_deberta = AutoModelForSequenceClassification.from_pretrained(zs_deberta_name)

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. ini

In [247]:
# Verify label order per model.
print("roberta id2label:", roberta.config.id2label)
print("deberta id2label:", deberta.config.id2label)
print("zs_deberta id2label:", zs_deberta.config.id2label)

roberta id2label: {0: 'CONTRADICTION', 1: 'NEUTRAL', 2: 'ENTAILMENT'}
deberta id2label: {0: 'CONTRADICTION', 1: 'NEUTRAL', 2: 'ENTAILMENT'}
zs_deberta id2label: {0: 'entailment', 1: 'not_entailment'}


- Roberta and deberta have the standard 3 MNLI labels whereas zero shot deberta is binary.

In [248]:
# Add models and tokenizers to dictionary.
models_and_tokenizers = {
        'roberta': (roberta, roberta_tokenizer),
        'deberta': (deberta, deberta_tokenizer),
        'zs_deberta': (zs_deberta, zs_deberta_tokenizer)
        }

In [249]:
# Set device 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # no MPS
for model, tok in models_and_tokenizers.values():
    model.to(device).eval()

### **Baseline Evasion Score Functions**

In [250]:
# List of evasive phrases
EVASIVE_PHRASES = [
    r"\btoo early\b",
    r"\bcan't (?:comment|share|discuss)\b",
    r"\bwon't (?:comment|share|provide)\b",
    r"\bno (?:update|comment)\b",
    r"\bwe (?:don't|do not) (?:break out|provide guidance)\b",
    r"\bnot (?:going to|able to) (?:comment|share|provide)\b",
    r"\bwe'll (?:come back|circle back)\b",
    r"\bnot something we disclose\b",
    r"\bas (?:we|I) (?:said|mentioned)\b",
    r"\bgenerally speaking\b",
    r"\bit's premature\b",
    r"\bit's difficult to say\b",
    r"\bI (?:wouldn't|won't) want to (?:speculate|get into)\b",
    r"\bI (?:think|guess|suppose)\b",
    r"\bkind of\b",
    r"\bsort of\b",
    r"\baround\b",
    r"\broughly\b",
    r"\bwe (?:prefer|plan) not to\b",
    r"\bwe're not prepared to\b",
]

# List of words that suggest the answer needs specific financial numbers to properly answer the question.
SPECIFICITY_TRIGGERS = [
    "how much","how many","what is","what are","when","which","where","who","why",
    "range","guidance","margin","capex","opex","revenue","sales","eps","ebitda",
    "timeline","date","target","growth","update","split","dividend","cost","price",
    "units","volumes","gross","net","tax","percentage","utilization","order book"
]

NUMERIC_PATTERN = r"(?:\d+(?:\.\d+)?%|\b\d{1,3}(?:,\d{3})*(?:\.\d+)?\b|£|\$|€)"

In [251]:
# Function to calculate cosine similarity between question and answers.
def cosine_sim(q, a):
    vec = TfidfVectorizer(stop_words='english').fit_transform([q, a]) # converts text to vectors 
    sim = float(cosine_similarity(vec[0], vec[1])[0, 0]) # calculate the cosine similarity between the two vectors

    return sim

In [252]:
# Function to compute baseline evasion score.
def baseline_evasion_score(q, a):
    # 1. Cosine similarity
    sim = cosine_sim(q, a) # calculates cosine similarity using previous function
    sim_component = (1 - sim) * 45 # less similar the answer is, the bigger the contribution to the evasion score, scaled by 45

    # 2. Numerical specificity- Does the question require and answer with financial figures/ a specific answer?
    needs_num = any(t in q.lower() for t in SPECIFICITY_TRIGGERS) # true if the question requires a numeric/ specific answer
    has_num = bool(re.search(NUMERIC_PATTERN, a)) # true if the answer includes a number 
    numeric_component = 25 if needs_num and not has_num else 0 # score of 25 if the question needs a number but the answer doesn't give one

    # 3. Evasive phrases- does the answer contain evasive phrases?
    phrase_hits = sum(len(re.findall(p, a.lower())) for p in EVASIVE_PHRASES) # counts how many times an evasive phrase appears in the answer
    phrase_component = min(3, phrase_hits) * 8 # max of 3 hits counted, each hit = 8 points 

    # Final evasion score.
    score = min(100, sim_component + numeric_component + phrase_component) # adds components together and caps score at 100
    
    return score, sim, phrase_hits, needs_num, has_num

### **LLM and Blended Evasion Score Functions**

In [253]:
# Function to build the premise for the model (question + answer).
def build_premise(q, a):
    return f'[QUESTION] {q} [ANSWER] {a}'

In [263]:
def model_max_len(tokenizer, model):
    m = getattr(tokenizer, "model_max_length", None)
    if m is None or m == int(1e30):
        m = getattr(getattr(model, "config", None), "max_position_embeddings", 512)
    return int(m or 512)

def token_len(tokenizer, text):
    return len(tokenizer.encode(text, add_special_tokens=False))

def compute_answer_budget(tokenizer, model, question, hyp_max_tokens, q_cap=128, safety_margin=12):
    max_len = model_max_len(tokenizer, model)            # usually 512
    specials = tokenizer.num_special_tokens_to_add(pair=True)
    q_tokens = min(token_len(tokenizer, question), q_cap)
    budget = max_len - specials - q_tokens - hyp_max_tokens - safety_margin
    return max(32, budget)

def chunk_answer_for_pair(tokenizer, answer, answer_budget, stride_tokens=128):
    """
    Chunk the ANSWER using tokenizer.tokenize (no model max-length checks),
    then stitch back to text with convert_tokens_to_string.
    """
    toks = tokenizer.tokenize(answer)  # <-- avoids the max-length warning
    if len(toks) <= answer_budget:
        return [answer]

    chunks, i = [], 0
    while i < len(toks):
        window_tokens = toks[i:i+answer_budget]
        window_text = tokenizer.convert_tokens_to_string(window_tokens)
        chunks.append(window_text)
        if i + answer_budget >= len(toks):
            break
        i += max(1, answer_budget - stride_tokens)
    return chunks

def pair_logits_chunks(model, tokenizer, device, premise, hypothesis, max_length=None, stride=128):
    if max_length is None:
        max_length = model_max_len(tokenizer, model)

    enc = tokenizer(
        premise,
        hypothesis,
        return_tensors='pt',
        truncation='only_first',          # split/truncate Q+A only
        max_length=max_length,
        stride=stride,
        return_overflowing_tokens=True,
        padding='max_length'              # <-- add this
    )

    # keep only keys the model expects
    input_names = set(getattr(tokenizer, "model_input_names",
                              ["input_ids", "attention_mask", "token_type_ids"]))

    def to_batch(enc_dict, i=None):
        batch = {}
        for k, v in enc_dict.items():
            if k in input_names and isinstance(v, torch.Tensor):
                batch[k] = (v[i:i+1] if i is not None else v).to(device)
        return batch

    # single chunk
    if enc["input_ids"].shape[0] == 1:
        batch = to_batch(enc)
        with torch.no_grad():
            logits = model(**batch).logits
        return [logits.squeeze(0)]

    # multiple overflowed chunks
    logits_list = []
    n = enc["input_ids"].shape[0]
    for i in range(n):
        batch = to_batch(enc, i)
        with torch.no_grad():
            out = model(**batch).logits
        logits_list.append(out.squeeze(0))
    return logits_list

def get_label_idx(model, name, default):
    id2label = getattr(model.config, "id2label", {})
    if id2label:
        for k, v in id2label.items():
            if name in str(v).lower():
                return int(k)
    return default

def p_entail_from_logits(logits, model, temperature=1.0):
    nlab = logits.shape[-1]
    ent_i = get_label_idx(model, "entail", 2 if nlab==3 else 1)
    probs = torch.softmax(logits / float(temperature), dim=-1)
    return float(probs[ent_i])

# --- your templates (unchanged) ---
DIRECT_TEMPLATES = [
    "The answer gives a direct and specific response to the question.",
    "The answer addresses the question explicitly and concretely.",
    "The answer responds directly with actionable specifics.",
]
EVASIVE_TEMPLATES = [
    "The answer avoids giving a direct response to the question.",
    "The answer is evasive or deflects without specifics.",
    "The answer sidesteps the question and withholds details.",
]

def llm_evasion_score(question, answer, model, tokenizer, device, temperature=2.0, stride=128):
    max_len = model_max_len(tokenizer, model)
    n_dir, n_eva = len(DIRECT_TEMPLATES), len(EVASIVE_TEMPLATES)

    p_ent_direct_list, p_ent_evasive_list = [], []

    premise = f"Q: {question}\nA: {answer}"

    # Collect P(entailment) for DIRECT hypotheses (over chunks), then mean over templates
    for h in DIRECT_TEMPLATES:
        logits_chunks = pair_logits_chunks(model, tokenizer, device, premise, h, max_length=max_len, stride=stride)
        # For each chunk, compute P(entail); take the max across chunks (recall-friendly)
        pents = [p_entail_from_logits(log, model, temperature) for log in logits_chunks]
        p_ent_direct_list.append(max(pents))

    # Same for EVASIVE hypotheses
    for h in EVASIVE_TEMPLATES:
        logits_chunks = pair_logits_chunks(model, tokenizer, device, premise, h, max_length=max_len, stride=stride)
        pents = [p_entail_from_logits(log, model, temperature) for log in logits_chunks]
        p_ent_evasive_list.append(max(pents))

    # Mean over templates
    p_ent_direct  = float(torch.tensor(p_ent_direct_list).mean())
    p_ent_evasive = float(torch.tensor(p_ent_evasive_list).mean())

    # Neutral-aware normalization (don’t force a 2-class softmax over logits)
    denom = p_ent_evasive + p_ent_direct + 1e-9
    p_evasive = float(p_ent_evasive / denom)
    p_direct  = 1.0 - p_evasive

    return {
        'p_direct': p_direct,
        'p_evasive': p_evasive,
        'p_ent_direct': p_ent_direct,
        'p_ent_evasive': p_ent_evasive
    }


In [255]:
# Function to compute blended evasion score and return all scores.
def compute_all_evasion_scores(q, a, *, models_and_tokenizers=models_and_tokenizers, device, LLM_WEIGHT=0.30):
    
    # Compute baseline evasion score.
    base_score, _, _, _, _ = baseline_evasion_score(q, a)

    # Individual LLM scores.
    llm_scores = {}
    for name, (m, t) in models_and_tokenizers.items():
        scores = llm_evasion_score(q, a, m, t, device)
        llm_scores[name] = float(100.0 * scores['p_evasive'])

    # Ensemble LLM score.
    llm_avg = float(np.mean(list(llm_scores.values()))) if llm_scores else 0.0

    # Compute blended score.
    blended_score = float(np.clip((1.0 - LLM_WEIGHT) * base_score + LLM_WEIGHT * llm_avg, 0.0, 100.0))

    return {
        'baseline': base_score,
        'llm_individual': llm_scores,
        'llm_avg': llm_avg,
        'blended': blended_score
        }

### **Main Pipeline v1**

- v1 tests three LLM models, an average of these (ensemble) vs a baseline (rule-based) with a blended score of avg + baseline 

In [256]:
# Function to label 'Direct' or 'Evasive' based on the score.
def label_from_score(score, threshold):
    return 'Evasive' if score >= threshold else 'Direct'

In [257]:
# Evasion Pipeline.
def evasion_pipeline(df, models_and_tokenizers, device, LLM_WEIGHT, EVASION_THRESHOLD_BASE, EVASION_THRESHOLD_LLM, EVASION_THRESHOLD_BLENDED):

    records = []

    for _, row in df.iterrows():
        q, a = str(row['question']), str(row['answer'])
        output = compute_all_evasion_scores(q=q, a=a, LLM_WEIGHT=LLM_WEIGHT, models_and_tokenizers=models_and_tokenizers, device=device)

        pred_base = label_from_score(output['baseline'], EVASION_THRESHOLD_BASE)
        pred_llm_avg = label_from_score(output['llm_avg'], EVASION_THRESHOLD_LLM)
        pred_blended = label_from_score(output['blended'], EVASION_THRESHOLD_BLENDED)

        record = {
            'question_number': row.get('question_number'),
            'question': q,
            'answer': a,

            # Evasion Scores
            'evasion_score_baseline': int(output['baseline']),
            'evasion_score_llm_avg': int(output['llm_avg']),
            "evasion_score_blended": int(output['blended']),

            # Predicted labels.
            'prediction_baseline': pred_base,
            'prediction_llm_avg': pred_llm_avg,
            'prediction_blended': pred_blended,
        }

        for model_name, score in output['llm_individual'].items():
            record[f'evasion_score_{model_name}'] = int(score)
            record[f'prediction_{model_name}'] = label_from_score(score, EVASION_THRESHOLD_LLM)

        records.append(record)

    return pd.DataFrame(records)

### **Fine-Tune Score Thresholds**

In [264]:
# Perform an initial run with preliminary threshold values.
LLM_WEIGHT = 0.30
EVASION_THRESHOLD_BASE = 30.0
EVASION_THRESHOLD_LLM = 30.0
EVASION_THRESHOLD_BLENDED = 30.0

jpm_val_qa_scores = evasion_pipeline(
    jpm_val_qa_labelled, 
    models_and_tokenizers, 
    device, 
    LLM_WEIGHT, 
    EVASION_THRESHOLD_BASE, 
    EVASION_THRESHOLD_LLM, 
    EVASION_THRESHOLD_BLENDED
    )

In [265]:
# View the results and reappend the label.
jpm_val_qa_scores['label'] = jpm_val_qa_labelled['label'].values
jpm_val_qa_scores.head()

Unnamed: 0,question_number,question,answer,evasion_score_baseline,evasion_score_llm_avg,evasion_score_blended,prediction_baseline,prediction_llm_avg,prediction_blended,evasion_score_roberta,prediction_roberta,evasion_score_deberta,prediction_deberta,evasion_score_zs_deberta,prediction_zs_deberta,label
0,2.0,"Hey, thanks. Good morning. Hey, Jeremy, I was ...","Yeah, sure. So let me just summarize the drive...",80,40,68,Evasive,Evasive,Evasive,45,Evasive,50,Evasive,25,Direct,Direct
1,3.0,"Yeah, and as a follow-up on the point about ra...","Well first of all, I don't quite believe it. S...",40,64,47,Evasive,Evasive,Evasive,40,Evasive,81,Evasive,70,Evasive,Direct
2,4.0,"Hi, thanks. Jeremy, wanted to follow up again ...","Yeah. John, it's a really good question, and w...",78,67,74,Evasive,Evasive,Evasive,54,Evasive,84,Evasive,63,Evasive,Direct
3,5.0,Okay. And then I wanted to ask Jamie – there's...,Yeah. I wouldn't use the word credit crunch if...,55,67,59,Evasive,Evasive,Evasive,58,Evasive,69,Evasive,74,Evasive,Direct
4,6.0,Hi. Good morning. My first question is you men...,"Yeah. So, Erika, as you know, we take – not go...",44,60,49,Evasive,Evasive,Evasive,36,Evasive,84,Evasive,59,Evasive,Evasive


In [266]:
# Function to extract ground truth (1 = Evasive, 0 = Direct)
def extract_y_true(df):
    return (df['label'].astype(str).str.strip().str.lower() == 'evasive').astype(int).values

In [267]:
# Function calculate metrics for each threshold.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def tune_threshold(df, score_col, thr_grid):
    y_true = extract_y_true(df)                     # get true labels
    scores = df[score_col].astype(float).values     # get raw evasion scores 

    rows = []
    for thr in thr_grid:
        y_pred = (scores >= thr).astype(int) # label response evasive (1) if score is higher than threshold

        precision = precision_score(y_true, y_pred, zero_division=0)
        recall = recall_score(y_true, y_pred, zero_division=0)
        f1 = f1_score(y_true, y_pred, zero_division=0)
        accuracy = accuracy_score(y_true, y_pred)

        rows.append({
            'threshold': float(thr),
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'accuracy': accuracy
        })
    
    results = pd.DataFrame(rows).sort_values(
        by=['f1', 'recall'],
        ascending=[False, False]
        ).reset_index(drop=True)
    
    return results

In [268]:
# Define threshold ranges around current thresholds.
thr_base_grid = np.arange(40, 85, 5)
thr_llm_grid = np.arange(35, 85, 5)
thr_blend_grid = np.arange(40, 85, 5)

In [269]:
# Baseline / blended / avg LLM 
base_results = tune_threshold(jpm_val_qa_scores, 'evasion_score_baseline', thr_base_grid)
llm_avg_results = tune_threshold(jpm_val_qa_scores, 'evasion_score_llm_avg', thr_llm_grid)
blend_results = tune_threshold(jpm_val_qa_scores, 'evasion_score_blended', thr_blend_grid)

# Individual LLM models
roberta_results = tune_threshold(jpm_val_qa_scores, 'evasion_score_roberta', thr_llm_grid)
deberta_results = tune_threshold(jpm_val_qa_scores, 'evasion_score_deberta', thr_llm_grid)
zs_deberta_results = tune_threshold(jpm_val_qa_scores, 'evasion_score_zs_deberta', thr_llm_grid)

In [270]:
# Extract the best thresholds based on recall.
best_base_thr = base_results.loc[0, 'threshold']
best_avg_llm_thr = llm_avg_results.loc[0, 'threshold']
best_blend_thr = blend_results.loc[0, 'threshold']

best_roberta_thr = roberta_results.loc[0, 'threshold']
best_deberta_thr = deberta_results.loc[0, 'threshold']
best_zs_derberta_thr = zs_deberta_results.loc[0, 'threshold']

print('Best Baseline Threshold:', best_base_thr)
print('Best avg LLM Threshold:', best_avg_llm_thr)
print('Best Blended Threshold', best_base_thr)

print('Best roberta Threshold:', best_roberta_thr)
print('Best deberta Threshold:', best_deberta_thr)
print('Best zs deberta Threshold', best_zs_derberta_thr)

Best Baseline Threshold: 40.0
Best avg LLM Threshold: 50.0
Best Blended Threshold 40.0
Best roberta Threshold: 35.0
Best deberta Threshold: 60.0
Best zs deberta Threshold 55.0


In [271]:
# Inspect trade-offs.
print('\nTop 5 baseline configs:\n', base_results.head())
print('\nTop 5 llm configs:\n', llm_avg_results.head())
print('\nTop 5 blended configs:\n', blend_results.head())

print('\nTop 5 roberta configs:\n', roberta_results.head())
print('\nTop 5 deberta configs:\n', deberta_results.head())
print('\nTop 5 zs deberta configs:\n', zs_deberta_results.head())


Top 5 baseline configs:
    threshold  precision    recall        f1  accuracy
0       40.0   0.208955  1.000000  0.345679  0.258741
1       45.0   0.183486  0.714286  0.291971  0.321678
2       65.0   0.209677  0.464286  0.288889  0.552448
3       70.0   0.236842  0.321429  0.272727  0.664336
4       55.0   0.177778  0.571429  0.271186  0.398601

Top 5 llm configs:
    threshold  precision    recall        f1  accuracy
0       50.0   0.211538  0.785714  0.333333  0.384615
1       35.0   0.198529  0.964286  0.329268  0.230769
2       40.0   0.198413  0.892857  0.324675  0.272727
3       55.0   0.215190  0.607143  0.317757  0.489510
4       45.0   0.198198  0.785714  0.316547  0.335664

Top 5 blended configs:
    threshold  precision    recall        f1  accuracy
0       40.0   0.201439  1.000000  0.335329  0.223776
1       65.0   0.250000  0.464286  0.325000  0.622378
2       45.0   0.193548  0.857143  0.315789  0.272727
3       50.0   0.183486  0.714286  0.291971  0.321678
4       55

- Best balanced performance of LLM models was the deberta model with threshold = 60, giving 79% recall, 46% accuracy and F1 = 0.36.
- Use baseline threshold = 0.40 as baseline detector as this gave the highest F1 score across the grid search, giving the most balanced model and so is the fairest comparison. 

### **Main Pipeline v2**

- This pipeline incorperates the results from the validation threshold tuning and best model performances 
- Updates some of the previous functions.

In [415]:
# def compute_all_evasion_scores_v2(q, a, *, models_and_tokenizers, device, LLM_WEIGHT=0.30):
#     # Baseline score
#     base_score, _, _, _, _ = baseline_evasion_score(q, a)

#     # Get RoBERTa from dict.
#     if 'roberta' in models_and_tokenizers:
#         roberta_model, roberta_tok = models_and_tokenizers['roberta']
#     else:
#         candidates = [k for k in models_and_tokenizers.keys() if 'roberta' in k.lower()]
#         if not candidates:
#             raise ValueError("RoBERTa model not found in models_and_tokenizers. Expected a key containing 'roberta'.")
#         roberta_model, roberta_tok = models_and_tokenizers[candidates[0]]

#     # RoBERTa LLM score
#     r_scores = llm_evasion_score(q, a, roberta_model, roberta_tok, device)
#     roberta_score = float(100.0 * r_scores['p_evasive'])

#     # Blended (Baseline <-> RoBERTa)
#     blended_score = float(np.clip((1.0 - LLM_WEIGHT) * base_score + LLM_WEIGHT * roberta_score, 0.0, 100.0))

#     return {
#         'baseline': base_score,
#         'roberta': roberta_score,
#         'blended': blended_score
#     }

In [None]:
# def entailment_logit(model, tokenizer, device, premise, hypothesis):
#     model.eval()
#     enc = tokenizer(premise, hypothesis, return_tensors='pt', truncation=True, max_length=512)
#     enc = {k: v.to(device) for k, v in enc.items()}
#     with torch.no_grad():
#         logits = model(**enc).logits.squeeze(0)
#     n = logits.shape[-1]
#     if n == 3:
#         return float(logits[2])  # index 2 = entailment
#     elif n == 2:
#         return float(logits[1])  # index 1 ~ entailment
#     else:
#         raise ValueError(f"Unexpected num_labels={n}")

In [None]:
# def llm_evasion_score_v2(question, answer, model, tokenizer, device):
#     premise   = f"[QUESTION] {question} [ANSWER] {answer}"
#     H_DIRECT  = "The answer gives a direct and specific response to the question."
#     H_EVASIVE = "The answer avoids giving a direct response to the question."

#     s_direct  = entailment_logit(model, tokenizer, device, premise, H_DIRECT)
#     s_evasive = entailment_logit(model, tokenizer, device, premise, H_EVASIVE)

#     # Pairwise softmax over entailment logits
#     s = torch.tensor([s_direct, s_evasive])
#     p = F.softmax(s, dim=0).tolist()
#     return {'p_direct': p[0], 'p_evasive': p[1]}

In [272]:
def compute_all_evasion_scores_v2(q, a, *, models_and_tokenizers, device, LLM_WEIGHT=0.30):
    base_score, _, _, _, _ = baseline_evasion_score(q, a)

    # pick deberta
    if 'deberta' in models_and_tokenizers:
        deberta_model, deberta_tok = models_and_tokenizers['deberta']
    else:
        k = next(k for k in models_and_tokenizers if 'deberta' in k.lower())
        deberta_model, deberta_tok = models_and_tokenizers[k]

    r = llm_evasion_score(q, a, deberta_model, deberta_tok, device)
    deberta_score = float(100.0 * r['p_evasive'])

    blended_score = float(np.clip((1.0 - LLM_WEIGHT) * base_score + LLM_WEIGHT * deberta_score, 0.0, 100.0))
    return {'baseline': base_score, 'deberta': deberta_score, 'blended': blended_score}

In [273]:
# Evasion pipeline v2 
def evasion_pipeline_v2(df, models_and_tokenizers, device, LLM_WEIGHT, EVASION_THRESHOLD_BASE, EVASION_THRESHOLD_DEBERTA, EVASION_THRESHOLD_BLENDED):
    records = []
    for _, row in df.iterrows():
        q, a = str(row['question']), str(row['answer'])
        output = compute_all_evasion_scores_v2(
            q=q, a=a,
            LLM_WEIGHT=LLM_WEIGHT,
            models_and_tokenizers=models_and_tokenizers,
            device=device
        )

        pred_base    = label_from_score(output['baseline'], EVASION_THRESHOLD_BASE)
        pred_deberta = label_from_score(output['deberta'], EVASION_THRESHOLD_DEBERTA)
        pred_blended = label_from_score(output['blended'], EVASION_THRESHOLD_BLENDED)

        record = {
            'question_number': row.get('question_number'),
            'question': q,
            'answer': a,

            # Scores
            'evasion_score_baseline': int(output['baseline']),
            'evasion_score_deberta': int(output['deberta']),
            'evasion_score_blended': int(output['blended']),

            # Predictions
            'prediction_baseline': pred_base,
            'prediction_deberta': pred_deberta,
            'prediction_blended': pred_blended,
        }
        records.append(record)

    return pd.DataFrame(records)

In [274]:
# View test dataset.
jpm_test_qa_labelled.head()

Unnamed: 0.1,Unnamed: 0,year,quarter,question_number,answer_number,question,answer,speaker_name,role,role_normalised,source_pdf,label,group_id
0,0,2023,Q1,1.0,1.0,"So, Jamie, I was actually hoping to get your p...","Well, I think you were already kind of complet...",Jamie Dimon,Chairman & Chief Executive Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_1.0
1,2,2023,Q1,1.0,3.0,"So, Jamie, I was actually hoping to get your p...","Well, we've told you that we're kind of pencil...",Jamie Dimon,Chairman & Chief Executive Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_1.0
2,10,2023,Q1,7.0,2.0,"So, as you think about all of what you've just...",Okay. Let's take a crack. Let's see what the b...,Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_7.0
3,19,2023,Q1,13.0,1.0,"Hi, good morning. I guess, maybe one question,...","Yeah, so Ebrahim let me sort of respond narrow...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_13.0
4,21,2023,Q1,13.0,3.0,"Hi, good morning. I guess, maybe one question,...","Yeah. And then in terms of the office space, a...",Jeremy Barnum,Chief Financial Officer,banker,data/raw/jpm/.ipynb_checkpoints/jpm-1q23-earni...,Direct,2023_Q1_13.0


In [275]:
# Run evasion pipeline v2 with test dataset. 
LLM_WEIGHT = 0.70
EVASION_THRESHOLD_BASE = 40.0
EVASION_THRESHOLD_DEBERTA = 60.0
EVASION_THRESHOLD_BLENDED = 40.0

jpm_test_qa_scores = evasion_pipeline_v2(
    jpm_test_qa_labelled,
    models_and_tokenizers,
    device,
    LLM_WEIGHT,
    EVASION_THRESHOLD_BASE,
    EVASION_THRESHOLD_DEBERTA,
    EVASION_THRESHOLD_BLENDED
)

### **Evaluation**

In [276]:
# View results.
jpm_test_qa_scores['label'] = jpm_test_qa_labelled['label'].values
jpm_test_qa_scores.head()

Unnamed: 0,question_number,question,answer,evasion_score_baseline,evasion_score_deberta,evasion_score_blended,prediction_baseline,prediction_deberta,prediction_blended,label
0,1.0,"So, Jamie, I was actually hoping to get your p...","Well, I think you were already kind of complet...",73,58,62,Evasive,Direct,Evasive,Direct
1,1.0,"So, Jamie, I was actually hoping to get your p...","Well, we've told you that we're kind of pencil...",49,62,58,Evasive,Evasive,Evasive,Direct
2,7.0,"So, as you think about all of what you've just...",Okay. Let's take a crack. Let's see what the b...,35,88,72,Direct,Evasive,Evasive,Direct
3,13.0,"Hi, good morning. I guess, maybe one question,...","Yeah, so Ebrahim let me sort of respond narrow...",82,53,62,Evasive,Direct,Evasive,Direct
4,13.0,"Hi, good morning. I guess, maybe one question,...","Yeah. And then in terms of the office space, a...",72,58,62,Evasive,Direct,Evasive,Direct


In [277]:
# Function to evaluate the evasion scores vs true labels.
def evaluate_evasion_scores(df):

    # True labels: 1 = Evasive, 0 = Direct (using 'human_label').
    y_true = (df['label'].astype(str).str.strip().str.lower() == 'evasive').astype(int).values

    # Convert predicted label strings to binary (1 = Evasive, 0 = Direct).
    def to_binary(pred_series):
        return (pred_series.astype(str).str.strip().str.lower() == 'evasive').astype(int).values

    # Convert predicted labels to binary.
    y_pred_base  = to_binary(df['prediction_baseline'])
    y_pred_deberta   = to_binary(df['prediction_deberta'])
    y_pred_blend = to_binary(df['prediction_blended'])

    return {
        'baseline': {
            'classification_report': classification_report(y_true, y_pred_base, target_names=["Direct", "Evasive"], digits=3, zero_division=0),
            'confusion_matrix': confusion_matrix(y_true, y_pred_base)
        },
        'deberta': {
            'classification_report': classification_report(y_true, y_pred_deberta, target_names=["Direct", "Evasive"], digits=3, zero_division=0),
            'confusion_matrix': confusion_matrix(y_true, y_pred_deberta)
        },
        'blended': {
            'classification_report': classification_report(y_true, y_pred_blend, target_names=["Direct", "Evasive"], digits=3, zero_division=0),
            'confusion_matrix': confusion_matrix(y_true, y_pred_blend) 
        }
    }

In [278]:
# Extract results.
eval_dict = evaluate_evasion_scores(jpm_test_qa_scores)
baseline_eval, deberta_eval, blended_eval = eval_dict['baseline'], eval_dict['deberta'], eval_dict['blended']

In [279]:
# View baseline results.
base_cr, base_cm = baseline_eval['classification_report'], baseline_eval['confusion_matrix']

print(base_cr)
print(base_cm)

              precision    recall  f1-score   support

      Direct      1.000     0.155     0.269        58
     Evasive      0.222     1.000     0.364        14

    accuracy                          0.319        72
   macro avg      0.611     0.578     0.316        72
weighted avg      0.849     0.319     0.287        72

[[ 9 49]
 [ 0 14]]


In [280]:
# View deberta results.
deberta_cr, deberta_cm = deberta_eval['classification_report'], deberta_eval['confusion_matrix']

print(deberta_cr)
print(deberta_cm)

              precision    recall  f1-score   support

      Direct      0.857     0.414     0.558        58
     Evasive      0.227     0.714     0.345        14

    accuracy                          0.472        72
   macro avg      0.542     0.564     0.451        72
weighted avg      0.735     0.472     0.517        72

[[24 34]
 [ 4 10]]


In [281]:
# View blended results.
blended_cr, blended_cm = blended_eval['classification_report'], blended_eval['confusion_matrix']

print(blended_cr)
print(blended_cm)

              precision    recall  f1-score   support

      Direct      1.000     0.017     0.034        58
     Evasive      0.197     1.000     0.329        14

    accuracy                          0.208        72
   macro avg      0.599     0.509     0.182        72
weighted avg      0.844     0.208     0.091        72

[[ 1 57]
 [ 0 14]]
