# **Summarisation & Evasion Notebook**

# **Handover Notes:** [delete after]
- Library imports and versions are saved in environments/summarisation_evasion_env.txt
- This notebook was originally built for a macbook pro M3 chip so some settings may need to be altered depending on your machine
- All files related/ generated by this notebook can be found in notebooks/summarisation_evasion_files

### **Work progress**
1. **Complete**
- Summarise banker answers using baseline model.
- Use Local RAG pipeline to bring in relevant external documents (PRA risk definitions) to create PRA aligned summaries.
- Developed a evasion detection prototype that generates evasion scores based on bankers answers (uses baseline model, LLM- natural language inference using RoBERTa and a blended score)
- Used jpm_2025 transcripts to get the pipeline working. Validated the evasion pipeline using jpm-23-1q data (involved human labelling the answer as Direct or Evasive- file saved in notebooks/summarisation_evasion_files).

2. **Not complete**
- Need to test pipleine on larger data set (e.g. jpm 2023-2025) and check against HSBC to make conclusions & comment on generalisability (answering research question: How does one bank’s tone and thematic profile compare to peers? Are divergences systemic or firm specific?)
- Summarisation pipeline could be improved using a two-stage pipeline: by first extractive summarisation to capture the context and details and then a second model to reframe the summary to be PRA and evasion aligned.
- Post-processing on the output file for the PRA aligned summaries by Mistral model so they are clearer- can this output be fed into another model to extract more insights/ detect evasion or risk?
- Increase the size of the validation set for the evasion pipeline prototype (e.g. more human labelling)
- Need to fine tune the evasion pipeline to increase accuracy
- Optional extensions e.g. using Agents, more complex RAG pipeline (including more useful context for the model), validation of instances of evasion using external news sources)

[we can also look at some of this in Week 5]

# 1. **Objectives**

# **2. Set up Workspace**

In [27]:
# Import libraries
# Core python
import os
import numpy as np
import pandas as pd
import re
import json
import pathlib
from pathlib import Path
from typing import List, Dict, Any 
import csv
import math

# NLP & Summarisation
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification
import nltk
import spacy
from llama_cpp import Llama 
import torch
import torch.nn.functional as F

# Evaluation
from rouge_score import rouge_scorer
import evaluate
from bert_score import score as bertscore 

# Retrieval
from sentence_transformers import SentenceTransformer 
import faiss
import chromadb
import langchain
import llama_index

# ML
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Visualisations
import matplotlib.pyplot as plt
import seaborn as sns 

# Set SEED.
SEED = 42


# **3. Load the dataset**

In [28]:
# Load the dataset.
jpm_2025_df = pd.read_csv('../data/processed/jpm/all_jpm_2025.csv')

# View the data.
jpm_2025_df.head()

Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf
0,1,,Ken Usdin,analyst,Autonomous Research,"Good morning, Jeremy. Wondering if you could s...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
1,1,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Sure, Ken. So I mean, at a high level, I would...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
2,2,,Ken Usdin,analyst,Autonomous Research,Yeah. And just one question on the NII ex. Mar...,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
3,2,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Yeah, that's a good question, Ken. You're righ...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...
4,2,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorganChase,In the curve basically.,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...


# **4. Preprocessing**

- Used all_jpm_2025.csv dataset
- Preliminary preprocessing to label roles as analyst vs banker (invalid roles were corrected) to make downstream analysis easier. Created a new column 'role_normalised'.

In [29]:
# View speaker roles.
jpm_2025_df['role'].unique()

array(['analyst', 'Chief Financial Officer',
       'Chairman & Chief Executive Officer',
       'And then some. Theres a lot of value added.', 'Okay'],
      dtype=object)

In [30]:
# View rows with invalid roles.
valid_roles = 'analyst', 'Chief Financial Officer', 'Chairman & Chief Executive Officer'
invalid_roles_df = jpm_2025_df[~jpm_2025_df['role'].isin(valid_roles)]

# Number of rows with invalid roles.
print('Number of rows:', invalid_roles_df.shape[0])

# View the rows.
invalid_roles_df.head()

Number of rows: 2


Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf
201,35,5.0,"Chief Financial Officer, JPMorganChase",And then some. Theres a lot of value added.,JPMorganChase,"Yeah. And obviously, I mean, we're not going t...",2025,Q2,data/raw/jpm/jpm-2q25-earnings-call-transcript...
205,36,3.0,"Chief Financial Officer, JPMorganChase",Okay,there you have it.,"But it's not like I thought it would do badly,...",2025,Q2,data/raw/jpm/jpm-2q25-earnings-call-transcript...


In [31]:
# Input the correct role information.
jpm_2025_df.at[205, 'role'] = 'Chief Financial Officer'
jpm_2025_df.at[209, 'role'] = 'Chief Financial Officer'

# Verify the roles have been updated.
jpm_2025_df['role'].unique()

array(['analyst', 'Chief Financial Officer',
       'Chairman & Chief Executive Officer',
       'And then some. Theres a lot of value added.'], dtype=object)

In [32]:
# Define role mapping.
role_map = {
    'analyst': 'analyst',
    'Chief Financial Officer': 'banker',
    'Chairman & Chief Executive Officer': 'banker'
}

# Apply to dataset.
jpm_2025_df['role_normalised'] = jpm_2025_df['role'].map(role_map)

In [33]:
# View the dataset.
jpm_2025_df.head()

Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf,role_normalised
0,1,,Ken Usdin,analyst,Autonomous Research,"Good morning, Jeremy. Wondering if you could s...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,analyst
1,1,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Sure, Ken. So I mean, at a high level, I would...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker
2,2,,Ken Usdin,analyst,Autonomous Research,Yeah. And just one question on the NII ex. Mar...,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,analyst
3,2,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Yeah, that's a good question, Ken. You're righ...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker
4,2,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorganChase,In the curve basically.,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker


# **5. Summarisation**

## **5.1 Baseline**

- Initial model exploration using BART and mistral-7B-instruct to summarise banker's answers (no additional context given to model)

### **5.1.1 BART**

In [34]:
# Filter data to banker answers only.
banker_answers = jpm_2025_df[jpm_2025_df['role_normalised'] == 'banker']['content'].tolist()
print(banker_answers[0][:200])

Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the w


In [None]:
# Summarisation baseline (BART)
bart = pipeline('summarization', model='facebook/bart-large-cnn')

sample_text = banker_answers[0]
summary_bart = bart(sample_text, max_length=80, min_length=30, do_sample=False)
print('Original:', sample_text[:400])
print('Summary:', summary_bart[0]['summary_text'])

Device set to use mps:0


- bart was able to extract ket ideas, focussing on fronloading of spending and tariff policy. 
- Compressed the response into two sentences and the summary is coherent, removing filler phrases.
- However, the summary is not fully neutral (e.g. includes ironically) and preserves tone
- Also there is a loss of context- e.g. consumer side vs wholesale side distinction is no longer explicit.

In [None]:
# Prompt conditioning to make PRA relevant.
prompt = "Summarise this answer, focusing on risk, capital and evasion of detail: " + sample_text
summary_bart_prompted = bart(prompt, max_length=80, min_length=30)
print('Original:', sample_text[:400])
print('Summary:', summary_bart_prompted[0]['summary_text'])

Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: Corporates are taking a wait-and-see approach to tariff policy. Some sectors are going to be much more exposed than others. Small business and smaller corporates are probably a little more challenged.


- Prompted summary shifts emphasis and includes interpretation around risk, even though those words were no explicit in the original
- This version is more aligned to evasion detection but moves away from concrete detail 
- Improved approach would be to have a two stage-pipeline: first extractive summarisation to capture the context and details and then a second model to reframe the summary to be PRA and evasion aligned.

### **5.1.2 Mistral-7B-Instruct**

- Mistral model: mistral-7b-instruct-v0.1.Q4_K_M.gguf
- Mistral-7B-Instruct model download: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF?show_file_info=mistral-7b-instruct-v0.1.Q4_K_M.gguf
- Also saved in shared team folder models

In [None]:
# Summarisation baseline (Mistral-7B-Instruct) with basic prompt.
llm = Llama(model_path='/Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
            n_ctx=4096, n_gpu_layers=-1, verbose=False, seed=SEED)  # change path as needed 

prompt = f"<s>[INST] Summarise the following answer in 2 sentences, focusing on concrete facts. Avoid opinions.\n\n{sample_text}\n[/INST]"

output = llm.create_chat_completion(
    messages=[{'role': 'user', 'content': prompt}],
    max_tokens=180,
    temperature=0.1,
    stop=['</s>']
)

summary_mistral = output["choices"][0]['message']['content'].strip()  

print('Original:', sample_text[:400])
print('Summary:', summary_mistral)

llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 

Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: The speaker is discussing the impact of recent news flow on the consumer and corporate sides. On the consumer side, there has been some frontloading of spending ahead of expected price increases from tariffs, which may distort the data and make it difficult to draw larger conclusions. On the corporate side, clients are reacting to changes in tariff policy by shifting their focus towards short-term work and optimizing supply chains. The speaker characterizes the attitude of corporate clients as a wait-and-see attitude, with smaller clients and smaller corporates being more c

- Preserves details and nuance and is more contextual and interpretive than the BART baseline model.
- However, the result is longer with heavier phrasing and includes phrases like 'distort the data' which is not explicit in the original.

In [None]:
# Summarisation baseline (Mistral-7B-Instruct) with more detailed prompt.
llm = Llama(model_path='/Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf',
            n_ctx=4096, n_gpu_layers=-1, verbose=False, seed=SEED)  # change path as needed 

prompt = f"<s>[INST] Summarise the following answer in 2 sentences, focusing on concrete facts. Avoid opinions. Focus on risk, capital and evasion of detail.\n\n{sample_text}\n[/INST]"

output = llm.create_chat_completion(
    messages=[{'role': 'user', 'content': prompt}],
    max_tokens=180,
    temperature=0.1,
    stop=['</s>']
)

summary_mistral_prompted = output["choices"][0]['message']['content'].strip()  

print('Original:', sample_text[:400])
print('Summary:', summary_mistral_prompted)

llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 

Original: Sure, Ken. So I mean, at a high level, I would say that obviously, some of the salient news flow is quite recent. So, we've done some soundings and some checking both on the consumer side and on the wholesale side. I think on the consumer side, the thing to check is the spending data. And to be honest, the main thing that we see there, what would appear to be a certain amount of frontloading of sp
Summary: The speaker is discussing the impact of recent news flow on the consumer and corporate sides of their business. On the consumer side, they have observed some frontloading of spending ahead of expected price increases from tariffs, which may distort data and make it difficult to draw larger conclusions. On the corporate side, clients are shifting their focus towards optimizing supply chains and responding to the current environment, rather than prioritizing more strategic work. The speaker notes that smaller clients and smaller corporates may be more challenged than larger o

- This summary brings in risk- language and is closer to the task objective.
- However, some interpretations are generated by the model rather than explicitly detailed in the answer

## **5.2 Adding Context**

Retrieve PRA risk categories to give greater PRA focus to summaries (local RAG loop).
- measure cosine similarity between transcript chunks and PRA risk categories (vectors)
- retrieve the top 2-3 most relevant risk categories 
- prepend them to the summarisation prompt to make summaries PRA-aligned instead of just summarised answers

- Attempting to use BART resulted in prompt echoing.
- New attempt using Mistral-7B-Instruct.
- Using sentence-BERT vs TF-IDF for vectorisation.

### **5.2.1 Mistral-7B-Instruct**

**Process**
- Performed some light cleaning of the transcript to remove whitespace.
- Split the transcript into smaller chunks that the model can summarise to avoid truncation
- Loaded the PRA categories csv file (contains category and definition)
- Embedded the PRA categories and chunks, evaluated the similarity to extract the PRA risk categories that were relevant to the text
- Summarised the chunk using detailed prompted and relevant PRA categories as additional context. 

**Output File**:
- The output file of this can be found in notebooks/summarisation_evasion_files, name = jpm_mistral_pra_summary.json
- It is in the format: summary, evidence, PRA category that relates to summary and reasoning for selecting these categories.

- Needed to use a lot of fine tuning for the prompt and set strict rules for the model
- Need to be very clear about the output expected or else the model deviates a lot, especially as it processes more data.
- Include lines about lack of evidence if not the model may hallucinate

In [None]:
# Function to remove whitespace in text.
def clean_text(text: str):
    return re.sub(r'\s+', ' ', text).strip()

In [None]:
# Function to split the transcript into smaller chunks.
def chunk_text(text: str, max_chars: int = 6000):
    sentences = re.split(r'(?<=[.!?])\s+', text.strip()) # split into sentences 
    chunks, current_chunk, current_len = [], [], 0 # list of chunks, sentences collecting for current chunk, character count for current chunk

    for s in sentences:
        if current_len + len(s) + 1 <= max_chars: # if the characters of current chunk + new sentence is below the limit:
            current_chunk.append(s) # add sentence to current chunk 
            current_len += len(s) + 1 # update running character count 
        
        else: # if the characters is above the limit:
            chunks.append(' '.join(current_chunk)) # add the current chunk to the final chunk list
            current_chunk, current_len = [s], len(s) # start a new chunk containing the sentence and update current len

    if current_chunk:
        chunks.append(' '.join(current_chunk)) # add any sentences in current chunk after loop ends 

    return chunks 

In [None]:
# Function to load PRA categories and definitions from CSV.
def load_pra_categories(path: Path):
    with open(path, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        return [
            (row.get('category', '').strip(), [row.get('definition', '').strip()])
            for row in reader if row.get('category')
        ]

In [None]:
# Build a Sentence-BERT embedding index for PRA categories.
def build_embedding_index(pra_categories):
    embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    docs = [f"{name} {' '.join(defs)}" for name, defs in pra_categories]
    pra_risk_embeddings = embedder.encode(docs, batch_size=32, normalize_embeddings=True)

    return embedder, np.asarray(pra_risk_embeddings)

In [None]:
# Function to find the relevant PRA categories to the transcript chunks.
def find_rel_categories(chunk, pra_categories, embedder, pra_risk_embeddings, top_k=2):
    query_vec = embedder.encode([chunk], normalize_embeddings=True) # turns chunk into embedding
    sims = cosine_similarity(query_vec, pra_risk_embeddings).ravel() # compares the chunk to each category doc 
    top_indices = np.argsort(-sims)[:top_k] # sorts scores descending and selected top k cateogories 

    return [pra_categories[i] for i in top_indices]

In [None]:
# Function to parse JSON
def parse_tagged_json(raw):
    m = re.search(r"<json>\s*(\{[\s\S]*?\})\s*</json>", raw, flags=re.IGNORECASE)
    if not m:
        return None
    try:
        return json.loads(m.group(1))
    except json.JSONDecodeError:
        return None

In [None]:
# Function to summarise the text chunks.
def summarise_chunk(model, chunk, relevant_categories, max_evidence=5):

    # Build PRA notes (limit to 2 bullets per category)
    lines = []
    for name, definition in relevant_categories:
        lines.append(f"- {name}:")
        for d in list(definition)[:2]:
            lines.append(f"- {d}")
    notes_block = "\n".join(lines)

    system_prompt = (
        "You are a careful data extraction model. "
        "Return ONLY valid JSON wrapped in <json>...</json> tags."
    )

    user_prompt = f"""
TRANSCRIPT:
{chunk}

PRA NOTES:
{notes_block}

TASK:
Return JSON ONLY, wrapped exactly like this:
<json>{{"summary": "...", "evidence": ["..."], "pra_categories": [{{"category":"...","why":"..."}}]}}</json>

RULES:
- 4-6 sentence neutral summary.
- Up to {max_evidence} evidence bullets (quotes/facts).
- 1-3 pra_categories objects.
- If evidence is lacking, use a single bullet "Insufficient evidence".
- Only choose categories supported by the evidence.
""".strip()

    response = model.create_chat_completion(
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.2,
        top_p=0.9,
        max_tokens=700,
        repeat_penalty=1.1,
    )

    raw = (response["choices"][0]["message"]["content"] or "").strip()

    # Parse the tagged JSON
    parsed = parse_tagged_json(raw)

    # Fallback if model didn’t follow instructions
    if not parsed:
        return (
            {"summary": "", "evidence": ["Insufficient evidence"], "pra_categories": []},
            raw,
        )

    # Light coercion to guarantee keys exist
    result = {
        "summary": parsed.get("summary", "") or "",
        "evidence": parsed.get("evidence", []) or [],
        "pra_categories": parsed.get("pra_categories", []) or []
    }
    return result, raw

In [None]:
# Define variables.
MODEL_PATH = '/Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf'
PRA_NOTES_PATH = '../data/RAG-resources/PRA_risk_categories.csv'
TRANSCRIPT_PATH = '../data/processed/jpm/all_jpm_2025.csv'
OUTPUT_PATH = pathlib.Path("jpm_mistral_pra_summary_raw.json")
TOP_K = 2

In [None]:
# Runner code.
pra_categories = load_pra_categories(Path(PRA_NOTES_PATH))
embedder, category_embeddings = build_embedding_index(pra_categories)

# Load and chunk transcript
transcript_text = Path(TRANSCRIPT_PATH).read_text(encoding="utf-8")
transcript_chunks = chunk_text(transcript_text)

n_threads = max(4, (os.cpu_count() or 8) - 2)

# Define the model.
model = Llama(
    model_path=str(MODEL_PATH),
    n_ctx=4096,
    n_gpu_layers=20,
    chat_format="mistral-instruct",
    n_threads=n_threads,
)

raw_outputs = []

for i, chunk in enumerate(transcript_chunks, 1):
    try:
        top_categories = find_rel_categories(
            chunk, pra_categories, embedder, category_embeddings, top_k=TOP_K
        )
        _, raw = summarise_chunk(
            model, chunk, top_categories, i, max_evidence=5
        )
        raw_outputs.append({"chunk": i, "raw": raw})

    except Exception:
        raw_outputs.append({"chunk": i, "raw": ""})

final_output = {"raw_outputs": raw_outputs}

OUTPUT_PATH.write_text(json.dumps(final_output, indent=2, ensure_ascii=False), encoding="utf-8")
print(f"Wrote final JSON to: {OUTPUT_PATH.resolve()}")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama_model_load_from_file_impl: using device Metal (Apple M3) - 8456 MiB free
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/laurenbrixey/Documents/Data Science Career Accelerator/Project Submissions/Course 3/topic_project_4.1/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_len

Wrote final JSON to: /Users/laurenbrixey/Documents/GitHub Repositories/cam_ds_ep_FinSight/notebooks/jpm_mistral_pra_summary.json


- Need to preprocess the output so it is visually clearer (summary, evidence, PRA categories (name & why the model chose this))
- Can this information be fed to the model again and can it detect any early PRA risk indicators?

# **6. Evasion Scoring**

- Use LLM to summarise answer and then tag with an evasion score.
- Detect evasiveness of bankers in relation to analyst questions and give an evasiveness score.

## **6.1 Preprocessing**

In [None]:
# View data.
jpm_2025_df.head()

Unnamed: 0,question_number,answer_number,speaker_name,role,company,content,year,quarter,source_pdf,role_normalised
0,1,,Ken Usdin,analyst,Autonomous Research,"Good morning, Jeremy. Wondering if you could s...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,analyst
1,1,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Sure, Ken. So I mean, at a high level, I would...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker
2,2,,Ken Usdin,analyst,Autonomous Research,Yeah. And just one question on the NII ex. Mar...,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,analyst
3,2,1.0,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Yeah, that's a good question, Ken. You're righ...",2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker
4,2,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorganChase,In the curve basically.,2025,Q1,data/raw/jpm/jpm-1q25-earnings-call-transcript...,banker


In [None]:
# Pair each analyst question with all the banker's answers.
def create_qa_pairs(df):
    questions = df[df['role_normalised'] == 'analyst']
    answers = df[df['role_normalised'] == 'banker']

    qa_pairs = []

    for q_num, q_row in questions.groupby('question_number'):
        q_text = ' '.join(q_row['content'].astype(str))
        a_rows = answers[answers['question_number'] == q_num]
        if not a_rows.empty:
            a_text = ' '.join(a_rows['content'].astype(str))
            qa_pairs.append({
                'question_number': q_num, 
                'question': q_text,
                'answer': a_text
            })

    return pd.DataFrame(qa_pairs)

In [None]:
# Create qa pairs.
jpm_2025_qa_pairs_df = create_qa_pairs(jpm_2025_df)

# View the results.
jpm_2025_qa_pairs_df.head()

Unnamed: 0,question_number,question,answer
0,1,"Good morning, Jeremy. Wondering if you could s...","Sure, Ken. So I mean, at a high level, I would..."
1,2,Yeah. And just one question on the NII ex. Mar...,"Yeah, that's a good question, Ken. You're righ..."
2,3,Yes. Good morning. This question is for Jamie....,"I just – before Jamie answers that, Erika, I j..."
3,4,Got it. And a second follow-up question. And I...,"Yeah, Erika, it's a good question. But the tru..."
4,5,Thank you. Operator: Thank you. Our next quest...,"Thanks, Erika. Operator: I apologize. Our next..."


## **6.2 Evasion Detection (prototype)**

1. **Baseline Evasion score** (rule-based) is made up of three components:
- **Cosine similarity**- similarity of the question and answer, lower similarity = more evasive
- **Numeric specificity check**- does the question require a number, if so does the answer contain a number?, e.g. requests for financial data
- **Evasive phrases**- does the answer contain evasive phrases?, presence = more evasive

2. **LLM evasion score** (RoBERTa-MNLI) uses entailment/neutral/contradiction between the question and answer
- Lower entailment (and higher neutral + contradiction) = more evasive
  
3. **Blended evasion score** combines both scores including a weight for the LLM component
- Rationale is that baseline enforces precision while the LLM will capture semantics

In [None]:
# Import model and tokenizer.
model_name = "roberta-large-mnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# RoBERTa has 3 labels
assert model.config.num_labels == 3

# Label order for roberta-large-mnli
id2label = {0: "contradiction", 1: "neutral", 2: "entailment"}

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Set device (change this if not using a macbook) 
if torch.backends.mps.is_available():
    device = torch.device("mps")   # Apple Metal
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
    
model.to(device)
None

### **Baseline evasion score functions**

In [None]:
# List of evasive phrases
EVASIVE_PHRASES = [
    r"\btoo early\b",
    r"\bcan't (?:comment|share|discuss)\b",
    r"\bwon't (?:comment|share|provide)\b",
    r"\bno (?:update|comment)\b",
    r"\bwe (?:don't|do not) (?:break out|provide guidance)\b",
    r"\bnot (?:going to|able to) (?:comment|share|provide)\b",
    r"\bwe'll (?:come back|circle back)\b",
    r"\bnot something we disclose\b",
    r"\bas (?:we|I) (?:said|mentioned)\b",
    r"\bgenerally speaking\b",
    r"\bit's premature\b",
    r"\bit's difficult to say\b",
    r"\bI (?:wouldn't|won't) want to (?:speculate|get into)\b",
    r"\bI (?:think|guess|suppose)\b",
    r"\bkind of\b",
    r"\bsort of\b",
    r"\baround\b",
    r"\broughly\b",
    r"\bwe (?:prefer|plan) not to\b",
    r"\bwe're not prepared to\b",
]

# List of words that suggest the answer needs specific financial numbers to properly answer the question.
SPECIFICITY_TRIGGERS = [
    "how much","how many","what is","what are","when","which","where","who","why",
    "range","guidance","margin","capex","opex","revenue","sales","eps","ebitda",
    "timeline","date","target","growth","update","split","dividend","cost","price",
    "units","volumes","gross","net","tax","percentage","utilization","order book"
]

NUMERIC_PATTERN = r"(?:\d+(?:\.\d+)?%|\b\d{1,3}(?:,\d{3})*(?:\.\d+)?\b|£|\$|€)"

In [None]:
# Function to calculate cosine similarity between question and answers.
def cosine_sim(q, a):
    vec = TfidfVectorizer(stop_words='english').fit_transform([q, a]) # converts text to vectors 
    sim = float(cosine_similarity(vec[0], vec[1])[0, 0]) # calculate the cosine similarity between the two vectors

    return sim

In [None]:
# Function to compute baseline evasion score.
def baseline_evasion_score(q, a):
    # 1. Cosine similarity
    sim = cosine_sim(q, a) # calculates cosine similarity using previous function
    sim_component = (1 - sim) * 45 # less similar the answer is, the bigger the contribution to the evasion score, scaled by 45

    # 2. Numerical specificity- Does the question require and answer with financial figures/ a specific answer?
    needs_num = any(t in q.lower() for t in SPECIFICITY_TRIGGERS) # true if the question requires a numeric/ specific answer
    has_num = bool(re.search(NUMERIC_PATTERN, a)) # true if the answer includes a number 
    numeric_component = 25 if needs_num and not has_num else 0 # score of 25 if the question needs a number but the answer doesn't give one

    # 3. Evasive phrases- does the answer contain evasive phrases?
    phrase_hits = sum(len(re.findall(p, a.lower())) for p in EVASIVE_PHRASES) # counts how many times an evasive phrase appears in the answer
    phrase_component = min(3, phrase_hits) * 8 # max of 3 hits counted, each hit = 8 points 

    # Final evasion score.
    score = min(100, sim_component + numeric_component + phrase_component) # adds components together and caps score at 100
    
    return score, sim, phrase_hits, needs_num, has_num

### **LLM and blended evasion score functions**

In [None]:
# Function to compute llm label scores: entailment, contradiction and neutral
def llm_label_scores(question, answer):

    # Function to calculate probabilities based on RoBERTa-MNLI labels.
    def probs(premise, hypothesis):
        inputs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True, max_length=512).to(device)  # tokenize the input
        with torch.no_grad():  # disable gradient tracking 
            logits = model(**inputs).logits  # runs the model and outputs raw scores

        probs = F.softmax(logits, dim=-1).squeeze().tolist()  # converts raw scores to probabilities

        # RoBERTa-MNLI label order: [contradiction, neutral, entailment]
        # contradiction = contradicts the question, neutral = related but not committing, entailment = directly answers the question
        return {"contradiction": probs[0], "neutral": probs[1], "entailment": probs[2]}

    # Get probability in both directions (e.g. does answer entail question, does question entail answer?)
    pA, pB = probs(answer, question), probs(question, answer)

    # Calculates scores.
    entail = math.sqrt(max(1e-9, pA["entailment"] * pB["entailment"]))
    neutral = 0.5 * (pA["neutral"] + pB["neutral"])
    contradiction = 0.5 * (pA["contradiction"] + pB["contradiction"])

    # Normalises the scores.
    s = entail + neutral + contradiction
    entail, neutral, contradiction = entail/s, neutral/s, contradiction/s
    return {
        "entailment": entail,
        "neutral": neutral,
        "contradiction": contradiction
    }

# Function to compute LLM evasion score from the label scores.
def llm_evasion_score(entail, neutral, contradiction):
    # Maps (entail, neutral, contradiction) -> evasion score in range 0..100
    evasion = (1 - entail) * 60 + neutral * 30 + contradiction * 10
    return max(0.0, min(100.0, evasion))

In [None]:
# Function to compute blended evasion score and return all scores.
def compute_all_evasion_scores(q, a, LLM_WEIGHT=0.30):
    
    # Compute baseline evasion score.
    base_score, _, _, _, _ = baseline_evasion_score(q, a)

    # Compute LLM evasion score.
    llm_label = llm_label_scores(q, a)
    llm_score = llm_evasion_score(llm_label['entailment'], llm_label['neutral'], llm_label['contradiction'])

    # Compute blended score.
    blended_score = max(0.0, min(100.0, base_score + LLM_WEIGHT * llm_score))

    return {
        'baseline': float(base_score),
        'llm_only': float(llm_score),
        'blended': float(blended_score)
        }

In [None]:
# Function to label based on the score.
def label_from_score(score, threshold):
    return 'Evasive' if score >= threshold else 'Direct'

### **Main Pipeline**

In [None]:
# Define thresholds.
LLM_WEIGHT = 0.30
EVASION_THRESHOLD_BASE = 60.0
EVASION_THRESHOLD_LLM = 50.0
EVASION_THRESHOLD_BLENDED = 60.0

In [None]:
# Evasion Pipeline.
def evasion_pipeline(df):
    records = []
    for _, row in df.iterrows():
        q, a = str(row["question"]), str(row["answer"])
        out = compute_all_evasion_scores(q, a)

        pred_base = label_from_score(out["baseline"], EVASION_THRESHOLD_BASE)
        pred_llm = label_from_score(out["llm_only"], EVASION_THRESHOLD_LLM)
        pred_blended = label_from_score(out["blended"], EVASION_THRESHOLD_BLENDED)

        records.append({
            "question_number": row.get("question_number"),
            "question": q,
            "answer": a,

            # Evasion Scores
            "evasion_score_baseline": int(out["baseline"]),
            "evasion_score_llm": int(out["llm_only"]),
            "evasion_score_blended": int(out["blended"]),

            # Predicted labels.
            "prediction_baseline": pred_base,
            "prediction_llm": pred_llm,
            "prediction_blended": pred_blended,
        })

    return pd.DataFrame(records)


In [None]:
# Run evasion pipeline.
jpm_2025_evasion_results = evasion_pipeline(jpm_2025_qa_pairs_df)

# View results.
jpm_2025_evasion_results.head()

Unnamed: 0,question_number,question,answer,evasion_score_baseline,evasion_score_llm,evasion_score_blended,prediction_baseline,prediction_llm,prediction_blended
0,1,"Good morning, Jeremy. Wondering if you could s...","Sure, Ken. So I mean, at a high level, I would...",64,70,85,Evasive,Evasive,Evasive
1,2,Yeah. And just one question on the NII ex. Mar...,"Yeah, that's a good question, Ken. You're righ...",44,73,66,Direct,Evasive,Evasive
2,3,Yes. Good morning. This question is for Jamie....,"I just – before Jamie answers that, Erika, I j...",35,75,57,Direct,Evasive,Direct
3,4,Got it. And a second follow-up question. And I...,"Yeah, Erika, it's a good question. But the tru...",88,81,100,Evasive,Evasive,Evasive
4,5,Thank you. Operator: Thank you. Our next quest...,"Thanks, Erika. Operator: I apologize. Our next...",67,73,89,Evasive,Evasive,Evasive


### **Validation**

- Built a test set of 26 examples from 2023 1q jpm results, human-labelled these evasive or direct.

In [None]:
# Load 2023 transcript.
jpm_1q_23_df = pd.read_csv('../data/processed/jpm/jpm-1q23-earnings-call-transcript_qa.csv')

In [None]:
# View speaker roles.
jpm_1q_23_df['role'].unique()

array(['Chief Financial Officer', 'analyst',
       'Chairman & Chief Executive Officer'], dtype=object)

In [None]:
# Define role mapping.
role_map = {
    'analyst': 'analyst',
    'Chief Financial Officer': 'banker',
    'Chairman & Chief Executive Officer': 'banker'
}

# Apply to dataset.
jpm_1q_23_df['role_normalised'] = jpm_1q_23_df['role'].map(role_map)

# View the dataset.
jpm_1q_23_df.head()

Unnamed: 0,section,question_number,answer_number,speaker_name,role,company,content,year,quarter,is_pleasantry,role_normalised
0,presentation,,,Jeremy Barnum,Chief Financial Officer,JPMorganChase,"Thanks, and good morning, everyone. The presen...",2023,Q1,False,banker
1,qa,,,Steven Chubak,analyst,Wolfe Research LLC,"Hey, good morning.",2023,Q1,True,analyst
2,qa,,,Jeremy Barnum,Chief Financial Officer,JPMorgan Chase & Co.,"Good morning, Steve.",2023,Q1,True,banker
3,qa,1.0,,Steven Chubak,analyst,Wolfe Research LLC,"So, Jamie, I was actually hoping to get your p...",2023,Q1,False,analyst
4,qa,1.0,1.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,"Well, I think you were already kind of complet...",2023,Q1,False,banker


In [None]:
# Filter out presention and is_pleasantry == True
jpm_1q_23_df = jpm_1q_23_df[jpm_1q_23_df['section'] == 'qa']
jpm_1q_23_df = jpm_1q_23_df[jpm_1q_23_df['is_pleasantry'] == False]

# View the dataset.
jpm_1q_23_df

Unnamed: 0,section,question_number,answer_number,speaker_name,role,company,content,year,quarter,is_pleasantry,role_normalised
3,qa,1.0,,Steven Chubak,analyst,Wolfe Research LLC,"So, Jamie, I was actually hoping to get your p...",2023,Q1,False,analyst
4,qa,1.0,1.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,"Well, I think you were already kind of complet...",2023,Q1,False,banker
5,qa,1.0,1.0,Steven Chubak,analyst,Wolfe Research LLC,Got it. And just in terms of appetite for the ...,2023,Q1,False,analyst
6,qa,1.0,2.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,"Oh, yeah.",2023,Q1,False,banker
7,qa,1.0,2.0,Steven Chubak,analyst,Wolfe Research LLC,...elevated macro uncertainties.,2023,Q1,False,analyst
...,...,...,...,...,...,...,...,...,...,...,...
93,qa,26.0,,Matt O'Connor,analyst,"Deutsche Bank Securities, Inc.",Okay. And then just separately to squeeze in –...,2023,Q1,False,analyst
94,qa,26.0,1.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,That'll be every quarter for the rest of our l...,2023,Q1,False,banker
95,qa,26.0,2.0,Jeremy Barnum,Chief Financial Officer,JPMorgan Chase & Co.,Cheap.,2023,Q1,False,banker
96,qa,26.0,3.0,Jamie Dimon,Chairman & Chief Executive Officer,JPMorgan Chase & Co.,Cheap.,2023,Q1,False,banker


In [None]:
# Create qa pairs.
jpm_1q_23_qa_pairs_df = create_qa_pairs(jpm_1q_23_df)

# View dataset.
display(jpm_1q_23_qa_pairs_df.head())

# View shape.
print('Number of samples:', jpm_1q_23_qa_pairs_df.shape[0])

Unnamed: 0,question_number,question,answer
0,1.0,"So, Jamie, I was actually hoping to get your p...","Well, I think you were already kind of complet..."
1,2.0,"Hey, thanks. Good morning. Hey, Jeremy, I was ...","Yeah, sure. So let me just summarize the drive..."
2,3.0,"Yeah, and as a follow-up on the point about ra...","Well first of all, I don't quite believe it. S..."
3,4.0,"Hi, thanks. Jeremy, wanted to follow up again ...","Yeah. John, it's a really good question, and w..."
4,5.0,Okay. And then I wanted to ask Jamie – there's...,Yeah. I wouldn't use the word credit crunch if...


Number of samples: 26


In [None]:
# Create the test set. 
jpm_1q_23_test_set_df = jpm_1q_23_qa_pairs_df.copy()

# Create a blank label column and export to CSV for human to label.
jpm_1q_23_test_set_df['label'] = ''  # fill with Direct or Evasive
jpm_1q_23_test_set_df.to_csv('jpm_1q_23_test_set.csv', index=False)

- The test set was human labelled with either 'direct' or 'evasive'.

In [None]:
# Import the labelled test set. 
jpm_1q_23_test_set_labelled_df = pd.read_csv('../notebooks/summarisation_evasion_files/jpm_1q_23_test_set_labelled.csv')

# View dataset.
jpm_1q_23_test_set_labelled_df.head()

Unnamed: 0,question_number,question,answer,label
0,1,"So, Jamie, I was actually hoping to get your p...","Well, I think you were already kind of complet...",Evasive
1,2,"Hey, thanks. Good morning. Hey, Jeremy, I was ...","Yeah, sure. So let me just summarize the drive...",Direct
2,3,"Yeah, and as a follow-up on the point about ra...","Well first of all, I don't quite believe it. S...",Direct
3,4,"Hi, thanks. Jeremy, wanted to follow up again ...","Yeah. John, it's a really good question, and w...",Evasive
4,5,Okay. And then I wanted to ask Jamie – there's...,Yeah. I wouldn't use the word credit crunch if...,Direct


In [None]:
# Run evasion pipeline.
jpm_1q_23_evasion_results = evasion_pipeline(jpm_1q_23_test_set_labelled_df)

# Reappend the human label.
jpm_1q_23_evasion_results['human_label'] = jpm_1q_23_test_set_labelled_df['label']

# View results.
jpm_1q_23_evasion_results.head()

Unnamed: 0,question_number,question,answer,evasion_score_baseline,evasion_score_llm,evasion_score_blended,prediction_baseline,prediction_llm,prediction_blended,human_label
0,1,"So, Jamie, I was actually hoping to get your p...","Well, I think you were already kind of complet...",55,79,79,Direct,Evasive,Evasive,Evasive
1,2,"Hey, thanks. Good morning. Hey, Jeremy, I was ...","Yeah, sure. So let me just summarize the drive...",80,78,100,Evasive,Evasive,Evasive,Direct
2,3,"Yeah, and as a follow-up on the point about ra...","Well first of all, I don't quite believe it. S...",40,83,65,Direct,Evasive,Evasive,Direct
3,4,"Hi, thanks. Jeremy, wanted to follow up again ...","Yeah. John, it's a really good question, and w...",78,73,99,Evasive,Evasive,Evasive,Evasive
4,5,Okay. And then I wanted to ask Jamie – there's...,Yeah. I wouldn't use the word credit crunch if...,55,81,80,Direct,Evasive,Evasive,Direct


In [None]:
# Function to evaluate the evasion scores.
def evaluate_evasion_scores(df, thr_base=EVASION_THRESHOLD_BASE, thr_llm=EVASION_THRESHOLD_LLM, thr_blend=EVASION_THRESHOLD_BLENDED):

    # Ground truth: 1 = Evasive, 0 = Direct (using 'human_label')
    y_true = (df["human_label"].astype(str).str.strip().str.lower() == "evasive").astype(int).values

    # Convert predicted label strings to binary (1 = Evasive, 0 = Direct)
    def to_binary(pred_series):
        return (pred_series.astype(str).str.strip().str.lower() == "evasive").astype(int).values

    y_pred_base  = to_binary(df["prediction_baseline"])
    y_pred_llm   = to_binary(df["prediction_llm"])
    y_pred_blend = to_binary(df["prediction_blended"])

    return {
        'baseline': {
            'classification_report': classification_report(y_true, y_pred_base, target_names=["Direct", "Evasive"], digits=3, zero_division=0),
            'confusion_matrix': confusion_matrix(y_true, y_pred_base)
        },
        'llm': {
            'classification_report': classification_report(y_true, y_pred_llm, target_names=["Direct", "Evasive"], digits=3, zero_division=0),
            'confusion_matrix': confusion_matrix(y_true, y_pred_llm)
        },
        'blended': {
            'classification_report': classification_report(y_true, y_pred_blend, target_names=["Direct", "Evasive"], digits=3, zero_division=0),
            'confusion_matrix': confusion_matrix(y_true, y_pred_blend) 
        }
    }

In [None]:
# Extract results.
eval_dict = evaluate_evasion_scores(jpm_1q_23_evasion_results)
baseline_eval, llm_eval, blended_eval = eval_dict['baseline'], eval_dict['llm'], eval_dict['blended']

In [None]:
# View baseline results.
base_cr, base_cm = baseline_eval['classification_report'], baseline_eval['confusion_matrix']

print(base_cr)
print(base_cm)

              precision    recall  f1-score   support

      Direct      0.588     0.714     0.645        14
     Evasive      0.556     0.417     0.476        12

    accuracy                          0.577        26
   macro avg      0.572     0.565     0.561        26
weighted avg      0.573     0.577     0.567        26

[[10  4]
 [ 7  5]]


In [None]:
# View llm results.
llm_cr, llm_cm = llm_eval['classification_report'], llm_eval['confusion_matrix']

print(llm_cr)
print(llm_cm)

              precision    recall  f1-score   support

      Direct      0.000     0.000     0.000        14
     Evasive      0.462     1.000     0.632        12

    accuracy                          0.462        26
   macro avg      0.231     0.500     0.316        26
weighted avg      0.213     0.462     0.291        26

[[ 0 14]
 [ 0 12]]


In [None]:
# View blended results.
blended_cr, blended_cm = blended_eval['classification_report'], blended_eval['confusion_matrix']

print(blended_cr)
print(blended_cm)

              precision    recall  f1-score   support

      Direct      1.000     0.071     0.133        14
     Evasive      0.480     1.000     0.649        12

    accuracy                          0.500        26
   macro avg      0.740     0.536     0.391        26
weighted avg      0.760     0.500     0.371        26

[[ 1 13]
 [ 0 12]]


- Need to fine tune threshold using grid search to improve accuracy.