# **FinSight Evasion Detection Notebook**

## **1. Documentation**

### **1.1 Purpose & Scope**
This notebook implements a pipeline to detect evasive vs direct answers in Q&A transcripts. The primary buisness goal is to maximise recall for the evasive class (minimise false negatives) so reviewers recieve a shortlist of likely evasive answers to inspect. We report ranking metrics such as P@K (precison within the top-K% of predicted positives) to quantify shortlist quality for downstream PRA review.

### **1.2 Data & Inputs**
- **Data:** Q&A transcripts (J.P. Morgan 2023-2025 and HSBC 2023-2025). J.P. Morgan used for validation & reporting, HSBC used for reporting.
- **Input format:** A table with the following columns: question_number, answer_number, speaker_name, role, company, content, year, quarter and source_pdf
- **Splits** `train` (used to test exemplar building), `jpm_val_qa_labelled` (for fine tuning thresholds & model selection) and `jpm_test_qa_labelled` (for final reporting). Final predictions were carried out on `jpm_2025_predict_qa` and `hsbc_2025_predict_qa`. 
- **Imbalance:** Strong class imbalance (evasive minority)

### **1.3 Pipeline Overview**
- **Rule-based Baseline:** Flag potential evasions by assessing question-answer semantic similarity, numerical expectations (e.g. question asks for numbers, answer lacks numbers) and evasive phrase hits.
- **NLI-Based Scoring:** Treats each question + answer as the premise and probes entailment against hypotheses representing Direct and Evasive answers using large MNLI models i.e. `roberta-large-mnli`, `microsoft/deberta-large-mnli` and `MoritzLaurer/deberta-v3-large-zeroshot-v2.0`. Map model logits/probabilities to an evasion score. 
- **Few-Shot RAG Exemplars (tested)**: Retrieves a small set of labelled exemplars (via SBERT similarity on question) and prepends them to the NLI context.
- **Blending:** Combines baseline and NLI scores (weighted blend) and averages LLM NLI scores.
- **Thresholding:** Converts scores to binary flags using tuned thresholds. 

### **1.4 Evaluation Metrics**
The notebook reports the following per threshold:
- **Precision/ Recall/ F1** (Direct)
- **Preicsion/ Recall / F1** (Evasive): primary gocus on recall_evasive 
- **AUPRC:** area under precision-recall curve for evasive class 
- **P@K% predicted positives:** e.g. 10%, 25%, 50%, use to set short-list size for flagged evasive answers (number of correctly predicted positives in K% of ranked list)