# **Evasion Detection Notebook**

# **1. Objectives**

The purpose of this notebook is to contain the evasion detection pipeline. 
1. **Baseline Evasion score** (rule-based) is made up of three components:
- **Cosine similarity**- similarity of the question and answer, lower similarity = more evasive
- **Numeric specificity check**- does the question require a number, if so does the answer contain a number?, e.g. requests for financial data
- **Evasive phrases**- does the answer contain evasive phrases?, presence = more evasive

2. **LLM evasion score** (RoBERTa-MNLI) uses entailment/neutral/contradiction between the question and answer
- Lower entailment (and higher neutral + contradiction) = more evasive
  
3. **Blended evasion score** combines both scores including a weight for the LLM component
- Rationale is that baseline enforces precision while the LLM will capture semantics

# **1. Set up Workspace**

In [None]:
# Import libraries
# Core python
import os
import numpy as np
import pandas as pd
import re
import json
import pathlib
from pathlib import Path
from typing import List, Dict, Any 
import csv
import math

# NLP & Summarisation
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from llama_cpp import Llama 
import torch
import torch.nn.functional as F

# Retrieval
from sentence_transformers import SentenceTransformer 

# ML
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Visualisations
import matplotlib.pyplot as plt
import seaborn as sns 

# Set SEED.
SEED = 42

# **2. Data Preprocessing**