<a href="https://colab.research.google.com/github/Abumaude/AI-Foolosophy/blob/main/FTE_HARM_Full_Validation_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FTE-HARM FULL: Complete Validation Pipeline

This notebook implements the **complete FTE-HARM validation pipeline** with:

- **Multiple hypotheses** (8+ based on discovered labels)
- **Two P_Score methods** (Option A: Binary, Option B3: Confidence-Weighted)
- **Three validation approaches** (Binary, Two-Stage, Confidence Calibration)
- **Multi-dataset processing** with results saved to Google Drive

## Dataset Structure

```
Base Path: /content/drive/My Drive/thesis/dataset/

Supported naming patterns:
- log.log / label.log
- log_auth.log / label_auth.log  
- openvpn.log / openvpn_labels.log
- <name>.log / <name>_labels.log
```

## Workflow

1. **Label Discovery** - Find all unique labels across datasets
2. **Hypothesis Generation** - Create hypotheses from discovered labels
3. **Entity Extraction** - Use transformer NER model
4. **P_Score Calculation** - Score with Option A (Binary) and B3 (Confidence)
5. **Validation** - Binary, Two-Stage, and Calibration approaches
6. **Save Results** - Export to Google Drive

---
## Cell 1: Imports and Configuration

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# FTE-HARM FULL: IMPORTS AND CONFIGURATION
# ═══════════════════════════════════════════════════════════════════════════

import os
import re
import json
import numpy as np
from datetime import datetime
from collections import defaultdict

# Install transformers if needed
try:
    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
except ImportError:
    !pip install -q transformers
    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

print("\n" + "="*60)
print("FTE-HARM FULL VALIDATION PIPELINE")
print("="*60)

In [None]:
# ─────────────────────────────────────────────────────────────────────────────
# PATH CONFIGURATION
# ─────────────────────────────────────────────────────────────────────────────

DATASET_BASE_PATH = '/content/drive/My Drive/thesis/dataset'
OUTPUT_PATH = '/content/drive/My Drive/thesis/hypotheses_validation'

os.makedirs(OUTPUT_PATH, exist_ok=True)

# Transformer models - update paths to your models
MODELS = {
    'distilbert': '/content/drive/My Drive/thesis/transformer/distilberta_base_uncased/results/checkpoint-5245',
    'distilroberta': '/content/drive/My Drive/thesis/transformer/distilroberta_base/results/checkpoint-5275',
    'roberta_large': '/content/drive/My Drive/thesis/transformer/roberta_large/results/checkpoint-2772',
    'xlm_roberta_base': '/content/drive/My Drive/thesis/transformer/xlm_roberta_base/results/checkpoint-12216',
    'xlm_roberta_large': '/content/drive/My Drive/thesis/transformer/xlm_roberta_large/results/checkpoint-12240',
}

# Select which model to use
SELECTED_MODEL = 'xlm_roberta_large'

# Confidence thresholds for triage
THRESHOLDS = {
    'HIGH': 0.65,
    'MEDIUM': 0.50,
    'LOW': 0.35
}

# Triage priority descriptions
TRIAGE_PRIORITIES = {
    'HIGH': 'Priority 1: Investigate immediately',
    'MEDIUM': 'Priority 2: Queue for investigation',
    'LOW': 'Priority 3: Investigate later',
    'INSUFFICIENT': 'Priority 4: Archive for future relevance'
}

print("✓ Configuration loaded")
print(f"  Dataset path: {DATASET_BASE_PATH}")
print(f"  Output path: {OUTPUT_PATH}")
print(f"  Selected model: {SELECTED_MODEL}")

---
## Cell 2: Dataset Discovery

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# DATASET DISCOVERY - FLEXIBLE FILE NAMING
# ═══════════════════════════════════════════════════════════════════════════

def find_log_label_pair(folder_path):
    """Find log and label files with variable naming conventions."""
    if not os.path.isdir(folder_path):
        return None, None
    
    files = os.listdir(folder_path)
    log_files = [f for f in files if f.endswith('.log')]
    
    log_file = None
    label_file = None
    
    # Strategy 1: log.log / label.log
    if 'log.log' in log_files and 'label.log' in log_files:
        log_file, label_file = 'log.log', 'label.log'
    
    # Strategy 2: log_*.log / label_*.log
    elif any(f.startswith('log_') for f in log_files):
        for f in log_files:
            if f.startswith('log_'):
                suffix = f[4:]
                potential_label = f'label_{suffix}'
                if potential_label in log_files:
                    log_file, label_file = f, potential_label
                    break
    
    # Strategy 3: <name>.log / <name>_labels.log
    else:
        for f in log_files:
            if not f.endswith('_labels.log'):
                base_name = f[:-4]
                potential_label = f'{base_name}_labels.log'
                if potential_label in log_files:
                    log_file, label_file = f, potential_label
                    break
    
    if log_file and label_file:
        return os.path.join(folder_path, log_file), os.path.join(folder_path, label_file)
    return None, None


def scan_all_datasets(base_path):
    """Recursively scan for all valid dataset pairs."""
    datasets = []
    
    for root, dirs, files in os.walk(base_path):
        log_path, label_path = find_log_label_pair(root)
        
        if log_path and label_path:
            rel_path = os.path.relpath(root, base_path)
            datasets.append({
                'name': rel_path,
                'folder': root,
                'log_path': log_path,
                'label_path': label_path,
                'log_file': os.path.basename(log_path),
                'label_file': os.path.basename(label_path)
            })
    
    return datasets


# Scan datasets
print("Scanning for datasets...")
all_datasets = scan_all_datasets(DATASET_BASE_PATH)
print(f"\n✓ Found {len(all_datasets)} dataset pairs")

for i, ds in enumerate(all_datasets, 1):
    print(f"  {i}. {ds['name']}")

---
## Cell 3: Label Discovery (Comprehensive)

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# LABEL DISCOVERY - COMPREHENSIVE
# ═══════════════════════════════════════════════════════════════════════════

def discover_labels(label_path):
    """Discover all unique labels in a label file."""
    all_labels = set()
    label_counts = defaultdict(int)
    first_label = None
    total_entries = 0
    
    if not os.path.exists(label_path):
        return None
    
    with open(label_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            
            try:
                entry = json.loads(line)
                labels = entry.get('labels', [])
                
                if labels:
                    total_entries += 1
                    for label in labels:
                        all_labels.add(label)
                        label_counts[label] += 1
                        if first_label is None:
                            first_label = label
            except json.JSONDecodeError:
                continue
    
    sorted_labels = sorted(label_counts.items(), key=lambda x: x[1], reverse=True)
    
    return {
        'all_labels': all_labels,
        'label_counts': dict(label_counts),
        'sorted_labels': sorted_labels,
        'first_label': first_label,
        'most_common': sorted_labels[0][0] if sorted_labels else None,
        'total_entries': total_entries
    }


def discover_all_labels(datasets):
    """Discover labels across all datasets."""
    combined_labels = set()
    combined_counts = defaultdict(int)
    labels_per_dataset = {}
    
    print("\n" + "="*80)
    print("LABEL DISCOVERY")
    print("="*80)
    
    for ds in datasets:
        result = discover_labels(ds['label_path'])
        
        if result:
            labels_per_dataset[ds['name']] = list(result['all_labels'])
            combined_labels.update(result['all_labels'])
            
            for label, count in result['label_counts'].items():
                combined_counts[label] += count
            
            print(f"\n{ds['name']}:")
            print(f"  Entries: {result['total_entries']}, Labels: {list(result['all_labels'])}")
    
    sorted_combined = sorted(combined_counts.items(), key=lambda x: x[1], reverse=True)
    
    print(f"\n{'─'*40}")
    print(f"COMBINED: {len(combined_labels)} unique labels")
    for label, count in sorted_combined:
        print(f"  {label}: {count}")
    
    return {
        'all_labels': combined_labels,
        'label_counts': dict(combined_counts),
        'sorted_labels': sorted_combined,
        'labels_per_dataset': labels_per_dataset
    }

In [None]:
# Run label discovery
label_discovery = discover_all_labels(all_datasets)

# Save discovery results
discovery_path = os.path.join(OUTPUT_PATH, f'label_discovery_{datetime.now().strftime("%Y%m%d_%H%M%S")}.txt')
with open(discovery_path, 'w') as f:
    f.write("LABEL DISCOVERY RESULTS\n")
    f.write(f"Timestamp: {datetime.now().isoformat()}\n")
    f.write(f"Datasets scanned: {len(all_datasets)}\n\n")
    f.write(f"Unique labels: {len(label_discovery['all_labels'])}\n\n")
    for label, count in label_discovery['sorted_labels']:
        f.write(f"  {label}: {count}\n")

print(f"\n✓ Discovery saved: {discovery_path}")

---
## Cell 4: Data Loaders

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# DATA LOADERS
# ═══════════════════════════════════════════════════════════════════════════

def load_ground_truth(label_path):
    """Load ground truth (NO tokenization)."""
    ground_truth = {}
    
    if not os.path.exists(label_path):
        return ground_truth
    
    with open(label_path, 'r', encoding='utf-8', errors='ignore') as f:
        for json_line in f:
            json_line = json_line.strip()
            if not json_line:
                continue
            
            try:
                entry = json.loads(json_line)
                line_num = entry.get('line')
                
                if line_num is not None:
                    ground_truth[line_num] = {
                        'labels': entry.get('labels', []),
                        'rules': entry.get('rules', {})
                    }
            except json.JSONDecodeError:
                continue
    
    return ground_truth


def get_label_for_line(line_number, ground_truth):
    """Get ground truth for a line."""
    if line_number in ground_truth:
        return {
            "is_malicious": True,
            "labels": ground_truth[line_number]['labels'],
            "rules": ground_truth[line_number]['rules']
        }
    return {"is_malicious": False, "labels": [], "rules": {}}


def load_raw_logs(log_path):
    """Load logs with 1-indexed line tracking."""
    logs = []
    
    if not os.path.exists(log_path):
        return logs
    
    with open(log_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line_number, log_text in enumerate(f, 1):
            log_text = log_text.strip()
            if log_text:
                logs.append((line_number, log_text))
    
    return logs


def load_dataset(dataset_info):
    """Load complete dataset."""
    logs = load_raw_logs(dataset_info['log_path'])
    ground_truth = load_ground_truth(dataset_info['label_path'])
    
    stats = {
        'name': dataset_info['name'],
        'total_lines': len(logs),
        'malicious_lines': len(ground_truth),
        'benign_lines': len(logs) - len(ground_truth),
        'malicious_pct': (len(ground_truth) / len(logs) * 100) if logs else 0
    }
    
    return logs, ground_truth, stats


print("✓ Data loaders defined")

---
## Cell 5: Model Loading and Entity Extraction

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# MODEL LOADING AND ENTITY EXTRACTION
# ═══════════════════════════════════════════════════════════════════════════

def get_model_pipeline(model_path):
    """Load NER model."""
    print(f"Loading model: {model_path}")
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForTokenClassification.from_pretrained(model_path)
    nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
    print("✓ Model loaded")
    return nlp


def process_text(text, nlp_pipeline):
    """Physical Token Quantization for entity extraction."""
    raw_results = nlp_pipeline(text)
    physical_tokens = [m for m in re.finditer(r'[^\[\]\s]+', text)]
    
    atomic_entities = []
    
    for pt in physical_tokens:
        t_start, t_end = pt.span()
        matches = [r for r in raw_results if r['start'] < t_end and r['end'] > t_start]
        
        if not matches:
            continue
        
        labels = set(m['entity_group'] for m in matches)
        chosen_label = 'IPAddress' if 'IPAddress' in labels and 'DNSName' in labels else matches[0]['entity_group']
        avg_score = sum(float(m['score']) for m in matches) / len(matches)
        
        atomic_entities.append({
            "label": chosen_label,
            "text": text[t_start:t_end],
            "start": t_start,
            "end": t_end,
            "confidence": avg_score
        })
    
    # Merge adjacent same-label entities
    if not atomic_entities:
        return []
    
    final = [atomic_entities[0]]
    for curr in atomic_entities[1:]:
        prev = final[-1]
        between = text[prev['end']:curr['start']]
        merge = between.strip() == '' and '[' not in between and ']' not in between
        
        if merge and prev['label'] == curr['label']:
            prev['end'] = curr['end']
            prev['text'] = text[prev['start']:prev['end']]
            prev['confidence'] = (prev['confidence'] + curr['confidence']) / 2
        else:
            final.append(curr)
    
    for e in final:
        e['confidence'] = round(e['confidence'], 4)
    
    return final


def extract_entities(line_number, log_text, nlp):
    """Extract entities with line tracking."""
    entities = process_text(log_text, nlp)
    
    entity_types = defaultdict(list)
    for e in entities:
        entity_types[e['label']].append({'value': e['text'], 'confidence': e['confidence']})
    
    return {
        'line_number': line_number,
        'log_text': log_text,
        'entities': entities,
        'entity_types': dict(entity_types)
    }


print("✓ Entity extraction functions defined")

In [None]:
# Load model
nlp = get_model_pipeline(MODELS[SELECTED_MODEL])

---
## Cell 6: Multiple Hypotheses (Dynamic Generation)

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# MULTIPLE HYPOTHESES - DYNAMIC GENERATION
# ═══════════════════════════════════════════════════════════════════════════

def generate_hypotheses_from_labels(discovered_labels):
    """
    Generate hypotheses based on discovered labels.
    
    Creates a hypothesis for each unique label found in the dataset.
    """
    hypotheses = {}
    
    # Default weight templates by category
    weight_templates = {
        'escalate': {
            'Process': 0.30, 'Username': 0.25, 'Action': 0.20,
            'DateTime': 0.15, 'Status': 0.10
        },
        'attacker': {
            'Username': 0.30, 'IPAddress': 0.25, 'Action': 0.20,
            'Process': 0.15, 'DateTime': 0.10
        },
        'lateral': {
            'IPAddress': 0.30, 'Username': 0.25, 'Process': 0.20,
            'Action': 0.15, 'DateTime': 0.10
        },
        'dns': {
            'DNSName': 0.35, 'IPAddress': 0.25, 'Action': 0.15,
            'Process': 0.15, 'DateTime': 0.10
        },
        'default': {
            'Process': 0.25, 'Action': 0.20, 'Username': 0.20,
            'IPAddress': 0.15, 'DateTime': 0.10, 'Status': 0.10
        }
    }
    
    for i, (label, count) in enumerate(discovered_labels, 1):
        # Select weight template based on label keywords
        label_lower = label.lower()
        
        if 'escalat' in label_lower or 'su' in label_lower or 'privilege' in label_lower:
            weights = weight_templates['escalate'].copy()
            critical = 'Process'
        elif 'attacker' in label_lower or 'change_user' in label_lower:
            weights = weight_templates['attacker'].copy()
            critical = 'Username'
        elif 'lateral' in label_lower or 'movement' in label_lower or 'ssh' in label_lower:
            weights = weight_templates['lateral'].copy()
            critical = 'IPAddress'
        elif 'dns' in label_lower or 'query' in label_lower:
            weights = weight_templates['dns'].copy()
            critical = 'DNSName'
        else:
            weights = weight_templates['default'].copy()
            critical = 'Process'
        
        hyp_name = f'H{i}_{label}'
        hypotheses[hyp_name] = {
            'name': hyp_name,
            'description': f'Hypothesis for {label}',
            'target_labels': [label],  # Labels this hypothesis maps to
            'weights': weights,
            'critical_entity': critical,
            'penalty_factor': 0.20
        }
    
    return hypotheses

In [None]:
# Generate hypotheses from discovered labels
HYPOTHESES = generate_hypotheses_from_labels(label_discovery['sorted_labels'])

print(f"\n✓ Generated {len(HYPOTHESES)} hypotheses:")
for name, hyp in HYPOTHESES.items():
    print(f"  {name}: critical={hyp['critical_entity']}, targets={hyp['target_labels']}")

---
## Cell 7: P_Score Option A (Binary Presence)

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# P_SCORE OPTION A: BINARY PRESENCE
# ═══════════════════════════════════════════════════════════════════════════

def calculate_pscore_option_a(entity_types, hypothesis):
    """
    P_Score with BINARY presence.
    Formula: P_Score = (Σ(W_i × E_i)) × (1 - P_F)
    E_i = 1 if present, 0 if absent
    """
    weights = hypothesis['weights']
    critical = hypothesis['critical_entity']
    penalty = hypothesis['penalty_factor']
    
    weighted_sum = 0.0
    breakdown = {}
    
    for etype, weight in weights.items():
        present = etype in entity_types and len(entity_types[etype]) > 0
        contrib = weight * (1 if present else 0)
        weighted_sum += contrib
        breakdown[etype] = {'weight': weight, 'present': present, 'contribution': contrib}
    
    critical_present = critical in entity_types and len(entity_types[critical]) > 0
    p_score = weighted_sum if critical_present else weighted_sum * (1 - penalty)
    
    if p_score >= THRESHOLDS['HIGH']:
        level = 'HIGH'
    elif p_score >= THRESHOLDS['MEDIUM']:
        level = 'MEDIUM'
    elif p_score >= THRESHOLDS['LOW']:
        level = 'LOW'
    else:
        level = 'INSUFFICIENT'
    
    return {
        'p_score': round(p_score, 4),
        'confidence_level': level,
        'is_malicious': p_score >= THRESHOLDS['LOW'],
        'triage_priority': TRIAGE_PRIORITIES[level],
        'critical_present': critical_present,
        'breakdown': breakdown,
        'method': 'Option_A'
    }


print("✓ P_Score Option A (Binary) defined")

---
## Cell 8: P_Score Option B3 (Confidence Weighted)

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# P_SCORE OPTION B3: CONFIDENCE WEIGHTED
# ═══════════════════════════════════════════════════════════════════════════

def calculate_pscore_option_b3(entity_types, hypothesis):
    """
    P_Score with CONFIDENCE weighting.
    Formula: P_Score = (Σ(W_i × C_i)) × (1 - P_F)
    C_i = average confidence of entities (0.0-1.0)
    """
    weights = hypothesis['weights']
    critical = hypothesis['critical_entity']
    penalty = hypothesis['penalty_factor']
    
    weighted_sum = 0.0
    breakdown = {}
    
    for etype, weight in weights.items():
        if etype in entity_types and entity_types[etype]:
            confidences = [e['confidence'] for e in entity_types[etype]]
            avg_conf = sum(confidences) / len(confidences)
            contrib = weight * avg_conf
            breakdown[etype] = {
                'weight': weight, 'present': True,
                'avg_confidence': round(avg_conf, 4), 'contribution': round(contrib, 4)
            }
        else:
            contrib = 0.0
            breakdown[etype] = {'weight': weight, 'present': False, 'contribution': 0.0}
        
        weighted_sum += contrib
    
    critical_present = critical in entity_types and len(entity_types[critical]) > 0
    p_score = weighted_sum if critical_present else weighted_sum * (1 - penalty)
    
    if p_score >= THRESHOLDS['HIGH']:
        level = 'HIGH'
    elif p_score >= THRESHOLDS['MEDIUM']:
        level = 'MEDIUM'
    elif p_score >= THRESHOLDS['LOW']:
        level = 'LOW'
    else:
        level = 'INSUFFICIENT'
    
    return {
        'p_score': round(p_score, 4),
        'confidence_level': level,
        'is_malicious': p_score >= THRESHOLDS['LOW'],
        'triage_priority': TRIAGE_PRIORITIES[level],
        'critical_present': critical_present,
        'breakdown': breakdown,
        'method': 'Option_B3'
    }


print("✓ P_Score Option B3 (Confidence) defined")

---
## Cell 9: Multi-Hypothesis Scoring

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# MULTI-HYPOTHESIS SCORING
# ═══════════════════════════════════════════════════════════════════════════

def score_all_hypotheses(entity_types, hypotheses, method='option_a'):
    """Score against all hypotheses and rank."""
    score_func = calculate_pscore_option_a if method == 'option_a' else calculate_pscore_option_b3
    
    scores = {name: score_func(entity_types, hyp) for name, hyp in hypotheses.items()}
    ranking = sorted([(n, s['p_score']) for n, s in scores.items()], key=lambda x: x[1], reverse=True)
    
    top = ranking[0] if ranking else (None, 0.0)
    
    return {
        'scores': scores,
        'ranking': ranking,
        'top_hypothesis': top[0],
        'top_score': top[1],
        'top_level': scores[top[0]]['confidence_level'] if top[0] else 'INSUFFICIENT',
        'is_malicious': top[1] >= THRESHOLDS['LOW'],
        'triage': scores[top[0]]['triage_priority'] if top[0] else TRIAGE_PRIORITIES['INSUFFICIENT']
    }


def process_log_full(line_number, log_text, nlp, hypotheses, method='option_a'):
    """Complete processing for one log line."""
    extraction = extract_entities(line_number, log_text, nlp)
    scoring = score_all_hypotheses(extraction['entity_types'], hypotheses, method)
    
    return {
        'line_number': line_number,
        'log_text': log_text,
        'entities': extraction['entities'],
        'entity_types': extraction['entity_types'],
        'scoring': scoring,
        'prediction': {
            'hypothesis': scoring['top_hypothesis'],
            'p_score': scoring['top_score'],
            'confidence': scoring['top_level'],
            'is_malicious': scoring['is_malicious'],
            'triage': scoring['triage']
        }
    }


print("✓ Multi-hypothesis scoring defined")

---
## Cell 10: Validation Approach 1 (Binary)

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# VALIDATION APPROACH 1: BINARY
# ═══════════════════════════════════════════════════════════════════════════

def validate_binary(predictions, ground_truth):
    """Binary validation: malicious vs benign."""
    tp = fp = tn = fn = 0
    details = {'tp': [], 'fp': [], 'tn': [], 'fn': []}
    
    for pred in predictions:
        line = pred['line_number']
        pred_mal = pred['prediction']['is_malicious']
        gt = get_label_for_line(line, ground_truth)
        actual_mal = gt['is_malicious']
        
        if pred_mal and actual_mal:
            tp += 1
            details['tp'].append({'line': line, 'score': pred['prediction']['p_score'], 'labels': gt['labels']})
        elif pred_mal and not actual_mal:
            fp += 1
            details['fp'].append({'line': line, 'score': pred['prediction']['p_score']})
        elif not pred_mal and not actual_mal:
            tn += 1
        else:
            fn += 1
            details['fn'].append({'line': line, 'labels': gt['labels']})
    
    total = tp + fp + tn + fn
    prec = tp / (tp + fp) if tp + fp > 0 else 0
    rec = tp / (tp + fn) if tp + fn > 0 else 0
    f1 = 2 * prec * rec / (prec + rec) if prec + rec > 0 else 0
    acc = (tp + tn) / total if total > 0 else 0
    
    return {
        'approach': 'Approach 1: Binary',
        'cm': {'TP': tp, 'FP': fp, 'TN': tn, 'FN': fn},
        'metrics': {'precision': round(prec, 4), 'recall': round(rec, 4), 'f1': round(f1, 4), 'accuracy': round(acc, 4)},
        'details': details
    }


print("✓ Validation Approach 1 (Binary) defined")

---
## Cell 11: Validation Approach 3 (Two-Stage)

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# VALIDATION APPROACH 3: TWO-STAGE
# ═══════════════════════════════════════════════════════════════════════════

def validate_two_stage(predictions, ground_truth, hypotheses):
    """Two-stage: detection + hypothesis accuracy."""
    # Stage A: Binary
    binary = validate_binary(predictions, ground_truth)
    
    # Stage B: Hypothesis accuracy for true positives
    correct = incorrect = 0
    
    for tp in binary['details']['tp']:
        line = tp['line']
        gt_labels = tp['labels']
        
        # Find predicted hypothesis
        pred = next((p for p in predictions if p['line_number'] == line), None)
        if not pred:
            continue
        
        pred_hyp = pred['prediction']['hypothesis']
        if pred_hyp and pred_hyp in hypotheses:
            target_labels = hypotheses[pred_hyp].get('target_labels', [])
            
            # Check if any gt label matches target labels
            matched = any(
                any(tl.lower() in gl.lower() or gl.lower() in tl.lower() 
                    for tl in target_labels)
                for gl in gt_labels
            )
            
            if matched:
                correct += 1
            else:
                incorrect += 1
    
    total_tp = correct + incorrect
    hyp_acc = correct / total_tp if total_tp > 0 else 0
    
    return {
        'approach': 'Approach 3: Two-Stage',
        'stage_a': binary['metrics'],
        'stage_b': {
            'total_tp': total_tp,
            'correct': correct,
            'incorrect': incorrect,
            'hypothesis_accuracy': round(hyp_acc, 4)
        },
        'combined': {
            'detection_f1': binary['metrics']['f1'],
            'hypothesis_acc': round(hyp_acc, 4),
            'overall': round(binary['metrics']['f1'] * hyp_acc, 4)
        }
    }


print("✓ Validation Approach 3 (Two-Stage) defined")

---
## Cell 12: Validation Approach 5 (Confidence Calibration)

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# VALIDATION APPROACH 5: CONFIDENCE CALIBRATION
# ═══════════════════════════════════════════════════════════════════════════

def validate_calibration(predictions, ground_truth):
    """Confidence calibration: does confidence correlate with accuracy?"""
    buckets = {level: {'correct': 0, 'incorrect': 0} for level in ['HIGH', 'MEDIUM', 'LOW', 'INSUFFICIENT']}
    
    for pred in predictions:
        line = pred['line_number']
        level = pred['prediction']['confidence']
        pred_mal = pred['prediction']['is_malicious']
        
        gt = get_label_for_line(line, ground_truth)
        correct = (pred_mal == gt['is_malicious'])
        
        if correct:
            buckets[level]['correct'] += 1
        else:
            buckets[level]['incorrect'] += 1
    
    # Calculate accuracy per level
    calibration = {}
    expected = {'HIGH': 0.90, 'MEDIUM': 0.70, 'LOW': 0.50, 'INSUFFICIENT': 0.30}
    
    for level, bucket in buckets.items():
        total = bucket['correct'] + bucket['incorrect']
        acc = bucket['correct'] / total if total > 0 else 0
        calibration[level] = {'total': total, 'correct': bucket['correct'], 'accuracy': round(acc, 4)}
    
    # Expected Calibration Error (ECE)
    total_preds = sum(c['total'] for c in calibration.values())
    ece = sum(
        (calibration[l]['total'] / total_preds * abs(calibration[l]['accuracy'] - expected[l]))
        for l in calibration if calibration[l]['total'] > 0
    ) if total_preds > 0 else 0
    
    quality = 'GOOD' if ece < 0.15 else 'MODERATE' if ece < 0.25 else 'POOR'
    
    return {
        'approach': 'Approach 5: Calibration',
        'by_level': calibration,
        'ece': round(ece, 4),
        'quality': quality
    }


print("✓ Validation Approach 5 (Calibration) defined")

---
## Cell 13: Full Validation Pipeline

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# FULL VALIDATION PIPELINE
# ═══════════════════════════════════════════════════════════════════════════

def run_full_validation(dataset_info, nlp, hypotheses):
    """Run complete validation with both P_Score methods and all approaches."""
    print(f"\n{'='*80}")
    print(f"FULL VALIDATION: {dataset_info['name']}")
    print(f"{'='*80}")
    
    logs, ground_truth, stats = load_dataset(dataset_info)
    print(f"Loaded: {stats['total_lines']} logs, {stats['malicious_lines']} malicious")
    
    results = {
        'dataset': dataset_info['name'],
        'stats': stats,
        'timestamp': datetime.now().isoformat(),
        'model': SELECTED_MODEL,
        'option_a': {},
        'option_b3': {}
    }
    
    for method in ['option_a', 'option_b3']:
        print(f"\n─── Processing with {method.upper()} ───")
        
        # Process all logs
        predictions = []
        for i, (line_num, log_text) in enumerate(logs):
            pred = process_log_full(line_num, log_text, nlp, hypotheses, method)
            predictions.append(pred)
            
            if (i + 1) % 500 == 0:
                print(f"  {i+1}/{len(logs)} processed")
        
        print(f"  ✓ {len(predictions)} predictions")
        
        # Run validations
        v1 = validate_binary(predictions, ground_truth)
        v3 = validate_two_stage(predictions, ground_truth, hypotheses)
        v5 = validate_calibration(predictions, ground_truth)
        
        results[method] = {
            'predictions': predictions,
            'approach_1': v1,
            'approach_3': v3,
            'approach_5': v5
        }
        
        print(f"  Binary F1: {v1['metrics']['f1']:.4f}")
        print(f"  Hypothesis Acc: {v3['stage_b']['hypothesis_accuracy']:.4f}")
        print(f"  Calibration: {v5['quality']} (ECE={v5['ece']:.4f})")
    
    return results


print("✓ Full validation pipeline defined")

---
## Cell 14: Save Results to Google Drive

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# SAVE RESULTS TO GOOGLE DRIVE
# ═══════════════════════════════════════════════════════════════════════════

def save_results(results, output_path=OUTPUT_PATH):
    """Save all validation results."""
    dataset_name = results['dataset'].replace('/', '_')
    ts = datetime.now().strftime('%Y%m%d_%H%M%S')
    folder = os.path.join(output_path, dataset_name)
    os.makedirs(folder, exist_ok=True)
    
    saved = []
    
    for method in ['option_a', 'option_b3']:
        mr = results[method]
        
        # Approach 1 - Binary
        path = os.path.join(folder, f'{method}_binary_{ts}.txt')
        with open(path, 'w') as f:
            f.write(f"FTE-HARM Validation - Approach 1 (Binary)\n")
            f.write(f"Dataset: {results['dataset']}\n")
            f.write(f"Method: {method}\n")
            f.write(f"Timestamp: {results['timestamp']}\n\n")
            f.write(f"Confusion Matrix:\n")
            for k, v in mr['approach_1']['cm'].items():
                f.write(f"  {k}: {v}\n")
            f.write(f"\nMetrics:\n")
            for k, v in mr['approach_1']['metrics'].items():
                f.write(f"  {k}: {v}\n")
        saved.append(path)
        
        # Approach 3 - Two-Stage
        path = os.path.join(folder, f'{method}_twostage_{ts}.txt')
        with open(path, 'w') as f:
            f.write(f"FTE-HARM Validation - Approach 3 (Two-Stage)\n")
            f.write(f"Dataset: {results['dataset']}\n")
            f.write(f"Method: {method}\n\n")
            f.write(f"Stage A (Detection):\n")
            for k, v in mr['approach_3']['stage_a'].items():
                f.write(f"  {k}: {v}\n")
            f.write(f"\nStage B (Hypothesis):\n")
            for k, v in mr['approach_3']['stage_b'].items():
                f.write(f"  {k}: {v}\n")
            f.write(f"\nCombined:\n")
            for k, v in mr['approach_3']['combined'].items():
                f.write(f"  {k}: {v}\n")
        saved.append(path)
        
        # Approach 5 - Calibration
        path = os.path.join(folder, f'{method}_calibration_{ts}.txt')
        with open(path, 'w') as f:
            f.write(f"FTE-HARM Validation - Approach 5 (Calibration)\n")
            f.write(f"Dataset: {results['dataset']}\n")
            f.write(f"Method: {method}\n\n")
            f.write(f"By Confidence Level:\n")
            for level, data in mr['approach_5']['by_level'].items():
                f.write(f"  {level}: {data}\n")
            f.write(f"\nECE: {mr['approach_5']['ece']}\n")
            f.write(f"Quality: {mr['approach_5']['quality']}\n")
        saved.append(path)
    
    print(f"\n✓ Saved {len(saved)} files to: {folder}")
    return saved


print("✓ Save functions defined")

---
## Cell 15: Run All Datasets

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# RUN VALIDATION ON ALL DATASETS
# ═══════════════════════════════════════════════════════════════════════════

def run_all_datasets(datasets, nlp, hypotheses):
    """Run validation on all datasets."""
    all_results = {}
    
    print(f"\n{'='*80}")
    print(f"RUNNING FULL VALIDATION ON {len(datasets)} DATASETS")
    print(f"{'='*80}")
    
    for i, ds in enumerate(datasets, 1):
        print(f"\n[{i}/{len(datasets)}] {ds['name']}")
        
        try:
            results = run_full_validation(ds, nlp, hypotheses)
            save_results(results)
            all_results[ds['name']] = results
        except Exception as e:
            print(f"ERROR: {e}")
            all_results[ds['name']] = {'error': str(e)}
    
    # Summary
    summary_path = os.path.join(OUTPUT_PATH, f'summary_{datetime.now().strftime("%Y%m%d_%H%M%S")}.txt')
    with open(summary_path, 'w') as f:
        f.write("FTE-HARM VALIDATION SUMMARY\n")
        f.write(f"Datasets: {len(datasets)}\n\n")
        
        for name, res in all_results.items():
            f.write(f"\n{name}:\n")
            if 'error' in res:
                f.write(f"  ERROR: {res['error']}\n")
            else:
                for m in ['option_a', 'option_b3']:
                    f.write(f"  {m}: F1={res[m]['approach_1']['metrics']['f1']}, HypAcc={res[m]['approach_3']['stage_b']['hypothesis_accuracy']}\n")
    
    print(f"\n✓ Summary saved: {summary_path}")
    return all_results


print("✓ Batch processing functions defined")

---
## Cell 16: Execute Validation

In [None]:
# ═══════════════════════════════════════════════════════════════════════════
# EXECUTE VALIDATION
# ═══════════════════════════════════════════════════════════════════════════

# Option 1: Run on single dataset for testing
if all_datasets:
    print("Running on first dataset for testing...")
    test_results = run_full_validation(all_datasets[0], nlp, HYPOTHESES)
    save_results(test_results)
else:
    print("No datasets found!")

In [None]:
# Option 2: Run on ALL datasets (uncomment to execute)
# all_results = run_all_datasets(all_datasets, nlp, HYPOTHESES)

---
## Output Structure

```
/content/drive/My Drive/thesis/hypotheses_validation/
├── label_discovery_<ts>.txt
├── summary_<ts>.txt
└── <dataset_name>/
    ├── option_a_binary_<ts>.txt
    ├── option_a_twostage_<ts>.txt
    ├── option_a_calibration_<ts>.txt
    ├── option_b3_binary_<ts>.txt
    ├── option_b3_twostage_<ts>.txt
    └── option_b3_calibration_<ts>.txt
```

## Checklist

- [ ] Label discovery completed
- [ ] Hypotheses generated from labels
- [ ] Both P_Score methods tested (Option A + B3)
- [ ] All 3 validation approaches run (Binary, Two-Stage, Calibration)
- [ ] Results saved to Google Drive
- [ ] Summary generated