<a href="https://colab.research.google.com/github/Abumaude/AI-Foolosophy/blob/main/dataset_loader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset Loader and Ground Truth Pairing for FTE-HARM Validation

This notebook implements a comprehensive dataset loading and ground truth pairing system for forensic log analysis validation. The system enables rigorous validation of **FTE-HARM** (Forensic Triage Entity - Hypothesis Assessment Risk Model) against known attack patterns.

---

## Purpose

**Why Ground Truth Pairing is Critical:**
- **Validation Requirement:** FTE-HARM's hypothesis scoring must be validated against known attack patterns
- **Ground Truth Necessity:** Without ground truth labels, we cannot measure precision, recall, or accuracy
- **Dataset Pairing:** Log files and ground truth must be matched correctly to ensure evaluation validity
- **Forensic Accountability:** Every triage decision must be traceable to verified evidence

---

## Supported Ground Truth Formats

| Format | Extension | Example |
|--------|-----------|--------|
| Line-by-Line | `.log`, `.txt` | `benign,0,none` or `malicious,1,privilege_escalation` |
| CSV with Line Numbers | `.csv` | `line_number,label,attack_type,confidence` |
| JSON Temporal | `.json` | Attack windows with start/end times |

---

## Setup and Installation

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import the dataset loader module
import sys
import os

# Check if running in Colab
try:
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    print("Not running in Colab - using local paths")

# Add module path - adjust based on where dataset_loader.py is located
if IN_COLAB:
    # Option 1: If module is in Google Drive
    module_path = '/content/drive/My Drive/thesis'
    if os.path.exists(module_path):
        sys.path.insert(0, module_path)
    
    # Option 2: If module is in the cloned repo
    repo_path = '/content/AI-Foolosophy'
    if os.path.exists(repo_path):
        sys.path.insert(0, repo_path)
else:
    # Local development
    module_path = os.path.dirname(os.path.abspath('__file__'))
    sys.path.insert(0, module_path)

# Import the dataset loader
from dataset_loader import (
    DatasetConfig,
    DatasetScanner,
    DatasetPairer,
    GroundTruthLoader,
    DatasetValidator,
    DatasetStatsGenerator,
    DatasetIterator,
    FTEHARMValidator,
    load_and_pair_datasets,
    validate_datasets,
    iterate_with_groundtruth,
    GroundTruthEntry,
    DatasetPair,
    ValidationResult,
    DatasetStatistics,
    GroundTruthFormat
)

print("Dataset loader module imported successfully!")

---

## Step 1: Configure Dataset Paths

Configure the paths to your forensic log datasets. The default configuration expects the following structure:

```
/content/drive/My Drive/thesis/dataset/
├── grp1/                    # Group 1: Primary datasets
│   ├── rm/                  # RussellMitchell AITv2 dataset
│   │   ├── log_auth.log     # SSH authentication logs (raw)
│   │   ├── label_auth.log   # Ground truth for log_auth.log
│   │   └── ...
│   └── santos/              # Santos DNS exfiltration dataset
│       ├── dns_queries.log  # DNS query logs (raw)
│       ├── dns_labels.log   # Ground truth for dns_queries.log
│       └── ...
└── grp2/                    # Group 2: Secondary/validation datasets
    └── ...
```

In [None]:
# Configure dataset paths
# Modify these paths according to your directory structure

DATASET_PATHS = {
    'grp1': '/content/drive/My Drive/thesis/dataset/grp1',
    'grp2': '/content/drive/My Drive/thesis/dataset/grp2'
}

# Create custom configuration
config = DatasetConfig()
config.DATASET_PATHS = DATASET_PATHS

# Display configuration
print("Dataset Configuration:")
print("=" * 60)
for group, path in DATASET_PATHS.items():
    exists = os.path.exists(path)
    status = "EXISTS" if exists else "NOT FOUND"
    print(f"  {group}: {path} [{status}]")

---

## Step 2: Scan Dataset Directories

Scan the configured directories to discover log files and potential ground truth files.

In [None]:
# Scan all dataset directories
scanner = DatasetScanner(config)

print("Scanning dataset directories...")
print("=" * 60)

all_datasets = {}
for group_name, group_path in DATASET_PATHS.items():
    if os.path.exists(group_path):
        group_datasets = scanner.scan_directory(group_path)
        print(f"\n{group_name.upper()} ({group_path}):")
        
        if not group_datasets:
            print("  No datasets found")
        else:
            for subdir, info in group_datasets.items():
                print(f"  {subdir}/")
                print(f"    Log files: {info['log_files']}")
                print(f"    Label files: {info['label_files']}")
                
                # Store with group prefix
                all_datasets[f"{group_name}/{subdir}"] = info
    else:
        print(f"\n{group_name.upper()}: Directory not found")

print(f"\nTotal subdirectories scanned: {len(all_datasets)}")

---

## Step 3: Pair Log Files with Ground Truth

Match each log file with its corresponding ground truth annotation file using multiple pairing rules:

1. **Prefix match:** `log_X.log` → `label_X.log`
2. **Suffix match:** `X.log` → `X_labels.csv`
3. **Root name match:** `X.log` → `X_gt.txt`

In [None]:
# Create dataset pairs
pairer = DatasetPairer(config)

if all_datasets:
    dataset_pairs = pairer.create_dataset_pairs(all_datasets)
    
    print("Dataset Pairing Results:")
    print("=" * 60)
    
    paired_count = sum(1 for p in dataset_pairs if p.paired)
    unpaired_count = len(dataset_pairs) - paired_count
    
    print(f"\nTotal log files: {len(dataset_pairs)}")
    print(f"Successfully paired: {paired_count}")
    print(f"Unpaired: {unpaired_count}")
    
    print("\nDetailed Pairing:")
    print("-" * 60)
    
    for pair in dataset_pairs:
        status = "PAIRED" if pair.paired else "UNPAIRED"
        log_name = os.path.basename(pair.log_file)
        label_name = os.path.basename(pair.label_file) if pair.label_file else "N/A"
        format_name = pair.ground_truth_format.value if pair.paired else "N/A"
        
        print(f"[{status}] {pair.dataset_name}")
        print(f"  Log: {log_name}")
        print(f"  Label: {label_name}")
        print(f"  Format: {format_name}")
        print()
else:
    dataset_pairs = []
    print("No datasets found to pair. Please check your dataset paths.")

---

## Step 4: Validate Dataset Integrity

Validate that paired datasets are correctly matched:
- Check that files exist
- Verify line counts match (for line-by-line format)
- Ensure all referenced lines exist (for CSV format)
- Validate ground truth label values

In [None]:
# Validate all dataset pairs
validator = DatasetValidator()

if dataset_pairs:
    print("Validating Dataset Pairs...")
    print("=" * 60)
    
    validation_results = validator.validate_all(dataset_pairs)
    
    valid_count = sum(1 for r in validation_results.values() if r.valid)
    invalid_count = len(validation_results) - valid_count
    
    print(f"\nValidation Summary:")
    print(f"  Valid: {valid_count}")
    print(f"  Invalid: {invalid_count}")
    
    # Show details for any invalid or warning cases
    print("\nValidation Details:")
    print("-" * 60)
    
    for dataset_name, result in validation_results.items():
        if not result.valid:
            print(f"\n[INVALID] {dataset_name}")
            for error in result.errors:
                print(f"  ERROR: {error}")
            for warning in result.warnings:
                print(f"  WARNING: {warning}")
        elif result.warnings:
            print(f"\n[VALID with warnings] {dataset_name}")
            for warning in result.warnings:
                print(f"  WARNING: {warning}")
        else:
            print(f"[VALID] {dataset_name}")
else:
    print("No dataset pairs to validate.")

---

## Step 5: Generate Dataset Statistics

Generate comprehensive statistics about the paired datasets, including:
- Total log entries
- Malicious vs benign distribution
- Attack type breakdown
- Per-group statistics

In [None]:
# Generate dataset statistics
stats_generator = DatasetStatsGenerator()

if dataset_pairs:
    paired_datasets = [p for p in dataset_pairs if p.paired]
    
    if paired_datasets:
        print("Generating Dataset Statistics...")
        stats = stats_generator.generate_stats(dataset_pairs)
        
        # Print detailed report
        stats_generator.print_report(stats)
    else:
        print("No paired datasets available for statistics generation.")
else:
    print("No datasets available for statistics.")

---

## Step 6: Iterate Through Dataset Pairs

Iterate through matched log-ground truth pairs for FTE-HARM processing. This demonstrates how to access each log entry along with its corresponding ground truth label.

In [None]:
# Example: Iterate through dataset pairs with a simple processor
iterator = DatasetIterator()

def example_processor(log_line, ground_truth, line_number):
    """
    Example processing function for each log entry.
    
    In real FTE-HARM usage, this would:
    1. Extract entities from log_line using NER
    2. Score all hypotheses
    3. Make triage decision
    
    Returns processing result for collection.
    """
    # Get ground truth values
    if isinstance(ground_truth, GroundTruthEntry):
        is_malicious = ground_truth.is_malicious
        attack_type = ground_truth.attack_type
    else:
        is_malicious = ground_truth.get('binary', 0) == 1
        attack_type = ground_truth.get('attack_type', 'unknown')
    
    return {
        'line': line_number,
        'log_preview': log_line[:50] + '...' if len(log_line) > 50 else log_line,
        'is_malicious': is_malicious,
        'attack_type': attack_type
    }

# Process a sample of entries (limit for demonstration)
if dataset_pairs:
    paired_datasets = [p for p in dataset_pairs if p.paired]
    
    if paired_datasets:
        print("Iterating through dataset pairs...")
        print("=" * 60)
        
        # Process first dataset as example
        results = iterator.iterate_pairs(paired_datasets[:1], example_processor, verbose=True)
        
        print(f"\nProcessed {len(results)} log entries")
        print("\nSample results (first 5):")
        print("-" * 60)
        
        for r in results[:5]:
            gt = r['result']
            status = "MALICIOUS" if gt['is_malicious'] else "BENIGN"
            print(f"Line {gt['line']} [{status}] {gt['attack_type']}")
            print(f"  {gt['log_preview']}")
            print()
    else:
        print("No paired datasets available for iteration.")
else:
    print("No datasets available for iteration.")

---

## Step 7: FTE-HARM Validation Integration

This section demonstrates how to integrate the dataset loader with FTE-HARM hypothesis validation. The `FTEHARMValidator` class provides a complete workflow for:

1. Loading paired datasets
2. Extracting entities from logs
3. Scoring hypotheses
4. Comparing predictions with ground truth
5. Calculating validation metrics (precision, recall, F1, accuracy)

In [None]:
# Example FTE-HARM Validation Integration
# Note: This requires actual FTE-HARM entity extraction and hypothesis scoring functions

# Placeholder functions for demonstration (replace with actual implementations)
def placeholder_entity_extractor(log_line):
    """
    Placeholder entity extractor.
    
    In real implementation, this would use the NER transformer model
    to extract entities like UserName, IPAddress, ProcessName, etc.
    """
    # Return mock entities for demonstration
    return {
        'UserName': 'admin' if 'admin' in log_line.lower() else 'user',
        'ProcessName': 'sshd' if 'ssh' in log_line.lower() else 'unknown',
        'IPAddress': '192.168.1.1'
    }

def placeholder_hypothesis_scorer(entities, hypothesis_config):
    """
    Placeholder hypothesis scorer.
    
    In real implementation, this would calculate P(H|E) using
    Bayesian inference with the hypothesis configuration.
    """
    # Return mock score based on entities
    if entities.get('UserName') == 'admin':
        return {'p_score': 0.65}
    return {'p_score': 0.25}

# Example hypothesis configurations
example_hypothesis_configs = {
    'privilege_escalation': {
        'name': 'Privilege Escalation',
        'prior': 0.15,
        'required_entities': ['UserName', 'ProcessName'],
        'evidence_weights': {'UserName': 0.3, 'ProcessName': 0.4}
    },
    'lateral_movement': {
        'name': 'Lateral Movement', 
        'prior': 0.10,
        'required_entities': ['IPAddress', 'UserName'],
        'evidence_weights': {'IPAddress': 0.5, 'UserName': 0.3}
    }
}

# Run validation (if datasets are available)
if dataset_pairs:
    paired_datasets = [p for p in dataset_pairs if p.paired]
    
    if paired_datasets:
        print("Running FTE-HARM Validation...")
        print("=" * 60)
        print("(Using placeholder functions - replace with actual implementations)")
        print()
        
        fte_validator = FTEHARMValidator()
        
        # Run validation on a subset for demonstration
        validation_results = fte_validator.validate(
            pairs=paired_datasets[:1],  # Limit for demo
            entity_extractor=placeholder_entity_extractor,
            hypothesis_scorer=placeholder_hypothesis_scorer,
            hypothesis_configs=example_hypothesis_configs,
            triage_threshold=0.45
        )
        
        # Print validation report
        fte_validator.print_validation_report(validation_results)
    else:
        print("No paired datasets available for FTE-HARM validation.")
else:
    print("No datasets available. Showing example validation report structure:")
    print()
    print("=" * 60)
    print("FTE-HARM VALIDATION REPORT (EXAMPLE)")
    print("=" * 60)
    print("Metrics that would be calculated:")
    print("  - Precision: TP / (TP + FP)")
    print("  - Recall: TP / (TP + FN)")  
    print("  - F1 Score: 2 * (P * R) / (P + R)")
    print("  - Accuracy: (TP + TN) / Total")

---

## Quick Reference: Convenience Functions

The dataset loader provides convenient one-liner functions for common operations:

In [None]:
# Quick Reference: One-liner convenience functions

# 1. Load and pair all datasets in one call
# pairs, stats = load_and_pair_datasets(DATASET_PATHS)

# 2. Validate all pairs
# validation_results = validate_datasets(pairs)

# 3. Iterate with custom processor
# results = iterate_with_groundtruth(pairs, my_processor_fn)

print("Convenience Function Examples:")
print("=" * 60)
print()
print("# Load and pair all datasets")
print("pairs, stats = load_and_pair_datasets(DATASET_PATHS)")
print()
print("# Validate all pairs")
print("validation_results = validate_datasets(pairs)")
print()
print("# Iterate with custom processor")
print("results = iterate_with_groundtruth(pairs, my_processor_fn)")

---

## Validation Checklist

Before proceeding to FTE-HARM hypothesis testing, ensure:

- [ ] Dataset directories scanned successfully
- [ ] All log files identified
- [ ] Ground truth files located
- [ ] Log-ground truth pairing completed
- [ ] Dataset integrity validated
- [ ] No line count mismatches
- [ ] Ground truth format understood
- [ ] Dataset statistics generated
- [ ] Iteration workflow tested
- [ ] Ready for FTE-HARM integration

---

## Expected Dataset Statistics

| Dataset | Total Logs | Malicious % | Primary Attack Types |
|---------|-----------|-------------|---------------------|
| RussellMitchell AITv2 | ~50,000 | ~15% | privilege_escalation, lateral_movement |
| Santos DNS | ~100,000 | ~5% | exfiltration, command_and_control |

---

## Notes for Thesis

**Methodological Contribution:**
The dataset pairing framework ensures validation integrity by establishing explicit correspondence between raw forensic data and verified ground truth annotations. This addresses the methodological challenge of validating probabilistic triage systems against deterministic forensic requirements.

**Technical Finding:**
Line-by-line ground truth format (RussellMitchell) provides stronger validation guarantees than sparse annotation formats, ensuring every triage decision can be evaluated against known labels. However, CSV formats with confidence scores enable more nuanced evaluation of threshold calibration.

**Practical Impact:**
Dataset pairing automation reduces manual validation effort from hours to seconds, enabling comprehensive evaluation across multiple datasets and attack types. This is critical for demonstrating FTE-HARM's generalization capability beyond single-domain testing.