# Address-Aware GNN for Cryptographic Function Detection

This notebook provides a complete implementation of a Graph Neural Network for detecting cryptographic functions in binary code.

**Key Features:**
- Address-aware spatial features (novel)
- Multiple GNN architectures (GCN, GAT, SAGE, GIN)
- 100+ features per function
- End-to-end pipeline: Data ‚Üí Training ‚Üí Inference

**Notebook Structure:**
1. Setup & Imports
2. Address Feature Extraction
3. Data Loading & Preprocessing
4. GNN Model Architectures
5. Training Pipeline
6. Evaluation & Visualization
7. Inference Pipeline
8. Complete Training & Testing

---
## Part 1: Setup & Imports

In [None]:
# Core imports
import json
import glob
import os
import pickle
import warnings
from pathlib import Path
from collections import Counter, defaultdict
from typing import List, Dict, Tuple, Optional

# Data processing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Progress bars
from tqdm.notebook import tqdm

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, confusion_matrix,
    accuracy_score, f1_score, precision_recall_fscore_support
)
from sklearn.preprocessing import LabelEncoder, StandardScaler

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

# PyTorch Geometric
from torch_geometric.data import Data, Batch
from torch_geometric.nn import (
    GCNConv, GATConv, SAGEConv, GINConv,
    global_mean_pool, global_max_pool, global_add_pool,
    BatchNorm, GraphNorm
)

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("‚úì All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

---
## Part 2: Address Feature Extraction

This section implements **novel address-aware features** that capture spatial patterns in binary code.

In [None]:
class AddressFeatureExtractor:
    """
    Extracts advanced address-based features from binary code.
    
    Address features capture spatial patterns, code locality, and
    memory layout information that's crucial for crypto detection.
    """
    
    @staticmethod
    def normalize_address(address: str) -> str:
        """
        Normalize address format to plain hex string.
        
        Handles formats:
        - "00010000" ‚Üí "00010000"
        - "code:010000" ‚Üí "010000"
        - "CODE:ABCDEF" ‚Üí "ABCDEF"
        - "0x10000" ‚Üí "10000"
        """
        # Handle code: prefix (case-insensitive)
        if address.lower().startswith('code:'):
            address = address[5:]
        # Handle 0x prefix
        if address.startswith('0x') or address.startswith('0X'):
            address = address[2:]
        return address
    
    @staticmethod
    def extract_address_features(address: str) -> Dict[str, float]:
        """
        Extract multiple features from a hexadecimal address.
        
        Returns 10 features:
        - Normalized value, alignment (4/8/16)
        - Section detection (text/data/bss)
        - Entropy, bit patterns
        """
        # Normalize address format
        address = AddressFeatureExtractor.normalize_address(address)
        addr_int = int(address, 16)
        
        features = {
            'addr_value_normalized': addr_int / 0xFFFFFFFF,
            'addr_alignment_4': float(addr_int % 4 == 0),
            'addr_alignment_8': float(addr_int % 8 == 0),
            'addr_alignment_16': float(addr_int % 16 == 0),
            
            'is_text_section': float(0x8000 <= addr_int < 0x100000),
            'is_data_section': float(0x100000 <= addr_int < 0x200000),
            'is_bss_section': float(0x200000 <= addr_int < 0x300000),
            
            'addr_entropy': AddressFeatureExtractor._calculate_hex_entropy(address),
            'addr_ones_ratio': bin(addr_int).count('1') / 32,
            'addr_nibble_variety': len(set(address)) / 16,
        }
        
        return features
    
    @staticmethod
    def _calculate_hex_entropy(hex_string: str) -> float:
        """Calculate Shannon entropy of hex string."""
        if not hex_string:
            return 0.0
        
        hex_string = AddressFeatureExtractor.normalize_address(hex_string)
        freq = Counter(hex_string)
        length = len(hex_string)
        
        entropy = -sum((count/length) * np.log2(count/length)
                      for count in freq.values() if count > 0)
        
        return entropy / 4.0 if length > 0 else 0.0
    
    @staticmethod
    def compute_edge_address_features(src_addr: str, dst_addr: str) -> Dict[str, float]:
        """
        Compute address-based features for control flow edges.
        
        Returns 9 features about jump distances and patterns.
        """
        # Normalize addresses
        src_addr = AddressFeatureExtractor.normalize_address(src_addr)
        dst_addr = AddressFeatureExtractor.normalize_address(dst_addr)
        
        src_int = int(src_addr, 16)
        dst_int = int(dst_addr, 16)
        
        jump_distance = dst_int - src_int
        abs_distance = abs(jump_distance)
        
        features = {
            'jump_distance': jump_distance,
            'abs_jump_distance': abs_distance,
            'jump_distance_log': np.log1p(abs_distance),
            
            'is_forward_jump': float(jump_distance > 0),
            'is_backward_jump': float(jump_distance < 0),
            'is_short_jump': float(abs_distance < 256),
            'is_long_jump': float(abs_distance > 4096),
            
            'alignment_preserved': float((src_int % 16) == (dst_int % 16)),
            'crosses_section': float(abs_distance > 0x10000),
        }
        
        return features
    
    @staticmethod
    def compute_graph_address_features(addresses: List[str]) -> Dict[str, float]:
        """
        Compute global address features for entire function graph.
        
        Returns 5 features about code layout and density.
        """
        if not addresses:
            return {f'graph_addr_{k}': 0.0 for k in [
                'span', 'span_log', 'density', 'avg_gap', 'locality_score'
            ]}
        
        # Normalize all addresses
        normalized_addresses = [AddressFeatureExtractor.normalize_address(addr) for addr in addresses]
        addr_ints = sorted([int(addr, 16) for addr in normalized_addresses])
        
        span = addr_ints[-1] - addr_ints[0] if len(addr_ints) > 1 else 0
        density = len(addr_ints) / (span + 1) if span > 0 else 1.0
        
        gaps = [addr_ints[i+1] - addr_ints[i] for i in range(len(addr_ints)-1)]
        avg_gap = np.mean(gaps) if gaps else 0
        locality_score = 1.0 / (1.0 + np.log1p(avg_gap))
        
        return {
            'graph_addr_span': span,
            'graph_addr_span_log': np.log1p(span),
            'graph_addr_density': density,
            'graph_addr_avg_gap': avg_gap,
            'graph_addr_locality_score': locality_score,
        }

print("‚úì AddressFeatureExtractor defined")

# Test it
test_addr = "code:010000"
features = AddressFeatureExtractor.extract_address_features(test_addr)
print(f"\nTest: '{test_addr}' ‚Üí {len(features)} features extracted")
for k, v in list(features.items())[:3]:
    print(f"  {k}: {v}")

---
## Part 3: Dataset & Data Loading

Load Ghidra JSON files and convert to graph representations.

In [None]:
# Configuration
CONFIG = {
    'data_dir': '/home/bhoomi/Desktop/compilerRepo/vestigo-data/ghidra_json',
    'output_dir': './gnn_outputs',
    'model_dir': './gnn_models',
    
    # Model hyperparameters
    'hidden_dim': 256,
    'num_layers': 4,
    'dropout': 0.3,
    'conv_type': 'gat',  # 'gcn', 'gat', 'sage', 'gin'
    'pooling': 'concat',
    
    # Training hyperparameters
    'batch_size': 32,
    'num_epochs': 110,  # Reduced for notebook
    'lr': 0.001,
    'weight_decay': 1e-4,
    
    # Data split
    'train_ratio': 0.7,
    'val_ratio': 0.15,
    'test_ratio': 0.15,
}

# Create output directories
os.makedirs(CONFIG['output_dir'], exist_ok=True)
os.makedirs(CONFIG['model_dir'], exist_ok=True)

print("Configuration:")
for k, v in CONFIG.items():
    print(f"  {k}: {v}")

In [None]:
# Load JSON files
json_files = glob.glob(os.path.join(CONFIG['data_dir'], '*.json'))
print(f"Found {len(json_files)} JSON files")

# Quick preview of a file
if json_files:
    print(f"\nSample file: {os.path.basename(json_files[0])}")
    with open(json_files[0], 'r') as f:
        sample_data = json.load(f)
    print(f"  Metadata: {sample_data.get('metadata', {})}")
    print(f"  Number of functions: {len(sample_data.get('functions', []))}")
    if sample_data.get('functions'):
        func = sample_data['functions'][0]
        print(f"  Sample function address: {func.get('address', 'N/A')}")
        print(f"  Sample function label: {func.get('label', 'N/A')}")

### Analyze Label Distribution

In [None]:
# Analyze label distribution
all_labels = []
all_complexities = []

print("Analyzing dataset...")
for json_file in tqdm(json_files[:50], desc="Loading files"):  # Sample for speed
    try:
        with open(json_file, 'r') as f:
            data = json.load(f)
        
        for func in data.get('functions', []):
            if 'label' in func:
                all_labels.append(func['label'])
                complexity = func.get('graph_level', {}).get('cyclomatic_complexity', 0)
                all_complexities.append(complexity)
    except Exception as e:
        print(f"Error loading {json_file}: {e}")
        continue

print(f"\nTotal functions analyzed: {len(all_labels)}")

# Plot distribution
label_counts = Counter(all_labels)
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart
labels, counts = zip(*label_counts.most_common(15))
axes[0].barh(range(len(labels)), counts, color='steelblue')
axes[0].set_yticks(range(len(labels)))
axes[0].set_yticklabels(labels)
axes[0].set_xlabel('Count')
axes[0].set_title('Label Distribution (Top 15)', fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Pie chart
top_10 = label_counts.most_common(10)
labels_pie, counts_pie = zip(*top_10)
axes[1].pie(counts_pie, labels=labels_pie, autopct='%1.1f%%', startangle=90)
axes[1].set_title('Top 10 Labels', fontweight='bold')

plt.tight_layout()
plt.savefig(os.path.join(CONFIG['output_dir'], 'label_distribution.png'), dpi=300, bbox_inches='tight')
plt.show()

print("\nLabel counts:")
for label, count in label_counts.most_common():
    print(f"  {label}: {count}")

---
## Part 4: Graph Dataset Implementation

Convert JSON functions to PyTorch Geometric graphs.

In [None]:
# This is a simplified version - full implementation in new_gnn.py
# For the notebook, we'll import from the main file

import sys
sys.path.insert(0, '/home/bhoomi/Desktop/compilerRepo/vestigo-data/ml')

from new_gnn import GraphDataset, collate_fn

print("‚úì GraphDataset imported from new_gnn.py")
print("\nNote: Full GraphDataset implementation is in new_gnn.py")
print("It includes:")
print("  - JSON parsing")
print("  - Graph construction (nodes, edges)")
print("  - Feature extraction (100+ features)")
print("  - Feature scaling (StandardScaler)")
print("  - Label encoding")

### Load Training Data

In [None]:
# Split data
train_files, test_files = train_test_split(
    json_files,
    test_size=CONFIG['test_ratio'],
    random_state=42
)

train_files, val_files = train_test_split(
    train_files,
    test_size=CONFIG['val_ratio'] / (CONFIG['train_ratio'] + CONFIG['val_ratio']),
    random_state=42
)

print(f"Data split:")
print(f"  Train: {len(train_files)} files")
print(f"  Val: {len(val_files)} files")
print(f"  Test: {len(test_files)} files")

In [None]:
# Load datasets
print("Loading training dataset...")
train_dataset = GraphDataset(train_files)

print("\nLoading validation dataset...")
val_dataset = GraphDataset(val_files, train_dataset.label_encoder)
val_dataset.node_scaler = train_dataset.node_scaler
val_dataset.edge_scaler = train_dataset.edge_scaler
val_dataset.graph_scaler = train_dataset.graph_scaler

print("\nLoading test dataset...")
test_dataset = GraphDataset(test_files, train_dataset.label_encoder)
test_dataset.node_scaler = train_dataset.node_scaler
test_dataset.edge_scaler = train_dataset.edge_scaler
test_dataset.graph_scaler = train_dataset.graph_scaler

print(f"\n‚úì Datasets loaded successfully!")
print(f"Classes: {train_dataset.label_encoder.classes_}")
print(f"\nDataset sizes:")
print(f"  Train: {len(train_dataset)} functions")
print(f"  Val: {len(val_dataset)} functions")
print(f"  Test: {len(test_dataset)} functions")

In [None]:
# Verify edge feature dimensions are correct
print("Verifying edge feature fix...")
print("-" * 60)

# Check the actual dimension used in new_gnn.py
test_graph = train_dataset.graphs[0]
edge_dim = test_graph['edge_features'].shape[1]

print(f"Edge feature dimension: {edge_dim}")

# Verify consistency across multiple graphs
edge_dims = set()
for i, graph in enumerate(train_dataset.graphs[:50]):
    edge_dims.add(graph['edge_features'].shape[1])

if len(edge_dims) == 1 and 13 in edge_dims:
    print(f"‚úì All graphs have consistent edge feature dimension: {edge_dims}")
    print("‚úì Edge feature fix is working correctly!")
else:
    print(f"Inconsistent edge dimensions: {edge_dims}")
    print("   You may need to restart the kernel and reload new_gnn.py")

print("-" * 60)

In [None]:
# Create data loaders
train_loader = DataLoader(
    train_dataset,
    batch_size=CONFIG['batch_size'],
    shuffle=True,
    collate_fn=collate_fn
)

val_loader = DataLoader(
    val_dataset,
    batch_size=CONFIG['batch_size'],
    shuffle=False,
    collate_fn=collate_fn
)

test_loader = DataLoader(
    test_dataset,
    batch_size=CONFIG['batch_size'],
    shuffle=False,
    collate_fn=collate_fn
)

print(f"Data loaders created:")
print(f"  Train batches: {len(train_loader)}")
print(f"  Val batches: {len(val_loader)}")
print(f"  Test batches: {len(test_loader)}")

# Check a sample batch
sample_batch = next(iter(train_loader))
print(f"\nSample batch:")
print(f"  Graphs: {sample_batch.num_graphs}")
print(f"  Total nodes: {sample_batch.x.shape[0]}")
print(f"  Node features: {sample_batch.x.shape[1]}")
print(f"  Edges: {sample_batch.edge_index.shape[1]}")
print(f"  Graph features: {sample_batch.graph_features.shape[-1]}")

---
## Part 5: GNN Model Architecture

Define the Address-Aware GNN model.

In [None]:
# Import model from new_gnn.py
from new_gnn import AddressAwareGNN, HierarchicalGNN

print("‚úì GNN models imported")
print("\nAvailable architectures:")
print("  1. AddressAwareGNN - Main model (GCN/GAT/SAGE/GIN)")
print("  2. HierarchicalGNN - Alternative with attention pooling")

In [None]:
# Get feature dimensions
sample = train_dataset[0]
num_node_features = sample.x.shape[1]
num_edge_features = sample.edge_attr.shape[1] if sample.edge_attr.numel() > 0 else 0
num_graph_features = sample.graph_features.shape[0]
num_classes = len(train_dataset.label_encoder.classes_)

print(f"Feature dimensions:")
print(f"  Node features: {num_node_features}")
print(f"  Edge features: {num_edge_features}")
print(f"  Graph features: {num_graph_features}")
print(f"  Number of classes: {num_classes}")

# Build model
model = AddressAwareGNN(
    num_node_features=num_node_features,
    num_edge_features=num_edge_features,
    num_graph_features=num_graph_features,
    num_classes=num_classes,
    hidden_dim=CONFIG['hidden_dim'],
    num_layers=CONFIG['num_layers'],
    dropout=CONFIG['dropout'],
    conv_type=CONFIG['conv_type'],
    pooling=CONFIG['pooling'],
)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n‚úì Model built: {CONFIG['conv_type'].upper()}")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"\nModel summary:")
print(model)

---
## Part 6: Training Pipeline

In [None]:
# Import trainer
from new_gnn import GNNTrainer

# Auto-detect device (GPU if available, CPU otherwise)
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("="*60)
print(f"Using device: {device.upper()}")
if device == 'cuda':
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print("   Expected training time: ~30-45 minutes")
else:
    print("   Expected training time: ~2-3 hours")
print("="*60)
print()

# Create trainer (auto-selects device)
trainer = GNNTrainer(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    test_loader=test_loader,
    label_encoder=train_dataset.label_encoder,
    device=device,  # Auto-detected device
    lr=CONFIG['lr'],
    weight_decay=CONFIG['weight_decay']
)

print("‚úì Trainer created")
print(f"  Device: {trainer.device}")
print(f"  Optimizer: AdamW")
print(f"  Learning rate: {CONFIG['lr']}")
print(f"  Weight decay: {CONFIG['weight_decay']}")

### Train the Model

In [14]:
# Train the model
print(f"Starting training for {CONFIG['num_epochs']} epochs...\n")

trainer.train(
    num_epochs=CONFIG['num_epochs'],
    save_dir=CONFIG['model_dir']
)

print("\n‚úì Training complete!")

Train Loss: 1.1063 | Train Acc: 0.7034
Val Loss: 2.3050 | Val Acc: 0.4121 | Val F1: 0.3012
‚úì Saved best model (F1: 0.3012)

Epoch 2/110
--------------------------------------------------


                                                           

Train Loss: 0.9230 | Train Acc: 0.7357
Val Loss: 1.9124 | Val Acc: 0.4643 | Val F1: 0.3504
‚úì Saved best model (F1: 0.3504)

Epoch 3/110
--------------------------------------------------


                                                           

Train Loss: 0.8640 | Train Acc: 0.7421
Val Loss: 1.9381 | Val Acc: 0.4203 | Val F1: 0.3220

Epoch 4/110
--------------------------------------------------


                                                           

Train Loss: 0.8203 | Train Acc: 0.7437
Val Loss: 1.7220 | Val Acc: 0.4794 | Val F1: 0.3539
‚úì Saved best model (F1: 0.3539)

Epoch 5/110
--------------------------------------------------


                                                           

Train Loss: 0.7857 | Train Acc: 0.7512
Val Loss: 1.6731 | Val Acc: 0.4973 | Val F1: 0.3834
‚úì Saved best model (F1: 0.3834)

Epoch 6/110
--------------------------------------------------


                                                           

Train Loss: 0.7401 | Train Acc: 0.7586
Val Loss: 2.0508 | Val Acc: 0.4519 | Val F1: 0.3610

Epoch 7/110
--------------------------------------------------


                                                           

Train Loss: 0.7242 | Train Acc: 0.7668
Val Loss: 1.9080 | Val Acc: 0.4821 | Val F1: 0.3793

Epoch 8/110
--------------------------------------------------


                                                           

Train Loss: 0.7065 | Train Acc: 0.7669
Val Loss: 1.6416 | Val Acc: 0.4821 | Val F1: 0.3871
‚úì Saved best model (F1: 0.3871)

Epoch 9/110
--------------------------------------------------


                                                           

Train Loss: 0.6630 | Train Acc: 0.7766
Val Loss: 1.7804 | Val Acc: 0.4986 | Val F1: 0.3908
‚úì Saved best model (F1: 0.3908)

Epoch 10/110
--------------------------------------------------


                                                           

Train Loss: 0.6518 | Train Acc: 0.7805
Val Loss: 1.6490 | Val Acc: 0.4712 | Val F1: 0.3733

Epoch 11/110
--------------------------------------------------


                                                           

Train Loss: 0.6479 | Train Acc: 0.7785
Val Loss: 1.5852 | Val Acc: 0.5124 | Val F1: 0.4278
‚úì Saved best model (F1: 0.4278)

Epoch 12/110
--------------------------------------------------


                                                           

Train Loss: 0.6243 | Train Acc: 0.7860
Val Loss: 1.7267 | Val Acc: 0.4615 | Val F1: 0.4036

Epoch 13/110
--------------------------------------------------


                                                           

Train Loss: 0.6076 | Train Acc: 0.7912
Val Loss: 1.5981 | Val Acc: 0.4780 | Val F1: 0.4051

Epoch 14/110
--------------------------------------------------


                                                           

Train Loss: 0.5918 | Train Acc: 0.7963
Val Loss: 1.4836 | Val Acc: 0.5124 | Val F1: 0.4810
‚úì Saved best model (F1: 0.4810)

Epoch 15/110
--------------------------------------------------


                                                           

Train Loss: 0.5779 | Train Acc: 0.7987
Val Loss: 1.6729 | Val Acc: 0.4931 | Val F1: 0.4333

Epoch 16/110
--------------------------------------------------


                                                           

Train Loss: 0.5725 | Train Acc: 0.8068
Val Loss: 1.5792 | Val Acc: 0.5275 | Val F1: 0.4538

Epoch 17/110
--------------------------------------------------


                                                           

Train Loss: 0.5485 | Train Acc: 0.8093
Val Loss: 1.6835 | Val Acc: 0.4890 | Val F1: 0.4155

Epoch 18/110
--------------------------------------------------


                                                           

Train Loss: 0.5450 | Train Acc: 0.8117
Val Loss: 1.7310 | Val Acc: 0.5220 | Val F1: 0.4686

Epoch 19/110
--------------------------------------------------


                                                           

Train Loss: 0.5332 | Train Acc: 0.8167
Val Loss: 1.5377 | Val Acc: 0.5096 | Val F1: 0.4532

Epoch 20/110
--------------------------------------------------


                                                           

Train Loss: 0.5111 | Train Acc: 0.8198
Val Loss: 1.5996 | Val Acc: 0.5110 | Val F1: 0.4813
‚úì Saved best model (F1: 0.4813)

Epoch 21/110
--------------------------------------------------


                                                           

Train Loss: 0.5179 | Train Acc: 0.8182
Val Loss: 1.7024 | Val Acc: 0.4808 | Val F1: 0.4293

Epoch 22/110
--------------------------------------------------


                                                           

Train Loss: 0.5103 | Train Acc: 0.8215
Val Loss: 1.5889 | Val Acc: 0.5082 | Val F1: 0.4418

Epoch 23/110
--------------------------------------------------


                                                           

Train Loss: 0.5108 | Train Acc: 0.8198
Val Loss: 1.8152 | Val Acc: 0.5124 | Val F1: 0.4834
‚úì Saved best model (F1: 0.4834)

Epoch 24/110
--------------------------------------------------


                                                           

Train Loss: 0.4914 | Train Acc: 0.8287
Val Loss: 1.6433 | Val Acc: 0.5000 | Val F1: 0.4468

Epoch 25/110
--------------------------------------------------


                                                           

Train Loss: 0.4664 | Train Acc: 0.8411
Val Loss: 1.4883 | Val Acc: 0.5618 | Val F1: 0.5197
‚úì Saved best model (F1: 0.5197)

Epoch 26/110
--------------------------------------------------


                                                           

Train Loss: 0.4647 | Train Acc: 0.8367
Val Loss: 1.5981 | Val Acc: 0.5275 | Val F1: 0.4727

Epoch 27/110
--------------------------------------------------


                                                           

Train Loss: 0.4516 | Train Acc: 0.8396
Val Loss: 1.4648 | Val Acc: 0.5549 | Val F1: 0.5331
‚úì Saved best model (F1: 0.5331)

Epoch 28/110
--------------------------------------------------


                                                           

Train Loss: 0.4603 | Train Acc: 0.8441
Val Loss: 1.5038 | Val Acc: 0.5632 | Val F1: 0.5217

Epoch 29/110
--------------------------------------------------


Training:  14%|‚ñà‚ñç        | 30/215 [00:00<00:03, 54.57it/s]

                                                           

Train Loss: 0.4495 | Train Acc: 0.8428
Val Loss: 1.7386 | Val Acc: 0.5508 | Val F1: 0.4933

Epoch 30/110
--------------------------------------------------


                                                           

Train Loss: 0.4394 | Train Acc: 0.8440
Val Loss: 1.4575 | Val Acc: 0.5481 | Val F1: 0.5045

Epoch 31/110
--------------------------------------------------


                                                           

Train Loss: 0.4345 | Train Acc: 0.8454
Val Loss: 1.6540 | Val Acc: 0.5522 | Val F1: 0.5167

Epoch 32/110
--------------------------------------------------


                                                           

Train Loss: 0.4261 | Train Acc: 0.8519
Val Loss: 1.5736 | Val Acc: 0.5824 | Val F1: 0.5558
‚úì Saved best model (F1: 0.5558)

Epoch 33/110
--------------------------------------------------


                                                           

Train Loss: 0.4217 | Train Acc: 0.8523
Val Loss: 1.6112 | Val Acc: 0.5481 | Val F1: 0.4845

Epoch 34/110
--------------------------------------------------


                                                           

Train Loss: 0.4150 | Train Acc: 0.8527
Val Loss: 1.7488 | Val Acc: 0.5261 | Val F1: 0.4971

Epoch 35/110
--------------------------------------------------


                                                           

Train Loss: 0.4200 | Train Acc: 0.8530
Val Loss: 1.6491 | Val Acc: 0.5604 | Val F1: 0.5343

Epoch 36/110
--------------------------------------------------


                                                           

Train Loss: 0.4096 | Train Acc: 0.8590
Val Loss: 1.7401 | Val Acc: 0.5398 | Val F1: 0.4962

Epoch 37/110
--------------------------------------------------


                                                           

Train Loss: 0.4046 | Train Acc: 0.8542
Val Loss: 1.7765 | Val Acc: 0.5247 | Val F1: 0.4823

Epoch 38/110
--------------------------------------------------


                                                           

Train Loss: 0.3959 | Train Acc: 0.8596
Val Loss: 1.5008 | Val Acc: 0.5783 | Val F1: 0.5524

Epoch 39/110
--------------------------------------------------


                                                           

Train Loss: 0.3822 | Train Acc: 0.8669
Val Loss: 1.6081 | Val Acc: 0.5769 | Val F1: 0.5527

Epoch 40/110
--------------------------------------------------


                                                           

Train Loss: 0.3861 | Train Acc: 0.8618
Val Loss: 1.8039 | Val Acc: 0.5687 | Val F1: 0.5313

Epoch 41/110
--------------------------------------------------


                                                           

Train Loss: 0.3782 | Train Acc: 0.8634
Val Loss: 1.7228 | Val Acc: 0.5508 | Val F1: 0.5043

Epoch 42/110
--------------------------------------------------


                                                           

Train Loss: 0.3882 | Train Acc: 0.8635
Val Loss: 1.5945 | Val Acc: 0.5701 | Val F1: 0.5389

Epoch 43/110
--------------------------------------------------


                                                           

Train Loss: 0.3635 | Train Acc: 0.8681
Val Loss: 1.7359 | Val Acc: 0.5879 | Val F1: 0.5594
‚úì Saved best model (F1: 0.5594)

Epoch 44/110
--------------------------------------------------


                                                           

Train Loss: 0.3654 | Train Acc: 0.8732
Val Loss: 1.6779 | Val Acc: 0.5673 | Val F1: 0.5272

Epoch 45/110
--------------------------------------------------


                                                           

Train Loss: 0.3547 | Train Acc: 0.8741
Val Loss: 1.7222 | Val Acc: 0.5920 | Val F1: 0.5556

Epoch 46/110
--------------------------------------------------


                                                           

Train Loss: 0.3529 | Train Acc: 0.8765
Val Loss: 1.5901 | Val Acc: 0.5824 | Val F1: 0.5609
‚úì Saved best model (F1: 0.5609)

Epoch 47/110
--------------------------------------------------


                                                           

Train Loss: 0.3429 | Train Acc: 0.8778
Val Loss: 1.7064 | Val Acc: 0.5755 | Val F1: 0.5396

Epoch 48/110
--------------------------------------------------


                                                           

Train Loss: 0.3457 | Train Acc: 0.8771
Val Loss: 1.7276 | Val Acc: 0.5810 | Val F1: 0.5465

Epoch 49/110
--------------------------------------------------


                                                           

Train Loss: 0.3468 | Train Acc: 0.8786
Val Loss: 1.8230 | Val Acc: 0.5440 | Val F1: 0.5278

Epoch 50/110
--------------------------------------------------


                                                           

Train Loss: 0.3375 | Train Acc: 0.8786
Val Loss: 1.8198 | Val Acc: 0.5852 | Val F1: 0.5671
‚úì Saved best model (F1: 0.5671)

Epoch 51/110
--------------------------------------------------


                                                           

Train Loss: 0.3467 | Train Acc: 0.8808
Val Loss: 2.0476 | Val Acc: 0.6003 | Val F1: 0.5804
‚úì Saved best model (F1: 0.5804)

Epoch 52/110
--------------------------------------------------


                                                           

Train Loss: 0.3303 | Train Acc: 0.8809
Val Loss: 1.6317 | Val Acc: 0.5989 | Val F1: 0.5783

Epoch 53/110
--------------------------------------------------


                                                           

Train Loss: 0.3288 | Train Acc: 0.8808
Val Loss: 1.9575 | Val Acc: 0.5810 | Val F1: 0.5489

Epoch 54/110
--------------------------------------------------


                                                           

Train Loss: 0.3051 | Train Acc: 0.8916
Val Loss: 1.9381 | Val Acc: 0.5742 | Val F1: 0.5605

Epoch 55/110
--------------------------------------------------


                                                           

Train Loss: 0.3178 | Train Acc: 0.8882
Val Loss: 1.9518 | Val Acc: 0.5659 | Val F1: 0.5485

Epoch 56/110
--------------------------------------------------


                                                           

Train Loss: 0.3204 | Train Acc: 0.8838
Val Loss: 1.7166 | Val Acc: 0.5865 | Val F1: 0.5672

Epoch 57/110
--------------------------------------------------


                                                           

Train Loss: 0.3162 | Train Acc: 0.8903
Val Loss: 1.5785 | Val Acc: 0.5948 | Val F1: 0.5634

Epoch 58/110
--------------------------------------------------


                                                           

Train Loss: 0.3060 | Train Acc: 0.8905
Val Loss: 1.9601 | Val Acc: 0.5632 | Val F1: 0.5362

Epoch 59/110
--------------------------------------------------


                                                           

Train Loss: 0.2982 | Train Acc: 0.8939
Val Loss: 2.0419 | Val Acc: 0.5659 | Val F1: 0.5324

Epoch 60/110
--------------------------------------------------


                                                           

Train Loss: 0.3124 | Train Acc: 0.8888
Val Loss: 1.7052 | Val Acc: 0.5879 | Val F1: 0.5649

Epoch 61/110
--------------------------------------------------


                                                           

Train Loss: 0.3102 | Train Acc: 0.8927
Val Loss: 2.0365 | Val Acc: 0.5989 | Val F1: 0.5781

Epoch 62/110
--------------------------------------------------


                                                           

Train Loss: 0.3050 | Train Acc: 0.8903
Val Loss: 2.0218 | Val Acc: 0.5797 | Val F1: 0.5458

Epoch 63/110
--------------------------------------------------


                                                           

Train Loss: 0.2776 | Train Acc: 0.8989
Val Loss: 1.5917 | Val Acc: 0.5948 | Val F1: 0.5781

Epoch 64/110
--------------------------------------------------


                                                           

Train Loss: 0.2583 | Train Acc: 0.9069
Val Loss: 1.6694 | Val Acc: 0.5962 | Val F1: 0.5769

Epoch 65/110
--------------------------------------------------


                                                           

Train Loss: 0.2562 | Train Acc: 0.9057
Val Loss: 1.7644 | Val Acc: 0.6030 | Val F1: 0.5800

Epoch 66/110
--------------------------------------------------


                                                           

Train Loss: 0.2568 | Train Acc: 0.9101
Val Loss: 1.6581 | Val Acc: 0.6099 | Val F1: 0.6054
‚úì Saved best model (F1: 0.6054)

Epoch 67/110
--------------------------------------------------


                                                           

Train Loss: 0.2531 | Train Acc: 0.9132
Val Loss: 1.7661 | Val Acc: 0.6003 | Val F1: 0.5754

Epoch 68/110
--------------------------------------------------


                                                           

Train Loss: 0.2462 | Train Acc: 0.9133
Val Loss: 1.9124 | Val Acc: 0.6058 | Val F1: 0.5837

Epoch 69/110
--------------------------------------------------


                                                           

Train Loss: 0.2431 | Train Acc: 0.9142
Val Loss: 1.8718 | Val Acc: 0.6140 | Val F1: 0.5808

Epoch 70/110
--------------------------------------------------


                                                           

Train Loss: 0.2369 | Train Acc: 0.9154
Val Loss: 1.7212 | Val Acc: 0.6140 | Val F1: 0.5893

Epoch 71/110
--------------------------------------------------


                                                           

Train Loss: 0.2377 | Train Acc: 0.9148
Val Loss: 2.1275 | Val Acc: 0.5769 | Val F1: 0.5409

Epoch 72/110
--------------------------------------------------


                                                           

Train Loss: 0.2282 | Train Acc: 0.9190
Val Loss: 1.9259 | Val Acc: 0.5934 | Val F1: 0.5645

Epoch 73/110
--------------------------------------------------


                                                           

Train Loss: 0.2209 | Train Acc: 0.9235
Val Loss: 1.7963 | Val Acc: 0.6250 | Val F1: 0.6023

Epoch 74/110
--------------------------------------------------


                                                           

Train Loss: 0.2362 | Train Acc: 0.9174
Val Loss: 2.1784 | Val Acc: 0.5962 | Val F1: 0.5483

Epoch 75/110
--------------------------------------------------


                                                           

Train Loss: 0.2394 | Train Acc: 0.9183
Val Loss: 1.7632 | Val Acc: 0.5975 | Val F1: 0.5728

Epoch 76/110
--------------------------------------------------


                                                           

Train Loss: 0.2328 | Train Acc: 0.9142
Val Loss: 1.9784 | Val Acc: 0.6058 | Val F1: 0.5828

Epoch 77/110
--------------------------------------------------


                                                           

Train Loss: 0.2277 | Train Acc: 0.9202
Val Loss: 1.9605 | Val Acc: 0.6113 | Val F1: 0.5888

Epoch 78/110
--------------------------------------------------


                                                           

Train Loss: 0.2195 | Train Acc: 0.9205
Val Loss: 1.9530 | Val Acc: 0.6209 | Val F1: 0.5998

Epoch 79/110
--------------------------------------------------


                                                           

Train Loss: 0.2054 | Train Acc: 0.9282
Val Loss: 1.9152 | Val Acc: 0.6154 | Val F1: 0.5920

Epoch 80/110
--------------------------------------------------


                                                           

Train Loss: 0.1963 | Train Acc: 0.9281
Val Loss: 1.8130 | Val Acc: 0.6305 | Val F1: 0.6146
‚úì Saved best model (F1: 0.6146)

Epoch 81/110
--------------------------------------------------


                                                           

Train Loss: 0.2100 | Train Acc: 0.9269
Val Loss: 2.0339 | Val Acc: 0.5865 | Val F1: 0.5626

Epoch 82/110
--------------------------------------------------


                                                           

Train Loss: 0.1955 | Train Acc: 0.9320
Val Loss: 1.8620 | Val Acc: 0.6113 | Val F1: 0.5925

Epoch 83/110
--------------------------------------------------


                                                           

Train Loss: 0.1969 | Train Acc: 0.9283
Val Loss: 1.9912 | Val Acc: 0.6044 | Val F1: 0.5767

Epoch 84/110
--------------------------------------------------


                                                           

Train Loss: 0.1958 | Train Acc: 0.9282
Val Loss: 1.8152 | Val Acc: 0.6277 | Val F1: 0.6117

Epoch 85/110
--------------------------------------------------


                                                           

Train Loss: 0.1928 | Train Acc: 0.9330
Val Loss: 1.9556 | Val Acc: 0.6126 | Val F1: 0.5931

Epoch 86/110
--------------------------------------------------


                                                           

Train Loss: 0.1959 | Train Acc: 0.9279
Val Loss: 1.9622 | Val Acc: 0.6030 | Val F1: 0.5830

Epoch 87/110
--------------------------------------------------


                                                           

Train Loss: 0.1827 | Train Acc: 0.9320
Val Loss: 1.8837 | Val Acc: 0.6181 | Val F1: 0.6017

Epoch 88/110
--------------------------------------------------


                                                           

Train Loss: 0.1916 | Train Acc: 0.9332
Val Loss: 1.9174 | Val Acc: 0.6236 | Val F1: 0.6072

Epoch 89/110
--------------------------------------------------


                                                           

Train Loss: 0.1932 | Train Acc: 0.9318
Val Loss: 1.9003 | Val Acc: 0.6085 | Val F1: 0.5939

Epoch 90/110
--------------------------------------------------


                                                           

Train Loss: 0.1895 | Train Acc: 0.9330
Val Loss: 1.8784 | Val Acc: 0.6236 | Val F1: 0.6036

Epoch 91/110
--------------------------------------------------


                                                           

Train Loss: 0.1918 | Train Acc: 0.9285
Val Loss: 2.3371 | Val Acc: 0.5920 | Val F1: 0.5539

Epoch 92/110
--------------------------------------------------


                                                           

Train Loss: 0.1832 | Train Acc: 0.9337
Val Loss: 2.0819 | Val Acc: 0.6181 | Val F1: 0.5970

Epoch 93/110
--------------------------------------------------


                                                           

Train Loss: 0.1800 | Train Acc: 0.9367
Val Loss: 1.8145 | Val Acc: 0.6470 | Val F1: 0.6289
‚úì Saved best model (F1: 0.6289)

Epoch 94/110
--------------------------------------------------


                                                           

Train Loss: 0.1697 | Train Acc: 0.9422
Val Loss: 1.7630 | Val Acc: 0.6236 | Val F1: 0.6164

Epoch 95/110
--------------------------------------------------


                                                           

Train Loss: 0.1788 | Train Acc: 0.9349
Val Loss: 1.8962 | Val Acc: 0.6360 | Val F1: 0.6195

Epoch 96/110
--------------------------------------------------


                                                           

Train Loss: 0.1705 | Train Acc: 0.9390
Val Loss: 2.1088 | Val Acc: 0.6181 | Val F1: 0.5961

Epoch 97/110
--------------------------------------------------


                                                           

Train Loss: 0.1805 | Train Acc: 0.9375
Val Loss: 2.3437 | Val Acc: 0.5893 | Val F1: 0.5541

Epoch 98/110
--------------------------------------------------


                                                           

Train Loss: 0.1695 | Train Acc: 0.9377
Val Loss: 2.1508 | Val Acc: 0.6168 | Val F1: 0.5950

Epoch 99/110
--------------------------------------------------


                                                           

Train Loss: 0.1737 | Train Acc: 0.9364
Val Loss: 1.9096 | Val Acc: 0.6346 | Val F1: 0.6179

Epoch 100/110
--------------------------------------------------


                                                           

Train Loss: 0.1728 | Train Acc: 0.9378
Val Loss: 2.0597 | Val Acc: 0.6140 | Val F1: 0.5920

Epoch 101/110
--------------------------------------------------


                                                           

Train Loss: 0.1714 | Train Acc: 0.9387
Val Loss: 2.3866 | Val Acc: 0.5948 | Val F1: 0.5629

Epoch 102/110
--------------------------------------------------


                                                           

Train Loss: 0.1681 | Train Acc: 0.9393
Val Loss: 1.9543 | Val Acc: 0.6181 | Val F1: 0.5999

Epoch 103/110
--------------------------------------------------


                                                           

Train Loss: 0.1712 | Train Acc: 0.9391
Val Loss: 2.2753 | Val Acc: 0.6016 | Val F1: 0.5730

Epoch 104/110
--------------------------------------------------


                                                           

Train Loss: 0.1625 | Train Acc: 0.9429
Val Loss: 1.9134 | Val Acc: 0.6346 | Val F1: 0.6232

Epoch 105/110
--------------------------------------------------


                                                           

Train Loss: 0.1655 | Train Acc: 0.9393
Val Loss: 2.0300 | Val Acc: 0.6250 | Val F1: 0.6088

Epoch 106/110
--------------------------------------------------


                                                           

Train Loss: 0.1628 | Train Acc: 0.9400
Val Loss: 2.0794 | Val Acc: 0.6319 | Val F1: 0.6155

Epoch 107/110
--------------------------------------------------


                                                           

Train Loss: 0.1603 | Train Acc: 0.9412
Val Loss: 1.9727 | Val Acc: 0.6223 | Val F1: 0.6066

Epoch 108/110
--------------------------------------------------


                                                           

Train Loss: 0.1583 | Train Acc: 0.9406
Val Loss: 1.9690 | Val Acc: 0.6305 | Val F1: 0.6107

Epoch 109/110
--------------------------------------------------


                                                           

Train Loss: 0.1649 | Train Acc: 0.9416
Val Loss: 1.8886 | Val Acc: 0.6236 | Val F1: 0.6090

Epoch 110/110
--------------------------------------------------


                                                           

Train Loss: 0.1663 | Train Acc: 0.9410
Val Loss: 1.9909 | Val Acc: 0.6154 | Val F1: 0.6008

Training completed! Best Val F1: 0.6289

‚úì Training complete!


### Visualize Training History

In [None]:
# Plot training curves
trainer.plot_training_history(
    save_path=os.path.join(CONFIG['output_dir'], 'training_history.png')
)

---
## Part 7: Evaluation & Testing

In [None]:
# Load best model
best_checkpoint = torch.load(
    os.path.join(CONFIG['model_dir'], 'best_model.pth'),
    map_location=trainer.device
)
trainer.model.load_state_dict(best_checkpoint['model_state_dict'])


print(f"‚úì Loaded best model from epoch {best_checkpoint['epoch']}")

print(f"  Best Val F1: {best_checkpoint['val_f1']:.4f}")

In [None]:
# Test the model
test_results = trainer.test()

print("\n" + "="*60)
print("TEST SET RESULTS")
print("="*60)
print(f"Test Accuracy: {test_results['test_acc']:.4f}")
print(f"Test F1 Score: {test_results['test_f1']:.4f}")

In [None]:
# Plot confusion matrix
trainer.plot_confusion_matrix(
    test_results['confusion_matrix'],
    save_path=os.path.join(CONFIG['output_dir'], 'confusion_matrix.png')
)

### Per-Class Performance Analysis

In [None]:
# Detailed per-class metrics
precision, recall, f1, support = precision_recall_fscore_support(
    test_results['labels'],
    test_results['predictions'],
    labels=range(num_classes)
)

performance_df = pd.DataFrame({
    'Class': train_dataset.label_encoder.classes_,
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1,
    'Support': support
}).sort_values('F1-Score', ascending=False)

print("\nPer-class performance:")
print(performance_df.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(14, 6))
x = np.arange(len(performance_df))
width = 0.25

ax.bar(x - width, performance_df['Precision'], width, label='Precision', alpha=0.8)
ax.bar(x, performance_df['Recall'], width, label='Recall', alpha=0.8)
ax.bar(x + width, performance_df['F1-Score'], width, label='F1-Score', alpha=0.8)

ax.set_xlabel('Crypto Algorithm', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Per-Class Performance Metrics', fontweight='bold', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(performance_df['Class'], rotation=45, ha='right')
ax.legend()
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(CONFIG['output_dir'], 'per_class_performance.png'), dpi=300, bbox_inches='tight')
plt.show()

---
## Part 8: Save Model for Inference

In [None]:
# Save metadata for inference pipeline
metadata = {
    'label_encoder': train_dataset.label_encoder,
    'node_scaler': train_dataset.node_scaler,
    'edge_scaler': train_dataset.edge_scaler,
    'graph_scaler': train_dataset.graph_scaler,
    'model_config': {
        'num_node_features': num_node_features,
        'num_edge_features': num_edge_features,
        'num_graph_features': num_graph_features,
        'num_classes': num_classes,
        'hidden_dim': CONFIG['hidden_dim'],
        'num_layers': CONFIG['num_layers'],
        'dropout': CONFIG['dropout'],
        'conv_type': CONFIG['conv_type'],
        'pooling': CONFIG['pooling'],
    }
}

metadata_path = os.path.join(CONFIG['model_dir'], 'metadata.pkl')
with open(metadata_path, 'wb') as f:
    pickle.dump(metadata, f)

print(f"‚úì Metadata saved to: {metadata_path}")
print(f"‚úì Model saved to: {os.path.join(CONFIG['model_dir'], 'best_model.pth')}")
print("\nYou can now use these for inference on new binaries!")

---
## Part 9: Inference Demo

Run inference on a sample file.

In [None]:
# Import inference pipeline
from new_gnn import CryptoDetectionPipeline

# Create pipeline
pipeline = CryptoDetectionPipeline(
    model_path=os.path.join(CONFIG['model_dir'], 'best_model.pth'),
    metadata_path=metadata_path
)

print("‚úì Inference pipeline created")

In [None]:
# Run inference on a test file
if test_files:
    demo_file = test_files[0]
    output_path = os.path.join(CONFIG['output_dir'], 'detection_results.json')
    
    print(f"Running inference on: {os.path.basename(demo_file)}")
    results = pipeline.process_json(demo_file, output_path)
    
    # Display top detections
    if results['crypto_functions']:
        print("\n" + "="*60)
        print("TOP 5 CRYPTO DETECTIONS")
        print("="*60)
        for i, func in enumerate(results['crypto_functions'][:5], 1):
            print(f"\n{i}. Address: {func['address']}")
            print(f"   Name: {func['name']}")
            print(f"   Algorithm: {func['algorithm']}")
            print(f"   Confidence: {func['confidence']:.4f}")
            print(f"   Top 3 probabilities:")
            sorted_probs = sorted(func['probabilities'].items(), key=lambda x: x[1], reverse=True)[:3]
            for algo, prob in sorted_probs:
                print(f"     {algo}: {prob:.4f}")
else:
    print("No test files available for inference demo")

---
## Summary

This notebook demonstrated:

‚úÖ **Address-aware feature extraction** - Novel spatial features  
‚úÖ **Data loading** - Ghidra JSON ‚Üí PyTorch Geometric graphs  
‚úÖ **GNN training** - Multiple architectures (GCN, GAT, SAGE, GIN)  
‚úÖ **Evaluation** - Comprehensive metrics and visualizations  
‚úÖ **Inference** - Production-ready detection pipeline  

### Next Steps

1. **Hyperparameter tuning**: Run `gnn_hyperparameter_tuning.py`
2. **Try different architectures**: Change `conv_type` to 'gcn', 'sage', or 'gin'
3. **Increase epochs**: Set `num_epochs` to 100+ for better performance
4. **Deploy**: Use the inference pipeline on new binaries

### Files Generated

- `gnn_models/best_model.pth` - Trained model weights
- `gnn_models/metadata.pkl` - Scalers and label encoder
- `gnn_outputs/training_history.png` - Learning curves
- `gnn_outputs/confusion_matrix.png` - Performance matrix
- `gnn_outputs/per_class_performance.png` - Per-class metrics
- `gnn_outputs/detection_results.json` - Inference results

**Ready to detect crypto! üîêüîç**