Skip to content

Complete End-to-End Pipeline Integration with Error Recovery and Testing #37

@unclesp1d3r

Description

@unclesp1d3r

Summary

Complete the integration of all StringyMcStringFace components into a cohesive end-to-end pipeline with comprehensive error recovery mechanisms and full integration test coverage.

Context

StringyMcStringFace is currently in a state where individual components are implemented or framework-ready:

  • ✅ Format detection (ELF, PE, Mach-O via goblin)
  • ✅ Container parsers with section classification
  • ✅ Import/export extraction
  • 🚧 String extraction engines (ASCII/UTF-8, UTF-16)
  • 🚧 Semantic classification (URLs, paths, GUIDs, etc.)
  • 🚧 Ranking and scoring system
  • 🚧 Output formatters (JSON, human-readable, YARA)

The pipeline integration task involves connecting these components into a complete data flow from binary input to formatted output.

Pipeline Architecture

Binary Input
    ↓
Format Detection (goblin)
    ↓
Container Parser Selection (ELF/PE/Mach-O)
    ↓
Section Classification & Weighting
    ↓
String Extraction (ASCII/UTF-8/UTF-16)
    ↓
Semantic Classification & Tagging
    ↓
Scoring & Ranking
    ↓
Output Formatting (JSON/Text/YARA)
    ↓
Final Output

Proposed Solution

1. Pipeline Orchestration Module

Create a pipeline.rs module that coordinates the flow:

  • Pipeline struct: Central orchestrator that manages state and error context
  • PipelineStage enum: Track which stage failed for better error messages
  • PipelineResult: Accumulates results and partial failures throughout execution
  • run_pipeline(): Main entry point that executes all stages in sequence

2. Error Recovery Strategy

Implement graceful degradation at each stage:

  • Format Detection Failure: Fall back to raw binary analysis mode
  • Parser Errors: Skip corrupted sections, continue with valid ones
  • Extraction Failures: Log warnings, continue with other encodings
  • Classification Errors: Mark strings as unclassified but retain them
  • Scoring Failures: Use default score (50) as fallback

Error handling pattern:

match stage_result {
    Ok(data) => pipeline.add_results(data),
    Err(e) => {
        pipeline.log_error(stage, e);
        pipeline.continue_with_partial();
    }
}

3. Data Flow Integration

Ensure consistent data structures flow between stages:

  • Stage 1→2: BinaryFormatContainerParser
  • Stage 2→3: Section metadata → StringExtractor
  • Stage 3→4: RawStringClassifier
  • Stage 4→5: TaggedStringScorer
  • Stage 5→6: ScoredStringOutputFormatter

4. End-to-End Integration Tests

Create comprehensive test suite in tests/integration/:

Test Fixtures

  • test_binary_elf: ELF with known strings in .rodata
  • test_binary_pe.exe: PE with UTF-16 strings and resources
  • test_binary_macho: Mach-O with __cstring section
  • corrupted_binary: Intentionally malformed for error recovery tests

Test Categories

Happy Path Tests:

  • test_elf_complete_pipeline(): Verify full ELF analysis
  • test_pe_complete_pipeline(): Verify full PE analysis with UTF-16
  • test_macho_complete_pipeline(): Verify full Mach-O analysis
  • test_json_output_format(): Validate JSON structure
  • test_semantic_tagging(): Verify URLs, paths, GUIDs detected

Error Recovery Tests:

  • test_corrupted_section_recovery(): Skip bad sections, continue analysis
  • test_unknown_format_fallback(): Handle unrecognized formats gracefully
  • test_partial_utf16_recovery(): Handle incomplete UTF-16 sequences
  • test_empty_sections_handling(): Process binaries with no string data

Performance Tests:

  • test_large_binary_performance(): Ensure reasonable performance on large binaries
  • test_memory_limits(): Verify memory usage stays within bounds

5. Configuration & CLI Integration

Update CLI to support pipeline configuration:

  • --max-errors N: Stop after N errors (default: continue all)
  • --skip-sections PATTERN: Exclude sections matching pattern
  • --strict: Fail fast on any error (no recovery)
  • --verbose: Show pipeline stage progress

Acceptance Criteria

  • Pipeline orchestration module implemented with all stages connected
  • Error recovery mechanisms in place for all failure modes
  • At least 15 integration tests covering happy path and error scenarios
  • All test fixtures (test_binary_elf, test_binary_pe.exe, test_binary_macho) processed successfully
  • JSON output validates against expected schema
  • Semantic tagging correctly identifies URLs, paths, GUIDs in test binaries
  • CLI successfully processes real-world binaries (/bin/ls, example PE file)
  • Error messages include pipeline stage context and recovery actions taken
  • Performance: Process 10MB binary in < 5 seconds on typical hardware
  • Documentation updated with pipeline architecture and error handling strategy
  • Code coverage > 80% for pipeline module

Implementation Checklist

  • Create src/pipeline.rs module
  • Define Pipeline, PipelineStage, PipelineResult types
  • Implement run_pipeline() orchestration function
  • Add error recovery logic for each stage
  • Create tests/integration/pipeline_tests.rs
  • Generate test fixtures with known string content
  • Write happy path integration tests (3 formats × 2 output modes)
  • Write error recovery tests (5+ scenarios)
  • Add CLI flags for pipeline configuration
  • Update main.rs to use pipeline orchestrator
  • Add pipeline architecture diagram to docs
  • Write troubleshooting guide for common errors

Dependencies

  • Blocked by: Main Extraction Pipeline implementation
  • Depends on: String extraction engines must be functional
  • Depends on: Semantic classification system must be operational

Related Issues

Reference any related issues for string extraction, classification, or scoring components.

Task ID

stringy-analyzer/complete-pipeline-integration

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions