-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Complete the integration of all StringyMcStringFace components into a cohesive end-to-end pipeline with comprehensive error recovery mechanisms and full integration test coverage.
Context
StringyMcStringFace is currently in a state where individual components are implemented or framework-ready:
- ✅ Format detection (ELF, PE, Mach-O via goblin)
- ✅ Container parsers with section classification
- ✅ Import/export extraction
- 🚧 String extraction engines (ASCII/UTF-8, UTF-16)
- 🚧 Semantic classification (URLs, paths, GUIDs, etc.)
- 🚧 Ranking and scoring system
- 🚧 Output formatters (JSON, human-readable, YARA)
The pipeline integration task involves connecting these components into a complete data flow from binary input to formatted output.
Pipeline Architecture
Binary Input
↓
Format Detection (goblin)
↓
Container Parser Selection (ELF/PE/Mach-O)
↓
Section Classification & Weighting
↓
String Extraction (ASCII/UTF-8/UTF-16)
↓
Semantic Classification & Tagging
↓
Scoring & Ranking
↓
Output Formatting (JSON/Text/YARA)
↓
Final Output
Proposed Solution
1. Pipeline Orchestration Module
Create a pipeline.rs module that coordinates the flow:
Pipelinestruct: Central orchestrator that manages state and error contextPipelineStageenum: Track which stage failed for better error messagesPipelineResult: Accumulates results and partial failures throughout executionrun_pipeline(): Main entry point that executes all stages in sequence
2. Error Recovery Strategy
Implement graceful degradation at each stage:
- Format Detection Failure: Fall back to raw binary analysis mode
- Parser Errors: Skip corrupted sections, continue with valid ones
- Extraction Failures: Log warnings, continue with other encodings
- Classification Errors: Mark strings as unclassified but retain them
- Scoring Failures: Use default score (50) as fallback
Error handling pattern:
match stage_result {
Ok(data) => pipeline.add_results(data),
Err(e) => {
pipeline.log_error(stage, e);
pipeline.continue_with_partial();
}
}3. Data Flow Integration
Ensure consistent data structures flow between stages:
- Stage 1→2:
BinaryFormat→ContainerParser - Stage 2→3:
Sectionmetadata →StringExtractor - Stage 3→4:
RawString→Classifier - Stage 4→5:
TaggedString→Scorer - Stage 5→6:
ScoredString→OutputFormatter
4. End-to-End Integration Tests
Create comprehensive test suite in tests/integration/:
Test Fixtures
test_binary_elf: ELF with known strings in .rodatatest_binary_pe.exe: PE with UTF-16 strings and resourcestest_binary_macho: Mach-O with __cstring sectioncorrupted_binary: Intentionally malformed for error recovery tests
Test Categories
Happy Path Tests:
test_elf_complete_pipeline(): Verify full ELF analysistest_pe_complete_pipeline(): Verify full PE analysis with UTF-16test_macho_complete_pipeline(): Verify full Mach-O analysistest_json_output_format(): Validate JSON structuretest_semantic_tagging(): Verify URLs, paths, GUIDs detected
Error Recovery Tests:
test_corrupted_section_recovery(): Skip bad sections, continue analysistest_unknown_format_fallback(): Handle unrecognized formats gracefullytest_partial_utf16_recovery(): Handle incomplete UTF-16 sequencestest_empty_sections_handling(): Process binaries with no string data
Performance Tests:
test_large_binary_performance(): Ensure reasonable performance on large binariestest_memory_limits(): Verify memory usage stays within bounds
5. Configuration & CLI Integration
Update CLI to support pipeline configuration:
--max-errors N: Stop after N errors (default: continue all)--skip-sections PATTERN: Exclude sections matching pattern--strict: Fail fast on any error (no recovery)--verbose: Show pipeline stage progress
Acceptance Criteria
- Pipeline orchestration module implemented with all stages connected
- Error recovery mechanisms in place for all failure modes
- At least 15 integration tests covering happy path and error scenarios
- All test fixtures (
test_binary_elf,test_binary_pe.exe,test_binary_macho) processed successfully - JSON output validates against expected schema
- Semantic tagging correctly identifies URLs, paths, GUIDs in test binaries
- CLI successfully processes real-world binaries (
/bin/ls, example PE file) - Error messages include pipeline stage context and recovery actions taken
- Performance: Process 10MB binary in < 5 seconds on typical hardware
- Documentation updated with pipeline architecture and error handling strategy
- Code coverage > 80% for pipeline module
Implementation Checklist
- Create
src/pipeline.rsmodule - Define
Pipeline,PipelineStage,PipelineResulttypes - Implement
run_pipeline()orchestration function - Add error recovery logic for each stage
- Create
tests/integration/pipeline_tests.rs - Generate test fixtures with known string content
- Write happy path integration tests (3 formats × 2 output modes)
- Write error recovery tests (5+ scenarios)
- Add CLI flags for pipeline configuration
- Update main.rs to use pipeline orchestrator
- Add pipeline architecture diagram to docs
- Write troubleshooting guide for common errors
Dependencies
- Blocked by: Main Extraction Pipeline implementation
- Depends on: String extraction engines must be functional
- Depends on: Semantic classification system must be operational
Related Issues
Reference any related issues for string extraction, classification, or scoring components.
Task ID
stringy-analyzer/complete-pipeline-integration