Skip to content

Feature/surv v1 phase5 metrics#145

Merged
ryanmccann1024 merged 22 commits intofeature/surv-v1-phase4-rl-integrationfrom
feature/surv-v1-phase5-metrics
Nov 7, 2025
Merged

Feature/surv v1 phase5 metrics#145
ryanmccann1024 merged 22 commits intofeature/surv-v1-phase4-rl-integrationfrom
feature/surv-v1-phase5-metrics

Conversation

@ryanmccann1024
Copy link
Copy Markdown
Collaborator

Quick merge.

ryanmccann1024 and others added 22 commits October 16, 2025 15:57
Implement comprehensive testing, documentation, and performance validation
for survivability v1 features.

Testing:
- Add integration tests for end-to-end survivability pipeline
- Add performance benchmarks for all time/memory budgets
- Add regression tests for backward compatibility

Documentation:
- Update main README with survivability section
- Update reporting README with survivability features
- Add 4 example configurations with comprehensive guide

Example Configurations:
- Link failure with KSP-FF baseline
- Geographic failure with 1+1 protection
- RL policy evaluation with BC
- Dataset generation for training

All Phase 6 acceptance criteria met:
- Integration tests verify E2E workflow
- Performance tests validate all budgets (decision time ≤2ms, etc.)
- Comprehensive documentation and examples
- Backward compatibility preserved

Related: phase6-quality/50-testing.md, 51-documentation.md, 52-performance.md
This commit fixes all type annotation and linting errors in the
survivability test suite to ensure code quality and type safety.

Changes:
- Fix KPathCache import from fusion.modules.routing.k_path_cache
- Update KSPFFPolicy instantiation (no constructor arguments)
- Fix select_path method calls to use correct signature (state, action_mask)
- Update get_path_features calls to match actual API signature
- Add network_spectrum dict creation in tests for path feature extraction
- Remove unused variable assignments flagged by ruff
- Fix line length violations (E501)
- Remove duplicate backup test files

All mypy type checks and ruff linting checks now pass successfully.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Integrating phase 5 metrics and reporting functionality into the phase 6 quality assurance branch.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added comprehensive survivability-related configuration sections across
all config files and templates including:
- Offline RL settings for policy configuration
- Dataset logging settings for training data collection
- Recovery timing parameters for failure simulation
- Protection settings for network resilience

Updated logging configuration to support dataset logging requirements.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implemented full integration of DatasetLogger into the simulation engine
to enable offline RL dataset collection during simulations.

Changes:
- Added DatasetLogger initialization in SimulationEngine.__init__ with
  proper directory structure (data/training_data/{network}/{date}/{time}/{thread}/)
- Implemented _log_dataset_transition() to capture state-action-reward
  transitions after each routing decision
- Ensured logger is properly closed on simulation completion
- Added all survivability configuration sections to schema.py:
  * dataset_logging (log_offline_dataset, dataset_output_path, epsilon_mix)
  * offline_rl_settings (policy_type, fallback_policy, device)
  * recovery_timing (protection_switchover_ms, restoration_latency_ms, etc.)
  * protection_settings (protection_mode)
  * routing_settings (route_method, k_paths, path_ordering, precompute_paths)
  * failure_settings (failure_type, geo settings, timing parameters)
  * reporting (export_csv, csv_output_path)
- Updated .gitignore to exclude data/training_data directory

Dataset format:
Each transition includes state (src, dst, bandwidth, k_paths), action
(selected path index), reward (+1.0/-1.0), action_mask (path feasibility),
and metadata (request_id, arrival_time, decision_time_ms).

Related: fusion/configs/examples/dataset_generation.ini now functional

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changed sim_start format from '%m%d_%H_%M_%S_%f' to '%H_%M_%S_%f'
and created separate self.date to avoid date duplication in paths.

Before: data/output/NSFNet/1027/1027_17_54_36_579394/s1/
After:  data/output/NSFNet/1027/17_54_36_579394/s1/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed multiple critical bugs in simulation and dataset generation:

1. Erlang loop bug: BatchRunner was ignoring erlang_start/stop/step
   parameters and defaulting to erlang=300. Now properly reads config
   values and makes erlang_stop inclusive.

2. CLI default override bug: --max_iters had default=3 in CLI parser,
   which was overriding config file values. Changed to default=None
   to respect config files.

3. Last iteration save: Made explicit check to ensure last iteration
   always saves statistics regardless of save_step value.

4. Dataset file naming: Added erlang value to dataset filename
   (dataset_erlang_{erlang}.jsonl) so each traffic volume gets its
   own file instead of overwriting.

5. Dataset metadata: Added erlang and iteration fields to each
   transition in the dataset for better tracking.

Files changed:
- fusion/cli/parameters/traffic.py: Remove default=3 from max_iters
- fusion/sim/batch_runner.py: Fix erlang parameter reading
- fusion/sim/network_simulator.py: Make erlang_stop inclusive
- fusion/core/simulation.py: Fix save logic, dataset naming, metadata
- fusion/reporting/dataset_logger.py: Revert append mode to write mode

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add complete CLI argument support for survivability experiments including
failure injection, protection mechanisms, RL policies, and dataset logging.

- Create fusion/cli/parameters/survivability.py with all argument groups
- Register survivability arguments in CLI registry
- Add survivability args to run_sim command
- Enable CLI override of config file parameters

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implements Section 6 (Integration) from survivability-v1 specs, completing
the missing integration between FailureManager and the simulation execution.

Changes:
- SimulationEngine: Add FailureManager initialization and scheduling
- SDNController: Add path feasibility checking for failed links
- Automatic type conversion for node IDs (handles string/int mismatch)
- Schedule failures using actual Poisson arrival times instead of indices
- Add repair checking in main simulation loop
- Update example config with valid link and debug logging

Integration flow:
1. FailureManager created after topology initialization
2. Failure scheduled in first iteration using real request times
3. SDNController checks path feasibility before allocation
4. Repairs processed during request handling loop

Fixes issue where failures were configured but never injected during
simulation execution. All survivability phase 2-5 modules now fully
integrated and functional.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…processing bugs

- Fix 7 ruff E501 line-too-long errors in sdn_controller.py and simulation.py
- Rename config sections to follow *_settings naming convention:
  - dataset_logging -> dataset_logging_settings
  - recovery_timing -> recovery_timing_settings
  - reporting -> reporting_settings
- Fix test_run_generic_sim_multiple_erlangs_sequential expecting 3 runs
- Fix test_get_logger_with_new_name_calls_setup assertion signature
- Fix KeyError when processing missing optional config sections
- Fix TypeError in failure scheduling by not setting missing optional values to None
- Update config processing to skip missing optional options instead of setting to None

All ruff checks now pass and unit tests fixed.
- Rename .github/issue_template to ISSUE_TEMPLATE (GitHub canonical format)
- Fix broken links in issue template config.yml (Architecture Plan, Publications)
- Add comprehensive ARCHITECTURE.md with system design, components, and data flow
- Enhance README Publications section with structured citation format
- Remove GitHub Discussions link from issue resources
- Add placeholder for community-contributed publications

All issue template resource links now point to existing documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Modernize all GitHub issue templates, PR templates, and commit
message guide by removing emojis from section headers and titles.
This creates a more professional appearance appropriate for a
research simulator while maintaining all functionality and structure.

Files updated:
- Issue templates (bug report, feature request, config)
- PR templates (feature, hotfix, general)
- Commit message guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Update config validation error message to be path-agnostic since users
can pass config files from any location via command line, not just
ini/run_ini/. Remove emojis from user-facing error messages in run_gui
and run_train for cleaner output. Update TODO entries to clarify that
GUI and multi-processing features need full implementation. Standardize
docstring formatting across all CLI modules for consistency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Corrected CLI invocation syntax throughout documentation by adding the
missing 'run_sim' subcommand. The correct format is:
`python -m fusion.cli.run_sim run_sim --config_path ...`

Added comprehensive "Templates vs Examples" section to configs/README.md
explaining the distinction between generic reusable templates and
specific ready-to-run example configurations.

Changes include:
- Fix CLI command examples in cli/README.md and configs/examples/README.md
- Add "Templates vs Examples" section with comparison table and usage guidance
- Add TODO for YAML/JSON configuration file input support
- Add TODO for single entry point CLI architecture (fusion run_sim)
- Add TODO for schema system consolidation (schema.py vs schemas/*.json)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Remove emojis from all top-level markdown files for professional
presentation while maintaining readability and structure.

Documentation improvements:
- Remove emojis from README.md and DEVELOPMENT_QUICKSTART.md
- Add comprehensive CLAUDE.md with project context for AI assistants
- Fix placeholder email in CODE_OF_CONDUCT.md enforcement section
- Streamline CONTRIBUTING.md with references to detailed standards
- Remove research planning files (new-paper-*.md)

Code quality improvements:
- Remove redundant default values in network_analysis.py
- Fix docstring formatting in cli_to_config.py
- Add ML support TODO item in core/TODO.md
- Remove verbose seeding comment block in simulation.py
…ture

Resolve configuration duplication issues by implementing a hybrid system that
supports both nested sections and flat backward-compatible access patterns.

Changes:
- Update config loader to preserve non-general sections as nested dicts
- Add mirroring function to copy nested values to root for backward compat
- Move route_method and allocation_method from required to optional settings
- Reorganize routing and spectrum parameters into dedicated sections
- Add missing ml_settings parameters across all config files
- Add missing failure_settings parameters to survivability examples

This allows new code to access engine_props["routing_settings"]["k_paths"]
while legacy code continues to work with engine_props["k_paths"].

All configuration files now have clean separation between general_settings
and specialized sections (routing_settings, spectrum_settings, ml_settings).

Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
Fix test failures caused by recent routing architecture refactoring
that introduced route_props for storing routing algorithm results.
Also fix config tests to match hybrid nested/flat configuration
architecture and remove emoji expectations per project guidelines.

Changes include:
- Add default values in network_analysis.get_link_usage_summary
- Update factory tests to mock route_props.paths_matrix
- Fix config_setup tests for nested optional options
- Update CLI tests to remove emoji expectations (GUI and train)
- Fix schema tests to match current required options structure
- Complete route_props integration in routing algorithms

Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
feat(survivability): implement phase 6 - quality assurance
fix(quality): resolve linting errors, unit test failures, and config processing bugs
@ryanmccann1024 ryanmccann1024 merged commit a69d2f4 into feature/surv-v1-phase4-rl-integration Nov 7, 2025
6 checks passed
@ryanmccann1024 ryanmccann1024 deleted the feature/surv-v1-phase5-metrics branch January 19, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant