Skip to content

Feature/checkpoint resume#5

Merged
Gambitnl merged 18 commits intomainfrom
feature/checkpoint-resume
Oct 24, 2025
Merged

Feature/checkpoint resume#5
Gambitnl merged 18 commits intomainfrom
feature/checkpoint-resume

Conversation

@Gambitnl
Copy link
Copy Markdown
Owner

No description provided.

Gambitnl and others added 17 commits October 23, 2025 21:24
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
The load method is not robust against corrupted or malformed checkpoint files. If path.read_text() succeeds but the content is not valid JSON, json.loads will raise a JSONDecodeError. Similarly, if the JSON is valid but missing keys, CheckpointRecord(**data) will raise a TypeError. This would crash the pipeline. It's safer to wrap this logic in a try...except block to handle potential parsing errors, log a warning, and return None, treating the checkpoint as invalid.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Replacing the Unicode checkbox '🔲' with just spaces makes the list item for current objectives look incomplete. To maintain visual structure and adhere to the ASCII-only guidelines, it would be better to use an ASCII representation for an unchecked box, such as [ ].

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
The word 'swords' appears to be a typo or an incomplete replacement for the '⚔️' emoji. For consistency with other ASCII tags being introduced (like [COMBAT] in the action_icons dictionary on line 375), this header should use a similar tag format.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
While the '🎒' emoji was removed, the header for 'Inventory' now lacks the ASCII tag format that has been applied to most other headers in this file (e.g., [BOOK], [TARGET]). For consistency, consider adding a descriptive tag like [BAG] or [INVENTORY].

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
The _get_env_as_bool method handles None but not empty strings. If the environment variable is set to an empty string (e.g., MY_VAR=), it will evaluate to False because "" is not in the set of true values. This is inconsistent with _get_env_as_int, which correctly treats an empty string as a reason to use the default value. For consistency, an empty string should also cause _get_env_as_bool to fall back to its default.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Implements P1-FEATURE-003: Batch Processing feature with:
- BatchProcessor class for sequential multi-file processing
- Automatic checkpoint resumption for interrupted sessions
- Progress reporting with rich progress bars
- Batch summary reports in markdown format
- CLI integration via `batch` command
- Comprehensive test suite (17 tests)

Also includes initial pipeline.py test coverage (12 tests).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Gambitnl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial checkpoint and resume system, enabling more resilient processing of audio sessions. It also delivers a new batch processing capability via the CLI, allowing users to efficiently handle multiple recordings. Concurrently, the UI has undergone substantial refactoring to improve modularity and maintainability. These functional enhancements are complemented by critical bug fixes for configuration parsing and a significant expansion of the project's planning and testing documentation, laying a stronger foundation for future development.

Highlights

  • Checkpoint and Resume System: Introduced a robust checkpoint system (P0-BUG-003) allowing processing to be resumed from the last successful stage, preventing data loss in case of interruptions. This includes human-readable JSON checkpoint files, UI integration for a 'Resume' option, and a CLI flag.
  • Batch Processing CLI: Added a new CLI command (cli.py batch) and a dedicated BatchProcessor module (P1-FEATURE-003) to enable processing of multiple audio files sequentially. This feature supports automatic resumption from checkpoints, graceful error handling, and generates a comprehensive summary report.
  • Major UI Refactoring: Undertook significant refactoring of the main app.py by extracting the 'Process Session' and 'Party Management' UI components into their own dedicated modules (src/ui/process_session_tab.py and src/ui/party_management_tab.py). This improves maintainability and modularity, aligning with P0-REFACTOR-001 and P0-REFACTOR-003.
  • Configuration Safety and Consistency: Implemented critical bug fixes (P0-BUG-002) by creating safe utility methods (get_env_as_int, get_env_as_bool) for parsing environment variables, preventing application crashes due to invalid .env values. This also addresses inconsistencies in handling empty strings between integer and boolean parsing.
  • Comprehensive Planning and Testing Documentation: Introduced extensive documentation outlining detailed implementation plans for current and future features (P0-P4) across multiple markdown files. A new TESTING.md and TEST_PLANS.md provide a comprehensive overview of test coverage, execution guides, smoke testing checklists, and test priority recommendations, alongside a new pytest.ini configuration.
  • Expanded Test Suite: Added numerous new unit tests for core components including OOCAnalyzer, AudioProcessor, BatchProcessor, CampaignDashboard, ClassifierFactory, SpeakerDiarizer, KnowledgeExtractor, and TranscriberFactory. This significantly increases test coverage and stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a very large pull request that introduces multiple features (batch processing), major refactoring (app.py), bug fixes (config loading), and a significant amount of documentation and tests. While the quality of the changes is high, the size of the PR makes it very difficult to review thoroughly. In the future, it would be highly beneficial to split such large changes into smaller, more focused pull requests (e.g., one for the feature, one for the refactoring, one for the tests).

That said, the changes are excellent. The refactoring of app.py into smaller, manageable UI modules is a great improvement for maintainability. The new batch processing feature is a valuable addition and is well-implemented with good CLI ergonomics and progress reporting. The massive addition of documentation and tests dramatically improves the project's quality and developer experience. The fixes to configuration handling are also correct and well-tested. I have one suggestion for improving the file discovery logic in the new batch command.

Comment thread cli.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/test_pipeline.py
Comment thread tests/test_snipper.py
@Gambitnl Gambitnl merged commit 0280ae7 into main Oct 24, 2025
Gambitnl pushed a commit that referenced this pull request Nov 12, 2025
- Add Classification and TranscriptFilter enums to src/constants.py
- Implement format_filtered() method with flexible filtering
- Update format_ic_only/ooc_only to delegate to new method
- Maintain 100% backward compatibility with existing code

Changes:
- src/constants.py: New file with Classification and TranscriptFilter enums
- src/formatter.py: Added format_filtered() method (80 lines)
- src/formatter.py: Refactored format_ic_only() to delegate (11 lines, was 36)
- src/formatter.py: Refactored format_ooc_only() to delegate (11 lines, was 33)

Benefits:
- Eliminates ~58 lines of duplicate code
- Strategy pattern allows easy extension (e.g., MIXED_ONLY filter)
- Type-safe filtering with enums instead of magic strings
- No breaking changes - all existing code continues to work

Part of: Refactor #5 - Consolidate Formatter Methods
Related: Refactor #7 - Replace Magic Strings with Enums
Gambitnl pushed a commit that referenced this pull request Nov 12, 2025
Add 14 comprehensive test cases covering:
- All filter types (ALL, IC_ONLY, OOC_ONLY, MIXED_ONLY)
- Edge cases (empty transcript, invalid filter type)
- Speaker profile mapping and character name usage
- Backward compatibility with format_ic_only/ooc_only
- Timestamp formatting
- Default parameter behavior

Tests ensure:
- New method works correctly for all scenarios
- Old methods produce identical output to new method
- Type safety with enum validation
- Proper error handling

Coverage: 100% of format_filtered() logic paths

Part of: Refactor #5 - Consolidate Formatter Methods
Gambitnl added a commit that referenced this pull request Nov 12, 2025
* feat(formatter): add format_filtered with Strategy pattern

- Add Classification and TranscriptFilter enums to src/constants.py
- Implement format_filtered() method with flexible filtering
- Update format_ic_only/ooc_only to delegate to new method
- Maintain 100% backward compatibility with existing code

Changes:
- src/constants.py: New file with Classification and TranscriptFilter enums
- src/formatter.py: Added format_filtered() method (80 lines)
- src/formatter.py: Refactored format_ic_only() to delegate (11 lines, was 36)
- src/formatter.py: Refactored format_ooc_only() to delegate (11 lines, was 33)

Benefits:
- Eliminates ~58 lines of duplicate code
- Strategy pattern allows easy extension (e.g., MIXED_ONLY filter)
- Type-safe filtering with enums instead of magic strings
- No breaking changes - all existing code continues to work

Part of: Refactor #5 - Consolidate Formatter Methods
Related: Refactor #7 - Replace Magic Strings with Enums

* test(formatter): add comprehensive tests for format_filtered

Add 14 comprehensive test cases covering:
- All filter types (ALL, IC_ONLY, OOC_ONLY, MIXED_ONLY)
- Edge cases (empty transcript, invalid filter type)
- Speaker profile mapping and character name usage
- Backward compatibility with format_ic_only/ooc_only
- Timestamp formatting
- Default parameter behavior

Tests ensure:
- New method works correctly for all scenarios
- Old methods produce identical output to new method
- Type safety with enum validation
- Proper error handling

Coverage: 100% of format_filtered() logic paths

Part of: Refactor #5 - Consolidate Formatter Methods

* fix(formatter): restore backward compatibility for MIXED segments

Critical fix addressing code review feedback from gemini-code-assist
and chatgpt-codex-connector bots.

Problem:
The original format_ic_only() used `if classification == "OOC": continue`,
which included BOTH IC and MIXED segments (excluded only OOC).
The new format_filtered() used `if classification != IC: continue`, which
EXCLUDED both OOC and MIXED segments, breaking backward compatibility.

Solution:
- Changed IC_ONLY filter to exclude only OOC (keeps IC and MIXED)
- Changed OOC_ONLY filter to exclude only IC (keeps OOC and MIXED)
- MIXED_ONLY filter continues to include only MIXED segments
- Added detailed comments explaining the filtering logic

Additional fixes:
- Changed TranscriptFilter.ALL header from "FULL VERSION" to "FILTERED VERSION"
  to avoid confusion with format_full_transcript() output
- Updated test fixtures to include MIXED classification
- Updated test expectations to verify MIXED segments appear in both IC and OOC outputs
- Added backward compatibility verification comments

This ensures existing code that calls format_ic_only()/format_ooc_only()
continues to work exactly as before, with MIXED content appearing in both
generated transcripts.

Fixes: Code review issues P1 (breaking change) and High (header confusion)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Gambitnl pushed a commit that referenced this pull request Nov 14, 2025
Implemented 16 TODO test cases covering all critical functionality:

Initialization Tests (3):
- test_init_with_defaults: Verify default config values
- test_init_with_custom_params: Verify custom parameter support
- test_init_loads_vad_model: Verify Silero VAD model loading

Chunking Logic Tests (4):
- test_chunk_audio_basic: Basic chunking with 30s audio
- test_chunk_audio_creates_overlap: Verify proper chunk overlap
- test_chunk_audio_respects_max_length: Ensure max length compliance
- test_chunk_audio_with_short_file: Handle audio shorter than chunk size

VAD Detection Tests (4):
- test_find_best_split_point_with_silence: Find optimal split in gaps
- test_find_best_split_point_no_silence: Fallback to target time
- test_proximity_scoring: Prefer gaps closer to target
- test_width_scoring: Prefer wider silence gaps

Edge Case Tests (5):
- test_empty_audio_file: Handle zero-duration audio
- test_audio_exact_chunk_length: Handle exact chunk length match
- test_very_long_audio: Handle long audio files (1+ hour)
- test_audio_with_invalid_sample_rate: Handle non-16kHz audio
- test_audio_with_multiple_channels: Handle stereo audio

All tests use proper mocking to avoid downloading VAD models during
testing. Helper functions create test audio files programmatically.
Only the slow integration test remains skipped (requires real audio).

Resolves task #5 - tests/test_chunker.py:114-290
Gambitnl pushed a commit that referenced this pull request Nov 14, 2025
- Updated TEST_PLANS.md with P0-2 completion status
- Added detailed implementation summary with all 16 tests
- Updated TESTING.md with HybridChunker coverage
- Added Priority 0 components tracking section
- Updated test structure diagram

References:
- Task #5: Complete chunker test suite
- tests/test_chunker.py: 16 tests implemented
- Coverage: Initialization, chunking logic, VAD detection, edge cases
Gambitnl added a commit that referenced this pull request Nov 14, 2025
* feat(tests): Implement comprehensive test suite for HybridChunker

Implemented 16 TODO test cases covering all critical functionality:

Initialization Tests (3):
- test_init_with_defaults: Verify default config values
- test_init_with_custom_params: Verify custom parameter support
- test_init_loads_vad_model: Verify Silero VAD model loading

Chunking Logic Tests (4):
- test_chunk_audio_basic: Basic chunking with 30s audio
- test_chunk_audio_creates_overlap: Verify proper chunk overlap
- test_chunk_audio_respects_max_length: Ensure max length compliance
- test_chunk_audio_with_short_file: Handle audio shorter than chunk size

VAD Detection Tests (4):
- test_find_best_split_point_with_silence: Find optimal split in gaps
- test_find_best_split_point_no_silence: Fallback to target time
- test_proximity_scoring: Prefer gaps closer to target
- test_width_scoring: Prefer wider silence gaps

Edge Case Tests (5):
- test_empty_audio_file: Handle zero-duration audio
- test_audio_exact_chunk_length: Handle exact chunk length match
- test_very_long_audio: Handle long audio files (1+ hour)
- test_audio_with_invalid_sample_rate: Handle non-16kHz audio
- test_audio_with_multiple_channels: Handle stereo audio

All tests use proper mocking to avoid downloading VAD models during
testing. Helper functions create test audio files programmatically.
Only the slow integration test remains skipped (requires real audio).

Resolves task #5 - tests/test_chunker.py:114-290

* docs: Update test documentation for completed chunker tests

- Updated TEST_PLANS.md with P0-2 completion status
- Added detailed implementation summary with all 16 tests
- Updated TESTING.md with HybridChunker coverage
- Added Priority 0 components tracking section
- Updated test structure diagram

References:
- Task #5: Complete chunker test suite
- tests/test_chunker.py: 16 tests implemented
- Coverage: Initialization, chunking logic, VAD detection, edge cases

---------

Co-authored-by: Claude <noreply@anthropic.com>
Gambitnl pushed a commit that referenced this pull request Nov 14, 2025
Summary of Changes:
- Updated MASTER_PLAN.md: Moved 6 completed bugs to Done section
- Updated TEST_PLANS.md: Marked P0-2, P2-1, P2-2 as completed with summaries
- Updated refactoring/README.md: Marked #5, #6, #8, #9 as completed
- Created CONCURRENT_ORCHESTRATION_SPRINT_2025-11-14.md: Comprehensive sprint report

Sprint Results:
- 15 concurrent tasks completed across 3 batches
- 220+ tests added, 99% coverage for party_config.py
- ~5,000 lines of production code
- Zero merge conflicts maintained
- 100% task independence validated

Documentation Updates:
- 6 bugs moved from "To Do" to "Done" in MASTER_PLAN.md
- 3 test plans marked complete with implementation summaries
- 4 refactoring tasks marked complete with actual effort tracking
- Added comprehensive sprint report with metrics and lessons learned

Files Modified:
- docs/MASTER_PLAN.md
- docs/TEST_PLANS.md
- docs/refactoring/README.md
- docs/CONCURRENT_ORCHESTRATION_SPRINT_2025-11-14.md (new)
Gambitnl added a commit that referenced this pull request Nov 14, 2025
Summary of Changes:
- Updated MASTER_PLAN.md: Moved 6 completed bugs to Done section
- Updated TEST_PLANS.md: Marked P0-2, P2-1, P2-2 as completed with summaries
- Updated refactoring/README.md: Marked #5, #6, #8, #9 as completed
- Created CONCURRENT_ORCHESTRATION_SPRINT_2025-11-14.md: Comprehensive sprint report

Sprint Results:
- 15 concurrent tasks completed across 3 batches
- 220+ tests added, 99% coverage for party_config.py
- ~5,000 lines of production code
- Zero merge conflicts maintained
- 100% task independence validated

Documentation Updates:
- 6 bugs moved from "To Do" to "Done" in MASTER_PLAN.md
- 3 test plans marked complete with implementation summaries
- 4 refactoring tasks marked complete with actual effort tracking
- Added comprehensive sprint report with metrics and lessons learned

Files Modified:
- docs/MASTER_PLAN.md
- docs/TEST_PLANS.md
- docs/refactoring/README.md
- docs/CONCURRENT_ORCHESTRATION_SPRINT_2025-11-14.md (new)

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant