Skip to content

feat: Add Google Drive OAuth authentication for private document access#2

Merged
Gambitnl merged 4 commits intomainfrom
claude/explore-project-011CUM5RvDiBmudCXkhQzq3x
Oct 21, 2025
Merged

feat: Add Google Drive OAuth authentication for private document access#2
Gambitnl merged 4 commits intomainfrom
claude/explore-project-011CUM5RvDiBmudCXkhQzq3x

Conversation

@Gambitnl
Copy link
Copy Markdown
Owner

Replace public document sharing requirement with proper OAuth 2.0 authentication.

Changes:

  • Add Google OAuth dependencies (google-auth, google-auth-oauthlib, google-api-python-client)
  • Create src/google_drive_auth.py module with OAuth flow implementation
  • Update app.py Document Viewer with OAuth authorization UI
  • Add comprehensive setup guide (GOOGLE_DRIVE_SETUP.md)
  • Update .gitignore to protect OAuth credentials

Benefits:

  • Access private Google Docs without making them publicly shared
  • Secure OAuth 2.0 flow with token refresh handling
  • Better security for campaign notes and session logs
  • Enables using historical session notes (sessions 1-5) for character profiling

Technical details:

  • Uses out-of-band OAuth flow compatible with Gradio
  • Token storage in outputs/gdrive_token.json
  • Auto-refresh for expired tokens
  • Read-only Drive API scope

🤖 Generated with Claude Code

Replace public document sharing requirement with proper OAuth 2.0 authentication.

Changes:
- Add Google OAuth dependencies (google-auth, google-auth-oauthlib, google-api-python-client)
- Create src/google_drive_auth.py module with OAuth flow implementation
- Update app.py Document Viewer with OAuth authorization UI
- Add comprehensive setup guide (GOOGLE_DRIVE_SETUP.md)
- Update .gitignore to protect OAuth credentials

Benefits:
- Access private Google Docs without making them publicly shared
- Secure OAuth 2.0 flow with token refresh handling
- Better security for campaign notes and session logs
- Enables using historical session notes (sessions 1-5) for character profiling

Technical details:
- Uses out-of-band OAuth flow compatible with Gradio
- Token storage in outputs/gdrive_token.json
- Auto-refresh for expired tokens
- Read-only Drive API scope

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Gambitnl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the application's document viewing capabilities by integrating Google Drive OAuth 2.0 authentication. This change eliminates the need for users to make their Google Docs publicly accessible, thereby enhancing security and privacy. The new system allows the application to securely fetch content from private Google Docs, which can then be used for internal processes like character profiling, making the tool more robust and user-friendly for handling sensitive campaign notes and session logs.

Highlights

  • Google Drive OAuth Integration: Implemented OAuth 2.0 authentication to allow secure access to private Google Docs, replacing the previous requirement for public sharing.
  • New google_drive_auth.py Module: Introduced a dedicated module to encapsulate the Google Drive OAuth flow, credential management, and API interactions for document retrieval.
  • Updated Gradio UI for Authorization: The 'Document Viewer' tab in app.py now includes a comprehensive user interface for managing Google Drive authorization, including status checks, starting/completing the flow, and revoking access.
  • Detailed Setup Guide: Added GOOGLE_DRIVE_SETUP.md which provides step-by-step instructions for configuring Google Cloud Project, enabling the Drive API, and setting up OAuth credentials.
  • Dependency and .gitignore Updates: Included necessary Google authentication libraries in requirements.txt and added credential files (gdrive_credentials.json, gdrive_token.json) to .gitignore for security.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant improvement, replacing public document sharing with a secure Google Drive OAuth 2.0 flow. The implementation is well-structured with a dedicated authentication module and comprehensive setup documentation. My review focuses on enhancing the robustness and correctness of the implementation. I've identified a critical issue regarding state management in the Gradio application that needs to be addressed to ensure correct behavior for concurrent users or multiple browser tabs. Additionally, I've provided suggestions to improve error handling by using exceptions instead of string comparisons and to make the Google Doc ID parsing more reliable.

Comment thread app.py Outdated

PROJECT_ROOT = Path(__file__).resolve().parent
NOTEBOOK_CONTEXT = ""
OAUTH_FLOW = None # Global to store OAuth flow between steps
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The use of a global variable OAUTH_FLOW to manage the OAuth flow state is not safe for a web application, even one like Gradio. This will cause issues if multiple users access the app concurrently, or if a single user opens it in multiple browser tabs, as they will all share the same OAUTH_FLOW instance, leading to race conditions and incorrect behavior. The correct way to handle user-specific state in Gradio is by using gr.State.

To fix this, you should:

  1. Add a gr.State component to your UI to hold the flow object: oauth_flow_state = gr.State(None).
  2. Update start_oauth_flow to return the flow object and wire it to oauth_flow_state in the .click() handler's outputs.
  3. Update complete_oauth_flow to accept the flow object from oauth_flow_state as an input.
  4. On successful authorization, complete_oauth_flow should return None to clear the oauth_flow_state.

Comment thread app.py
Comment on lines 197 to 208
try:
doc_id_start = doc_url.find("/d/") + 3
doc_id_end = doc_url.find("/edit")
if doc_id_end == -1:
doc_id_end = doc_url.find("/view")
if doc_id_end == -1:
doc_id_end = len(doc_url)

doc_id = doc_url[doc_id_start:doc_id_end]
export_url = f"https://docs.google.com/document/d/{doc_id}/export?format=txt"

response = requests.get(export_url)
response.raise_for_status()
NOTEBOOK_CONTEXT = response.text or ""
return NOTEBOOK_CONTEXT
if not is_authenticated():
return "Error: Not authenticated with Google Drive. Please authorize first using the 'Authorize Google Drive' section below."

content = get_document_content(doc_url)

# Only update NOTEBOOK_CONTEXT if we got valid content
if not content.startswith("Error"):
NOTEBOOK_CONTEXT = content

return content
except Exception as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error handling in view_google_doc relies on checking if the returned content string starts with "Error". This is a fragile pattern because the error message could change, or a valid document could coincidentally start with that word. A more robust approach is to have get_document_content and other functions in src/google_drive_auth.py raise specific exceptions for different error conditions (e.g., AuthenticationError, DocumentNotFoundError). The calling function view_google_doc can then use a try...except block to catch these specific exceptions and format an appropriate error message for the user. This makes the code cleaner, more reliable, and decouples the UI from the implementation details of the auth module.

Suggested change
try:
doc_id_start = doc_url.find("/d/") + 3
doc_id_end = doc_url.find("/edit")
if doc_id_end == -1:
doc_id_end = doc_url.find("/view")
if doc_id_end == -1:
doc_id_end = len(doc_url)
doc_id = doc_url[doc_id_start:doc_id_end]
export_url = f"https://docs.google.com/document/d/{doc_id}/export?format=txt"
response = requests.get(export_url)
response.raise_for_status()
NOTEBOOK_CONTEXT = response.text or ""
return NOTEBOOK_CONTEXT
if not is_authenticated():
return "Error: Not authenticated with Google Drive. Please authorize first using the 'Authorize Google Drive' section below."
content = get_document_content(doc_url)
# Only update NOTEBOOK_CONTEXT if we got valid content
if not content.startswith("Error"):
NOTEBOOK_CONTEXT = content
return content
except Exception as e:
try:
# This change requires refactoring get_document_content() to raise exceptions on errors.
content = get_document_content(doc_url)
NOTEBOOK_CONTEXT = content
return content
except Exception as e:
# For better user feedback, you could catch specific custom exceptions
# from the google_drive_auth module here.
return f"Error: {e}"

Comment thread src/google_drive_auth.py
Comment on lines +84 to +86
except Exception as e:
print(f"Error exchanging code for token: {e}")
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching a broad Exception can hide bugs and make debugging difficult. It's better to catch more specific exceptions that you expect can be raised by flow.fetch_token(). According to the google-auth-oauthlib library, this would typically be a subclass of google.auth.exceptions.GoogleAuthError. This makes your error handling more explicit and robust.

Suggested change
except Exception as e:
print(f"Error exchanging code for token: {e}")
return False
except Exception as e: # Consider catching a more specific exception, e.g., from google.auth.exceptions
print(f"Error exchanging code for token: {e}")
return False

Comment thread src/google_drive_auth.py
Comment on lines +102 to +108
if creds and creds.expired and creds.refresh_token:
try:
creds.refresh(Request())
TOKEN_FILE.write_text(creds.to_json())
except Exception as e:
print(f"Error refreshing token: {e}")
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to other parts of the code, this except Exception is too broad. When refreshing a token, the google-auth library can raise a google.auth.exceptions.RefreshError. Catching this specific exception makes the code more robust and clearly documents the expected failure mode. If a refresh fails, the token file might be corrupt, and it could be a good idea to delete it to force a full re-authentication.

Suggested change
if creds and creds.expired and creds.refresh_token:
try:
creds.refresh(Request())
TOKEN_FILE.write_text(creds.to_json())
except Exception as e:
print(f"Error refreshing token: {e}")
return None
if creds and creds.expired and creds.refresh_token:
try:
creds.refresh(Request())
TOKEN_FILE.write_text(creds.to_json())
except Exception as e: # Consider catching google.auth.exceptions.RefreshError
print(f"Error refreshing token: {e}")
# The token may be invalid, consider deleting it to force re-auth.
# TOKEN_FILE.unlink(missing_ok=True)
return None

Comment thread GOOGLE_DRIVE_SETUP.md Outdated
8. Click **"Download JSON"**
9. Save the downloaded file as `gdrive_credentials.json` in your project root directory

**Important:** The file MUST be named exactly `gdrive_credentials.json` and placed in the root of your Video_chunking directory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The instruction hardcodes the directory name Video_chunking. This might be confusing for users who have cloned the repository under a different name. It's better to use a more generic term like "project root directory" to avoid confusion and make the documentation more generally applicable.

Suggested change
**Important:** The file MUST be named exactly `gdrive_credentials.json` and placed in the root of your Video_chunking directory.
**Important:** The file MUST be named exactly `gdrive_credentials.json` and placed in the project root directory.

Comment thread src/google_drive_auth.py Outdated
Comment on lines +181 to +193
# If it's already just an ID (no slashes), return as-is
if '/' not in doc_url:
return doc_url

# Extract from URL format: https://docs.google.com/document/d/{ID}/edit
doc_id_start = doc_url.find("/d/") + 3
doc_id_end = doc_url.find("/edit")
if doc_id_end == -1:
doc_id_end = doc_url.find("/view")
if doc_id_end == -1:
doc_id_end = len(doc_url)

return doc_url[doc_id_start:doc_id_end]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation for extracting the document ID uses string.find(), which can be brittle. For example, it might not correctly handle URLs with extra query parameters. Using a regular expression is a more robust way to parse the ID from various Google Doc URL formats.

This also simplifies the code by removing the chained if statements.

Suggested change
# If it's already just an ID (no slashes), return as-is
if '/' not in doc_url:
return doc_url
# Extract from URL format: https://docs.google.com/document/d/{ID}/edit
doc_id_start = doc_url.find("/d/") + 3
doc_id_end = doc_url.find("/edit")
if doc_id_end == -1:
doc_id_end = doc_url.find("/view")
if doc_id_end == -1:
doc_id_end = len(doc_url)
return doc_url[doc_id_start:doc_id_end]
import re
def _extract_doc_id(doc_url: str) -> str:
"""
Extract document ID from Google Docs URL.
Args:
doc_url: Full URL or just the document ID
Returns:
Document ID string
"""
# Regex to find the document ID in various URL formats
match = re.search(r'/d/([a-zA-Z0-9-_]+)', doc_url)
if match:
return match.group(1)
# If no match, assume the input is the ID itself.
return doc_url

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/google_drive_auth.py Outdated
Comment on lines +47 to +53
# Use out-of-band flow for Gradio compatibility
# User will copy-paste the authorization code
flow = Flow.from_client_secrets_file(
str(CLIENT_CONFIG_FILE),
scopes=SCOPES,
redirect_uri='urn:ietf:wg:oauth:2.0:oob'
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace deprecated OOB OAuth redirect

The OAuth flow is hard‑wired to redirect_uri='urn:ietf:wg:oauth:2.0:oob'. Google permanently disabled the OOB redirect for new client IDs in 2022, so users who follow the new setup guide will receive invalid_request errors when trying to authorize. The flow should use the loopback/localhost pattern via InstalledAppFlow.run_local_server (or an equivalent custom callback) so that newly created credentials can complete authentication.

Useful? React with 👍 / 👎.

Critical bug fix identified by gemini-code-assist bot review.

Problem:
- Global OAUTH_FLOW variable caused race conditions with concurrent users
- Multiple browser tabs would share the same OAuth flow instance
- Not safe for multi-user Gradio web application

Solution:
- Replace global variable with gr.State for per-session state management
- Update start_oauth_flow() to return (instructions, flow_object) tuple
- Update complete_oauth_flow() to accept flow from State and return (result, cleared_state)
- Wire gr.State through UI click handlers

Benefits:
- Thread-safe OAuth flow for concurrent users
- Each browser tab/session has isolated OAuth state
- Proper Gradio best practices for stateful components

Technical changes:
- Removed: OAUTH_FLOW global variable
- Added: oauth_flow_state = gr.State(None) in Document Viewer tab
- Updated: Function signatures to accept/return flow object
- Updated: Click handlers to wire State through inputs/outputs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixes all issues identified by gemini-code-assist and codex bot reviews:

**CRITICAL P1 Fix:**
- Replace deprecated OOB OAuth (disabled 2022) with localhost redirect
- Update flow to use http://localhost:8080/ instead of urn:ietf:wg:oauth:2.0:oob
- Support both full redirect URL and authorization code as input
- Extract code from URL query parameters using urlparse

**Error Handling Improvements:**
- Replace broad Exception catches with specific RefreshError
- Auto-delete invalid token file on refresh failure to force re-auth
- Better error messages and handling

**Robust URL Parsing:**
- Replace string.find() with regex in _extract_doc_id()
- Handle various Google Docs URL formats with query parameters
- More reliable document ID extraction

**Documentation Updates:**
- Use generic "project root" instead of hardcoded "Video_chunking"
- Update OAuth flow instructions for localhost redirect
- Clear explanation that localhost redirect page won't load (expected)
- Detailed steps for copying redirect URL from browser

**UI Updates:**
- Updated authorization instructions in app.py
- Changed textbox label to "Redirect URL or Authorization Code"
- Better placeholder text explaining URL format
- Clearer step-by-step guidance for users

Technical changes:
- Added imports: re, urllib.parse (urlparse, parse_qs), RefreshError
- Updated get_auth_url() redirect_uri
- Updated exchange_code_for_token() to parse URLs
- Updated get_credentials() to use specific exception handling
- Refactored _extract_doc_id() with regex pattern

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Gambitnl Gambitnl merged commit 5ba89b5 into main Oct 21, 2025
@Gambitnl Gambitnl deleted the claude/explore-project-011CUM5RvDiBmudCXkhQzq3x branch November 6, 2025 22:50
Gambitnl pushed a commit that referenced this pull request Nov 12, 2025
Extract _build_prompt() and _parse_response() methods from OllamaClassifier
and GroqClassifier into the shared BaseClassifier to eliminate code duplication.

Changes:
- Add _build_prompt() to BaseClassifier with comprehensive docstrings
- Add _parse_response() to BaseClassifier with standard format handling
- Remove duplicate _build_prompt() from OllamaClassifier (~18 lines)
- Remove duplicate _build_prompt() from GroqClassifier (~18 lines)
- Remove duplicate _parse_response() from OllamaClassifier (~45 lines)
- Remove duplicate _parse_response() from GroqClassifier (~45 lines)

Benefits:
- DRY principle: Single implementation eliminates ~100 lines of duplicate code
- Maintainability: Changes to prompt building or parsing only need one update
- Consistency: Both classifiers use identical logic guaranteed
- Extensibility: New classifiers can inherit these methods without reimplementation

The shared implementation:
- Uses prompt_template attribute set during initialization
- Handles standard Classificatie/Reden/Vertrouwen/Personage format
- Provides graceful fallbacks for invalid responses
- Properly clamps confidence values using ConfidenceDefaults

Related: Part of Refactor #2+#3 (Consolidate Classifier Methods)
Uses: Classification enum from constants.py (Agent A's work)
Gambitnl pushed a commit that referenced this pull request Nov 12, 2025
Consolidated duplicate _parse_response() and _build_prompt() methods
from OllamaClassifier and GroqClassifier into the BaseClassifier,
eliminating ~65 lines of duplicate code.

Changes:
- Moved _build_prompt() to BaseClassifier with comprehensive docstring
  - Handles prompt template formatting with character/player context
  - Subclasses can override for custom prompt building

- Moved _parse_response() to BaseClassifier with comprehensive docstring
  - Parses Dutch-format LLM responses (Classificatie/Reden/Vertrouwen/Personage)
  - Handles Classification enum, confidence clamping, and error cases
  - Subclasses can override for different response formats

- Added preflight_check() to GroqClassifier
  - Validates API key configuration
  - Tests API connectivity
  - Matches OllamaClassifier pattern

Benefits:
- DRY principle: Single implementation instead of duplicate code
- Maintainability: Bug fixes and improvements apply to all classifiers
- Extensibility: New classifiers can inherit default implementations
- Type safety: Uses Classification enum from constants module

Testing:
- All 37 existing tests pass
- Code coverage: 78%
- No breaking changes - backward compatible

Dependencies:
- Uses Classification enum from src/constants.py (Refactor #7)

Refs: Refactor #2+#3 - Consolidate Classifier Methods
Gambitnl added a commit that referenced this pull request Nov 12, 2025
* refactor(classifier): consolidate duplicate methods into BaseClassifier

Extract _build_prompt() and _parse_response() methods from OllamaClassifier
and GroqClassifier into the shared BaseClassifier to eliminate code duplication.

Changes:
- Add _build_prompt() to BaseClassifier with comprehensive docstrings
- Add _parse_response() to BaseClassifier with standard format handling
- Remove duplicate _build_prompt() from OllamaClassifier (~18 lines)
- Remove duplicate _build_prompt() from GroqClassifier (~18 lines)
- Remove duplicate _parse_response() from OllamaClassifier (~45 lines)
- Remove duplicate _parse_response() from GroqClassifier (~45 lines)

Benefits:
- DRY principle: Single implementation eliminates ~100 lines of duplicate code
- Maintainability: Changes to prompt building or parsing only need one update
- Consistency: Both classifiers use identical logic guaranteed
- Extensibility: New classifiers can inherit these methods without reimplementation

The shared implementation:
- Uses prompt_template attribute set during initialization
- Handles standard Classificatie/Reden/Vertrouwen/Personage format
- Provides graceful fallbacks for invalid responses
- Properly clamps confidence values using ConfidenceDefaults

Related: Part of Refactor #2+#3 (Consolidate Classifier Methods)
Uses: Classification enum from constants.py (Agent A's work)

* refactor(classifier): improve _parse_response robustness

Address code review feedback to make response parsing more resilient:

1. Fix potential IndexError when response lines don't contain ':'
   - Now splits first and checks if we have exactly 2 parts
   - Skips malformed lines with 'continue' instead of crashing

2. Add logging for invalid confidence values
   - Previously silently ignored ValueError, now logs warning
   - Consistent with how invalid classification values are handled

3. Improve parsing precision
   - Use exact key matching (==) instead of startswith()
   - Strips whitespace from both key and value separately
   - More robust against extra whitespace (e.g., "Classificatie : value")

Benefits:
- Prevents crashes on malformed LLM responses
- Better observability through consistent warning logs
- More precise key matching reduces false positives
- Graceful degradation with clear fallback values

Thanks to gemini-code-assist bot for the review feedback!

---------

Co-authored-by: Claude <noreply@anthropic.com>
Gambitnl added a commit that referenced this pull request Nov 12, 2025
* refactor(classifier): consolidate duplicate methods in BaseClassifier

Consolidated duplicate _parse_response() and _build_prompt() methods
from OllamaClassifier and GroqClassifier into the BaseClassifier,
eliminating ~65 lines of duplicate code.

Changes:
- Moved _build_prompt() to BaseClassifier with comprehensive docstring
  - Handles prompt template formatting with character/player context
  - Subclasses can override for custom prompt building

- Moved _parse_response() to BaseClassifier with comprehensive docstring
  - Parses Dutch-format LLM responses (Classificatie/Reden/Vertrouwen/Personage)
  - Handles Classification enum, confidence clamping, and error cases
  - Subclasses can override for different response formats

- Added preflight_check() to GroqClassifier
  - Validates API key configuration
  - Tests API connectivity
  - Matches OllamaClassifier pattern

Benefits:
- DRY principle: Single implementation instead of duplicate code
- Maintainability: Bug fixes and improvements apply to all classifiers
- Extensibility: New classifiers can inherit default implementations
- Type safety: Uses Classification enum from constants module

Testing:
- All 37 existing tests pass
- Code coverage: 78%
- No breaking changes - backward compatible

Dependencies:
- Uses Classification enum from src/constants.py (Refactor #7)

Refs: Refactor #2+#3 - Consolidate Classifier Methods

* fix(classifier): improve response parsing robustness and logging

Addressed code review feedback from gemini-code-assist:

1. Fixed high-severity multi-line parsing issue:
   - Replaced brittle line-by-line parsing with regex-based extraction
   - Used re.DOTALL flag for reasoning field to handle multi-line values
   - Prevents corruption when field keywords appear in multi-line text
   - Handles out-of-order fields gracefully

2. Fixed medium-severity silent error handling:
   - Added warning log for invalid confidence values
   - Now consistent with classification error handling
   - Improves debuggability by surfacing parsing issues

Changes:
- _parse_response() now uses regex patterns for each field
- Reden (reasoning) captures everything until next field
- More resilient to LLM output variations
- Better error visibility for debugging

Testing:
- All 37 tests still passing
- Backward compatible with existing test cases

Co-authored-by: gemini-code-assist[bot]

---------

Co-authored-by: Claude <noreply@anthropic.com>
Gambitnl pushed a commit that referenced this pull request Nov 14, 2025
…aintainability

Extract complex logic from _extract_speaker_embeddings() method into
smaller, focused methods to improve code readability and testability.

Changes:
- Extract _load_audio_for_embeddings(): Handles pydub audio loading
  with proper error handling for embedding extraction
- Extract _extract_single_speaker_embedding(): Extracts embedding for
  a single speaker, including audio concatenation and inference
- Simplify _extract_speaker_embeddings(): Now orchestrates the
  extraction process by calling the new helper methods

The refactoring reduces the complexity of _extract_speaker_embeddings()
from 81 lines to 57 lines, making it easier to understand and maintain.
Each extracted method has a single, well-defined responsibility.

Testing:
- Add 6 new unit tests for the extracted methods
- Tests cover success cases, error handling, and edge cases
- Syntax validation passed for all modified files

This completes task #2 from the refactoring plan (06-diarizer-complex-method.md).
Gambitnl added a commit that referenced this pull request Nov 14, 2025
* refactor: Extract methods from SpeakerDiarizer.diarize() for better maintainability

Extract complex logic from _extract_speaker_embeddings() method into
smaller, focused methods to improve code readability and testability.

Changes:
- Extract _load_audio_for_embeddings(): Handles pydub audio loading
  with proper error handling for embedding extraction
- Extract _extract_single_speaker_embedding(): Extracts embedding for
  a single speaker, including audio concatenation and inference
- Simplify _extract_speaker_embeddings(): Now orchestrates the
  extraction process by calling the new helper methods

The refactoring reduces the complexity of _extract_speaker_embeddings()
from 81 lines to 57 lines, making it easier to understand and maintain.
Each extracted method has a single, well-defined responsibility.

Testing:
- Add 6 new unit tests for the extracted methods
- Tests cover success cases, error handling, and edge cases
- Syntax validation passed for all modified files

This completes task #2 from the refactoring plan (06-diarizer-complex-method.md).

* docs: Update refactoring documentation for completed task #6

Add completion summary to the diarizer refactoring task document
and update the refactoring README to track progress.

Changes:
- Add detailed implementation summary to 06-diarizer-complex-method.md
  - Document what was implemented (2 extracted methods)
  - Record actual effort (2 hours vs 10-12 estimated)
  - Note deviations from original plan
  - Include metrics showing 30% code reduction
  - List 6 new unit tests added

- Update refactoring README.md
  - Mark task #6 as completed with ✅ indicator
  - Add completion summary with key metrics
  - Update summary statistics to show 1/10 completed
  - Update last modified date to 2025-11-14

No architecture documentation updates needed - the refactoring was
internal to the SpeakerDiarizer class and doesn't change the public
API or high-level design documented in DEVELOPMENT.md.

* refactor: Improve type hints and optimize imports in diarizer

Address code review feedback from gemini-code-assist to improve
code quality, type safety, and performance.

Changes:
1. Type Safety Improvements
   - Add TYPE_CHECKING imports for AudioSegment and Annotation
   - Replace Optional[Any] with Optional['AudioSegment'] in
     _load_audio_for_embeddings()
   - Replace Any with 'Annotation' and 'AudioSegment' type hints
     in _extract_single_speaker_embedding()
   - Update _perform_diarization() return type to use 'Annotation'
   - Update _extract_speaker_embeddings() parameter type

2. Documentation Accuracy
   - Fix Raises clause in _extract_single_speaker_embedding()
     to specify RuntimeError instead of generic Exception

3. Performance Optimization
   - Replace `from pydub import AudioSegment` import inside loop
     with `type(audio).empty()` to avoid repeated module lookups
   - This removes import overhead from the per-speaker iteration

These improvements enhance static analysis capabilities, provide
better IDE autocomplete support, and reduce unnecessary import
overhead in the embedding extraction loop.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants