feat: implement async batch processing for classification pipeline by kunalbhardwaj2006 · Pull Request #11 · AOSSIE-Org/LibrEd

kunalbhardwaj2006 · 2026-03-22T04:34:29Z

🚀 Overview

This PR introduces async batch processing for the classification pipeline to improve performance and scalability.

🔧 Changes Made

Implemented batch-based processing using process_batch
Removed duplicate and redundant async wrappers
Fixed unreachable logger statements
Improved error handling and logging
Cleaned and refactored code structure

⚡ Impact

Faster classification processing
Reduced overhead for large datasets
Improved maintainability and readability

🧪 Testing

Verified classification flow execution
Response files generated correctly
No breaking changes observed

📌 Notes

This PR focuses on classification pipeline improvements. LLM optimization is handled in a separate PR.

Summary by CodeRabbit

Refactor
- Optimized processing efficiency through batched async execution
- Strengthened error handling with enhanced logging and recovery mechanisms
- Refined output quality with improved encoding validation

coderabbitai · 2026-03-22T04:34:40Z

Walkthrough

Updated process_classification_prompts to use batched async processing instead of per-file synchronous LLM calls. Added bounded execution control, improved error handling for file reads and JSON extraction, enhanced theory prompt processing with UTF-8 encoding, and refined manifest generation outputs.

Changes

Cohort / File(s)	Summary
Async Batching & Execution Control `generator/src/knowledge_utils.py`	Converted classification prompt processing from synchronous per-file LLM calls to batched async processing using `CLASSIFICATION_BATCH_SIZE`; replaced per-iteration `processed_count` break control with upfront file list truncation via `limit`.
Error Handling & File Processing `generator/src/knowledge_utils.py`	Enhanced error handling: logs file read failures while inserting empty prompts; `_extract_json` now catches all exceptions and logs warnings; theory generation errors include `exc_info=True`; all file operations use UTF-8 encoding.
Theory Prompt & Manifest Generation `generator/src/knowledge_utils.py`	Ensures generated theory content is non-`None` and stripped; writes markdown with UTF-8 encoding; conditionally exports frontend assets only when `stream_code`, `subject_name`, and `subtopic_name` are present; precomputes `subj_slug` and writes `structure.json` with UTF-8 encoding and `ensure_ascii=False`; appends subjects to manifest only when containing at least one topic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Python Lang

Poem

🐰 From sync to async, we batch our way,
Error logs flow where bugs once lay,
UTF-8 encoding makes characters dance,
Manifests refined with better chance,
Swift and steady, the refactor's grand display!

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change in the changeset: implementing async batch processing for the classification pipeline, which is the primary focus of the modifications to process_classification_prompts and the batch processing logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

coderabbitai

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@generator/src/knowledge_utils.py`:
- Around line 50-54: The async function is performing blocking file I/O with
open() (see the block that reads into prompts and the similar code at lines
marked 67-68); replace these synchronous reads with aiofiles: import aiofiles,
use async with aiofiles.open(filepath, 'r') and await f.read() (and preserve the
same exception handling to log errors with logger.error and append "" on
failure) so the function remains non-blocking and compatible with asyncio-based
concurrency.
- Around line 291-295: The code sets stream_code, subject_name, subtopic_name to
None when meta is None but continues and still performs the DB insert (using
those values) while only skipping frontend export; change the flow so that when
meta is None for a given subtopic_id you skip all further processing for that
subtopic (e.g., early continue/return) and do not perform the DB insert. Locate
the block that unpacks meta and the subsequent DB insertion code (references:
meta, subtopic_id, stream_code, subject_name, subtopic_name and the DB insert
logic) and make the missing-metadata branch bail out before any insert or
downstream work.
- Line 39: Remove the redundant fallback in knowledge_utils.py by changing the
batch_size assignment so it uses the validated configuration value directly
(i.e., set batch_size = CLASSIFICATION_BATCH_SIZE) instead of
"CLASSIFICATION_BATCH_SIZE or 3"; this avoids masking misconfiguration that is
already asserted as positive in config.py and keeps behavior consistent with the
validation in config.
- Around line 60-68: The loop uses zip(batch_files, results) which can silently
truncate if lengths differ; add an explicit length check before iterating (e.g.,
if len(batch_files) != len(results): raise ValueError or log and handle the
mismatch) or switch to zip(batch_files, results, strict=True) if your runtime
supports it, so you never write responses to the wrong response_file; update the
code around the loop that iterates over filepath, response to validate lengths
and fail/handle early instead of silently truncating.
- Around line 82-86: The slice logic can produce an empty string when there is a
'{' but no closing '}', so change the checks around start/end: call end_index =
text.rfind('}'), ensure start != -1 and end_index != -1 and end_index > start
before setting end = end_index + 1 and json_str = text[start:end]; if that
condition fails, avoid creating json_str (or explicitly set it to
None/raise/skip) so malformed input is handled safely; update the code around
the variables start, end_index/end, and json_str accordingly.
- Around line 301-303: The call to the synchronous generate_text(prompt) inside
the async block blocks the event loop; either (A) wrap the call in the event
loop's executor and await it (e.g., content = await
asyncio.get_running_loop().run_in_executor(None, generate_text, prompt)) or (B,
preferred) refactor generate_text to be async (use aiohttp instead of requests)
and then await generate_text(prompt) directly; update the code where content =
generate_text(prompt) is used and any callers of generate_text to match the
chosen approach.
- Around line 177-182: The loop currently reassigns loop variables subject_name
and subtopic_name after checking special cases, which is confusing and flagged
by linters; instead, create new variables (e.g., norm_subject_name and
norm_subtopic_name) to hold the normalized values from
normalize_subtopic(subject_name) and normalize_subtopic(subtopic_name), handle
the "other"/"unclassifiable" mapping into those new variables (e.g., set
norm_subject_name="general aptitude", norm_subtopic_name="miscellaneous" when
matched), and update any subsequent code in this block to use norm_subject_name
and norm_subtopic_name rather than mutating the original loop variables.
- Line 16: The import of process_batch from generator.src.llm_utils is failing
because process_batch is not defined in llm_utils.py; locate the call results =
await process_batch(prompts) in knowledge_utils.py and either (A) implement an
async process_batch(prompts) function inside generator/src/llm_utils.py that
matches the call signature and uses existing helpers (e.g., generate_text) to
process prompt lists, or (B) if an equivalent batch helper already exists under
a different name in llm_utils.py, change the import and the call in
knowledge_utils.py to that existing function name (ensuring it is async and
returns the expected results structure). Ensure the symbol names process_batch
(or the replaced function name) and generate_text are consistent between the two
modules so the import and await call succeed.
- Around line 403-414: The current loop parsing question IDs (variables q_id,
parts) assumes a prefix_year_qno_... format and emits a generic logger.warning
when malformed; update the logger.warning in the for loop that builds q_path
(q_path, q_list) to include the actual malformed q_id and the expected format
(e.g., "expected 'prefix_year_qno[_...]'") and optionally an example, so the
message becomes descriptive enough to aid debugging; keep skipping behavior but
make the warning text include q_id and the expected pattern using the same
logger.warning call.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 644b91ce-5b74-49dd-830b-c16c4a5e12b0

📥 Commits

Reviewing files that changed from the base of the PR and between f0dcda6 and 9e59523.

📒 Files selected for processing (1)

generator/src/knowledge_utils.py

coderabbitai · 2026-03-22T04:38:01Z