[DONE] feat/flat-folder — port flat-folder ingestion strategy#39
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR ports the “v2 flat-folder” ingestion pipeline into @bb/ingest-github, replacing the old v1 single-pass file strategy with a multi-phase, crash-resumable pipeline that writes intermediate artifacts to disk and persists a richer analysis shape into Mongo + Neo4j (including new :Repo / :Folder nodes and extended FileAnalysis fields). It also adds an LLM-backed skip-decision gate to reduce unnecessary analysis calls for unknown file types and common noise files.
Changes:
- Replace v1 worker/strategy plumbing with a new orchestrated pipeline (
pipeline/run.ts) and the 7-phase flat-folder strategy. - Add skip-decision infrastructure (seed data, cache, YES/NO LLM prompt) to filter files before analysis.
- Extend shared types/config and Neo4j/Mongo/LLM helpers to support richer analysis fields and new graph nodes.
Reviewed changes
Copilot reviewed 99 out of 100 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/types/src/job.ts | Add orgId to payloads |
| packages/types/src/context.md | Document orgId override |
| packages/types/src/config.ts | Add flat-folder config keys |
| packages/neo4j/src/repo.ts | Add :Repo upsert |
| packages/neo4j/src/index.ts | Export new upserts/indexes |
| packages/neo4j/src/folder.ts | Add :Folder upsert |
| packages/neo4j/src/flatFolderIndexes.ts | Add repo/folder constraints |
| packages/neo4j/src/fileVersions.ts | Snapshot extended file props |
| packages/neo4j/src/files.ts | Extend file node properties |
| packages/neo4j/src/context.md | Update Neo4j module docs |
| packages/mongo/src/raw.ts | Extend FileAnalysis shape |
| packages/mongo/src/index.ts | Export new analysis types |
| packages/llm/src/jsonClient.ts | Add JSON + YES/NO clients |
| packages/llm/src/index.ts | Export JSON helpers |
| packages/ingest-github/tsconfig.json | Add src/* path alias |
| packages/ingest-github/src/worker.ts | Remove v1 worker |
| packages/ingest-github/src/types/strategy.ts | New strategy port types |
| packages/ingest-github/src/types/pipeline.ts | Pipeline/scan/skip types |
| packages/ingest-github/src/types/meta-paths.ts | Meta artifact paths type |
| packages/ingest-github/src/types/ingest-runner.ts | Runner dependency types |
| packages/ingest-github/src/types/index.ts | Types barrel exports |
| packages/ingest-github/src/types/file-analysis.ts | emptyFileAnalysis helper |
| packages/ingest-github/src/types/context.md | Types tier documentation |
| packages/ingest-github/src/types/condensed-file-analysis.ts | Condensed artifact shape |
| packages/ingest-github/src/types/big-file.ts | Big-file artifact shapes |
| packages/ingest-github/src/Strategy.ts | Remove v1 strategy interface |
| packages/ingest-github/src/strategies/flat-folder/types.ts | Flat-folder domain types |
| packages/ingest-github/src/strategies/flat-folder/repo-summary.ts | Repo summary generation |
| packages/ingest-github/src/strategies/flat-folder/prompts/repo-summary.ts | Repo summary prompts |
| packages/ingest-github/src/strategies/flat-folder/prompts/folder-summary.ts | Folder summary prompts |
| packages/ingest-github/src/strategies/flat-folder/prompts/file-analysis.ts | File analysis prompts |
| packages/ingest-github/src/strategies/flat-folder/prompts/file-analysis-fields.ts | Field schema block |
| packages/ingest-github/src/strategies/flat-folder/prompts/context.md | Prompt module docs |
| packages/ingest-github/src/strategies/flat-folder/prompts/condense.ts | Condense prompt |
| packages/ingest-github/src/strategies/flat-folder/prompts/chunk.ts | Chunk prompt |
| packages/ingest-github/src/strategies/flat-folder/prompts/backfill.ts | Backfill prompt |
| packages/ingest-github/src/strategies/flat-folder/phases/store-flat-analysis.ts | Neo4j write phase |
| packages/ingest-github/src/strategies/flat-folder/phases/process-big-files.ts | Big-file phase |
| packages/ingest-github/src/strategies/flat-folder/phases/context.md | Phase docs |
| packages/ingest-github/src/strategies/flat-folder/phases/classify-and-analyse-small.ts | Phase 1 classify/analyze |
| packages/ingest-github/src/strategies/flat-folder/index.ts | Strategy orchestrator |
| packages/ingest-github/src/strategies/flat-folder/folder-summary.ts | Folder summary phase |
| packages/ingest-github/src/strategies/flat-folder/folder-path.ts | Folder path helpers |
| packages/ingest-github/src/strategies/flat-folder/context.md | Strategy docs |
| packages/ingest-github/src/strategies/flat-folder/big-file/storage.ts | Big-file cache storage |
| packages/ingest-github/src/strategies/flat-folder/big-file/index.ts | Big-file processing |
| packages/ingest-github/src/strategies/flat-folder/big-file/detector.ts | Big-file list I/O |
| packages/ingest-github/src/strategies/flat-folder/big-file/context.md | Big-file docs |
| packages/ingest-github/src/strategies/flat-folder/big-file/chunker.ts | Token-based chunking |
| packages/ingest-github/src/strategies/flat-folder/big-file/chunk-analyzer.ts | Chunk LLM analysis |
| packages/ingest-github/src/strategies/flat-folder/big-file/cache.ts | Big-file cache inspect |
| packages/ingest-github/src/strategies/flat-folder/backfill/fields.ts | Phase 3 backfill |
| packages/ingest-github/src/strategies/flat-folder/backfill/context.md | Backfill docs |
| packages/ingest-github/src/strategies/flat-folder/backfill/big-files.ts | Phase 4 backfill big |
| packages/ingest-github/src/strategies/flat-folder/analyse-file.ts | Small-file analysis wrapper |
| packages/ingest-github/src/strategies/context.md | Strategies tree docs |
| packages/ingest-github/src/strategies/basic-file-analysis/context.md | Archived v1 docs |
| packages/ingest-github/src/strategies/basic-file-analysis/BasicFileAnalysisStrategy.ts.archived | Archive v1 source |
| packages/ingest-github/src/scan.ts | Remove v1 scanner |
| packages/ingest-github/src/pipeline/source.ts | Clone-or-sync + HEAD hash |
| packages/ingest-github/src/pipeline/skip-decisions/seed.ts | Load skip seed data |
| packages/ingest-github/src/pipeline/skip-decisions/seed-data/filenameIgnore.json | Seed filename rejects |
| packages/ingest-github/src/pipeline/skip-decisions/seed-data/extensions.json | Known ext→language map |
| packages/ingest-github/src/pipeline/skip-decisions/seed-data/directoryIgnore.json | Seed dir rejects |
| packages/ingest-github/src/pipeline/skip-decisions/seed-data/context.md | Seed data docs |
| packages/ingest-github/src/pipeline/skip-decisions/prompts/skip-decision.ts | Skip gate prompts |
| packages/ingest-github/src/pipeline/skip-decisions/prompts/context.md | Skip prompt docs |
| packages/ingest-github/src/pipeline/skip-decisions/index.ts | Skip-decisions barrel |
| packages/ingest-github/src/pipeline/skip-decisions/decider.ts | Skip decision logic |
| packages/ingest-github/src/pipeline/skip-decisions/context.md | Skip-decisions docs |
| packages/ingest-github/src/pipeline/skip-decisions/cache.ts | Skip decision cache |
| packages/ingest-github/src/pipeline/scan.ts | New streaming scanner |
| packages/ingest-github/src/pipeline/run.ts | New orchestrator runner |
| packages/ingest-github/src/pipeline/paths.ts | Repo/meta path helpers |
| packages/ingest-github/src/pipeline/filters.ts | Static filters wiring |
| packages/ingest-github/src/pipeline/context.md | Pipeline docs |
| packages/ingest-github/src/pipeline/concurrency.ts | Concurrency limiter |
| packages/ingest-github/src/pipeline/cancellation.ts | Cooperative cancellation |
| packages/ingest-github/src/pipeline/branch.ts | Branch validation |
| packages/ingest-github/src/payload/narrow.ts | Payload narrowing |
| packages/ingest-github/src/payload/context.md | Payload docs |
| packages/ingest-github/src/paths.ts | Remove v1 paths helper |
| packages/ingest-github/src/index.ts | New worker wiring + exports |
| packages/ingest-github/src/handlers/ingest-job.ts | Job handlers |
| packages/ingest-github/src/handlers/context.md | Handler docs |
| packages/ingest-github/src/context.md | Rewrite package docs |
| packages/ingest-github/src/concurrency.ts | Remove v1 concurrency |
| packages/ingest-github/src/bigFile.ts | Remove v1 big-file logic |
| packages/ingest-github/src/analyze.ts | Remove v1 analysis entry |
| packages/ingest-github/src/analysisShared.ts | Remove v1 shared analysis |
| packages/ingest-github/src/adapters/llm-file-analyzer.ts | LLM analyzer adapter |
| packages/ingest-github/src/adapters/index.ts | Adapter barrel |
| packages/ingest-github/src/adapters/context.md | Adapter docs |
| packages/errors/src/ingest-errors.ts | Add CancellationError |
| packages/errors/src/index.ts | Export CancellationError |
| packages/errors/src/context.md | Document cancellation error |
| packages/config/src/schema.ts | Add some new config defaults |
Comments suppressed due to low confidence (1)
packages/ingest-github/src/pipeline/scan.ts:81
scanRepositorytreats files withcountLines(content) > Config.BigFileLineThresholdaskind: "oversized". In Phase 1,oversizedfiles are written as permanent "too-large" stubs and never sent through the big-file chunk/analyse/condense path, so line-threshold files will be silently skipped rather than analyzed. If the intent is to route these into the big-file pipeline, consider emitting a distinct kind (e.g."big") or yielding a normal"file"entry and letting Phase 1 enqueue it for Phase 2 instead of stubbing it out.
const content = buf.toString("utf8");
if (countLines(content) > limits.bigFileLineThreshold) {
counts.oversized += 1;
yield { kind: "oversized", relativePath, absolutePath: abs, sizeBytes };
continue;
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
15
to
41
| export const configSchema = z | ||
| .object({ | ||
| server_port: z.number().int().min(1).max(65535).default(8080), | ||
| mongo_uri: z.string().default(""), | ||
| neo4j_uri: z.string().default(""), | ||
| neo4j_user: z.string().default(""), | ||
| neo4j_password: z.string().default(""), | ||
| redis_url: z.string().default(""), | ||
| openrouter_api_key: z.string().default(""), | ||
| openrouter_model: z.string().default("deepseek/deepseek-v4-flash"), | ||
| openrouter_fallback_model_1: z.string().default("qwen/qwen3.5-flash-02-23"), | ||
| openrouter_fallback_model_2: z.string().default("minimax/minimax-m2.7"), | ||
| openrouter_fallback_model_3: z.string().default("moonshotai/kimi-k2.5"), | ||
| openrouter_fallback_model_4: z.string().default("x-ai/grok-4.3"), | ||
| concurrency: concurrencySchema.default({ github: 2 }), | ||
| log_level: z.enum(LOG_LEVELS).default("info"), | ||
| log_retention_days: z.number().int().positive().default(14), | ||
| llm_cache_enabled: z.boolean().default(true), | ||
| "context.window.limit": z.number().int().positive().default(15000), | ||
| "max.tokens.per.chunk": z.number().int().positive().default(6000), | ||
| "big.file.concurrency": z.number().int().positive().default(25), | ||
| "absolute.file.size.cap": z.number().int().positive().default(52428800), | ||
| "concurrent.workers": z.number().int().positive().default(4), | ||
| "condense.context.limit": z.number().int().positive().default(12000), | ||
| "condense.prompt.overhead": z.number().int().nonnegative().default(1500), | ||
| "small.file.dedup.threshold": z.number().int().positive().default(3), | ||
| }) |
Comment on lines
+63
to
+66
| const sizeBytes = (await stat(abs)).size; | ||
| const relativePath = path.relative(rootDir, abs); | ||
| const ext = path.extname(entry.name).toLowerCase(); | ||
| if (sizeBytes > limits.absoluteCap) { |
Comment on lines
+48
to
+54
| if (entry.kind === "oversized") { | ||
| bigFileBuffer.push({ | ||
| relativePath: entry.relativePath, | ||
| sizeBytes: entry.sizeBytes, | ||
| tokenCount: 0, | ||
| reason: "too-large", | ||
| }); |
Comment on lines
+34
to
+38
| const ATTACH_FILE_TO_FOLDER = ` | ||
| MATCH (f:File {knowledgeId: $knowledgeId, relativePath: $relativePath}) | ||
| MATCH (folder:Folder {knowledgeId: $knowledgeId, folderPath: $folderPath}) | ||
| MERGE (folder)-[:CONTAINS]->(f) | ||
| `; |
Comment on lines
+21
to
+26
| const UPSERT_FOLDER = ` | ||
| MERGE (folder:Folder {orgId: $orgId, knowledgeId: $knowledgeId, repoId: $repoId, folderPath: $folderPath}) | ||
| SET folder.purpose = $purpose, | ||
| folder.summary = $summary, | ||
| folder.dependencyGraph = $dependencyGraph, | ||
| folder.updatedAt = $updatedAt |
Comment on lines
28
to
41
| fv.summary = f.summary, | ||
| fv.businessContext = f.businessContext, | ||
| fv.dataFlowDirection = f.dataFlowDirection, | ||
| fv.ontologyConcepts = f.ontologyConcepts, | ||
| fv.businessEntities = f.businessEntities, | ||
| fv.systemCapabilities = f.systemCapabilities, | ||
| fv.sideEffects = f.sideEffects, | ||
| fv.configDependencies = f.configDependencies, | ||
| fv.integrationSurface = f.integrationSurface, | ||
| fv.contractsProvided = f.contractsProvided, | ||
| fv.contractsConsumed = f.contractsConsumed, | ||
| fv.sectionNames = f.sectionNames, | ||
| fv.sectionDescriptions = f.sectionDescriptions, | ||
| fv.snapshotAt = $snapshotAt |
Comment on lines
+165
to
+172
| async function persistStats(input: PersistStatsInput): Promise<void> { | ||
| const estimatedCost = await estimateCostFromBreakdown({}); | ||
| await recordProcessingStats({ | ||
| knowledgeId: input.knowledgeId, | ||
| repoName: input.repoName, | ||
| commitHash: input.commitHash, | ||
| modelTokens: {}, | ||
| estimatedCost, |
- dev mode for better logging
[DONE] feat/ingest-gitpull — incremental GitHub pull + commit picker + dev-mode logs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
Ingestion pipeline (v2 flat-folder strategy)
BasicFileAnalysisStrategy(single-pass per-file LLM, no folder/repo summaries) with a 7-phase flat-folder strategy ported from kubev2:keywords,sideEffects,configDependencies,dataFlowDirection)dependencyGraphper folder)ContextWindowLimit, else batch + merge):Repo→:Folder→:Filewith extended analysis props)bigFiles.json,file-analysis/*.json,folder-summaries/*.json,big-file-analysis/*.manifest.json,repo-summary.json) are the inter-phase contract — a crash mid-run resumes from the next sub-phase boundary.packages/ingest-github/src/strategies/basic-file-analysis/BasicFileAnalysisStrategy.ts.archivedfor reference (not compiled, not exported).types/,pipeline/,adapters/,payload/,handlers/,strategies/flat-folder/. Every folder has its owncontext.md.LLM skip-decision gate
pipeline/skip-decisions/— kube-style YES/NO LLM gate for files with unknown extensions or extensionless filenames.seed-data/and merged intoSKIP_DIRS/SKIP_FILES/BINARY_EXTENSIONSso the scanner now rejectsCODEOWNERS,Chart.yaml,_helpers.tpl,.tfvars, etc. without LLM calls.~/.bytebell/llmDecisions.json(atomic-write, mode 0600). Users hand-editignore: true → falseto permanently un-ignore an extension.Config.SkipDecisionEnabled(defaulttrue).Cross-package changes
@bb/config+@bb/types: addedContextWindowLimit,MaxTokensPerChunk,BigFileConcurrency,AbsoluteFileSizeCap,ConcurrentWorkers,CondenseContextLimit,CondensePromptOverhead,SmallFileDedupThreshold,BigFileLineThreshold,OrgId(locked to"local"),SkipDecisionEnabled,SkipDecisionMaxCharsForLlm,SkipDecisionCachePath.@bb/llm: addedaskJsonLLM<T>(JSON-shape calls with parse + 1-retry fallback) andaskYesNoLLM(boolean parse, null on failure) next to the existingaskLLM.@bb/mongo: extendedFileAnalysiswith kube's field set (ontologyConcepts,businessEntities,systemCapabilities,sideEffects,configDependencies,dataFlowDirection,integrationSurface,contractsProvided,contractsConsumed,sectionMap). All new fields optional for back-compat.@bb/neo4j: addedupsertRepoNode,upsertFolderNode,ensureFlatFolderIndexes; extendedupsertFileNodewithorgId/repoId/ Folder→FileCONTAINSedge / extended analysis props. Cypher writes splitsectionMapinto parallelsectionNames/sectionDescriptionsarrays (Neo4j doesn't support nested objects).@bb/errors: addedCancellationErrortoingest-errors.ts; the orchestrator catches it and re-throws without flipping Mongo state toFAILED.@bb/server: parkedPOST /api/v1/github/pullat HTTP 503 with a structured "migrating" body until the pull plan lands. Re-index for now.Plumbing
pipeline/run.tsorchestrator wraps the strategy with state transitions, clone-or-fetch+reset, meta-dir setup, stats persistence, commit anchoring, and cooperative cancellation.resolveOrgId(payload)returnspayload.orgId ?? Config.OrgId— single source of truth for org resolution.pipeline/scan.tsasync generator withConfig.AbsoluteFileSizeCap(50 MiB default) +Config.BigFileLineThreshold(1200 lines default) classifying oversized files; emits end-of-scan totals (acceptStatic/acceptLlm/rejectStatic/rejectLlm/oversized/binary).pipeline/concurrency.tsexposes a self-containedwithConcurrency(n)(p-limit shape, zero external deps).Stats
Why
This PR opens the door on
bytebell-public's ingestion engine. The flat-folder pipeline shipped here is the same strategy we run in our internal kube deployment — same prompts, same phases, same disk artifacts, same graph shape. We are not shipping a stripped-down OSS demo or a teaser; we are committing the production ingestion code so the community can run it locally and form their own opinions about how it works.