[DONE] feat/flat-folder — port flat-folder ingestion strategy by Dead-Bytes · Pull Request #39 · ByteBell/bytebell-oss

Dead-Bytes · 2026-05-13T06:20:13Z

What changed

Ingestion pipeline (v2 flat-folder strategy)

Replaced the v1 BasicFileAnalysisStrategy (single-pass per-file LLM, no folder/repo summaries) with a 7-phase flat-folder strategy ported from kube v2:
1. classify + analyse small files (scan → token-classify → small-file LLM)
2. process big files (chunk → analyse → condense → manifest)
3. backfill missing fields (top up keywords, sideEffects, configDependencies, dataFlowDirection)
4. backfill big files (re-condense stale entries from cached chunks)
5. summarise folders (LLM per direct-child bucket, mermaid dependencyGraph per folder)
6. summarise repo (single-shot if it fits ContextWindowLimit, else batch + merge)
7. store flat analysis (:Repo → :Folder → :File with extended analysis props)
Disk artifacts (bigFiles.json, file-analysis/*.json, folder-summaries/*.json, big-file-analysis/*.manifest.json, repo-summary.json) are the inter-phase contract — a crash mid-run resumes from the next sub-phase boundary.
Archived the old strategy at packages/ingest-github/src/strategies/basic-file-analysis/BasicFileAnalysisStrategy.ts.archived for reference (not compiled, not exported).
Reorganised the package into tiered subtrees: types/, pipeline/, adapters/, payload/, handlers/, strategies/flat-folder/. Every folder has its own context.md.

LLM skip-decision gate

Added pipeline/skip-decisions/ — kube-style YES/NO LLM gate for files with unknown extensions or extensionless filenames.
Static reject lists (~95 KB of seed data: 169 directories, 207 filenames, 768 patterns) bundled under seed-data/ and merged into SKIP_DIRS / SKIP_FILES / BINARY_EXTENSIONS so the scanner now rejects CODEOWNERS, Chart.yaml, _helpers.tpl, .tfvars, etc. without LLM calls.
Unknown-extension verdicts cache to ~/.bytebell/llmDecisions.json (atomic-write, mode 0600). Users hand-edit ignore: true → false to permanently un-ignore an extension.
Gated behind Config.SkipDecisionEnabled (default true).

Cross-package changes

@bb/config + @bb/types: added ContextWindowLimit, MaxTokensPerChunk, BigFileConcurrency, AbsoluteFileSizeCap, ConcurrentWorkers, CondenseContextLimit, CondensePromptOverhead, SmallFileDedupThreshold, BigFileLineThreshold, OrgId (locked to "local"), SkipDecisionEnabled, SkipDecisionMaxCharsForLlm, SkipDecisionCachePath.
@bb/llm: added askJsonLLM<T> (JSON-shape calls with parse + 1-retry fallback) and askYesNoLLM (boolean parse, null on failure) next to the existing askLLM.
@bb/mongo: extended FileAnalysis with kube's field set (ontologyConcepts, businessEntities, systemCapabilities, sideEffects, configDependencies, dataFlowDirection, integrationSurface, contractsProvided, contractsConsumed, sectionMap). All new fields optional for back-compat.
@bb/neo4j: added upsertRepoNode, upsertFolderNode, ensureFlatFolderIndexes; extended upsertFileNode with orgId / repoId / Folder→File CONTAINS edge / extended analysis props. Cypher writes split sectionMap into parallel sectionNames / sectionDescriptions arrays (Neo4j doesn't support nested objects).
@bb/errors: added CancellationError to ingest-errors.ts; the orchestrator catches it and re-throws without flipping Mongo state to FAILED.
@bb/server: parked POST /api/v1/github/pull at HTTP 503 with a structured "migrating" body until the pull plan lands. Re-index for now.

Plumbing

New pipeline/run.ts orchestrator wraps the strategy with state transitions, clone-or-fetch+reset, meta-dir setup, stats persistence, commit anchoring, and cooperative cancellation. resolveOrgId(payload) returns payload.orgId ?? Config.OrgId — single source of truth for org resolution.
New pipeline/scan.ts async generator with Config.AbsoluteFileSizeCap (50 MiB default) + Config.BigFileLineThreshold (1200 lines default) classifying oversized files; emits end-of-scan totals (acceptStatic/acceptLlm/rejectStatic/rejectLlm/oversized/binary).
pipeline/concurrency.ts exposes a self-contained withConcurrency(n) (p-limit shape, zero external deps).

Stats

100 files changed, +8533 / -1149. Five commits: config keys → llm helpers → flat-folder types/strategy → neo4j field backfill → skip-decision gate.

Why

This PR opens the door on bytebell-public's ingestion engine. The flat-folder pipeline shipped here is the same strategy we run in our internal kube deployment — same prompts, same phases, same disk artifacts, same graph shape. We are not shipping a stripped-down OSS demo or a teaser; we are committing the production ingestion code so the community can run it locally and form their own opinions about how it works.

…load narrowing

…mpts and types

Copilot

Pull request overview

This PR ports the “v2 flat-folder” ingestion pipeline into @bb/ingest-github, replacing the old v1 single-pass file strategy with a multi-phase, crash-resumable pipeline that writes intermediate artifacts to disk and persists a richer analysis shape into Mongo + Neo4j (including new :Repo / :Folder nodes and extended FileAnalysis fields). It also adds an LLM-backed skip-decision gate to reduce unnecessary analysis calls for unknown file types and common noise files.

Changes:

Replace v1 worker/strategy plumbing with a new orchestrated pipeline (pipeline/run.ts) and the 7-phase flat-folder strategy.
Add skip-decision infrastructure (seed data, cache, YES/NO LLM prompt) to filter files before analysis.
Extend shared types/config and Neo4j/Mongo/LLM helpers to support richer analysis fields and new graph nodes.

Reviewed changes

Copilot reviewed 99 out of 100 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
packages/types/src/job.ts	Add `orgId` to payloads
packages/types/src/context.md	Document `orgId` override
packages/types/src/config.ts	Add flat-folder config keys
packages/neo4j/src/repo.ts	Add `:Repo` upsert
packages/neo4j/src/index.ts	Export new upserts/indexes
packages/neo4j/src/folder.ts	Add `:Folder` upsert
packages/neo4j/src/flatFolderIndexes.ts	Add repo/folder constraints
packages/neo4j/src/fileVersions.ts	Snapshot extended file props
packages/neo4j/src/files.ts	Extend file node properties
packages/neo4j/src/context.md	Update Neo4j module docs
packages/mongo/src/raw.ts	Extend `FileAnalysis` shape
packages/mongo/src/index.ts	Export new analysis types
packages/llm/src/jsonClient.ts	Add JSON + YES/NO clients
packages/llm/src/index.ts	Export JSON helpers
packages/ingest-github/tsconfig.json	Add `src/*` path alias
packages/ingest-github/src/worker.ts	Remove v1 worker
packages/ingest-github/src/types/strategy.ts	New strategy port types
packages/ingest-github/src/types/pipeline.ts	Pipeline/scan/skip types
packages/ingest-github/src/types/meta-paths.ts	Meta artifact paths type
packages/ingest-github/src/types/ingest-runner.ts	Runner dependency types
packages/ingest-github/src/types/index.ts	Types barrel exports
packages/ingest-github/src/types/file-analysis.ts	`emptyFileAnalysis` helper
packages/ingest-github/src/types/context.md	Types tier documentation
packages/ingest-github/src/types/condensed-file-analysis.ts	Condensed artifact shape
packages/ingest-github/src/types/big-file.ts	Big-file artifact shapes
packages/ingest-github/src/Strategy.ts	Remove v1 strategy interface
packages/ingest-github/src/strategies/flat-folder/types.ts	Flat-folder domain types
packages/ingest-github/src/strategies/flat-folder/repo-summary.ts	Repo summary generation
packages/ingest-github/src/strategies/flat-folder/prompts/repo-summary.ts	Repo summary prompts
packages/ingest-github/src/strategies/flat-folder/prompts/folder-summary.ts	Folder summary prompts
packages/ingest-github/src/strategies/flat-folder/prompts/file-analysis.ts	File analysis prompts
packages/ingest-github/src/strategies/flat-folder/prompts/file-analysis-fields.ts	Field schema block
packages/ingest-github/src/strategies/flat-folder/prompts/context.md	Prompt module docs
packages/ingest-github/src/strategies/flat-folder/prompts/condense.ts	Condense prompt
packages/ingest-github/src/strategies/flat-folder/prompts/chunk.ts	Chunk prompt
packages/ingest-github/src/strategies/flat-folder/prompts/backfill.ts	Backfill prompt
packages/ingest-github/src/strategies/flat-folder/phases/store-flat-analysis.ts	Neo4j write phase
packages/ingest-github/src/strategies/flat-folder/phases/process-big-files.ts	Big-file phase
packages/ingest-github/src/strategies/flat-folder/phases/context.md	Phase docs
packages/ingest-github/src/strategies/flat-folder/phases/classify-and-analyse-small.ts	Phase 1 classify/analyze
packages/ingest-github/src/strategies/flat-folder/index.ts	Strategy orchestrator
packages/ingest-github/src/strategies/flat-folder/folder-summary.ts	Folder summary phase
packages/ingest-github/src/strategies/flat-folder/folder-path.ts	Folder path helpers
packages/ingest-github/src/strategies/flat-folder/context.md	Strategy docs
packages/ingest-github/src/strategies/flat-folder/big-file/storage.ts	Big-file cache storage
packages/ingest-github/src/strategies/flat-folder/big-file/index.ts	Big-file processing
packages/ingest-github/src/strategies/flat-folder/big-file/detector.ts	Big-file list I/O
packages/ingest-github/src/strategies/flat-folder/big-file/context.md	Big-file docs
packages/ingest-github/src/strategies/flat-folder/big-file/chunker.ts	Token-based chunking
packages/ingest-github/src/strategies/flat-folder/big-file/chunk-analyzer.ts	Chunk LLM analysis
packages/ingest-github/src/strategies/flat-folder/big-file/cache.ts	Big-file cache inspect
packages/ingest-github/src/strategies/flat-folder/backfill/fields.ts	Phase 3 backfill
packages/ingest-github/src/strategies/flat-folder/backfill/context.md	Backfill docs
packages/ingest-github/src/strategies/flat-folder/backfill/big-files.ts	Phase 4 backfill big
packages/ingest-github/src/strategies/flat-folder/analyse-file.ts	Small-file analysis wrapper
packages/ingest-github/src/strategies/context.md	Strategies tree docs
packages/ingest-github/src/strategies/basic-file-analysis/context.md	Archived v1 docs
packages/ingest-github/src/strategies/basic-file-analysis/BasicFileAnalysisStrategy.ts.archived	Archive v1 source
packages/ingest-github/src/scan.ts	Remove v1 scanner
packages/ingest-github/src/pipeline/source.ts	Clone-or-sync + HEAD hash
packages/ingest-github/src/pipeline/skip-decisions/seed.ts	Load skip seed data
packages/ingest-github/src/pipeline/skip-decisions/seed-data/filenameIgnore.json	Seed filename rejects
packages/ingest-github/src/pipeline/skip-decisions/seed-data/extensions.json	Known ext→language map
packages/ingest-github/src/pipeline/skip-decisions/seed-data/directoryIgnore.json	Seed dir rejects
packages/ingest-github/src/pipeline/skip-decisions/seed-data/context.md	Seed data docs
packages/ingest-github/src/pipeline/skip-decisions/prompts/skip-decision.ts	Skip gate prompts
packages/ingest-github/src/pipeline/skip-decisions/prompts/context.md	Skip prompt docs
packages/ingest-github/src/pipeline/skip-decisions/index.ts	Skip-decisions barrel
packages/ingest-github/src/pipeline/skip-decisions/decider.ts	Skip decision logic
packages/ingest-github/src/pipeline/skip-decisions/context.md	Skip-decisions docs
packages/ingest-github/src/pipeline/skip-decisions/cache.ts	Skip decision cache
packages/ingest-github/src/pipeline/scan.ts	New streaming scanner
packages/ingest-github/src/pipeline/run.ts	New orchestrator runner
packages/ingest-github/src/pipeline/paths.ts	Repo/meta path helpers
packages/ingest-github/src/pipeline/filters.ts	Static filters wiring
packages/ingest-github/src/pipeline/context.md	Pipeline docs
packages/ingest-github/src/pipeline/concurrency.ts	Concurrency limiter
packages/ingest-github/src/pipeline/cancellation.ts	Cooperative cancellation
packages/ingest-github/src/pipeline/branch.ts	Branch validation
packages/ingest-github/src/payload/narrow.ts	Payload narrowing
packages/ingest-github/src/payload/context.md	Payload docs
packages/ingest-github/src/paths.ts	Remove v1 paths helper
packages/ingest-github/src/index.ts	New worker wiring + exports
packages/ingest-github/src/handlers/ingest-job.ts	Job handlers
packages/ingest-github/src/handlers/context.md	Handler docs
packages/ingest-github/src/context.md	Rewrite package docs
packages/ingest-github/src/concurrency.ts	Remove v1 concurrency
packages/ingest-github/src/bigFile.ts	Remove v1 big-file logic
packages/ingest-github/src/analyze.ts	Remove v1 analysis entry
packages/ingest-github/src/analysisShared.ts	Remove v1 shared analysis
packages/ingest-github/src/adapters/llm-file-analyzer.ts	LLM analyzer adapter
packages/ingest-github/src/adapters/index.ts	Adapter barrel
packages/ingest-github/src/adapters/context.md	Adapter docs
packages/errors/src/ingest-errors.ts	Add `CancellationError`
packages/errors/src/index.ts	Export `CancellationError`
packages/errors/src/context.md	Document cancellation error
packages/config/src/schema.ts	Add some new config defaults

Comments suppressed due to low confidence (1)

packages/ingest-github/src/pipeline/scan.ts:81

scanRepository treats files with countLines(content) > Config.BigFileLineThreshold as kind: "oversized". In Phase 1, oversized files are written as permanent "too-large" stubs and never sent through the big-file chunk/analyse/condense path, so line-threshold files will be silently skipped rather than analyzed. If the intent is to route these into the big-file pipeline, consider emitting a distinct kind (e.g. "big") or yielding a normal "file" entry and letting Phase 1 enqueue it for Phase 2 instead of stubbing it out.

    const content = buf.toString("utf8");
    if (countLines(content) > limits.bigFileLineThreshold) {
      counts.oversized += 1;
      yield { kind: "oversized", relativePath, absolutePath: abs, sizeBytes };
      continue;
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 export const configSchema = z
  .object({
    server_port: z.number().int().min(1).max(65535).default(8080),
    mongo_uri: z.string().default(""),
    neo4j_uri: z.string().default(""),
    neo4j_user: z.string().default(""),
    neo4j_password: z.string().default(""),
    redis_url: z.string().default(""),
    openrouter_api_key: z.string().default(""),
    openrouter_model: z.string().default("deepseek/deepseek-v4-flash"),
    openrouter_fallback_model_1: z.string().default("qwen/qwen3.5-flash-02-23"),
    openrouter_fallback_model_2: z.string().default("minimax/minimax-m2.7"),
    openrouter_fallback_model_3: z.string().default("moonshotai/kimi-k2.5"),
    openrouter_fallback_model_4: z.string().default("x-ai/grok-4.3"),
    concurrency: concurrencySchema.default({ github: 2 }),
    log_level: z.enum(LOG_LEVELS).default("info"),
    log_retention_days: z.number().int().positive().default(14),
    llm_cache_enabled: z.boolean().default(true),
+    "context.window.limit": z.number().int().positive().default(15000),
+    "max.tokens.per.chunk": z.number().int().positive().default(6000),
+    "big.file.concurrency": z.number().int().positive().default(25),
+    "absolute.file.size.cap": z.number().int().positive().default(52428800),
+    "concurrent.workers": z.number().int().positive().default(4),
+    "condense.context.limit": z.number().int().positive().default(12000),
+    "condense.prompt.overhead": z.number().int().nonnegative().default(1500),
+    "small.file.dedup.threshold": z.number().int().positive().default(3),
  })


+    const sizeBytes = (await stat(abs)).size;
+    const relativePath = path.relative(rootDir, abs);
+    const ext = path.extname(entry.name).toLowerCase();
+    if (sizeBytes > limits.absoluteCap) {


+    if (entry.kind === "oversized") {
+      bigFileBuffer.push({
+        relativePath: entry.relativePath,
+        sizeBytes: entry.sizeBytes,
+        tokenCount: 0,
+        reason: "too-large",
+      });


+const ATTACH_FILE_TO_FOLDER = `
+MATCH (f:File {knowledgeId: $knowledgeId, relativePath: $relativePath})
+MATCH (folder:Folder {knowledgeId: $knowledgeId, folderPath: $folderPath})
+MERGE (folder)-[:CONTAINS]->(f)
+`;


+const UPSERT_FOLDER = `
+MERGE (folder:Folder {orgId: $orgId, knowledgeId: $knowledgeId, repoId: $repoId, folderPath: $folderPath})
+SET folder.purpose = $purpose,
+    folder.summary = $summary,
+    folder.dependencyGraph = $dependencyGraph,
+    folder.updatedAt = $updatedAt


    fv.summary = f.summary,
    fv.businessContext = f.businessContext,
+    fv.dataFlowDirection = f.dataFlowDirection,
+    fv.ontologyConcepts = f.ontologyConcepts,
+    fv.businessEntities = f.businessEntities,
+    fv.systemCapabilities = f.systemCapabilities,
+    fv.sideEffects = f.sideEffects,
+    fv.configDependencies = f.configDependencies,
+    fv.integrationSurface = f.integrationSurface,
+    fv.contractsProvided = f.contractsProvided,
+    fv.contractsConsumed = f.contractsConsumed,
+    fv.sectionNames = f.sectionNames,
+    fv.sectionDescriptions = f.sectionDescriptions,
    fv.snapshotAt = $snapshotAt


+async function persistStats(input: PersistStatsInput): Promise<void> {
+  const estimatedCost = await estimateCostFromBreakdown({});
+  await recordProcessingStats({
+    knowledgeId: input.knowledgeId,
+    repoName: input.repoName,
+    commitHash: input.commitHash,
+    modelTokens: {},
+    estimatedCost,


…straction

- dev mode for better logging

[DONE] feat/ingest-gitpull — incremental GitHub pull + commit picker + dev-mode logs

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Dead-Bytes added 5 commits May 13, 2026 10:21

feat: add new configuration options and enhance LLM functionality

e5cf4cb

feat: implement LLM file analyzer and job handlers with defensive pay…

2c459ab

…load narrowing

feat: Implement flat-folder ingestion strategy with comprehensive pro…

52736ad

…mpts and types

fix(flat-folder): added missing fields back to neo4j

6a20cd2

feat(ingest0github): added llm based skip decisions

331fb6c

Copilot AI review requested due to automatic review settings May 13, 2026 06:20

Copilot started reviewing on behalf of Dead-Bytes May 13, 2026 06:20 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Dead-Bytes added 5 commits May 13, 2026 13:05

feat(ingest-github): enhance ingestion pipeline with source reader ab…

f22aabb

…straction

feat(config): cleaned type errors

0f33b62

feat: implement pull diff resolver and related functionality

718ccda

feat: enhance GitHub integration with token handling , added

e5b844f

- dev mode for better logging

Merge branch 'feat/flat_folder' into feat/ingest-gitpull

95a5ce8

Dead-Bytes added the enhancement New feature or request label May 13, 2026

Merge pull request #40 from ByteBell/feat/ingest-gitpull

2cd5c9b

[DONE] feat/ingest-gitpull — incremental GitHub pull + commit picker + dev-mode logs

Copilot AI review requested due to automatic review settings May 13, 2026 11:04

Copilot AI reviewed May 13, 2026

View reviewed changes

nitesh32 merged commit e8f0548 into main May 13, 2026
2 checks passed

Dead-Bytes deleted the feat/flat_folder branch May 13, 2026 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DONE] feat/flat-folder — port flat-folder ingestion strategy#39

[DONE] feat/flat-folder — port flat-folder ingestion strategy#39
nitesh32 merged 11 commits into
mainfrom
feat/flat_folder

Dead-Bytes commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Dead-Bytes commented May 13, 2026

What changed

Ingestion pipeline (v2 flat-folder strategy)

LLM skip-decision gate

Cross-package changes

Plumbing

Stats

Why

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants