Skip to content
This repository was archived by the owner on Apr 23, 2026. It is now read-only.
This repository was archived by the owner on Apr 23, 2026. It is now read-only.

Preprocessor for scraped llvm/llvm-project PRs & comments → documents for Pinecone #88

@jonathanMLDev

Description

@jonathanMLDev

Summary

Preprocess already scraped llvm/llvm-project pull request data (including comments) into document objects (content + metadata) for Pinecone upsert. No scraping or GitHub API calls.

Scope

  • Input: Scraped PR data for llvm/llvm-project (with comments). Output: Documents with content and metadata (e.g. repo, number, state, author, created_at, labels, url). In scope: parse/validate, normalize text, include comments in content/chunking, document schema. Out of scope: GitHub fetch; Pinecone API.

Result

Library or CLI: scraped payload(s) → list of { content, metadata }. Config for field mapping/truncation. Code, tests, and doc schema README.

Acceptance criteria

  • Accepts scraped llvm/llvm-project PR (+ comments) data in agreed format; outputs stable content + metadata.
  • Metadata includes identifier, repo (llvm/llvm-project), type (PR), filterable fields. Schema documented.
  • No GitHub API or scraping in this component.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions