Implement private gallery indexing engine#392
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd00e7dc96
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (options is null) throw new ArgumentNullException(nameof(options)); | ||
|
|
||
| using var http = PrivateGalleryHttp.CreateClient(options.RequestTimeoutSeconds, _httpHandler); | ||
| var packageBase = await ResolvePackageBaseAddressAsync(http, serviceIndexUrl, options, cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
Cache package base address lookups across downloads
DownloadPackageAsync resolves the NuGet V3 service index on every package download, but PrivateGalleryIndexer calls this method once per package version; for feeds with many versions this creates a repeated extra HTTP request per download (service index + package), which significantly slows indexing and increases throttling/timeout risk in CI. Resolve the package base URL once per feed (or cache by serviceIndexUrl) before iterating versions.
Useful? React with 👍 / 👎.
| else if (lowerPath.Contains("/docs/", StringComparison.Ordinal) && IsTextDocument(lowerName)) | ||
| kind = "docs"; | ||
| else if (lowerPath.Contains("/examples/", StringComparison.Ordinal) && IsExampleDocument(lowerName)) | ||
| kind = "example"; |
There was a problem hiding this comment.
Detect docs/examples at package root paths
The document classifier only matches "/docs/" and "/examples/" substrings, so common root-level paths like docs/GettingStarted.md or examples/Run.ps1 are never tagged as docs/example. In those packages the generated metadata and search.json silently miss documentation/example entries, reducing discoverability and producing incomplete quality counts.
Useful? React with 👍 / 👎.
| "protocolType=NuGet", | ||
| "includeUrls=true", | ||
| "includeAllVersions=" + includeAllVersions.ToString(CultureInfo.InvariantCulture).ToLowerInvariant(), | ||
| "includeDescription=" + includeDescription.ToString(CultureInfo.InvariantCulture).ToLowerInvariant(), | ||
| "$top=" + top.ToString(CultureInfo.InvariantCulture), |
There was a problem hiding this comment.
Include prerelease packages in inventory query
The Azure Artifacts packages request omits the isRelease query parameter; Microsoft’s REST API defaults isRelease to true, which filters results to non-prerelease versions when includeAllVersions is used. This means feeds that publish prerelease-only modules (or rely on prerelease history) are indexed incompletely even though the task advertises full private-gallery inventory.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0761dd8461
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (packages.Count >= maxPackages) | ||
| break; |
There was a problem hiding this comment.
Run prerelease pass even when package cap is reached
When IncludeAllVersions is enabled, the client correctly plans two queries (isRelease=true then isRelease=false), but this early break exits before the prerelease pass whenever the first pass reaches MaxPackages (500 by default). In feeds with many packages, that means each selected package only keeps release versions, so the output silently violates the "all versions" contract and misses prerelease history. Azure’s isRelease default is true (release-only), so the second pass is required even after the package cap is met; only adding new package IDs should be capped, not merging additional versions for already selected packages.
Useful? React with 👍 / 👎.
| }; | ||
| using var stream = entry.Open(); | ||
| using var reader = XmlReader.Create(stream, settings); | ||
| var document = XDocument.Load(reader, LoadOptions.None); |
There was a problem hiding this comment.
Enforce size limit before loading help XML
The inspector caps manifest text reads to MaxTextBytes, but ApplyHelpXml parses every -help.xml entry with XDocument.Load without any entry-length check. A package can include a very large help XML file (still within the 250 MB package cap), which can drive high memory/CPU usage and stall or crash indexing when includePackageContent is enabled. Add a size guard (or streaming parse limits) before loading help XML content.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f11bc6d8a9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var fetchedForFilter = 0; | ||
| while (fetchedForFilter < maxPackages) |
There was a problem hiding this comment.
Enforce MaxPackages across all release-filter passes
The package cap is applied per isRelease pass because fetchedForFilter is reset for each filter, so with IncludeAllVersions=true this can return more than MaxPackages package IDs (e.g., release packages in the first pass plus prerelease-only packages in the second). That violates the MaxPackages contract and can materially increase runtime/output size on large feeds.
Useful? React with 👍 / 👎.
| if (fetched < take) | ||
| break; |
There was a problem hiding this comment.
Paginate using API page size, not parsed item count
Pagination currently stops when fetched < take, but fetched only counts successfully parsed packages. If Azure returns any malformed item (already handled by ParsePackage returning null), fetched drops below take and the loop breaks early, truncating the remaining pages. This means one bad row can silently prevent indexing many valid later packages.
Useful? React with 👍 / 👎.
| if (fetched < take) | ||
| break; | ||
|
|
||
| skip += fetched; |
There was a problem hiding this comment.
Increment
$skip by returned rows, not parsed rows
The next page offset is advanced by fetched, which excludes rows that failed parsing. Azure's $skip is based on rows in the response, so after any malformed entry the client requests an overlapping page ($skip too small), causing duplicates and potentially preventing progress to later packages under MaxPackages limits. Use the response row count (value length) to advance pagination even when some rows are discarded.
Useful? React with 👍 / 👎.
| "anyOf": [ | ||
| { "required": ["out"] }, | ||
| { "required": ["output"] }, | ||
| { "required": ["outputDirectory"] }, | ||
| { "required": ["output-directory"] } | ||
| ] |
There was a problem hiding this comment.
Keep schema output optional to match runtime default
The runner intentionally supports omitting out/output* by defaulting private gallery output to ./data/private-gallery, but the pipeline schema now requires one of those properties. This creates false validation failures in schema-aware tooling/CI for configurations that the runtime accepts, making documented default behavior unusable in strict schema validation flows.
Useful? React with 👍 / 👎.
…-engine-merge # Conflicts: # Docs/PowerForge.Web.PrivateGallery.md
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f16d40109
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| skip += returnedRows; | ||
|
|
||
| if (packages.Count >= maxPackages && fetched == 0) |
There was a problem hiding this comment.
Keep paging prerelease results after hitting MaxPackages
With includeAllVersions=true, the second (isRelease=false) pass can still contain prerelease versions for already-selected packages, but the loop currently stops at packages.Count >= maxPackages && fetched == 0. Because fetched only increments when a row merges into an existing package or adds a new one, any prerelease page that contains only out-of-cap package IDs causes an early break, even though later pages may still contain prerelease rows for selected packages. Fresh evidence: the new guard at this line still short-circuits pagination based on a single page’s composition, so capped runs can silently miss prerelease history.
Useful? React with 👍 / 👎.
| Directory.CreateDirectory(moduleDirectory); | ||
| foreach (var package in result.Document.Packages) |
There was a problem hiding this comment.
Prune obsolete module JSON when regenerating gallery output
The generator writes current package/version files under modules/ but never removes files from packages or versions that disappeared since the previous run. Re-running against a changed feed therefore leaves stale JSON artifacts on disk that no longer match feed.json/search.json, which can leak outdated package data to consumers that read or serve the per-module/per-version files directly. Cleaning or reconciling the modules subtree before writing would keep the output contract consistent.
Useful? React with 👍 / 👎.
Summary
private-gallery-indexpipeline task and schema/docs coverageValidation
Stacking
codex/private-gallery-web-engine-plan