Add portal docs index pipeline task#393
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29850183f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| catch (Exception ex) when (ex is IOException or UnauthorizedAccessException or JsonException or HttpRequestException or TaskCanceledException) | ||
| { | ||
| AddSourceWarning(normalizedSource, document.Warnings, $"Source '{normalizedSource.Id}' failed: {ex.Message}"); | ||
| } |
There was a problem hiding this comment.
Preserve partial docs when a GitHub file fetch fails
This source-level catch turns any HttpRequestException/TaskCanceledException from one file fetch into a failure for the entire source, so all documents collected earlier in IndexGitHubSource are dropped because AddRange(docs) is never reached. In practice, a single transient timeout in a large repo can result in zero repository docs while the pipeline still reports success with only a warning. Handle per-file fetch failures inside IndexGitHubSource (and continue) so one bad request does not wipe the full source output.
Useful? React with 👍 / 👎.
| if (char.IsLetterOrDigit(ch)) | ||
| builder.Append(char.ToLowerInvariant(ch)); | ||
| else if (builder.Length == 0 || builder[^1] != '-') | ||
| builder.Append('-'); |
There was a problem hiding this comment.
Prevent duplicate IDs from lossy slug generation
MakeSafeFragment replaces every non-alphanumeric character with -, which makes distinct paths collapse to the same ID (for example foo_bar.md and foo-bar.md under the same source). Because MakeStableId relies on this slug directly, docs.json/search.json can emit duplicate id values, which can overwrite entries in downstream maps keyed by id. Add a collision-resistant suffix (e.g., hash of the original path) or a reversible encoding to keep IDs unique.
Useful? React with 👍 / 👎.
| if (!json.RootElement.TryGetProperty("tree", out var tree) || tree.ValueKind != JsonValueKind.Array) | ||
| return ExplicitIncludePaths(include); | ||
|
|
||
| return tree.EnumerateArray() |
There was a problem hiding this comment.
Detect truncated GitHub tree responses before indexing
The GitHub tree API can return truncated: true for recursive listings, which means the tree array is incomplete; this implementation never checks that flag and treats the partial list as complete. On large repositories, portal-docs-index will silently miss documents and produce incomplete docs.json/search.json without any warning. Add explicit handling for truncated (warn/fail, or fall back to non-recursive subtree traversal) before building the path list.
Useful? React with 👍 / 👎.
| if (string.IsNullOrWhiteSpace(doc.Kind)) | ||
| doc.Kind = Classify(asset.Path, classify); |
There was a problem hiding this comment.
Apply fallback classification when package kind is missing
This guard is unreachable because CreateDocument already normalizes empty kind values to "docs", so string.IsNullOrWhiteSpace(doc.Kind) is never true. As a result, package assets with missing/empty kind are never classified by path (README, CHANGELOG, LICENSE, custom classify rules) and are mislabeled as generic docs. Check asset.Kind (or classify before CreateDocument) so fallback classification actually runs.
Useful? React with 👍 / 👎.
| case "portal-docs-index": | ||
| case "portal-docs": | ||
| case "company-portal-docs": | ||
| ExecutePortalDocsIndex(step, baseDir, stepResult); | ||
| break; |
There was a problem hiding this comment.
Register portal-docs outputs in pipeline cache tracking
This commit wires portal-docs-index into task execution, but cache output inference has no matching case, so expectedOutputs is empty for this task. With pipeline caching enabled, a subsequent run can report a cache hit even when data/portal/docs.json or search.json were deleted, because output existence is never checked. Add portal-docs-index/aliases to GetExpectedStepOutputs so cache hits require the generated files to still exist.
Useful? React with 👍 / 👎.
| return null; | ||
| } | ||
|
|
||
| return JsonSerializer.Deserialize<PrivateGalleryDocument>(File.ReadAllText(galleryPath), WebJson.Options); |
There was a problem hiding this comment.
Avoid failing all indexing on malformed gallery JSON
LoadPrivateGallery deserializes the gallery file before per-source processing, and a malformed feed.json throws JsonException that aborts the entire task, including unrelated local/github sources. This is inconsistent with the existing soft-fail behavior for a missing gallery file (warning + continue) and makes portal indexing brittle when package data is optional. Catch parse failures here and downgrade to a warning so non-package sources can still be emitted.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 912f1e9988
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| case "portal-docs-index": | ||
| case "portal-docs": | ||
| case "company-portal-docs": | ||
| ExecutePortalDocsIndex(step, baseDir, stepResult); |
There was a problem hiding this comment.
Track private gallery feed in portal-docs cache fingerprint
Adding portal-docs-index to task execution makes it cacheable, but this task depends on privateGallery/private-gallery/privateGalleryFeed input files that are not part of fingerprint path tracking (FingerprintPathKeys only covers sources/source here). With pipeline cache enabled, package-backed docs can remain stale after feed.json changes because the step fingerprint does not change and the task is skipped. Please include the private-gallery input aliases in fingerprint path tracking for this task.
Useful? React with 👍 / 👎.
| var docs = source.Kind?.Trim().ToLowerInvariant() switch | ||
| { | ||
| "local" => IndexLocalSource(sourceSpec.Defaults, source, normalizedSource, baseDir, options, document.Warnings), | ||
| "package" => IndexPackageSource(sourceSpec.Defaults, source, normalizedSource, gallery, document.Warnings), | ||
| "github" => await IndexGitHubSource(sourceSpec.Defaults, source, normalizedSource, http, options, document.Warnings).ConfigureAwait(false), | ||
| "azure-devops" or "azuredevops" => await IndexAzureDevOpsSource(sourceSpec.Defaults, source, normalizedSource, http, options, token, document.Warnings).ConfigureAwait(false), | ||
| _ => UnsupportedSource(source, normalizedSource, document.Warnings) |
There was a problem hiding this comment.
Default missing source kind to local before dispatch
Source dispatch uses source.Kind?.Trim().ToLowerInvariant() directly, so a source entry without kind falls into the unsupported branch and emits no documents. This conflicts with the model normalization that already defaults source kind to "local", causing silently skipped local sources when kind is omitted. Apply the same default before the switch (or switch on normalized kind) so omitted kind behaves consistently.
Useful? React with 👍 / 👎.
| if (!string.IsNullOrWhiteSpace(token)) | ||
| client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", token); |
There was a problem hiding this comment.
Avoid sending one token to every remote source host
CreateHttpClient sets DefaultRequestHeaders.Authorization once for the shared client, and that client is reused across GitHub and Azure DevOps fetches. In mixed-source configs, an Azure PAT (or other non-GitHub token) will be sent to GitHub endpoints, which is unnecessary credential disclosure to another service. Set authorization per request/source instead of as a global default header.
Useful? React with 👍 / 👎.
| if (bytes.Length > maxContentBytes) | ||
| { | ||
| AddSourceWarning(source, warnings, $"Azure DevOps content request '{url}' returned {bytes.Length} bytes and was truncated to {maxContentBytes} bytes."); | ||
| bytes = bytes.Take(maxContentBytes).ToArray(); | ||
| } |
There was a problem hiding this comment.
Parse Azure DevOps JSON before truncating response bytes
For Azure DevOps content fetches, the code truncates the raw byte payload first and then parses JSON when the media type is JSON. If a valid JSON response exceeds maxContentBytes, truncation can make the JSON invalid and throw JsonException; this is caught at the source level and drops all docs from that source. Parse full JSON first (or stream parse) and truncate only the extracted content string.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4104e904ea
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ? MakeSafeFragment($"{source.Kind}-{source.Title}-{source.Module}-{source.Path}-{source.Repo}") | ||
| : source.Id!; |
There was a problem hiding this comment.
Include Azure DevOps repository fields in fallback source IDs
When id is omitted, fallback IDs are built from kind/title/module/path/repo, but Azure DevOps sources commonly use repository (plus organization/project) instead of repo. That means multiple Azure DevOps entries can collapse to the same generated source ID (often just azure-devops), and then MakeStableId(sourceId, path) can produce duplicate document IDs for shared paths like README.md, causing ambiguous or overwritten entries in downstream ID-keyed consumers.
Useful? React with 👍 / 👎.
| "projects", "projectFiles", "project-files", | ||
| "module", "modules", "moduleFiles", "module-files", | ||
| "out", "output", "source", "destination", "dest", | ||
| "privateGallery", "private-gallery", "privateGalleryFeed", "private-gallery-feed", "gallery", |
There was a problem hiding this comment.
Track sourcesFile aliases in portal-docs cache fingerprints
portal-docs-index accepts sourcesFile/sources-file, but those keys are not included in FingerprintPathKeys. In cached pipeline runs, edits to a sources file referenced via these aliases will not change the step fingerprint, so the step can be treated as a cache hit and leave stale docs.json/search.json in place.
Useful? React with 👍 / 👎.
| return extension.Equals(".md", StringComparison.OrdinalIgnoreCase) || | ||
| extension.Equals(".mdx", StringComparison.OrdinalIgnoreCase) || | ||
| extension.Equals(".txt", StringComparison.OrdinalIgnoreCase) || | ||
| string.IsNullOrWhiteSpace(extension) && Path.GetFileName(path).Equals("LICENSE", StringComparison.OrdinalIgnoreCase); |
There was a problem hiding this comment.
Accept extensionless README/CHANGELOG as text documents
IsTextDocument only allows extensionless LICENSE, but the classifier and common repo layouts also rely on extensionless README/CHANGELOG. Because all source indexers filter through this function, repositories that store docs as README or CHANGELOG (without .md) are silently excluded even when explicitly included by pattern, resulting in missing core docs in docs.json and search output.
Useful? React with 👍 / 👎.
Summary
Validation
Notes