Skip to content

Add portal docs index pipeline task#393

Merged
PrzemyslawKlys merged 4 commits into
mainfrom
codex/portal-docs-index
May 24, 2026
Merged

Add portal docs index pipeline task#393
PrzemyslawKlys merged 4 commits into
mainfrom
codex/portal-docs-index

Conversation

@PrzemyslawKlys
Copy link
Copy Markdown
Member

Summary

  • add a reusable PowerForge.Web portal docs indexer that composes website-local docs, private gallery package document assets, and GitHub repository docs into docs/search JSON
  • expose it through the portal-docs-index pipeline task and schema aliases
  • document the portal.sources.json contract and add focused tests for local/package composition and GitHub-backed docs ingestion

Validation

  • dotnet test .\PowerForge.Tests\PowerForge.Tests.csproj -c Release --filter PortalDocsIndexTests
  • dotnet build .\PowerForge.Web.Cli\PowerForge.Web.Cli.csproj -c Release --no-restore
  • ConvertFrom-Json -AsHashtable on Schemas\powerforge.web.pipelinespec.schema.json
  • local demo pipeline via built CLI: dotnet C:\Support\GitHub\PSPublishModule-portal-docs-index\PowerForge.Web.Cli\bin\Release\net10.0\PowerForge.Web.Cli.dll pipeline --config C:\Support\GitHub\PowerShellGalleryDemo\pipeline.json --mode dev
  • browser smoke at http://127.0.0.1:8091/: loaded private gallery + portal docs data, displayed 13 docs, searched repository-ingested docs with no console errors

Notes

  • Azure DevOps repository doc sources are accepted by the source contract but intentionally emit an explicit warning in this slice; GitHub repository docs are fully enumerated/fetched.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29850183f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +63 to +66
catch (Exception ex) when (ex is IOException or UnauthorizedAccessException or JsonException or HttpRequestException or TaskCanceledException)
{
AddSourceWarning(normalizedSource, document.Warnings, $"Source '{normalizedSource.Id}' failed: {ex.Message}");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve partial docs when a GitHub file fetch fails

This source-level catch turns any HttpRequestException/TaskCanceledException from one file fetch into a failure for the entire source, so all documents collected earlier in IndexGitHubSource are dropped because AddRange(docs) is never reached. In practice, a single transient timeout in a large repo can result in zero repository docs while the pipeline still reports success with only a warning. Handle per-file fetch failures inside IndexGitHubSource (and continue) so one bad request does not wipe the full source output.

Useful? React with 👍 / 👎.

Comment on lines +706 to +709
if (char.IsLetterOrDigit(ch))
builder.Append(char.ToLowerInvariant(ch));
else if (builder.Length == 0 || builder[^1] != '-')
builder.Append('-');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent duplicate IDs from lossy slug generation

MakeSafeFragment replaces every non-alphanumeric character with -, which makes distinct paths collapse to the same ID (for example foo_bar.md and foo-bar.md under the same source). Because MakeStableId relies on this slug directly, docs.json/search.json can emit duplicate id values, which can overwrite entries in downstream maps keyed by id. Add a collision-resistant suffix (e.g., hash of the original path) or a reversible encoding to keep IDs unique.

Useful? React with 👍 / 👎.

Comment on lines +289 to +292
if (!json.RootElement.TryGetProperty("tree", out var tree) || tree.ValueKind != JsonValueKind.Array)
return ExplicitIncludePaths(include);

return tree.EnumerateArray()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Detect truncated GitHub tree responses before indexing

The GitHub tree API can return truncated: true for recursive listings, which means the tree array is incomplete; this implementation never checks that flag and treats the partial list as complete. On large repositories, portal-docs-index will silently miss documents and produce incomplete docs.json/search.json without any warning. Add explicit handling for truncated (warn/fail, or fall back to non-recursive subtree traversal) before building the path list.

Useful? React with 👍 / 👎.

Comment on lines +210 to +211
if (string.IsNullOrWhiteSpace(doc.Kind))
doc.Kind = Classify(asset.Path, classify);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply fallback classification when package kind is missing

This guard is unreachable because CreateDocument already normalizes empty kind values to "docs", so string.IsNullOrWhiteSpace(doc.Kind) is never true. As a result, package assets with missing/empty kind are never classified by path (README, CHANGELOG, LICENSE, custom classify rules) and are mislabeled as generic docs. Check asset.Kind (or classify before CreateDocument) so fallback classification actually runs.

Useful? React with 👍 / 👎.

Comment on lines +58 to +62
case "portal-docs-index":
case "portal-docs":
case "company-portal-docs":
ExecutePortalDocsIndex(step, baseDir, stepResult);
break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Register portal-docs outputs in pipeline cache tracking

This commit wires portal-docs-index into task execution, but cache output inference has no matching case, so expectedOutputs is empty for this task. With pipeline caching enabled, a subsequent run can report a cache hit even when data/portal/docs.json or search.json were deleted, because output existence is never checked. Add portal-docs-index/aliases to GetExpectedStepOutputs so cache hits require the generated files to still exist.

Useful? React with 👍 / 👎.

return null;
}

return JsonSerializer.Deserialize<PrivateGalleryDocument>(File.ReadAllText(galleryPath), WebJson.Options);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid failing all indexing on malformed gallery JSON

LoadPrivateGallery deserializes the gallery file before per-source processing, and a malformed feed.json throws JsonException that aborts the entire task, including unrelated local/github sources. This is inconsistent with the existing soft-fail behavior for a missing gallery file (warning + continue) and makes portal indexing brittle when package data is optional. Catch parse failures here and downgrade to a warning so non-package sources can still be emitted.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 912f1e9988

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +58 to +61
case "portal-docs-index":
case "portal-docs":
case "company-portal-docs":
ExecutePortalDocsIndex(step, baseDir, stepResult);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Track private gallery feed in portal-docs cache fingerprint

Adding portal-docs-index to task execution makes it cacheable, but this task depends on privateGallery/private-gallery/privateGalleryFeed input files that are not part of fingerprint path tracking (FingerprintPathKeys only covers sources/source here). With pipeline cache enabled, package-backed docs can remain stale after feed.json changes because the step fingerprint does not change and the task is skipped. Please include the private-gallery input aliases in fingerprint path tracking for this task.

Useful? React with 👍 / 👎.

Comment on lines +52 to +58
var docs = source.Kind?.Trim().ToLowerInvariant() switch
{
"local" => IndexLocalSource(sourceSpec.Defaults, source, normalizedSource, baseDir, options, document.Warnings),
"package" => IndexPackageSource(sourceSpec.Defaults, source, normalizedSource, gallery, document.Warnings),
"github" => await IndexGitHubSource(sourceSpec.Defaults, source, normalizedSource, http, options, document.Warnings).ConfigureAwait(false),
"azure-devops" or "azuredevops" => await IndexAzureDevOpsSource(sourceSpec.Defaults, source, normalizedSource, http, options, token, document.Warnings).ConfigureAwait(false),
_ => UnsupportedSource(source, normalizedSource, document.Warnings)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Default missing source kind to local before dispatch

Source dispatch uses source.Kind?.Trim().ToLowerInvariant() directly, so a source entry without kind falls into the unsupported branch and emits no documents. This conflicts with the model normalization that already defaults source kind to "local", causing silently skipped local sources when kind is omitted. Apply the same default before the switch (or switch on normalized kind) so omitted kind behaves consistently.

Useful? React with 👍 / 👎.

Comment on lines +552 to +553
if (!string.IsNullOrWhiteSpace(token))
client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", token);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid sending one token to every remote source host

CreateHttpClient sets DefaultRequestHeaders.Authorization once for the shared client, and that client is reused across GitHub and Azure DevOps fetches. In mixed-source configs, an Azure PAT (or other non-GitHub token) will be sent to GitHub endpoints, which is unnecessary credential disclosure to another service. Set authorization per request/source instead of as a global default header.

Useful? React with 👍 / 👎.

Comment on lines +97 to +101
if (bytes.Length > maxContentBytes)
{
AddSourceWarning(source, warnings, $"Azure DevOps content request '{url}' returned {bytes.Length} bytes and was truncated to {maxContentBytes} bytes.");
bytes = bytes.Take(maxContentBytes).ToArray();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse Azure DevOps JSON before truncating response bytes

For Azure DevOps content fetches, the code truncates the raw byte payload first and then parses JSON when the media type is JSON. If a valid JSON response exceeds maxContentBytes, truncation can make the JSON invalid and throw JsonException; this is caught at the source level and drops all docs from that source. Parse full JSON first (or stream parse) and truncate only the extracted content string.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4104e904ea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +356 to +357
? MakeSafeFragment($"{source.Kind}-{source.Title}-{source.Module}-{source.Path}-{source.Repo}")
: source.Id!;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include Azure DevOps repository fields in fallback source IDs

When id is omitted, fallback IDs are built from kind/title/module/path/repo, but Azure DevOps sources commonly use repository (plus organization/project) instead of repo. That means multiple Azure DevOps entries can collapse to the same generated source ID (often just azure-devops), and then MakeStableId(sourceId, path) can produce duplicate document IDs for shared paths like README.md, causing ambiguous or overwritten entries in downstream ID-keyed consumers.

Useful? React with 👍 / 👎.

"projects", "projectFiles", "project-files",
"module", "modules", "moduleFiles", "module-files",
"out", "output", "source", "destination", "dest",
"privateGallery", "private-gallery", "privateGalleryFeed", "private-gallery-feed", "gallery",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Track sourcesFile aliases in portal-docs cache fingerprints

portal-docs-index accepts sourcesFile/sources-file, but those keys are not included in FingerprintPathKeys. In cached pipeline runs, edits to a sources file referenced via these aliases will not change the step fingerprint, so the step can be treated as a cache hit and leave stale docs.json/search.json in place.

Useful? React with 👍 / 👎.

Comment on lines +628 to +631
return extension.Equals(".md", StringComparison.OrdinalIgnoreCase) ||
extension.Equals(".mdx", StringComparison.OrdinalIgnoreCase) ||
extension.Equals(".txt", StringComparison.OrdinalIgnoreCase) ||
string.IsNullOrWhiteSpace(extension) && Path.GetFileName(path).Equals("LICENSE", StringComparison.OrdinalIgnoreCase);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept extensionless README/CHANGELOG as text documents

IsTextDocument only allows extensionless LICENSE, but the classifier and common repo layouts also rely on extensionless README/CHANGELOG. Because all source indexers filter through this function, repositories that store docs as README or CHANGELOG (without .md) are silently excluded even when explicitly included by pattern, resulting in missing core docs in docs.json and search output.

Useful? React with 👍 / 👎.

@PrzemyslawKlys PrzemyslawKlys merged commit a129404 into main May 24, 2026
6 checks passed
@PrzemyslawKlys PrzemyslawKlys deleted the codex/portal-docs-index branch May 24, 2026 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant