feat: adds automation to create PR to add popular sources#20
Conversation
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an automated pipeline: Firecrawl scraper, OpenRouter LLM processing (extract/validate/generate YAML), Bash orchestration, scheduled GitHub Actions workflow that conditionally commits and opens PRs, plus dependency and .gitignore updates. ChangesSource Scraping Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/fetch_new_sources.yml:
- Around line 19-31: The workflow is missing Python setup and dependency
installation which causes ModuleNotFoundError when running Python scripts;
update the job steps (e.g., after the "Checkout" and "Setup Git" steps and
before any Python script runs) to add an actions/setup-python@v4 step to install
the required Python version and a step that runs pip install -r requirements.txt
(optionally with pip cache) so dependencies like firecrawl and pyyaml are
available for subsequent steps such as the "Create branch from main" related
scripts.
- Around line 33-53: The workflow step that runs bash scripts/source_scraper.sh
is missing required environment variables (TOP_SOURCES_LIST, FIRECRAWL_API_KEY,
OPENROUTER_API_KEY) so the script fails with "env var not set"; update the "Run
source scraper (with retries)" step to provide these variables (either by adding
an env block for TOP_SOURCES_LIST, FIRECRAWL_API_KEY, OPENROUTER_API_KEY or
exporting them before invoking scripts/source_scraper.sh) so the script receives
them during each attempt and retries behave correctly.
In `@requirements.txt`:
- Around line 1-4: The requirements file uses an incorrect package name and
lacks pinned versions; change "firecrawl" to the correct PyPI name
"firecrawl-py" and add explicit version pins for all packages (e.g., pyyaml,
jsonschema, referencing, firecrawl-py) to ensure reproducible installs—update
the lines for "pyyaml", "jsonschema", "referencing", and the renamed
"firecrawl-py" to include the chosen version specifiers (== or ~=) consistent
with your compatibility policy.
In `@scripts/openrouter.py`:
- Around line 35-37: The urlopen calls in fetch_text and post_openrouter lack
timeouts and can hang indefinitely; update both functions to pass a reasonable
timeout (e.g., timeout=10) to urlopen and handle socket.timeout/URLError
exceptions accordingly so the workflow fails fast and logs or raises a clear
error; locate the urlopen usage in fetch_text and the corresponding call in
post_openrouter and add the timeout argument and appropriate exception handling
around the network call.
- Around line 287-292: The code uses parsed.get('name') and immediately calls
.strip() which will raise AttributeError if the YAML lacks a name; update the
logic around the filename assignment (the variable filename and the call to
re.sub) to first check for None/empty, e.g. obtain raw_name = parsed.get('name')
and if raw_name is falsy use a safe default (like "unnamed-{uuid4()}" or a
timestamp) before calling .strip() and re.sub, then append ".yaml"; ensure this
change is applied where filename is computed so downstream code receives a valid
string.
In `@scripts/scrape_firecrawl.py`:
- Around line 22-30: Update the misleading comment above the fc.scrape call:
change the text that says the SDK "returns a dict with a 'markdown' field" to
state that fc.scrape returns an object (or remove the type reference entirely),
since the code uses doc.markdown; ensure the comment references fc.scrape and
doc.markdown so it accurately reflects the SDK returns an object with a markdown
attribute.
In `@scripts/source_scraper.sh`:
- Around line 4-9: The script crashes under set -u because URL is assigned from
an unset TOP_SOURCES_LIST; change the assignment URL="$TOP_SOURCES_LIST" to use
the safe default expansion URL="${TOP_SOURCES_LIST:-}" so the script can run the
custom error check (the TOP_SOURCES_LIST variable is the unique symbol to
update), leaving the subsequent checks that already use the :- pattern
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: ceee9ef7-f81f-4391-a31d-513f19b7d2a0
📒 Files selected for processing (6)
.github/workflows/fetch_new_sources.yml.gitignorerequirements.txtscripts/openrouter.pyscripts/scrape_firecrawl.pyscripts/source_scraper.sh
Signed-off-by: Amit Singh <singhamitch@outlook.com>
4e50063 to
d90143f
Compare
There was a problem hiding this comment.
4 issues found across 6 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="scripts/openrouter.py">
<violation number="1" location="scripts/openrouter.py:287">
P1: `parsed.get('name')` can return `None` if the YAML document lacks a `name` field, causing `filename.strip()` to raise `AttributeError`. Add a guard to skip documents without a valid `name`.</violation>
</file>
<file name="scripts/source_scraper.sh">
<violation number="1" location="scripts/source_scraper.sh:4">
P2: Handle `TOP_SOURCES_LIST` without a bare expansion here; under `set -u`, an unset variable exits before the explicit error message runs.</violation>
</file>
<file name=".github/workflows/fetch_new_sources.yml">
<violation number="1" location=".github/workflows/fetch_new_sources.yml:23">
P0: The workflow runs Python scripts that depend on packages from `requirements.txt` (firecrawl, pyyaml, etc.), but never sets up Python or installs dependencies. Add `actions/setup-python` and `pip install -r requirements.txt` steps before executing the scripts, otherwise they will fail with `ModuleNotFoundError`.</violation>
<violation number="2" location=".github/workflows/fetch_new_sources.yml:33">
P0: The `source_scraper.sh` script requires `TOP_SOURCES_LIST`, `FIRECRAWL_API_KEY`, and `OPENROUTER_API_KEY` environment variables, but none are set in this workflow step. The script will fail immediately with "env var not set" errors. Add an `env` block to pass these secrets.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
@semmet95 I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
.github/workflows/fetch_new_sources.yml (2)
33-40:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winPass required secrets/env vars into the scraper step.
source_scraper.shdepends onTOP_SOURCES_LIST,FIRECRAWL_API_KEY, andOPENROUTER_API_KEY; without them this step fails deterministically.🔧 Proposed fix
- name: Run source scraper (with retries) + env: + TOP_SOURCES_LIST: ${{ secrets.TOP_SOURCES_LIST }} + FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }} + OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }} run: | set -e attempt=1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/fetch_new_sources.yml around lines 33 - 40, The workflow step "Run source scraper (with retries)" calls scripts/source_scraper.sh but doesn't pass the required environment variables; update that step to export or pass TOP_SOURCES_LIST, FIRECRAWL_API_KEY, and OPENROUTER_API_KEY (sourced from repository or organization secrets) into the step environment so the script can access them (e.g., using env: or by exporting them before invoking bash) and ensure secret values use GitHub Secrets to avoid plaintext exposure.
20-33:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winSet up Python and install requirements before running scraper scripts.
The workflow invokes Python tooling but does not provision Python packages first, so runs can fail with missing-module errors.
🔧 Proposed fix
- name: Checkout uses: actions/checkout@v4 + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.11' + + - name: Install dependencies + run: pip install -r requirements.txt + - name: Setup Git run: | git config user.name "github-actions[bot]"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/fetch_new_sources.yml around lines 20 - 33, The workflow is missing a Python setup and dependency install before running the scraper; add a step after "Setup Git" (and before "Run source scraper (with retries)") to use actions/setup-python to install the required Python version and then run pip install -r requirements.txt (or the project's install command) so the scraper has its modules available; ensure the new step is clearly named (e.g., "Set up Python and install requirements") and placed between the "Create branch from main" and "Run source scraper (with retries)" steps so dependencies are installed prior to executing the scraper.scripts/source_scraper.sh (1)
4-4:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard
TOP_SOURCES_LISTbefore expansion under strict mode.On Line 4,
set -ucan abort before your intended validation flow ifTOP_SOURCES_LISTis unset. Use safe expansion plus an explicit check so the failure is deterministic and readable.🔧 Proposed fix
-URL="$TOP_SOURCES_LIST" +URL="${TOP_SOURCES_LIST:-}" + +if [[ -z "$URL" ]]; then + echo "Error: TOP_SOURCES_LIST is not set." >&2 + exit 1 +fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/source_scraper.sh` at line 4, The script uses strict mode (set -u) but expands TOP_SOURCES_LIST directly into URL which can cause an immediate unhelpful abort; change the assignment to use a safe expansion (e.g., default-empty or parameter expansion) and add an explicit check for TOP_SOURCES_LIST before using it so failures are deterministic and readable—update the line that sets URL and add a subsequent conditional that tests TOP_SOURCES_LIST (referencing the TOP_SOURCES_LIST variable and the URL assignment) and emits a clear error and exit if it is unset or empty.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/fetch_new_sources.yml:
- Around line 60-65: The workflow uses unquoted $GITHUB_OUTPUT in the echo
redirections (the echo "changed=false" >> $GITHUB_OUTPUT and echo "changed=true"
>> $GITHUB_OUTPUT lines), which risks word-splitting; update those redirections
to quote the variable (use >> "$GITHUB_OUTPUT") so the shell treats the path as
a single token and avoids SC2086 warnings.
---
Duplicate comments:
In @.github/workflows/fetch_new_sources.yml:
- Around line 33-40: The workflow step "Run source scraper (with retries)" calls
scripts/source_scraper.sh but doesn't pass the required environment variables;
update that step to export or pass TOP_SOURCES_LIST, FIRECRAWL_API_KEY, and
OPENROUTER_API_KEY (sourced from repository or organization secrets) into the
step environment so the script can access them (e.g., using env: or by exporting
them before invoking bash) and ensure secret values use GitHub Secrets to avoid
plaintext exposure.
- Around line 20-33: The workflow is missing a Python setup and dependency
install before running the scraper; add a step after "Setup Git" (and before
"Run source scraper (with retries)") to use actions/setup-python to install the
required Python version and then run pip install -r requirements.txt (or the
project's install command) so the scraper has its modules available; ensure the
new step is clearly named (e.g., "Set up Python and install requirements") and
placed between the "Create branch from main" and "Run source scraper (with
retries)" steps so dependencies are installed prior to executing the scraper.
In `@scripts/source_scraper.sh`:
- Line 4: The script uses strict mode (set -u) but expands TOP_SOURCES_LIST
directly into URL which can cause an immediate unhelpful abort; change the
assignment to use a safe expansion (e.g., default-empty or parameter expansion)
and add an explicit check for TOP_SOURCES_LIST before using it so failures are
deterministic and readable—update the line that sets URL and add a subsequent
conditional that tests TOP_SOURCES_LIST (referencing the TOP_SOURCES_LIST
variable and the URL assignment) and emits a clear error and exit if it is unset
or empty.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 9a358d31-10ee-4745-aa50-1d4d1e3d0245
📒 Files selected for processing (2)
.github/workflows/fetch_new_sources.ymlscripts/source_scraper.sh
There was a problem hiding this comment.
1 issue found across 6 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".github/workflows/fetch_new_sources.yml">
<violation number="1" location=".github/workflows/fetch_new_sources.yml:16">
P2: Branch naming uses only `github.run_id`, so reruns reuse the same branch name and can fail when pushing updates. Include `github.run_attempt` (or another unique suffix) in the branch name.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="scripts/openrouter.py">
<violation number="1" location="scripts/openrouter.py:289">
P1: Don't reference `e` here; this branch runs outside the exception handler and will crash when `name` is missing.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
.github/workflows/fetch_new_sources.yml (1)
71-77:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winQuote
$GITHUB_OUTPUTin the redirections.These writes should use
>> "$GITHUB_OUTPUT"so the path is treated as a single token and the step stays shell-safe.🔧 Suggested change
- echo "changed=false" >> $GITHUB_OUTPUT + echo "changed=false" >> "$GITHUB_OUTPUT" ... - echo "changed=true" >> $GITHUB_OUTPUT + echo "changed=true" >> "$GITHUB_OUTPUT"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/fetch_new_sources.yml around lines 71 - 77, The redirections currently append to an unquoted $GITHUB_OUTPUT which can break with spaces or special chars; update the redirect usages in the shell block that contains the git check/commit (the lines using git diff --cached --quiet and git commit -m "feat(sources): adds new most popular sources") to append to the output variable using >> "$GITHUB_OUTPUT" (replace all occurrences of >> $GITHUB_OUTPUT with >> "$GITHUB_OUTPUT") so the path is treated as a single token and the step remains shell-safe.scripts/openrouter.py (1)
190-197:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate parsed YAML is a mapping before using
.get().Both code paths assume
yaml.safe_load(...)returns a dict. LLM output can still parse into a scalar/list/None, which makes.get(...)blow up. Also, the missing-name guard still referenceseon Line 289, so that fallback path raisesNameErrorinstead of skipping the bad doc.🛡️ Suggested change
for doc_str in source_docs: try: doc = yaml.safe_load(doc_str) except Exception as e: print(f"Warning: failed to parse source_doc; not adding it to the source list it. Error: {e}", file=sys.stderr) continue + if not isinstance(doc, dict): + print("Warning: source_doc is not a YAML mapping; skipping", file=sys.stderr) + continue duplicate = False for src in existing: if doc.get('name') == src.get('name') or doc.get('uri') == src.get('uri'): duplicate = True break ... for doc_str in unique_src_docs: try: parsed = yaml.safe_load(doc_str) except Exception as e: print(f"Warning: failed to parse src_doc : {doc_str}: {e}", file=sys.stderr) continue + if not isinstance(parsed, dict): + print(f"Warning: parsed source doc is not a mapping, skipping: {doc_str}", file=sys.stderr) + continue filename = parsed.get('name') - if filename == None: - print(f"Warning: failed to extract name field from : {doc_str}: {e}", file=sys.stderr) + if not filename: + print(f"Warning: failed to extract name field from: {doc_str}", file=sys.stderr) continueCan PyYAML's yaml.safe_load return a scalar, list, or None instead of a dict depending on the YAML document content?Also applies to: 281-293
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/openrouter.py` around lines 190 - 197, The YAML loader may return non-mapping types, so before using doc.get(...) validate that the parsed doc is a dict/mapping (e.g., isinstance(doc, dict)) and skip/log a warning and continue if it's None, a scalar, or a list; update the duplicate/name checks that reference doc.get('name') or doc.get('uri') to only run when doc is a mapping. Also fix the missing-name guard that incorrectly references the exception variable e (causing NameError) — replace that reference with a safe message or use the parsed doc/state instead so the fallback path cleanly skips bad docs (adjust logic around doc_str/doc and existing to avoid referencing e).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/fetch_new_sources.yml:
- Around line 15-20: The workflow uses a repo-specific secret GH_TOKEN which may
be absent; update the env entry that sets GH_TOKEN to use the built-in Actions
token instead (replace the GH_TOKEN value ${ { secrets.GH_TOKEN } } with the
default token expression such as ${{ secrets.GITHUB_TOKEN }} or ${{ github.token
}}), so steps that run gh pr create can use the provided workflow token;
specifically modify the env block where BRANCH and GH_TOKEN are defined to set
GH_TOKEN to the built-in token.
In `@scripts/openrouter.py`:
- Around line 145-155: The current selection of latest = max(files, key=lambda
p: os.path.getmtime(p)) is unreliable in CI because mtime reflects checkout
time; replace this logic to choose a deterministic or VCS-derived "latest"
instead: either query git for the most recently added/committed source under
sources_dir (e.g., using git log/git ls-files to pick the file with the newest
commit that added or modified it) and set latest to that path, or pick a
deterministic fixed sample (e.g., sorted(files)[0] or a specific filename) so
behavior is stable in CI; update any error handling around entries/files/latest
to report failures if git lookup or deterministic selection returns nothing.
- Line 26: MD_PROCESSING_SKILL_URL currently points to the mutable
refs/heads/main raw URL; replace it with an immutable reference (either the raw
URL pinned to a specific commit SHA for the upstream file or vendored into this
repo and point MD_PROCESSING_SKILL_URL to the local path). Update the constant
MD_PROCESSING_SKILL_URL in scripts/openrouter.py to use the chosen fixed
location and ensure any tests/consumers still read from that constant.
- Around line 54-64: The code in post_openrouter currently only catches
HTTPError which misses transport errors (URLError) like timeouts or DNS failures
and causes the run to abort; update the exception handling around urlopen (the
try/except block that reads status/body) to also catch urllib.error.URLError (or
a combined except HTTPError, URLError as e) and set status and body similar to
the HTTPError path (e.g., set status to a sentinel or None and body to the
readable message from e or empty string) so the caller can continue fallback to
the next model; locate the try block that opens urlopen(req, timeout=10) and
modify its except clauses to handle URLError alongside HTTPError.
---
Duplicate comments:
In @.github/workflows/fetch_new_sources.yml:
- Around line 71-77: The redirections currently append to an unquoted
$GITHUB_OUTPUT which can break with spaces or special chars; update the redirect
usages in the shell block that contains the git check/commit (the lines using
git diff --cached --quiet and git commit -m "feat(sources): adds new most
popular sources") to append to the output variable using >> "$GITHUB_OUTPUT"
(replace all occurrences of >> $GITHUB_OUTPUT with >> "$GITHUB_OUTPUT") so the
path is treated as a single token and the step remains shell-safe.
In `@scripts/openrouter.py`:
- Around line 190-197: The YAML loader may return non-mapping types, so before
using doc.get(...) validate that the parsed doc is a dict/mapping (e.g.,
isinstance(doc, dict)) and skip/log a warning and continue if it's None, a
scalar, or a list; update the duplicate/name checks that reference
doc.get('name') or doc.get('uri') to only run when doc is a mapping. Also fix
the missing-name guard that incorrectly references the exception variable e
(causing NameError) — replace that reference with a safe message or use the
parsed doc/state instead so the fallback path cleanly skips bad docs (adjust
logic around doc_str/doc and existing to avoid referencing e).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 44ba98e9-2b0a-4fe4-8f92-681932d269e0
📒 Files selected for processing (4)
.github/workflows/fetch_new_sources.ymlrequirements.txtscripts/openrouter.pyscripts/scrape_firecrawl.py
✅ Files skipped from review due to trivial changes (1)
- requirements.txt
a812e98 to
6b7f3d9
Compare
|
@semmet95 I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
scripts/openrouter.py (3)
54-62:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winInitialize
statuson the failure path.If
urlopenraises here, Line 62 returns an undefinedstatus, so the script dies withUnboundLocalErrorbefore it can fall back to the next model.🩹 Suggested change
def post_openrouter(api_key: str, payload: dict) -> Tuple[int, str]: data = json.dumps(payload).encode("utf-8") + status = 0 + body = "" req = Request( "https://openrouter.ai/api/v1/chat/completions", data=data, @@ try: with urlopen(req, timeout=60) as r: status = r.getcode() body = r.read().decode("utf-8") except Exception as e: print(f"Error making request to OpenRouter API: {e}", file=sys.stderr) - body = "" return status, body🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/openrouter.py` around lines 54 - 62, The except block after the urlopen call leaves `status` undefined which causes an UnboundLocalError on return; in the except branch inside the function that uses urlopen(req, timeout=60) set `status` to a safe sentinel (e.g., None or -1) before returning along with `body` (keep the existing error print), so change the except handler to assign status = None (or -1) and body = "" so the final `return status, body` always has both variables defined.
26-26:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPin
MD_PROCESSING_SKILL_URLto an immutable revision.Using
refs/heads/mainmeans upstream prompt edits can silently change scheduled output without any PR in this repo.🔒 Suggested change
-MD_PROCESSING_SKILL_URL = "https://raw.githubusercontent.com/semmet95/agent-skills/refs/heads/main/md-processing/SKILL.md" +MD_PROCESSING_SKILL_URL = "https://raw.githubusercontent.com/semmet95/agent-skills/<pinned-commit-sha>/md-processing/SKILL.md"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/openrouter.py` at line 26, MD_PROCESSING_SKILL_URL currently points to a mutable branch ref ("refs/heads/main") which allows upstream changes to silently alter behavior; update MD_PROCESSING_SKILL_URL to use an immutable commit or tag reference (SHA or annotated tag) instead, e.g., replace the "refs/heads/main" segment with the specific commit SHA or a release tag so the URL always fetches the exact same SKILL.md; locate the constant MD_PROCESSING_SKILL_URL in scripts/openrouter.py and change its value to the pinned immutable revision.
141-152:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
mtimeis not a stable “latest source” signal in CI.On a fresh GitHub Actions checkout, file mtimes reflect checkout timing, not when a source was added. This makes the schema sample effectively arbitrary between runs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/openrouter.py` around lines 141 - 152, The current selection of latest source uses filesystem mtime (latest = max(files, key=lambda p: os.path.getmtime(p))) which is unstable in CI; replace this with a Git-commit-based timestamp comparison. Add a helper (e.g., get_git_commit_time(path)) that runs git log --format=%ct -1 -- <path> via subprocess and returns an integer timestamp (fall back to os.path.getmtime if git fails), then compute latest = max(files, key=get_git_commit_time) so selection is deterministic across checkouts; update the code paths that build files and the latest assignment to use this helper..github/workflows/fetch_new_sources.yml (1)
15-19:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winExport a token for
gh pr create.This job has the right permissions, but the GitHub CLI will not use them unless
GH_TOKENorGITHUB_TOKENis present in the environment. The PR step will fail on the first run that actually has changes.🔑 Suggested change
env: BRANCH: src-auto-add-${{ github.run_id }} + GH_TOKEN: ${{ github.token }} TOP_SOURCES_LIST: ${{ vars.TOP_SOURCES_LIST }} FIRECRAWL_API_KEY: ${{ secrets.FIRECRAWL_API_KEY }} OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/fetch_new_sources.yml around lines 15 - 19, The job sets env vars (BRANCH, TOP_SOURCES_LIST, FIRECRAWL_API_KEY, OPENROUTER_API_KEY) but does not export a GitHub token for the GitHub CLI; add GITHUB_TOKEN (or GH_TOKEN) to the environment so `gh pr create` can authenticate (e.g., set GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} in the env block) and ensure the job will succeed on first run that creates a PR.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/fetch_new_sources.yml:
- Line 16: The branch name currently set in the BRANCH env ("src-auto-add-${{
github.run_id }}") can collide on workflow reruns; update the BRANCH value to
include the run attempt (e.g., append `${{ github.run_attempt }}`) so the BRANCH
variable (BRANCH: src-auto-add-...) becomes unique per attempt and avoids
non-fast-forward push failures or PR conflicts.
In `@scripts/openrouter.py`:
- Around line 187-194: The code calls doc.get(...) and src.get(...) but
yaml.safe_load(doc_str) can return None or non-mapping types; update the loop in
scripts/openrouter.py to first check that doc is a mapping (e.g.,
isinstance(doc, dict)) and skip with a warning if not, and likewise ensure each
existing item (src) is a mapping before calling src.get() to avoid
AttributeError; keep the duplicate detection logic the same once both doc and
src are guaranteed dicts.
- Around line 284-292: The sanitized filename may be empty (e.g., when
parsed.get('name') is only punctuation/whitespace), so after performing the
re.sub on parsed.get('name') and before appending ".yaml" you must validate the
resulting basename (the variable filename) is non-empty and safe; if it is
empty, either skip the document or replace it with a default safe basename
(e.g., "untitled-<unique>" or derived from another field) and log a warning to
stderr. Update the code that sets filename (uses parsed.get('name'),
re.sub(...), and then filename = filename + ".yaml") to perform this check and
take one of the safe actions (skip/write with default) so you never write to
"sources/.yaml".
---
Duplicate comments:
In @.github/workflows/fetch_new_sources.yml:
- Around line 15-19: The job sets env vars (BRANCH, TOP_SOURCES_LIST,
FIRECRAWL_API_KEY, OPENROUTER_API_KEY) but does not export a GitHub token for
the GitHub CLI; add GITHUB_TOKEN (or GH_TOKEN) to the environment so `gh pr
create` can authenticate (e.g., set GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} in
the env block) and ensure the job will succeed on first run that creates a PR.
In `@scripts/openrouter.py`:
- Around line 54-62: The except block after the urlopen call leaves `status`
undefined which causes an UnboundLocalError on return; in the except branch
inside the function that uses urlopen(req, timeout=60) set `status` to a safe
sentinel (e.g., None or -1) before returning along with `body` (keep the
existing error print), so change the except handler to assign status = None (or
-1) and body = "" so the final `return status, body` always has both variables
defined.
- Line 26: MD_PROCESSING_SKILL_URL currently points to a mutable branch ref
("refs/heads/main") which allows upstream changes to silently alter behavior;
update MD_PROCESSING_SKILL_URL to use an immutable commit or tag reference (SHA
or annotated tag) instead, e.g., replace the "refs/heads/main" segment with the
specific commit SHA or a release tag so the URL always fetches the exact same
SKILL.md; locate the constant MD_PROCESSING_SKILL_URL in scripts/openrouter.py
and change its value to the pinned immutable revision.
- Around line 141-152: The current selection of latest source uses filesystem
mtime (latest = max(files, key=lambda p: os.path.getmtime(p))) which is unstable
in CI; replace this with a Git-commit-based timestamp comparison. Add a helper
(e.g., get_git_commit_time(path)) that runs git log --format=%ct -1 -- <path>
via subprocess and returns an integer timestamp (fall back to os.path.getmtime
if git fails), then compute latest = max(files, key=get_git_commit_time) so
selection is deterministic across checkouts; update the code paths that build
files and the latest assignment to use this helper.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 08bd737e-8b65-4167-a277-f99e1427603b
📒 Files selected for processing (4)
.github/workflows/fetch_new_sources.ymlrequirements.txtscripts/openrouter.pyscripts/scrape_firecrawl.py
✅ Files skipped from review due to trivial changes (1)
- requirements.txt
There was a problem hiding this comment.
4 issues found across 6 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".github/workflows/fetch_new_sources.yml">
<violation number="1" location=".github/workflows/fetch_new_sources.yml:82">
P2: Authenticate `gh pr create` with `GH_TOKEN`; otherwise PR creation can fail in CI due to missing GitHub CLI auth.</violation>
</file>
<file name="scripts/openrouter.py">
<violation number="1" location="scripts/openrouter.py:60">
P1: `status` can be undefined on request errors, causing `UnboundLocalError` when returning from `post_openrouter`.</violation>
<violation number="2" location="scripts/openrouter.py:290">
P1: If the `name` field is all whitespace or punctuation, `re.sub` sanitizes it to an empty string, resulting in writing to `sources/.yaml`. Add a guard after sanitization to skip empty basenames. Also, use `is None` instead of `== None`.</violation>
</file>
<file name="scripts/source_scraper.sh">
<violation number="1" location="scripts/source_scraper.sh:4">
P2: Validate `TOP_SOURCES_LIST` before using it. An empty repo variable currently becomes a blank URL and causes a downstream scrape failure.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Signed-off-by: Amit Singh <singhamitch@outlook.com>
6b7f3d9 to
e73294b
Compare
|
@semmet95 I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
scripts/openrouter.py (3)
26-26:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPin
MD_PROCESSING_SKILL_URLto an immutable revision.Fetching the prompt from
refs/heads/mainmakes scheduled runs non-deterministic and lets upstream changes alter generation behavior without any PR in this repo. Use a commit SHA or vendor the prompt locally.Suggested change
-MD_PROCESSING_SKILL_URL = "https://raw.githubusercontent.com/semmet95/agent-skills/refs/heads/main/md-processing/SKILL.md" +MD_PROCESSING_SKILL_URL = "https://raw.githubusercontent.com/semmet95/agent-skills/<pinned-commit-sha>/md-processing/SKILL.md"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/openrouter.py` at line 26, MD_PROCESSING_SKILL_URL currently points to refs/heads/main making prompt fetches non-deterministic; change it to an immutable reference by replacing the branch URL with a commit SHA URL (use the raw.githubusercontent.com URL with the specific commit hash for the SKILL.md) or vendor the SKILL.md into the repo and update MD_PROCESSING_SKILL_URL to a local relative path; update any code that fetches this constant (scripts/openrouter.py -> MD_PROCESSING_SKILL_URL) and add a short comment explaining why we pin the revision.
188-198:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard non-mapping YAML before calling
.get().
yaml.safe_load(doc_str)can return a list or scalar. In that case, Line 198 raisesAttributeErrorondoc.get(...)and aborts the run instead of skipping malformed model output.Suggested change
for doc_str in source_docs: try: doc = yaml.safe_load(doc_str) except Exception as e: print(f"Warning: failed to parse source_doc; not adding it to the source list it. Error: {e}", file=sys.stderr) continue - if doc == None: - print(f"Warning: skipping the following yaml string as it failed to load: {doc_str}", file=sys.stderr) + if not isinstance(doc, dict): + print(f"Warning: generated source doc is not a YAML mapping; skipping: {doc_str}", file=sys.stderr) continue duplicate = False for src in existing: if doc.get('name') == src.get('name') or doc.get('uri') == src.get('uri'):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/openrouter.py` around lines 188 - 198, The YAML loader can return non-mapping types so calling doc.get(...) can raise; update the block after yaml.safe_load to verify the parsed doc is a mapping (e.g., isinstance(doc, dict)) and if not, warn to stderr and continue; also guard the loop over existing by ensuring each src is a mapping before calling src.get(...) (skip or coerce non-mappings) so the duplicate check in the for src in existing loop (the doc.get(...) and src.get(...) calls) never invokes .get on non-dict objects.
138-153:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a deterministic schema sample instead of checkout
mtime.On GitHub Actions checkouts, file mtimes reflect checkout timing rather than when a source was added or last updated in git, so
max(files, key=os.path.getmtime)can pick an arbitrary sample file. That makes the schema prompt unstable across runs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/openrouter.py` around lines 138 - 153, Replace the filesystem mtime-based selection in get_latest_source with a git-commit-time-based selection so results are deterministic in CI: for each candidate file in files (the list built from sources_dir), call git to get the last commit timestamp for that path (e.g. via subprocess.run(["git","log","-1","--format=%ct","--", path], ...)), parse the epoch seconds, and use that timestamp as the key to select latest instead of os.path.getmtime; update error handling around the subprocess calls and ensure the variable latest is assigned from this git-timestamp-based max selection (still using sources_dir, files and latest identifiers).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/openrouter.py`:
- Around line 196-202: The current dedupe loop only compares each new document
(doc) against on-disk files (existing), so duplicate docs generated in the same
run can slip through; update the logic in the dedup block (where variables
duplicate, existing, cleaned, doc, doc_str are used) to compare against both
existing and already-collected cleaned entries (e.g., iterate over existing +
cleaned or check cleaned entries by parsing their names/uris) and skip adding
doc_str if any entry in either set has the same doc.get('name') or
doc.get('uri'), preserving current behavior for disk dedupe while preventing
intra-run duplicates.
---
Duplicate comments:
In `@scripts/openrouter.py`:
- Line 26: MD_PROCESSING_SKILL_URL currently points to refs/heads/main making
prompt fetches non-deterministic; change it to an immutable reference by
replacing the branch URL with a commit SHA URL (use the
raw.githubusercontent.com URL with the specific commit hash for the SKILL.md) or
vendor the SKILL.md into the repo and update MD_PROCESSING_SKILL_URL to a local
relative path; update any code that fetches this constant (scripts/openrouter.py
-> MD_PROCESSING_SKILL_URL) and add a short comment explaining why we pin the
revision.
- Around line 188-198: The YAML loader can return non-mapping types so calling
doc.get(...) can raise; update the block after yaml.safe_load to verify the
parsed doc is a mapping (e.g., isinstance(doc, dict)) and if not, warn to stderr
and continue; also guard the loop over existing by ensuring each src is a
mapping before calling src.get(...) (skip or coerce non-mappings) so the
duplicate check in the for src in existing loop (the doc.get(...) and
src.get(...) calls) never invokes .get on non-dict objects.
- Around line 138-153: Replace the filesystem mtime-based selection in
get_latest_source with a git-commit-time-based selection so results are
deterministic in CI: for each candidate file in files (the list built from
sources_dir), call git to get the last commit timestamp for that path (e.g. via
subprocess.run(["git","log","-1","--format=%ct","--", path], ...)), parse the
epoch seconds, and use that timestamp as the key to select latest instead of
os.path.getmtime; update error handling around the subprocess calls and ensure
the variable latest is assigned from this git-timestamp-based max selection
(still using sources_dir, files and latest identifiers).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8766f742-99ae-4ebb-a7f7-3d318d2cd19c
📒 Files selected for processing (4)
.github/workflows/fetch_new_sources.ymlrequirements.txtscripts/openrouter.pyscripts/scrape_firecrawl.py
🚧 Files skipped from review as they are similar to previous changes (2)
- requirements.txt
- .github/workflows/fetch_new_sources.yml
Fixes: #19
Summary by CodeRabbit
New Features
Chores