feat(seo): add bot detection for dynamic og:image tags#3171
feat(seo): add bot detection for dynamic og:image tags#3171MarkusNeusinger merged 4 commits intomainfrom
Conversation
- Add nginx bot detection map for social media crawlers
(Twitter, Facebook, LinkedIn, Slack, Telegram, WhatsApp,
Google, Bing, Discord, Pinterest, Apple)
- Add SEO proxy endpoints for bot-optimized HTML with og:tags:
- /seo-proxy/ - home page
- /seo-proxy/catalog - catalog page
- /seo-proxy/{spec_id} - spec overview (default og:image)
- /seo-proxy/{spec_id}/{library} - implementation (dynamic preview_url)
- Use error_page 418 trick for safe nginx conditional proxying
- Add comprehensive unit tests for all SEO proxy endpoints
This ensures social media bots (which don't execute JavaScript)
receive proper meta tags with dynamic og:image URLs instead of
the static default og-image.png.
🤖 Generated with [Claude Code](https://claude.ai/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| BOT_HTML_TEMPLATE.format( | ||
| title=f"{spec_id} | pyplots.ai", | ||
| description=DEFAULT_DESCRIPTION, | ||
| image=DEFAULT_IMAGE, | ||
| url=f"https://pyplots.ai/{html.escape(spec_id)}", | ||
| ) |
Check warning
Code scanning / CodeQL
Reflected server-side cross-site scripting Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 4 months ago
In general, to fix reflected server-side XSS, every user-controlled value inserted into an HTML document must be properly escaped for the context in which it appears (HTML body, attribute, URL, etc.). In this file, all uses of spec_id in HTML contexts should be consistently passed through html.escape, just as is already done for url in this same branch and for title/description when the DB is available.
The single best minimal fix is to escape spec_id when it is interpolated into the title for the DB-unavailable fallback in seo_spec_overview. Specifically, change line 124 from title=f"{spec_id} | pyplots.ai", to title=f"{html.escape(spec_id)} | pyplots.ai",. This mirrors the escaping already used for the url field in the same response and for the title field later in the function when spec is loaded from the database. No new imports are needed because html is already imported at the top of api/routers/seo.py. No other behavioral changes are introduced; only the unsafe direct inclusion of the raw path parameter into HTML is corrected.
| @@ -121,7 +121,7 @@ | ||
| # Fallback when DB unavailable | ||
| return HTMLResponse( | ||
| BOT_HTML_TEMPLATE.format( | ||
| title=f"{spec_id} | pyplots.ai", | ||
| title=f"{html.escape(spec_id)} | pyplots.ai", | ||
| description=DEFAULT_DESCRIPTION, | ||
| image=DEFAULT_IMAGE, | ||
| url=f"https://pyplots.ai/{html.escape(spec_id)}", |
| BOT_HTML_TEMPLATE.format( | ||
| title=f"{spec_id} - {library} | pyplots.ai", | ||
| description=DEFAULT_DESCRIPTION, | ||
| image=DEFAULT_IMAGE, | ||
| url=f"https://pyplots.ai/{html.escape(spec_id)}/{html.escape(library)}", | ||
| ) |
Check warning
Code scanning / CodeQL
Reflected server-side cross-site scripting Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 4 months ago
In general, to fix reflected server-side XSS in this endpoint, all user-controlled values (spec_id, library) must be HTML-escaped before being interpolated into BOT_HTML_TEMPLATE, not only when used in URLs but also when used in text nodes like the <title> element. The Python standard library’s html.escape() is already imported and used for some fields; we should extend its use to every occurrence where raw user input is inserted into the template.
Concretely, in seo_spec_implementation’s DB-unavailable fallback (lines ~155–161), the title field currently embeds spec_id and library without escaping. We should wrap these in html.escape() like is already done for the url field. This preserves existing functionality (the same values are displayed) but ensures any <, >, &, quotes, etc. are encoded and cannot break out of the HTML context. No new imports or helpers are needed; we only adjust the f-string expressions in that block. The rest of the function already escapes user-derived values where necessary.
| @@ -154,7 +154,7 @@ | ||
| # Fallback when DB unavailable | ||
| return HTMLResponse( | ||
| BOT_HTML_TEMPLATE.format( | ||
| title=f"{spec_id} - {library} | pyplots.ai", | ||
| title=f"{html.escape(spec_id)} - {html.escape(library)} | pyplots.ai", | ||
| description=DEFAULT_DESCRIPTION, | ||
| image=DEFAULT_IMAGE, | ||
| url=f"https://pyplots.ai/{html.escape(spec_id)}/{html.escape(library)}", |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR adds bot detection and SEO proxy endpoints to serve dynamic og:image tags for social media crawlers. The solution uses nginx User-Agent detection to route bot traffic to backend endpoints that return pre-rendered HTML with correct meta tags, while regular browsers continue to receive the client-side rendered React app with zero performance impact.
Key Changes
- nginx bot detection for 11 social media crawlers (Twitter, Facebook, LinkedIn, Slack, Telegram, WhatsApp, Google, Bing, Discord, Pinterest, Apple)
- Four new SEO proxy endpoints that return HTML with dynamic og:tags based on database content
- Dynamic
og:imagefrom databasepreview_urlfor implementation pages
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| app/nginx.conf | Adds bot detection map and proxy routing to redirect bot requests to SEO endpoints while preserving SPA routing for normal browsers |
| api/routers/seo.py | Adds four new SEO proxy endpoints (home, catalog, spec overview, spec implementation) with HTML template and dynamic meta tag generation |
| tests/unit/api/test_routers.py | Adds comprehensive test suite (9 tests) covering all SEO proxy endpoints with and without database, including fallback scenarios |
|
|
||
| return HTMLResponse( | ||
| BOT_HTML_TEMPLATE.format( | ||
| title=f"{html.escape(spec.title)} - {html.escape(library)} | pyplots.ai", | ||
| description=html.escape(spec.description or DEFAULT_DESCRIPTION), | ||
| image=image, |
There was a problem hiding this comment.
The preview_url from the database is inserted directly into the HTML template without HTML escaping. This could lead to XSS vulnerabilities if the preview_url contains malicious content. The image variable should be HTML-escaped before being used in the template, similar to how spec.title and spec.description are escaped.
| return HTMLResponse( | |
| BOT_HTML_TEMPLATE.format( | |
| title=f"{html.escape(spec.title)} - {html.escape(library)} | pyplots.ai", | |
| description=html.escape(spec.description or DEFAULT_DESCRIPTION), | |
| image=image, | |
| escaped_image = html.escape(image, quote=True) | |
| return HTMLResponse( | |
| BOT_HTML_TEMPLATE.format( | |
| title=f"{html.escape(spec.title)} - {html.escape(library)} | pyplots.ai", | |
| description=html.escape(spec.description or DEFAULT_DESCRIPTION), | |
| image=escaped_image, |
| # Fallback when DB unavailable | ||
| return HTMLResponse( | ||
| BOT_HTML_TEMPLATE.format( | ||
| title=f"{spec_id} - {library} | pyplots.ai", |
There was a problem hiding this comment.
The spec_id and library parameters are not HTML-escaped in the fallback URL construction when the database is unavailable. While they are path parameters and less likely to contain malicious content, they should still be escaped for consistency and defense-in-depth, similar to how they are escaped in lines 127, 141, 160, and 178.
| title=f"{spec_id} - {library} | pyplots.ai", | |
| title=f"{html.escape(spec_id)} - {html.escape(library)} | pyplots.ai", |
| # Named location for bot SEO proxy | ||
| location @seo_proxy { | ||
| proxy_pass https://api.pyplots.ai/seo-proxy$request_uri; | ||
| proxy_set_header Host api.pyplots.ai; | ||
| proxy_ssl_server_name on; | ||
| } |
There was a problem hiding this comment.
In location @seo_proxy you are proxying to https://api.pyplots.ai without enabling TLS certificate verification, and by default nginx does not verify upstream certificates for proxy_pass over HTTPS. An attacker who can influence DNS or the network path between this frontend and api.pyplots.ai could perform a man-in-the-middle attack to tamper with or replace the SEO HTML returned to bots. Enable strict certificate validation for this upstream (for example by turning on proxy_ssl_verify and configuring trusted CAs, ideally at the http or server level so all HTTPS proxies inherit it) to ensure the backend’s identity is authenticated.
- Escape spec_id and library in fallback title (XSS prevention) - Escape preview_url with quote=True before inserting in HTML template - Enable proxy_ssl_verify for backend proxy to prevent MITM attacks - Add trusted CA certificate path for TLS verification Addresses Copilot and GitHub Advanced Security findings. 🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| # Fallback when DB unavailable | ||
| return HTMLResponse( | ||
| BOT_HTML_TEMPLATE.format( | ||
| title=f"{spec_id} | pyplots.ai", |
There was a problem hiding this comment.
The spec_id in the title is not HTML-escaped, which could lead to XSS if a malicious spec_id is provided. All user-controlled inputs should be escaped before insertion into HTML. The url parameter already escapes it correctly, but the title needs the same treatment.
| title=f"{spec_id} | pyplots.ai", | |
| title=f"{html.escape(spec_id)} | pyplots.ai", |
| # Fallback when DB unavailable | ||
| return HTMLResponse( | ||
| BOT_HTML_TEMPLATE.format( | ||
| title=f"{html.escape(spec_id)} - {html.escape(library)} | pyplots.ai", |
There was a problem hiding this comment.
The library parameter in the title is not HTML-escaped, which could lead to XSS if a malicious library value is provided. All user-controlled inputs should be escaped before insertion into HTML.
Summary
og:tagsfor botsog:imagefrom databasepreview_urlfor implementation pagesProblem
CSR (Client-Side Rendered) React app sets meta tags via React Helmet after JavaScript execution. Social media bots don't execute JavaScript → all pages show the default
og-image.pnginstead of dynamic plot previews.Solution
Endpoints
/seo-proxy//seo-proxy/catalog/seo-proxy/{spec_id}/seo-proxy/{spec_id}/{library}preview_urlfrom DBTest plan
curl -A "Twitterbot" https://pyplots.ai/scatter-basic/matplotlib🤖 Generated with Claude Code