-
Notifications
You must be signed in to change notification settings - Fork 60
#108 Sitemap XML Generator #441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#108 Sitemap XML Generator #441
Conversation
- page init - shadcn Input
- generate xml & stream it to the client
- handle error. loading state, better ui for output
- priority & last modified
- fix url parsing, better log handling, env for limit
- gui responsiveness, use ref instead of state
- gui show found pages count
- update tools.json
- lock file
- refactoring & tests
- extend tests
- edge runtime
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
WalkthroughThe changes introduce a feature for generating XML sitemaps, including the addition of a new environment variable in the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ReactComponent
participant API
participant SitemapGenerator
User->>ReactComponent: Input URL
ReactComponent->>API: Fetch sitemap for URL
API->>SitemapGenerator: Generate sitemap
SitemapGenerator-->>API: Return sitemap XML
API-->>ReactComponent: Return sitemap XML
ReactComponent-->>User: Display sitemap
Warning There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure. 🔧 eslint
warning eslint@8.57.1: This version is no longer supported. Please see https://eslint.org/version-support for other options. Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Outside diff range and nitpick comments (12)
src/lib/utils.js (1)
8-20: Enhance JSDoc documentation for clarity.The implementation correctly validates URLs using the built-in URL constructor and appropriately restricts protocols to HTTP/HTTPS. However, the documentation could be more descriptive.
Consider enhancing the JSDoc to clarify the protocol restriction:
/** - * + * Validates if the provided string is a valid HTTP/HTTPS URL. * @param {string} url + * @description Checks if the URL is well-formed and uses either HTTP or HTTPS protocol * @returns {boolean} */src/app/api/generator/sitemap-xml/route.js (2)
18-20: Enhance error message for better user experience.The current error message "Missing URL" could be more descriptive to help users understand what's expected.
- return new Response("Missing URL", { status: 400 }); + return new Response("Missing 'url' parameter in the request query", { status: 400 });
46-51: Consider adding compression for large sitemaps.The response headers are appropriate, but for large sitemaps, consider adding compression support.
return new Response(stream, { headers: { "Content-Type": "application/xml", "Cache-Control": "no-store", + "Content-Encoding": "gzip", }, });Note: If implementing compression, ensure to wrap the stream with appropriate compression transform stream.
__tests__/lib/generator/sitemapXml.test.js (3)
27-29: Fix malformed HTML in mock data.The mock HTML contains a syntax error:
<p>some other elements>. This should be properly closed.- '<html><a href="/page1">Page 1</a><p>some other elements><a href="/page2">Page 2</a></html>', + '<html><a href="/page1">Page 1</a><p>some other elements</p><a href="/page2">Page 2</a></html>',
5-40: Enhance mock coverage for edge cases.The current mock implementation only handles the success path. Consider adding test coverage for:
- Invalid URLs
- Network errors
- Different content types
- Various HTTP status codes
Would you like me to provide an example implementation with these scenarios?
51-84: Enhance test coverage for XML generation.The test case could be improved in several ways:
- Consider using a constant or environment variable for the base URL
- Add test cases for XML escaping (e.g., URLs with special characters)
- Validate the generated XML against the sitemap schema
Example test for XML escaping:
test("handles special characters in URLs", async () => { const baseUrl = "https://example.com/path with spaces/"; // ... test implementation expect(sitemapXML).toContain("https://example.com/path%20with%20spaces/"); });src/app/generator/sitemap-xml/page.jsx (2)
15-22: Consider cleaning up the downloadSitemapUrl blob.The
downloadSitemapUrlstate contains a blob URL created withURL.createObjectURL(). To prevent memory leaks, consider revoking this URL when it's no longer needed.Add cleanup in a useEffect:
+ useEffect(() => { + return () => { + if (downloadSitemapUrl) { + URL.revokeObjectURL(downloadSitemapUrl); + } + }; + }, [downloadSitemapUrl]);
186-198: Consider adding TypeScript and improving type safety.The Button component could benefit from better type safety and defaults.
Consider converting to TypeScript:
interface ButtonProps extends React.ButtonHTMLAttributes<HTMLButtonElement> { isDarkMode: boolean; className?: string; } function Button({ children, className, isDarkMode, type = 'button', ...props }: ButtonProps) { return ( <button type={type} {...props} className={cn( `${isDarkMode ? "bg-slate-100 text-slate-950" : "bg-slate-800 text-slate-100"} disabled:opacity-60 px-4 py-2 flex items-center justify-center rounded font-medium`, className, )} > {children} </button> ); }src/lib/generator/sitemapXml.js (4)
178-183: Avoid assignment within expression in while loopAssigning within the
whileloop condition can be confusing and is flagged by linters. Refactor the loop for clarity.Apply this diff:
- while ((match = linkRegex.exec(buffer)) !== null) { + let match; + while ((match = linkRegex.exec(buffer))) {This change enhances readability and conforms to best practices.
🧰 Tools
🪛 Biome
[error] 178-178: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.(lint/suspicious/noAssignInExpressions)
172-186: Use an HTML parser instead of regex for parsing linksParsing HTML with regular expressions is error-prone and may miss valid links or fail on malformed HTML. Consider using an HTML parser like
cheeriofor robust link extraction.Example modification:
+ import cheerio from 'cheerio'; async function getLinks(currentUrl, baseUrl) { // ... - const links = processBuffer(baseUrl, buffer); + const html = buffer; + const links = parseLinksFromHTML(html, baseUrl); // ... } - function processBuffer(baseUrl, buffer) { - const linkRegex = /<a\s+(?:[^>]*?\s+)?href="([^"]*)"/g; - // existing code - } + function parseLinksFromHTML(html, baseUrl) { + const $ = cheerio.load(html); + const links = []; + $('a[href]').each((_, element) => { + const href = $(element).attr('href'); + const url = parseLink(href, baseUrl); + if (url) { + links.push(url); + } + }); + return links; + }This approach ensures accurate parsing and better handles complex HTML structures.
🧰 Tools
🪛 Biome
[error] 178-178: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.(lint/suspicious/noAssignInExpressions)
54-54: Set a User-Agent header in fetch requestsWhen fetching resources, setting a User-Agent header can help avoid being blocked and identifies your crawler appropriately.
Apply this diff:
// In generateSitemapXML - const response = await fetch(url, { method: "HEAD" }); + const response = await fetch(url, { + method: "HEAD", + headers: { "User-Agent": "WebDevToolsSitemapGenerator/1.0" }, + }); // In getLinks - const response = await fetch(currentUrl); + const response = await fetch(currentUrl, { + headers: { "User-Agent": "WebDevToolsSitemapGenerator/1.0" }, + });Replace
"WebDevToolsSitemapGenerator/1.0"with an appropriate user-agent string.Also applies to: 122-122
27-110: Consider respectingrobots.txtfor ethical crawlingThe current implementation does not check
robots.txtdirectives, which could lead to unintended crawling of disallowed pages. Implementingrobots.txtparsing ensures compliance with websites' crawling policies.Consider using a library like
robots-txt-parserto parse and respectrobots.txtrules before fetching pages.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
.env.example(1 hunks)__tests__/lib/generator/sitemapXml.test.js(1 hunks)src/app/api/generator/sitemap-xml/route.js(1 hunks)src/app/generator/sitemap-xml/page.jsx(1 hunks)src/db/tools.json(1 hunks)src/lib/generator/sitemapXml.js(1 hunks)src/lib/utils.js(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .env.example
🧰 Additional context used
🪛 Biome
src/lib/generator/sitemapXml.js
[error] 178-178: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
🔇 Additional comments (7)
src/app/api/generator/sitemap-xml/route.js (2)
22-30: Verify URL validation and consider protocol handling.
The URL validation and normalization look good, but consider adding explicit protocol handling to ensure only HTTP(S) URLs are processed.
#!/bin/bash
# Description: Check URL validation implementation
# Test: Check URL validation implementation in utils.js
rg -A 5 "isUrlValid.*=" "src/lib/utils.js" if (!isUrlValid(url)) {
- return new Response("Invalid URL", { status: 400 });
+ return new Response("Invalid URL. Please provide a valid HTTP or HTTPS URL", { status: 400 });
}
+const protocol = new URL(url).protocol;
+if (protocol !== 'http:' && protocol !== 'https:') {
+ return new Response("Only HTTP and HTTPS protocols are supported", { status: 400 });
+}1-3: Verify import dependencies.
The imports look correct for the implementation. Let's verify there are no circular dependencies.
✅ Verification successful
Import dependencies are clean and well-structured
The verification confirms:
- No circular dependencies exist between the files
- All imported functions are properly exported from their respective modules:
generateSitemapXMLfromsrc/lib/generator/sitemapXml.jsisUrlValidandgetSitemapXmlGeneratorLimitfromsrc/lib/utils.js
- The imports follow a unidirectional flow from API route to utility modules
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential circular dependencies in the imported files
# Test 1: Check if sitemapXml.js imports from route.js
rg -l "from.*api/generator/sitemap-xml/route" "src/lib/generator/sitemapXml.js"
# Test 2: Check if utils.js imports from route.js
rg -l "from.*api/generator/sitemap-xml/route" "src/lib/utils.js"
Length of output: 136
Script:
#!/bin/bash
# Let's verify the imports and potential circular dependencies more thoroughly
# Check the implementation of imported functions
echo "=== Checking sitemapXml.js ==="
fd "sitemapXml.js" --exec rg -l "import.*from.*api/generator/sitemap-xml"
echo "=== Checking utils.js ==="
fd "utils.js" --exec rg -l "import.*from.*api/generator/sitemap-xml"
# Check reverse dependencies
echo "=== Checking implementations of imported functions ==="
rg "export.*generateSitemapXML" -A 2
rg "export.*getSitemapXmlGeneratorLimit|export.*isUrlValid" -A 2
Length of output: 1016
src/db/tools.json (2)
164-168: LGTM! The new tool entry follows consistent structure.
The new Sitemap XML Generator entry maintains proper formatting and follows existing patterns for id sequencing, naming, and categorization.
167-167: Verify the route implementation for the sitemap generator.
The link path /generator/sitemap-xml should correspond to the actual route implementation.
✅ Verification successful
Route implementation for sitemap generator is properly implemented
The route /generator/sitemap-xml is correctly implemented with both:
- A page component at
src/app/generator/sitemap-xml/page.jsxthat handles the UI - An API route at
src/app/api/generator/sitemap-xml/route.jsthat generates the sitemap XML
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence of the corresponding route implementation
# Expected: Find route implementation files for the sitemap generator
# Check for page component implementation
fd -t f "page.tsx?" "src/app/generator/sitemap-xml"
# Check for API route implementation
fd -t f "route.tsx?" "src/app/api/generator/sitemap-xml"
Length of output: 108
Script:
#!/bin/bash
# Let's try a broader search for sitemap-related routes and files
# Search for any files containing sitemap in their path
fd "sitemap"
# Search for route implementations containing sitemap
rg -l "sitemap" "src/app"
# Search for any API endpoints or route handlers with sitemap
ast-grep --pattern 'export async function GET($$$) { $$$ sitemap $$$ }'
Length of output: 373
Script:
#!/bin/bash
# Let's verify the content of the found route implementations
cat src/app/generator/sitemap-xml/page.jsx
cat src/app/api/generator/sitemap-xml/route.js
Length of output: 7374
__tests__/lib/generator/sitemapXml.test.js (1)
1-108: Verify streaming implementation and resource usage.
While the tests cover the basic functionality, consider adding tests to verify:
- Memory efficiency of the streaming implementation
- Proper cleanup of resources
- Performance with larger HTML responses
✅ Verification successful
Tests adequately cover the implementation's resource management
The implementation shows proper resource handling with:
- Stream-based processing using
getReader()to avoid loading entire pages into memory - Comprehensive error handling with try-catch blocks at critical points
- Proper URL validation and error logging
- Graceful degradation when encountering invalid links or failed requests
The current test suite effectively validates this through:
- Testing the streaming output with
for awaititeration - Verifying limit enforcement which tests the stream termination
- Mocking the stream reader with proper done/value protocol
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for memory management and resource cleanup in the implementation
rg -A 5 'getReader|ReadableStream|finally|close' src/lib/generator/sitemapXml.js
Length of output: 282
Script:
#!/bin/bash
# Let's examine the full implementation to understand the resource management
ast-grep --pattern 'export async function* generateSitemapXML($$$)'
# Also check for error handling patterns
rg -A 10 'try|catch|finally' src/lib/generator/sitemapXml.js
Length of output: 1924
src/app/generator/sitemap-xml/page.jsx (1)
1-14: LGTM! Clean setup with necessary imports and theme toggle.
The component setup is well-structured with appropriate imports and a clean dark mode implementation.
src/lib/generator/sitemapXml.js (1)
195-213: Verify URL normalization in parseLink function
Ensure that the URL normalization correctly resolves both absolute and relative links without introducing errors. Test with various link formats to confirm expected behavior.
Run the following script to test URL normalization:
This script helps verify that parseLink behaves as expected with different link formats.
- code review changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (10)
__tests__/lib/generator/sitemapXml.test.js (4)
5-40: Enhance mock implementation for better test coverage.The current mock setup has several areas for improvement:
- Add error scenarios (network errors, non-200 responses)
- Fix malformed HTML in the mock response
- Add different content-type scenarios
Consider applying these changes:
global.fetch = jest.fn(() => { + // Simulate different scenarios based on URL or call count + if (fetchCount > 2) { + return Promise.reject(new Error('Network error')); + } return Promise.resolve({ ok: true, headers: { get: (header) => { if (header === "content-type") { return "text/html"; } if (header === "last-modified") { return "Mon, 01 Jan 2024 00:00:00 GMT"; } return undefined; }, }, body: { getReader: () => { return { read: () => { if (fetchCount === 0) { fetchCount++; return Promise.resolve({ done: false, value: new TextEncoder().encode( - '<html><a href="/page1">Page 1</a><p>some other elements><a href="/page2">Page 2</a></html>', + '<html><a href="/page1">Page 1</a><p>some other elements</p><a href="/page2">Page 2</a></html>', ), }); } return Promise.resolve({ done: true, }); }, }; }, }, }); });
42-56: Optimize global cleanup approach.The test setup/teardown is well structured, but we can improve the cleanup approach:
Consider this alternative to using delete:
afterAll(() => { - delete global.TextEncoder; - delete global.TextDecoder; + global.TextEncoder = undefined; + global.TextDecoder = undefined; });Also, consider adding error handling for TextEncoder/TextDecoder setup:
beforeAll(async () => { + if (!global.TextEncoder) { global.TextEncoder = require("util").TextEncoder; + } + if (!global.TextDecoder) { global.TextDecoder = require("util").TextDecoder; + } });🧰 Tools
🪛 Biome
[error] 54-54: Avoid the delete operator which can impact performance.
Unsafe fix: Use an undefined assignment instead.
(lint/performance/noDelete)
[error] 55-55: Avoid the delete operator which can impact performance.
Unsafe fix: Use an undefined assignment instead.
(lint/performance/noDelete)
58-91: Enhance XML validation in tests.While the test covers basic functionality, consider these improvements:
- Add XML schema validation
- Make date expectations dynamic
- Verify URL uniqueness
Example implementation:
const validateXMLSchema = (xml) => { // Add XML schema validation logic const schema = `<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <!-- Add schema definition --> </xs:schema>`; // Return validation result }; test("generates valid sitemap XML with multiple links", async () => { // ... existing setup ... const sitemapXML = sitemapParts.join(""); // Validate schema expect(validateXMLSchema(sitemapXML)).toBe(true); // Use dynamic date const expectedDate = new Date("2024-01-01T00:00:00+00:00") .toISOString() .split('.')[0]+'+00:00'; // Track unique URLs const urls = new Set(); sitemapXML.match(/<loc>([^<]+)<\/loc>/g)?.forEach(match => { const url = match.replace(/<\/?loc>/g, ''); expect(urls.has(url)).toBe(false); urls.add(url); }); // ... rest of the assertions ... });
93-114: Strengthen limit test assertions.While the test verifies basic limit functionality, consider adding:
- Exact URL count validation
- Verification that the sitemap is well-formed even when limited
Add these assertions:
const sitemapXML = sitemapParts.join(""); + // Verify exact URL count + const urlCount = (sitemapXML.match(/<url>/g) || []).length; + expect(urlCount).toBe(limit); + + // Verify well-formed XML structure + expect(sitemapXML).toMatch( + /^<\?xml.*?>\s*<urlset.*?>.*?<\/urlset>$/s + ); + expect(sitemapXML).toContain("<urlset"); // ... rest of the assertions ...src/app/generator/sitemap-xml/page.jsx (2)
85-194: Enhance user feedback and accessibility.The UI needs improvements in user feedback and accessibility:
- Loading state for copy operation
- Success feedback for copy operation
- Missing aria-live regions for dynamic content
Apply these improvements:
+ const [isCopying, setIsCopying] = useState(false); + const [copySuccess, setCopySuccess] = useState(false); // In the copy button click handler onClick={async () => { + setIsCopying(true); try { await navigator.clipboard.writeText(sitemap); + setCopySuccess(true); + setTimeout(() => setCopySuccess(false), 2000); } catch (e) { setError("Failed to copy to clipboard"); } finally { + setIsCopying(false); } }} // Add aria-live region for dynamic content + <div aria-live="polite" className="sr-only"> + {isLoading && "Generating sitemap..."} + {copySuccess && "Copied to clipboard"} + {error && `Error: ${error}`} + </div> // Update copy button content - <ContentCopyIcon className="w-5 h-5" /> + {isCopying ? ( + <CircularProgress size="1.25rem" color="inherit" /> + ) : copySuccess ? ( + "Copied!" + ) : ( + <ContentCopyIcon className="w-5 h-5" /> + )}
196-208: Add type safety to the Button component.Consider adding proper TypeScript types to improve maintainability and prevent potential runtime errors.
Convert the component to TypeScript:
-function Button({ children, className, isDarkMode, ...props }) { +type ButtonProps = React.ButtonHTMLAttributes<HTMLButtonElement> & { + children: React.ReactNode; + className?: string; + isDarkMode: boolean; +}; + +function Button({ children, className, isDarkMode, ...props }: ButtonProps) {src/lib/generator/sitemapXml.js (4)
5-16: Consider adding more static asset extensions to IGNORED_SUFFIX.The current list covers common static assets, but consider adding these extensions to prevent unnecessary processing:
- Document formats:
.doc,.docx,.xls,.xlsx,.ppt,.pptx- Media files:
.mp3,.mp4,.wav,.webp,.avif- Archive files:
.zip,.rar,.tar,.gz
86-86: Consider a more nuanced priority calculation.The current priority calculation (
0.8 ** depth) decays exponentially with depth, which might not be suitable for all site structures. Some deep pages could be more important than shallow ones based on their content or purpose.Consider:
- Using path segments instead of crawl depth
- Adding configuration for the decay factor (0.8)
- Supporting priority overrides via page metadata
173-173: Improve href attribute matching regex.The current regex might miss valid href formats. Consider using a more comprehensive pattern that handles:
- Single quotes:
href='...'- Unquoted attributes:
href=http://example.com- URLs with quotes:
href="http://example.com/?q=\"quoted\""Apply this diff:
-const linkRegex = /<a\s+(?:[^>]*?\s+)?href="([^"]*)"/g; +const linkRegex = /<a\s+(?:[^>]*?\s+)?href=(?:"([^"]*)|'([^']*)'|([^\s>]*))/g;
195-214: Enhance URL normalization.The current implementation could be improved to handle more edge cases:
- Normalize paths (remove
.and..)- Handle URL fragments consistently
- Normalize query parameter order
Apply this diff:
try { const baseUrlObj = new URL(baseUrl); + // Normalize the base URL + baseUrlObj.hash = ''; + baseUrlObj.search = new URLSearchParams(baseUrlObj.search).toString(); const linkUrl = new URL(link, baseUrlObj.origin); + // Normalize the link URL + linkUrl.hash = ''; + linkUrl.search = new URLSearchParams(linkUrl.search).toString(); + return removeWWW(linkUrl.hostname) === removeWWW(baseUrlObj.hostname) - ? new URL(linkUrl.pathname, baseUrl).href + ? new URL(linkUrl.pathname + linkUrl.search, baseUrl).href : null;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
.env.example(1 hunks)__tests__/lib/generator/sitemapXml.test.js(1 hunks)src/app/api/generator/sitemap-xml/route.js(1 hunks)src/app/generator/sitemap-xml/page.jsx(1 hunks)src/lib/generator/sitemapXml.js(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .env.example
- src/app/api/generator/sitemap-xml/route.js
🧰 Additional context used
🪛 Biome
__tests__/lib/generator/sitemapXml.test.js
[error] 54-54: Avoid the delete operator which can impact performance.
Unsafe fix: Use an undefined assignment instead.
(lint/performance/noDelete)
[error] 55-55: Avoid the delete operator which can impact performance.
Unsafe fix: Use an undefined assignment instead.
(lint/performance/noDelete)
src/lib/generator/sitemapXml.js
[error] 178-178: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
🔇 Additional comments (1)
src/app/generator/sitemap-xml/page.jsx (1)
1-8: LGTM: Component setup and imports are appropriate.
The necessary dependencies are correctly imported, and the "use client" directive is properly placed for client-side rendering.
- code review changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (7)
src/app/generator/sitemap-xml/page.jsx (3)
46-54: Enhance error handling for network failures.The error handling could be more specific about the type of failure.
Consider this improvement:
const response = await fetch( `/api/generator/sitemap-xml?url=${encodeURIComponent(url)}`, ); if (!response.ok) { - console.error("Failed to fetch sitemap"); - setError("Failed to fetch sitemap"); + const errorMessage = response.status === 404 + ? "Website not found or inaccessible" + : response.status === 429 + ? "Too many requests. Please try again later" + : `Failed to fetch sitemap (HTTP ${response.status})`; + console.error(errorMessage); + setError(errorMessage); return; }
160-166: Enhance loading state accessibility.The loading indicator should have an ARIA label for screen readers.
Apply this improvement:
<CircularProgress size="1.5rem" color="inherit" className="text-white" + aria-label="Generating sitemap..." />
215-227: Consider adding TypeScript for better type safety.The Button component would benefit from TypeScript props interface definition.
Consider this improvement:
+interface ButtonProps extends React.ButtonHTMLAttributes<HTMLButtonElement> { + isDarkMode: boolean; + className?: string; + children: React.ReactNode; +} -function Button({ children, className, isDarkMode, ...props }) +function Button({ children, className, isDarkMode, ...props }: ButtonProps)src/lib/generator/sitemapXml.js (4)
5-16: Enhance URL filtering by adding more file extensions and prefixesConsider adding more common file extensions and prefixes to improve filtering:
const IGNORED_SUFFIX = [ ".jpg", ".jpeg", ".png", ".gif", ".svg", ".pdf", ".css", ".js", ".ico", + ".mp4", + ".webp", + ".webm", + ".mp3", + ".wav", + ".woff", + ".woff2", + ".ttf", ]; const IGNORED_PREFIX = [ "#", "mailto:", "tel:", + "javascript:", + "data:", + "blob:", ];
86-86: Document priority calculation formulaThe priority calculation uses 0.8 as a base value, which should be documented or made configurable.
+ // Priority decreases by 20% for each level deeper from the root URL const priority = 0.8 ** currentPage.depth;
53-64: Add rate limiting to prevent overwhelming target serversConsider implementing rate limiting between requests to be a good web citizen.
+ // Add at the top of the file + const RATE_LIMIT_MS = 100; // Configurable delay between requests + + // Add before the fetch + await new Promise(resolve => setTimeout(resolve, RATE_LIMIT_MS)); const response = await fetch(url, { method: "HEAD" });
214-217: Enhance URL normalizationThe current URL normalization could be improved to handle more edge cases.
- return removeWWW(linkUrl.hostname) === removeWWW(baseUrlObj.hostname) - ? new URL(linkUrl.pathname, baseUrl).href + if (removeWWW(linkUrl.hostname) !== removeWWW(baseUrlObj.hostname)) { + return null; + } + // Normalize the URL: remove fragments, normalize path + linkUrl.hash = ''; + const normalizedPath = linkUrl.pathname.replace(/\/+/g, '/').replace(/\/$/, ''); + return new URL(normalizedPath + linkUrl.search, baseUrl).href
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
src/app/generator/sitemap-xml/page.jsx(1 hunks)src/lib/generator/sitemapXml.js(1 hunks)
🧰 Additional context used
🪛 Biome
src/lib/generator/sitemapXml.js
[error] 186-186: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
🔇 Additional comments (2)
src/app/generator/sitemap-xml/page.jsx (1)
1-8: LGTM! Dependencies and setup look good.
All necessary imports are present and properly organized.
src/lib/generator/sitemapXml.js (1)
82-84: Add error handling consistency in HEAD request catch block
The error handling here should follow the same pattern as other error cases by removing the URL from visited and adding it to rejected.
- improve test, improve url parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (5)
src/lib/generator/sitemapXml.js (5)
5-16: Consider enhancing the ignored file extensions listThe IGNORED_SUFFIX list could be expanded to include more common file types that shouldn't be in sitemaps:
- Document formats: .doc, .docx, .xls, .xlsx, .ppt, .pptx
- Archive formats: .zip, .rar, .tar, .gz
- Media formats: .mp3, .mp4, .wav, .webp, .avif
Consider making these lists configurable through environment variables to allow customization without code changes.
85-85: Consider making priority calculation configurableThe priority calculation (
0.8 ** depth) is hardcoded. Consider making the base value (0.8) configurable to allow different priority decay rates for different use cases.+const PRIORITY_DECAY_RATE = process.env.NEXT_PUBLIC_SITEMAP_PRIORITY_DECAY_RATE || 0.8; + - const priority = 0.8 ** currentPage.depth; + const priority = PRIORITY_DECAY_RATE ** currentPage.depth;
31-33: Consider using Set for queue to avoid duplicate processingWhile visited and rejected URLs are tracked in Sets, the queue might contain duplicate URLs that will be filtered out later. Consider using a Set-based queue implementation to avoid pushing duplicate URLs altogether.
- const queue = [{ url: baseUrl, depth: 0 }]; + const queueSet = new Set([baseUrl]); + const queueItems = [{ url: baseUrl, depth: 0 }]; + + // In the push operation (around line 102): - queue.push(...links.map((url) => ({ url, depth: currentPage.depth + 1 }))); + const newItems = links + .filter(url => !queueSet.has(url)) + .map(url => ({ url, depth: currentPage.depth + 1 })); + newItems.forEach(item => queueSet.add(item.url)); + queueItems.push(...newItems);
213-216: Enhance hostname comparison logicThe current hostname comparison might fail with subdomains. Consider using a more robust comparison that handles various subdomain scenarios.
- return removeWWW(linkUrl.hostname) === removeWWW(baseUrlObj.hostname) + const normalizedLinkHost = removeWWW(linkUrl.hostname).split('.').slice(-2).join('.'); + const normalizedBaseHost = removeWWW(baseUrlObj.hostname).split('.').slice(-2).join('.'); + return normalizedLinkHost === normalizedBaseHost
223-230: Improve JSDoc for removeWWW functionThe current JSDoc could be more descriptive and include edge cases.
/** + * Removes 'www.' prefix from a hostname if present. * - * @param {String} hostname e.g. `www.example.com` - * @returns String e.g. `example.com` + * @param {string} hostname - The hostname to process (e.g., 'www.example.com', 'sub.example.com') + * @returns {string} The hostname without 'www.' prefix (e.g., 'example.com', 'sub.example.com') */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
__tests__/lib/generator/sitemapXml.test.js(1 hunks)src/lib/generator/sitemapXml.js(1 hunks)
🧰 Additional context used
🪛 Biome
__tests__/lib/generator/sitemapXml.test.js
[error] 56-56: Avoid the delete operator which can impact performance.
Unsafe fix: Use an undefined assignment instead.
(lint/performance/noDelete)
[error] 57-57: Avoid the delete operator which can impact performance.
Unsafe fix: Use an undefined assignment instead.
(lint/performance/noDelete)
src/lib/generator/sitemapXml.js
[error] 185-185: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
|
Hi @Bashamega, can you please have a look at my PR. Thanks. |
Bashamega
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current sitemap API isn’t generating a full sitemap for my large site—it only captures part of it. I’m already on Vercel’s free plan, so I’d prefer a solution that doesn’t require upgrading. Are there any prebuilt sitemap APIs we could use to handle this?
I appreciate your enthusiasm for developing the API, but I’m concerned about the potential ongoing costs.
annuk123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Hello @annuk123 |
okay |
That's probably because of the limit (100), meaning - 100 pages will be processed (fetched html and parsed links). xml-sitemaps.com/ also has a limit of 500 in free tier. But please provide the page for me to verify.
I understand, however I don't think there is a need to upgrade hosting and worry about cost, you can change the limit (which will affect ram usage and execution time) based on the usage and for those users who need more, this is an open-source project, they have access to the source code and can run it themself, slefhost, copy, modify, etc. |
- not ok response error handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (5)
src/lib/generator/sitemapXml.js (5)
5-16: Consider expanding ignored patterns for comprehensive filteringThe current list of ignored patterns could be enhanced to cover more cases:
Additional file extensions to ignore:
- Archives:
.zip,.rar,.tar,.gz- Documents:
.doc,.docx,.xls,.xlsx,.ppt,.pptx- Media:
.mp3,.mp4,.wav,.webp,.avifAdditional prefixes to ignore (line 16):
- Protocols:
ftp:,sftp:,data:,javascript:- Special links:
sms:,whatsapp:
85-85: Consider adjusting priority calculation for better SEOThe current priority calculation (
0.8 ** depth) drops too quickly for deeper pages. For a depth of 5, the priority would be 0.328, which might be too low for important content.Consider using a more gradual decrease:
-const priority = 0.8 ** currentPage.depth; +const priority = Math.max(0.5, 1 - (currentPage.depth * 0.1));This ensures that even at depth 5, the priority would still be 0.5.
53-54: Add rate limiting to prevent overwhelming serversThe current implementation makes requests as fast as possible, which could overwhelm the target server or trigger rate limiting.
Consider adding a delay between requests:
+const DELAY_BETWEEN_REQUESTS = 100; // milliseconds + try { + await new Promise(resolve => setTimeout(resolve, DELAY_BETWEEN_REQUESTS)); const response = await fetch(url, { method: "HEAD" });
180-190: Modernize link extraction with for...of loopReplace the while loop with assignment with a more modern and readable approach using
for...ofwithmatchAll().-let match; -while ((match = linkRegex.exec(buffer)) !== null) { - const url = parseLink(match[2], baseUrl); +for (const match of buffer.matchAll(linkRegex)) { + const url = parseLink(match[2], baseUrl); if (url) { links.push(url); } }🧰 Tools
🪛 Biome
[error] 185-185: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.(lint/suspicious/noAssignInExpressions)
202-221: Enhance URL parsing robustnessThe current implementation could be enhanced to handle more edge cases:
function parseLink(link, baseUrl) { + // Trim whitespace and decode entities + link = link.trim().replace(/&/g, '&'); + + // Handle hash-only links + if (link === '#') return null; + if ( IGNORED_SUFFIX.some((ext) => link.endsWith(ext)) || IGNORED_PREFIX.some((prefix) => link.startsWith(prefix)) ) { return null; } try { const baseUrlObj = new URL(baseUrl); + // Remove hash fragments + link = link.split('#')[0]; + // Remove query parameters if needed + // link = link.split('?')[0]; const linkUrl = new URL(link, baseUrlObj.origin); return removeWWW(linkUrl.hostname) === removeWWW(baseUrlObj.hostname) ? new URL(linkUrl.pathname, baseUrl).href : null; } catch (e) { devLogger(`Error parsing link: ${link}`, e); return null; } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/lib/generator/sitemapXml.js(1 hunks)
🧰 Additional context used
🪛 Biome
src/lib/generator/sitemapXml.js
[error] 185-185: The assignment should not be in an expression.
The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.
(lint/suspicious/noAssignInExpressions)
|
Thank you for the prompt reply @LenVavro This is the website that i have tried it on: https://adambashaahmednaji.com/ |
I've checked it @Bashamega and no issue was found, reason for the clipped sitemap is the default limit (100). You can increase it easily using env variable From my side, everything's ready for the merge. |
Can we use another api? |
|
@Bashamega you can adjust the limit as you want using I didn't find any free API for this purpose and I don't think someone will provide their server resources for free or without limits. Vercel is already providing free hosting and I've implemented it, so in a way, this is the only available free and unlimited API to generate sitemaps 😄 |
Bashamega
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much



Description
Issue #108 Sitemap xml generator, that recursively iterate through website's pages, process its html and parse links to create
sitemap.xmlfile.Key points
Lightweight
To keep it lightweight I've decided not to use Playwright, Puppeteer or other similar package, but a simple fetch and regex. Furthemore, to minimize ram usage, I am processing HTML response in stream, therefore only chunk of the HTML is stored in the ram at once and immediately processed. The generated sitemap is also streamed to the frontend, so the user can see the progress in real time.
Limitations
Getting HTML content from different page is almost impossible without backend, bc of CORS policies in browsers. Therefore I had to fetch website's content on the server. However, I can see that web is hosted on the Vercel, which has a timeout for server/edge functions. Therefore I set runtime to edge, which should allow streaming response beyond 25s limit (source).
Limit visited pages
Visited page is URL, which content was fetched and processed. To limit number of visited pages I added a limit, which can be set in new env property
NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT, if you leave it empty, default limit is 100. Meaning at most 100 pages will be in the final sitemap.xml. This limit is important to save hosting resources.Type of change
Checklist:
Summary by CodeRabbit
Release Notes
New Features
Tests
Chores