Skip to content

Conversation

@LenVavro
Copy link
Contributor

@LenVavro LenVavro commented Nov 7, 2024

Description

Issue #108 Sitemap xml generator, that recursively iterate through website's pages, process its html and parse links to create sitemap.xml file.

Key points

  • Lightweight
  • Stream real-time results to FE
  • Create downloadable file in the browser

Lightweight

To keep it lightweight I've decided not to use Playwright, Puppeteer or other similar package, but a simple fetch and regex. Furthemore, to minimize ram usage, I am processing HTML response in stream, therefore only chunk of the HTML is stored in the ram at once and immediately processed. The generated sitemap is also streamed to the frontend, so the user can see the progress in real time.

Limitations

Getting HTML content from different page is almost impossible without backend, bc of CORS policies in browsers. Therefore I had to fetch website's content on the server. However, I can see that web is hosted on the Vercel, which has a timeout for server/edge functions. Therefore I set runtime to edge, which should allow streaming response beyond 25s limit (source).

Limit visited pages

Visited page is URL, which content was fetched and processed. To limit number of visited pages I added a limit, which can be set in new env property NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT, if you leave it empty, default limit is 100. Meaning at most 100 pages will be in the final sitemap.xml. This limit is important to save hosting resources.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project
  • Any new tools are named according to the Tool Naming Convention outlined in contributing.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced an API endpoint for generating XML sitemaps based on user-provided URLs.
    • Added a user interface for generating and downloading XML sitemaps, including error handling and dark mode support.
    • New environment variable for sitemap generation limit to control the number of pages processed.
  • Tests

    • Implemented a comprehensive test suite for the sitemap generation functionality, ensuring robustness and reliability.
  • Chores

    • Added a new tool entry for the "Sitemap XML Generator" in the tools list.

- page init
- shadcn Input
- generate xml & stream it to the client
- handle error. loading state, better ui for output
- priority & last modified
- fix url parsing, better log handling, env for limit
- gui responsiveness, use ref instead of state
- gui show found pages count
- update tools.json
- refactoring & tests
- extend tests
- edge runtime
@LenVavro LenVavro requested a review from Bashamega as a code owner November 7, 2024 15:50
@vercel
Copy link

vercel bot commented Nov 7, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
web-dev-tools ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 14, 2024 7:09am

@coderabbitai
Copy link

coderabbitai bot commented Nov 7, 2024

Walkthrough

The changes introduce a feature for generating XML sitemaps, including the addition of a new environment variable in the .env.example file, an API endpoint for sitemap generation, a React component for user interaction, and utility functions for URL validation and limit retrieval. A comprehensive test suite for the sitemap generation function has been established, and a new entry for a sitemap generator tool has been added to the tools JSON file.

Changes

File Path Change Summary
.env.example Added environment variable NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT.
__tests__/lib/generator/sitemapXml.test.js Introduced a test suite for generateSitemapXML with mock fetch, including three main tests.
src/app/api/generator/sitemap-xml/route.js Added API endpoint for generating XML sitemaps with error handling for URL validation.
src/app/generator/sitemap-xml/page.jsx Introduced a React component for generating sitemaps, including state management and error handling.
src/db/tools.json Added new tool entry for "Sitemap XML Generator" with id 28 and link "/generator/sitemap-xml".
src/lib/generator/sitemapXml.js Added functions for generating XML sitemaps, including generateSitemapXML and utility functions.
src/lib/utils.js Added functions isUrlValid(url) and getSitemapXmlGeneratorLimit() for URL validation and limit retrieval.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ReactComponent
    participant API
    participant SitemapGenerator

    User->>ReactComponent: Input URL
    ReactComponent->>API: Fetch sitemap for URL
    API->>SitemapGenerator: Generate sitemap
    SitemapGenerator-->>API: Return sitemap XML
    API-->>ReactComponent: Return sitemap XML
    ReactComponent-->>User: Display sitemap
Loading

🐰 "In the garden where sitemaps bloom,
A new tool has come to dispel the gloom.
With URLs valid and limits in sight,
We generate sitemaps, oh what a delight!
So hop along, let’s fetch and create,
In the world of XML, we celebrate!" 🌼

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 eslint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

warning eslint@8.57.1: This version is no longer supported. Please see https://eslint.org/version-support for other options.
warning eslint > @humanwhocodes/config-array@0.13.0: Use @eslint/config-array instead
warning eslint > @humanwhocodes/config-array > @humanwhocodes/object-schema@2.0.3: Use @eslint/object-schema instead
warning eslint > file-entry-cache > flat-cache > rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported
warning eslint > file-entry-cache > flat-cache > rimraf > glob@7.2.3: Glob versions prior to v9 are no longer supported
warning eslint > file-entry-cache > flat-cache > rimraf > glob > inflight@1.0.6: This module is not supported, and leaks memory. Do not use it. Check out lru-cache if you want a good and tested way to coalesce async requests by a key value, which is much more comprehensive and powerful.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (12)
src/lib/utils.js (1)

8-20: Enhance JSDoc documentation for clarity.

The implementation correctly validates URLs using the built-in URL constructor and appropriately restricts protocols to HTTP/HTTPS. However, the documentation could be more descriptive.

Consider enhancing the JSDoc to clarify the protocol restriction:

 /**
- *
+ * Validates if the provided string is a valid HTTP/HTTPS URL.
  * @param {string} url
+ * @description Checks if the URL is well-formed and uses either HTTP or HTTPS protocol
  * @returns {boolean}
  */
src/app/api/generator/sitemap-xml/route.js (2)

18-20: Enhance error message for better user experience.

The current error message "Missing URL" could be more descriptive to help users understand what's expected.

-    return new Response("Missing URL", { status: 400 });
+    return new Response("Missing 'url' parameter in the request query", { status: 400 });

46-51: Consider adding compression for large sitemaps.

The response headers are appropriate, but for large sitemaps, consider adding compression support.

 return new Response(stream, {
   headers: {
     "Content-Type": "application/xml",
     "Cache-Control": "no-store",
+    "Content-Encoding": "gzip",
   },
 });

Note: If implementing compression, ensure to wrap the stream with appropriate compression transform stream.

__tests__/lib/generator/sitemapXml.test.js (3)

27-29: Fix malformed HTML in mock data.

The mock HTML contains a syntax error: <p>some other elements>. This should be properly closed.

-                  '<html><a href="/page1">Page 1</a><p>some other elements><a href="/page2">Page 2</a></html>',
+                  '<html><a href="/page1">Page 1</a><p>some other elements</p><a href="/page2">Page 2</a></html>',

5-40: Enhance mock coverage for edge cases.

The current mock implementation only handles the success path. Consider adding test coverage for:

  • Invalid URLs
  • Network errors
  • Different content types
  • Various HTTP status codes

Would you like me to provide an example implementation with these scenarios?


51-84: Enhance test coverage for XML generation.

The test case could be improved in several ways:

  1. Consider using a constant or environment variable for the base URL
  2. Add test cases for XML escaping (e.g., URLs with special characters)
  3. Validate the generated XML against the sitemap schema

Example test for XML escaping:

test("handles special characters in URLs", async () => {
  const baseUrl = "https://example.com/path with spaces/";
  // ... test implementation
  expect(sitemapXML).toContain("https://example.com/path%20with%20spaces/");
});
src/app/generator/sitemap-xml/page.jsx (2)

15-22: Consider cleaning up the downloadSitemapUrl blob.

The downloadSitemapUrl state contains a blob URL created with URL.createObjectURL(). To prevent memory leaks, consider revoking this URL when it's no longer needed.

Add cleanup in a useEffect:

+ useEffect(() => {
+   return () => {
+     if (downloadSitemapUrl) {
+       URL.revokeObjectURL(downloadSitemapUrl);
+     }
+   };
+ }, [downloadSitemapUrl]);

186-198: Consider adding TypeScript and improving type safety.

The Button component could benefit from better type safety and defaults.

Consider converting to TypeScript:

interface ButtonProps extends React.ButtonHTMLAttributes<HTMLButtonElement> {
  isDarkMode: boolean;
  className?: string;
}

function Button({ 
  children, 
  className, 
  isDarkMode, 
  type = 'button',
  ...props 
}: ButtonProps) {
  return (
    <button
      type={type}
      {...props}
      className={cn(
        `${isDarkMode ? "bg-slate-100 text-slate-950" : "bg-slate-800 text-slate-100"} disabled:opacity-60 px-4 py-2 flex items-center justify-center rounded font-medium`,
        className,
      )}
    >
      {children}
    </button>
  );
}
src/lib/generator/sitemapXml.js (4)

178-183: Avoid assignment within expression in while loop

Assigning within the while loop condition can be confusing and is flagged by linters. Refactor the loop for clarity.

Apply this diff:

- while ((match = linkRegex.exec(buffer)) !== null) {
+ let match;
+ while ((match = linkRegex.exec(buffer))) {

This change enhances readability and conforms to best practices.

🧰 Tools
🪛 Biome

[error] 178-178: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)


172-186: Use an HTML parser instead of regex for parsing links

Parsing HTML with regular expressions is error-prone and may miss valid links or fail on malformed HTML. Consider using an HTML parser like cheerio for robust link extraction.

Example modification:

+ import cheerio from 'cheerio';

async function getLinks(currentUrl, baseUrl) {
  // ...
- const links = processBuffer(baseUrl, buffer);
+ const html = buffer;
+ const links = parseLinksFromHTML(html, baseUrl);
  // ...
}

- function processBuffer(baseUrl, buffer) {
-   const linkRegex = /<a\s+(?:[^>]*?\s+)?href="([^"]*)"/g;
-   // existing code
- }

+ function parseLinksFromHTML(html, baseUrl) {
+   const $ = cheerio.load(html);
+   const links = [];
+   $('a[href]').each((_, element) => {
+     const href = $(element).attr('href');
+     const url = parseLink(href, baseUrl);
+     if (url) {
+       links.push(url);
+     }
+   });
+   return links;
+ }

This approach ensures accurate parsing and better handles complex HTML structures.

🧰 Tools
🪛 Biome

[error] 178-178: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)


54-54: Set a User-Agent header in fetch requests

When fetching resources, setting a User-Agent header can help avoid being blocked and identifies your crawler appropriately.

Apply this diff:

// In generateSitemapXML
- const response = await fetch(url, { method: "HEAD" });
+ const response = await fetch(url, {
+   method: "HEAD",
+   headers: { "User-Agent": "WebDevToolsSitemapGenerator/1.0" },
+ });

// In getLinks
- const response = await fetch(currentUrl);
+ const response = await fetch(currentUrl, {
+   headers: { "User-Agent": "WebDevToolsSitemapGenerator/1.0" },
+ });

Replace "WebDevToolsSitemapGenerator/1.0" with an appropriate user-agent string.

Also applies to: 122-122


27-110: Consider respecting robots.txt for ethical crawling

The current implementation does not check robots.txt directives, which could lead to unintended crawling of disallowed pages. Implementing robots.txt parsing ensures compliance with websites' crawling policies.

Consider using a library like robots-txt-parser to parse and respect robots.txt rules before fetching pages.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e91bf2e and e1d5450.

📒 Files selected for processing (7)
  • .env.example (1 hunks)
  • __tests__/lib/generator/sitemapXml.test.js (1 hunks)
  • src/app/api/generator/sitemap-xml/route.js (1 hunks)
  • src/app/generator/sitemap-xml/page.jsx (1 hunks)
  • src/db/tools.json (1 hunks)
  • src/lib/generator/sitemapXml.js (1 hunks)
  • src/lib/utils.js (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .env.example
🧰 Additional context used
🪛 Biome
src/lib/generator/sitemapXml.js

[error] 178-178: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (7)
src/app/api/generator/sitemap-xml/route.js (2)

22-30: Verify URL validation and consider protocol handling.

The URL validation and normalization look good, but consider adding explicit protocol handling to ensure only HTTP(S) URLs are processed.

#!/bin/bash
# Description: Check URL validation implementation

# Test: Check URL validation implementation in utils.js
rg -A 5 "isUrlValid.*=" "src/lib/utils.js"
 if (!isUrlValid(url)) {
-    return new Response("Invalid URL", { status: 400 });
+    return new Response("Invalid URL. Please provide a valid HTTP or HTTPS URL", { status: 400 });
 }

+const protocol = new URL(url).protocol;
+if (protocol !== 'http:' && protocol !== 'https:') {
+    return new Response("Only HTTP and HTTPS protocols are supported", { status: 400 });
+}

1-3: Verify import dependencies.

The imports look correct for the implementation. Let's verify there are no circular dependencies.

✅ Verification successful

Import dependencies are clean and well-structured

The verification confirms:

  • No circular dependencies exist between the files
  • All imported functions are properly exported from their respective modules:
    • generateSitemapXML from src/lib/generator/sitemapXml.js
    • isUrlValid and getSitemapXmlGeneratorLimit from src/lib/utils.js
  • The imports follow a unidirectional flow from API route to utility modules
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential circular dependencies in the imported files

# Test 1: Check if sitemapXml.js imports from route.js
rg -l "from.*api/generator/sitemap-xml/route" "src/lib/generator/sitemapXml.js"

# Test 2: Check if utils.js imports from route.js
rg -l "from.*api/generator/sitemap-xml/route" "src/lib/utils.js"

Length of output: 136


Script:

#!/bin/bash
# Let's verify the imports and potential circular dependencies more thoroughly

# Check the implementation of imported functions
echo "=== Checking sitemapXml.js ==="
fd "sitemapXml.js" --exec rg -l "import.*from.*api/generator/sitemap-xml"

echo "=== Checking utils.js ==="
fd "utils.js" --exec rg -l "import.*from.*api/generator/sitemap-xml"

# Check reverse dependencies
echo "=== Checking implementations of imported functions ==="
rg "export.*generateSitemapXML" -A 2
rg "export.*getSitemapXmlGeneratorLimit|export.*isUrlValid" -A 2

Length of output: 1016

src/db/tools.json (2)

164-168: LGTM! The new tool entry follows consistent structure.

The new Sitemap XML Generator entry maintains proper formatting and follows existing patterns for id sequencing, naming, and categorization.


167-167: Verify the route implementation for the sitemap generator.

The link path /generator/sitemap-xml should correspond to the actual route implementation.

✅ Verification successful

Route implementation for sitemap generator is properly implemented

The route /generator/sitemap-xml is correctly implemented with both:

  • A page component at src/app/generator/sitemap-xml/page.jsx that handles the UI
  • An API route at src/app/api/generator/sitemap-xml/route.js that generates the sitemap XML
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of the corresponding route implementation
# Expected: Find route implementation files for the sitemap generator

# Check for page component implementation
fd -t f "page.tsx?" "src/app/generator/sitemap-xml"

# Check for API route implementation
fd -t f "route.tsx?" "src/app/api/generator/sitemap-xml"

Length of output: 108


Script:

#!/bin/bash
# Let's try a broader search for sitemap-related routes and files
# Search for any files containing sitemap in their path
fd "sitemap" 

# Search for route implementations containing sitemap
rg -l "sitemap" "src/app"

# Search for any API endpoints or route handlers with sitemap
ast-grep --pattern 'export async function GET($$$) { $$$ sitemap $$$ }'

Length of output: 373


Script:

#!/bin/bash
# Let's verify the content of the found route implementations
cat src/app/generator/sitemap-xml/page.jsx
cat src/app/api/generator/sitemap-xml/route.js

Length of output: 7374

__tests__/lib/generator/sitemapXml.test.js (1)

1-108: Verify streaming implementation and resource usage.

While the tests cover the basic functionality, consider adding tests to verify:

  1. Memory efficiency of the streaming implementation
  2. Proper cleanup of resources
  3. Performance with larger HTML responses
✅ Verification successful

Tests adequately cover the implementation's resource management

The implementation shows proper resource handling with:

  • Stream-based processing using getReader() to avoid loading entire pages into memory
  • Comprehensive error handling with try-catch blocks at critical points
  • Proper URL validation and error logging
  • Graceful degradation when encountering invalid links or failed requests

The current test suite effectively validates this through:

  • Testing the streaming output with for await iteration
  • Verifying limit enforcement which tests the stream termination
  • Mocking the stream reader with proper done/value protocol
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for memory management and resource cleanup in the implementation
rg -A 5 'getReader|ReadableStream|finally|close' src/lib/generator/sitemapXml.js

Length of output: 282


Script:

#!/bin/bash
# Let's examine the full implementation to understand the resource management
ast-grep --pattern 'export async function* generateSitemapXML($$$)'

# Also check for error handling patterns
rg -A 10 'try|catch|finally' src/lib/generator/sitemapXml.js

Length of output: 1924

src/app/generator/sitemap-xml/page.jsx (1)

1-14: LGTM! Clean setup with necessary imports and theme toggle.

The component setup is well-structured with appropriate imports and a clean dark mode implementation.

src/lib/generator/sitemapXml.js (1)

195-213: Verify URL normalization in parseLink function

Ensure that the URL normalization correctly resolves both absolute and relative links without introducing errors. Test with various link formats to confirm expected behavior.

Run the following script to test URL normalization:

This script helps verify that parseLink behaves as expected with different link formats.

- code review changes
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (10)
__tests__/lib/generator/sitemapXml.test.js (4)

5-40: Enhance mock implementation for better test coverage.

The current mock setup has several areas for improvement:

  1. Add error scenarios (network errors, non-200 responses)
  2. Fix malformed HTML in the mock response
  3. Add different content-type scenarios

Consider applying these changes:

 global.fetch = jest.fn(() => {
+  // Simulate different scenarios based on URL or call count
+  if (fetchCount > 2) {
+    return Promise.reject(new Error('Network error'));
+  }
   return Promise.resolve({
     ok: true,
     headers: {
       get: (header) => {
         if (header === "content-type") {
           return "text/html";
         }
         if (header === "last-modified") {
           return "Mon, 01 Jan 2024 00:00:00 GMT";
         }
         return undefined;
       },
     },
     body: {
       getReader: () => {
         return {
           read: () => {
             if (fetchCount === 0) {
               fetchCount++;
               return Promise.resolve({
                 done: false,
                 value: new TextEncoder().encode(
-                  '<html><a href="/page1">Page 1</a><p>some other elements><a href="/page2">Page 2</a></html>',
+                  '<html><a href="/page1">Page 1</a><p>some other elements</p><a href="/page2">Page 2</a></html>',
                 ),
               });
             }
             return Promise.resolve({
               done: true,
             });
           },
         };
       },
     },
   });
 });

42-56: Optimize global cleanup approach.

The test setup/teardown is well structured, but we can improve the cleanup approach:

Consider this alternative to using delete:

   afterAll(() => {
-    delete global.TextEncoder;
-    delete global.TextDecoder;
+    global.TextEncoder = undefined;
+    global.TextDecoder = undefined;
   });

Also, consider adding error handling for TextEncoder/TextDecoder setup:

   beforeAll(async () => {
+    if (!global.TextEncoder) {
       global.TextEncoder = require("util").TextEncoder;
+    }
+    if (!global.TextDecoder) {
       global.TextDecoder = require("util").TextDecoder;
+    }
   });
🧰 Tools
🪛 Biome

[error] 54-54: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 55-55: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


58-91: Enhance XML validation in tests.

While the test covers basic functionality, consider these improvements:

  1. Add XML schema validation
  2. Make date expectations dynamic
  3. Verify URL uniqueness

Example implementation:

const validateXMLSchema = (xml) => {
  // Add XML schema validation logic
  const schema = `<?xml version="1.0" encoding="UTF-8"?>
    <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
      <!-- Add schema definition -->
    </xs:schema>`;
  // Return validation result
};

test("generates valid sitemap XML with multiple links", async () => {
  // ... existing setup ...
  
  const sitemapXML = sitemapParts.join("");
  
  // Validate schema
  expect(validateXMLSchema(sitemapXML)).toBe(true);
  
  // Use dynamic date
  const expectedDate = new Date("2024-01-01T00:00:00+00:00")
    .toISOString()
    .split('.')[0]+'+00:00';
    
  // Track unique URLs
  const urls = new Set();
  sitemapXML.match(/<loc>([^<]+)<\/loc>/g)?.forEach(match => {
    const url = match.replace(/<\/?loc>/g, '');
    expect(urls.has(url)).toBe(false);
    urls.add(url);
  });
  
  // ... rest of the assertions ...
});

93-114: Strengthen limit test assertions.

While the test verifies basic limit functionality, consider adding:

  1. Exact URL count validation
  2. Verification that the sitemap is well-formed even when limited

Add these assertions:

   const sitemapXML = sitemapParts.join("");
+  // Verify exact URL count
+  const urlCount = (sitemapXML.match(/<url>/g) || []).length;
+  expect(urlCount).toBe(limit);
+
+  // Verify well-formed XML structure
+  expect(sitemapXML).toMatch(
+    /^<\?xml.*?>\s*<urlset.*?>.*?<\/urlset>$/s
+  );
+
   expect(sitemapXML).toContain("<urlset");
   // ... rest of the assertions ...
src/app/generator/sitemap-xml/page.jsx (2)

85-194: Enhance user feedback and accessibility.

The UI needs improvements in user feedback and accessibility:

  1. Loading state for copy operation
  2. Success feedback for copy operation
  3. Missing aria-live regions for dynamic content

Apply these improvements:

+  const [isCopying, setIsCopying] = useState(false);
+  const [copySuccess, setCopySuccess] = useState(false);

   // In the copy button click handler
   onClick={async () => {
+    setIsCopying(true);
     try {
       await navigator.clipboard.writeText(sitemap);
+      setCopySuccess(true);
+      setTimeout(() => setCopySuccess(false), 2000);
     } catch (e) {
       setError("Failed to copy to clipboard");
     } finally {
+      setIsCopying(false);
     }
   }}

   // Add aria-live region for dynamic content
+  <div aria-live="polite" className="sr-only">
+    {isLoading && "Generating sitemap..."}
+    {copySuccess && "Copied to clipboard"}
+    {error && `Error: ${error}`}
+  </div>

   // Update copy button content
-  <ContentCopyIcon className="w-5 h-5" />
+  {isCopying ? (
+    <CircularProgress size="1.25rem" color="inherit" />
+  ) : copySuccess ? (
+    "Copied!"
+  ) : (
+    <ContentCopyIcon className="w-5 h-5" />
+  )}

196-208: Add type safety to the Button component.

Consider adding proper TypeScript types to improve maintainability and prevent potential runtime errors.

Convert the component to TypeScript:

-function Button({ children, className, isDarkMode, ...props }) {
+type ButtonProps = React.ButtonHTMLAttributes<HTMLButtonElement> & {
+  children: React.ReactNode;
+  className?: string;
+  isDarkMode: boolean;
+};
+
+function Button({ children, className, isDarkMode, ...props }: ButtonProps) {
src/lib/generator/sitemapXml.js (4)

5-16: Consider adding more static asset extensions to IGNORED_SUFFIX.

The current list covers common static assets, but consider adding these extensions to prevent unnecessary processing:

  • Document formats: .doc, .docx, .xls, .xlsx, .ppt, .pptx
  • Media files: .mp3, .mp4, .wav, .webp, .avif
  • Archive files: .zip, .rar, .tar, .gz

86-86: Consider a more nuanced priority calculation.

The current priority calculation (0.8 ** depth) decays exponentially with depth, which might not be suitable for all site structures. Some deep pages could be more important than shallow ones based on their content or purpose.

Consider:

  1. Using path segments instead of crawl depth
  2. Adding configuration for the decay factor (0.8)
  3. Supporting priority overrides via page metadata

173-173: Improve href attribute matching regex.

The current regex might miss valid href formats. Consider using a more comprehensive pattern that handles:

  • Single quotes: href='...'
  • Unquoted attributes: href=http://example.com
  • URLs with quotes: href="http://example.com/?q=\"quoted\""

Apply this diff:

-const linkRegex = /<a\s+(?:[^>]*?\s+)?href="([^"]*)"/g;
+const linkRegex = /<a\s+(?:[^>]*?\s+)?href=(?:"([^"]*)|'([^']*)'|([^\s>]*))/g;

195-214: Enhance URL normalization.

The current implementation could be improved to handle more edge cases:

  1. Normalize paths (remove . and ..)
  2. Handle URL fragments consistently
  3. Normalize query parameter order

Apply this diff:

 try {
   const baseUrlObj = new URL(baseUrl);
+  // Normalize the base URL
+  baseUrlObj.hash = '';
+  baseUrlObj.search = new URLSearchParams(baseUrlObj.search).toString();
 
   const linkUrl = new URL(link, baseUrlObj.origin);
+  // Normalize the link URL
+  linkUrl.hash = '';
+  linkUrl.search = new URLSearchParams(linkUrl.search).toString();
+
   return removeWWW(linkUrl.hostname) === removeWWW(baseUrlObj.hostname)
-    ? new URL(linkUrl.pathname, baseUrl).href
+    ? new URL(linkUrl.pathname + linkUrl.search, baseUrl).href
     : null;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e1d5450 and 3281566.

📒 Files selected for processing (5)
  • .env.example (1 hunks)
  • __tests__/lib/generator/sitemapXml.test.js (1 hunks)
  • src/app/api/generator/sitemap-xml/route.js (1 hunks)
  • src/app/generator/sitemap-xml/page.jsx (1 hunks)
  • src/lib/generator/sitemapXml.js (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .env.example
  • src/app/api/generator/sitemap-xml/route.js
🧰 Additional context used
🪛 Biome
__tests__/lib/generator/sitemapXml.test.js

[error] 54-54: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 55-55: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)

src/lib/generator/sitemapXml.js

[error] 178-178: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (1)
src/app/generator/sitemap-xml/page.jsx (1)

1-8: LGTM: Component setup and imports are appropriate.

The necessary dependencies are correctly imported, and the "use client" directive is properly placed for client-side rendering.

- code review changes
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (7)
src/app/generator/sitemap-xml/page.jsx (3)

46-54: Enhance error handling for network failures.

The error handling could be more specific about the type of failure.

Consider this improvement:

 const response = await fetch(
   `/api/generator/sitemap-xml?url=${encodeURIComponent(url)}`,
 );

 if (!response.ok) {
-  console.error("Failed to fetch sitemap");
-  setError("Failed to fetch sitemap");
+  const errorMessage = response.status === 404 
+    ? "Website not found or inaccessible"
+    : response.status === 429
+    ? "Too many requests. Please try again later"
+    : `Failed to fetch sitemap (HTTP ${response.status})`;
+  console.error(errorMessage);
+  setError(errorMessage);
   return;
 }

160-166: Enhance loading state accessibility.

The loading indicator should have an ARIA label for screen readers.

Apply this improvement:

 <CircularProgress
   size="1.5rem"
   color="inherit"
   className="text-white"
+  aria-label="Generating sitemap..."
 />

215-227: Consider adding TypeScript for better type safety.

The Button component would benefit from TypeScript props interface definition.

Consider this improvement:

+interface ButtonProps extends React.ButtonHTMLAttributes<HTMLButtonElement> {
+  isDarkMode: boolean;
+  className?: string;
+  children: React.ReactNode;
+}

-function Button({ children, className, isDarkMode, ...props })
+function Button({ children, className, isDarkMode, ...props }: ButtonProps)
src/lib/generator/sitemapXml.js (4)

5-16: Enhance URL filtering by adding more file extensions and prefixes

Consider adding more common file extensions and prefixes to improve filtering:

 const IGNORED_SUFFIX = [
   ".jpg",
   ".jpeg",
   ".png",
   ".gif",
   ".svg",
   ".pdf",
   ".css",
   ".js",
   ".ico",
+  ".mp4",
+  ".webp",
+  ".webm",
+  ".mp3",
+  ".wav",
+  ".woff",
+  ".woff2",
+  ".ttf",
 ];
 const IGNORED_PREFIX = [
   "#",
   "mailto:",
   "tel:",
+  "javascript:",
+  "data:",
+  "blob:",
 ];

86-86: Document priority calculation formula

The priority calculation uses 0.8 as a base value, which should be documented or made configurable.

+  // Priority decreases by 20% for each level deeper from the root URL
   const priority = 0.8 ** currentPage.depth;

53-64: Add rate limiting to prevent overwhelming target servers

Consider implementing rate limiting between requests to be a good web citizen.

+  // Add at the top of the file
+  const RATE_LIMIT_MS = 100; // Configurable delay between requests
+
+  // Add before the fetch
+  await new Promise(resolve => setTimeout(resolve, RATE_LIMIT_MS));
   const response = await fetch(url, { method: "HEAD" });

214-217: Enhance URL normalization

The current URL normalization could be improved to handle more edge cases.

-    return removeWWW(linkUrl.hostname) === removeWWW(baseUrlObj.hostname)
-      ? new URL(linkUrl.pathname, baseUrl).href
+    if (removeWWW(linkUrl.hostname) !== removeWWW(baseUrlObj.hostname)) {
+      return null;
+    }
+    // Normalize the URL: remove fragments, normalize path
+    linkUrl.hash = '';
+    const normalizedPath = linkUrl.pathname.replace(/\/+/g, '/').replace(/\/$/, '');
+    return new URL(normalizedPath + linkUrl.search, baseUrl).href
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3281566 and 73c76d3.

📒 Files selected for processing (2)
  • src/app/generator/sitemap-xml/page.jsx (1 hunks)
  • src/lib/generator/sitemapXml.js (1 hunks)
🧰 Additional context used
🪛 Biome
src/lib/generator/sitemapXml.js

[error] 186-186: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

🔇 Additional comments (2)
src/app/generator/sitemap-xml/page.jsx (1)

1-8: LGTM! Dependencies and setup look good.

All necessary imports are present and properly organized.

src/lib/generator/sitemapXml.js (1)

82-84: Add error handling consistency in HEAD request catch block

The error handling here should follow the same pattern as other error cases by removing the URL from visited and adding it to rejected.

- improve test, improve url parsing
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (5)
src/lib/generator/sitemapXml.js (5)

5-16: Consider enhancing the ignored file extensions list

The IGNORED_SUFFIX list could be expanded to include more common file types that shouldn't be in sitemaps:

  • Document formats: .doc, .docx, .xls, .xlsx, .ppt, .pptx
  • Archive formats: .zip, .rar, .tar, .gz
  • Media formats: .mp3, .mp4, .wav, .webp, .avif

Consider making these lists configurable through environment variables to allow customization without code changes.


85-85: Consider making priority calculation configurable

The priority calculation (0.8 ** depth) is hardcoded. Consider making the base value (0.8) configurable to allow different priority decay rates for different use cases.

+const PRIORITY_DECAY_RATE = process.env.NEXT_PUBLIC_SITEMAP_PRIORITY_DECAY_RATE || 0.8;
+
-      const priority = 0.8 ** currentPage.depth;
+      const priority = PRIORITY_DECAY_RATE ** currentPage.depth;

31-33: Consider using Set for queue to avoid duplicate processing

While visited and rejected URLs are tracked in Sets, the queue might contain duplicate URLs that will be filtered out later. Consider using a Set-based queue implementation to avoid pushing duplicate URLs altogether.

-  const queue = [{ url: baseUrl, depth: 0 }];
+  const queueSet = new Set([baseUrl]);
+  const queueItems = [{ url: baseUrl, depth: 0 }];
+
+  // In the push operation (around line 102):
-  queue.push(...links.map((url) => ({ url, depth: currentPage.depth + 1 })));
+  const newItems = links
+    .filter(url => !queueSet.has(url))
+    .map(url => ({ url, depth: currentPage.depth + 1 }));
+  newItems.forEach(item => queueSet.add(item.url));
+  queueItems.push(...newItems);

213-216: Enhance hostname comparison logic

The current hostname comparison might fail with subdomains. Consider using a more robust comparison that handles various subdomain scenarios.

-    return removeWWW(linkUrl.hostname) === removeWWW(baseUrlObj.hostname)
+    const normalizedLinkHost = removeWWW(linkUrl.hostname).split('.').slice(-2).join('.');
+    const normalizedBaseHost = removeWWW(baseUrlObj.hostname).split('.').slice(-2).join('.');
+    return normalizedLinkHost === normalizedBaseHost

223-230: Improve JSDoc for removeWWW function

The current JSDoc could be more descriptive and include edge cases.

 /**
+ * Removes 'www.' prefix from a hostname if present.
  *
- * @param {String} hostname  e.g. `www.example.com`
- * @returns String e.g. `example.com`
+ * @param {string} hostname - The hostname to process (e.g., 'www.example.com', 'sub.example.com')
+ * @returns {string} The hostname without 'www.' prefix (e.g., 'example.com', 'sub.example.com')
  */
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 73c76d3 and 79ada46.

📒 Files selected for processing (2)
  • __tests__/lib/generator/sitemapXml.test.js (1 hunks)
  • src/lib/generator/sitemapXml.js (1 hunks)
🧰 Additional context used
🪛 Biome
__tests__/lib/generator/sitemapXml.test.js

[error] 56-56: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)


[error] 57-57: Avoid the delete operator which can impact performance.

Unsafe fix: Use an undefined assignment instead.

(lint/performance/noDelete)

src/lib/generator/sitemapXml.js

[error] 185-185: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

@LenVavro
Copy link
Contributor Author

Hi @Bashamega, can you please have a look at my PR. Thanks.

Copy link
Owner

@Bashamega Bashamega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current sitemap API isn’t generating a full sitemap for my large site—it only captures part of it. I’m already on Vercel’s free plan, so I’d prefer a solution that doesn’t require upgrading. Are there any prebuilt sitemap APIs we could use to handle this?

I appreciate your enthusiasm for developing the API, but I’m concerned about the potential ongoing costs.

@Bashamega Bashamega requested a review from annuk123 November 13, 2024 06:17
Copy link
Collaborator

@annuk123 annuk123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot from 2024-11-13 20-22-06
it gives an error

@Bashamega
Copy link
Owner

Screenshot from 2024-11-13 20-22-06 it gives an error

Hello @annuk123
This was a issue with a tool, and io forgot to fix it. It should be solved now

@annuk123
Copy link
Collaborator

Screenshot from 2024-11-13 20-22-06 it gives an error

Hello @annuk123 This was a issue with a tool, and io forgot to fix it. It should be solved now

okay

@LenVavro
Copy link
Contributor Author

LenVavro commented Nov 14, 2024

@Bashamega

The current sitemap API isn’t generating a full sitemap for my large site—it only captures part of it.

That's probably because of the limit (100), meaning - 100 pages will be processed (fetched html and parsed links). xml-sitemaps.com/ also has a limit of 500 in free tier. But please provide the page for me to verify.


I’m already on Vercel’s free plan, so I’d prefer a solution that doesn’t require upgrading. Are there any prebuilt sitemap APIs we could use to handle this?

I appreciate your enthusiasm for developing the API, but I’m concerned about the potential ongoing costs.

I understand, however I don't think there is a need to upgrade hosting and worry about cost, you can change the limit (which will affect ram usage and execution time) based on the usage and for those users who need more, this is an open-source project, they have access to the source code and can run it themself, slefhost, copy, modify, etc.

- not ok response error handling
@LenVavro
Copy link
Contributor Author

Screenshot from 2024-11-13 20-22-06 it gives an error

I've just added better error message

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
src/lib/generator/sitemapXml.js (5)

5-16: Consider expanding ignored patterns for comprehensive filtering

The current list of ignored patterns could be enhanced to cover more cases:

  1. Additional file extensions to ignore:

    • Archives: .zip, .rar, .tar, .gz
    • Documents: .doc, .docx, .xls, .xlsx, .ppt, .pptx
    • Media: .mp3, .mp4, .wav, .webp, .avif
  2. Additional prefixes to ignore (line 16):

    • Protocols: ftp:, sftp:, data:, javascript:
    • Special links: sms:, whatsapp:

85-85: Consider adjusting priority calculation for better SEO

The current priority calculation (0.8 ** depth) drops too quickly for deeper pages. For a depth of 5, the priority would be 0.328, which might be too low for important content.

Consider using a more gradual decrease:

-const priority = 0.8 ** currentPage.depth;
+const priority = Math.max(0.5, 1 - (currentPage.depth * 0.1));

This ensures that even at depth 5, the priority would still be 0.5.


53-54: Add rate limiting to prevent overwhelming servers

The current implementation makes requests as fast as possible, which could overwhelm the target server or trigger rate limiting.

Consider adding a delay between requests:

+const DELAY_BETWEEN_REQUESTS = 100; // milliseconds
+
 try {
+  await new Promise(resolve => setTimeout(resolve, DELAY_BETWEEN_REQUESTS));
   const response = await fetch(url, { method: "HEAD" });

180-190: Modernize link extraction with for...of loop

Replace the while loop with assignment with a more modern and readable approach using for...of with matchAll().

-let match;
-while ((match = linkRegex.exec(buffer)) !== null) {
-  const url = parseLink(match[2], baseUrl);
+for (const match of buffer.matchAll(linkRegex)) {
+  const url = parseLink(match[2], baseUrl);
   if (url) {
     links.push(url);
   }
 }
🧰 Tools
🪛 Biome

[error] 185-185: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)


202-221: Enhance URL parsing robustness

The current implementation could be enhanced to handle more edge cases:

 function parseLink(link, baseUrl) {
+  // Trim whitespace and decode entities
+  link = link.trim().replace(/&amp;/g, '&');
+
+  // Handle hash-only links
+  if (link === '#') return null;
+
   if (
     IGNORED_SUFFIX.some((ext) => link.endsWith(ext)) ||
     IGNORED_PREFIX.some((prefix) => link.startsWith(prefix))
   ) {
     return null;
   }

   try {
     const baseUrlObj = new URL(baseUrl);
+    // Remove hash fragments
+    link = link.split('#')[0];
+    // Remove query parameters if needed
+    // link = link.split('?')[0];

     const linkUrl = new URL(link, baseUrlObj.origin);
     return removeWWW(linkUrl.hostname) === removeWWW(baseUrlObj.hostname)
       ? new URL(linkUrl.pathname, baseUrl).href
       : null;
   } catch (e) {
     devLogger(`Error parsing link: ${link}`, e);
     return null;
   }
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 79ada46 and 7fe5680.

📒 Files selected for processing (1)
  • src/lib/generator/sitemapXml.js (1 hunks)
🧰 Additional context used
🪛 Biome
src/lib/generator/sitemapXml.js

[error] 185-185: The assignment should not be in an expression.

The use of assignments in expressions is confusing.
Expressions are often considered as side-effect free.

(lint/suspicious/noAssignInExpressions)

@Bashamega
Copy link
Owner

Thank you for the prompt reply @LenVavro

This is the website that i have tried it on: https://adambashaahmednaji.com/
I fear that the API will exceed limit, but we can push it to prod and see what will happen/

@LenVavro
Copy link
Contributor Author

Thank you for the prompt reply @LenVavro

This is the website that i have tried it on: https://adambashaahmednaji.com/ I fear that the API will exceed limit, but we can push it to prod and see what will happen/

I've checked it @Bashamega and no issue was found, reason for the clipped sitemap is the default limit (100). You can increase it easily using env variable NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT=.

From my side, everything's ready for the merge.

@LenVavro LenVavro requested a review from Bashamega November 18, 2024 07:14
@Bashamega
Copy link
Owner

Thank you for the prompt reply @LenVavro
This is the website that i have tried it on: https://adambashaahmednaji.com/ I fear that the API will exceed limit, but we can push it to prod and see what will happen/

I've checked it @Bashamega and no issue was found, reason for the clipped sitemap is the default limit (100). You can increase it easily using env variable NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT=.

From my side, everything's ready for the merge.

Can we use another api?
So it can scrape all the website.
I don't want the generated sitemaps to be incomplete. I don't have a problem with using a third party free api

@LenVavro
Copy link
Contributor Author

@Bashamega you can adjust the limit as you want using NEXT_PUBLIC_GENERATOR_SITEMAP_XML_LIMIT e.g. 10mil and it will for sure generate full sitemap, but once again even xml-sitemaps.com has a limit of 500 in free tier, as you can read it on their homepage.

I didn't find any free API for this purpose and I don't think someone will provide their server resources for free or without limits. Vercel is already providing free hosting and I've implemented it, so in a way, this is the only available free and unlimited API to generate sitemaps 😄

Copy link
Owner

@Bashamega Bashamega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much

@Bashamega Bashamega merged commit a7795de into Bashamega:main Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants