Add properties to tests, add additional params for tests, fix broken numbers. by SuppliedOrange · Pull Request #9 · SuppliedOrange/WebscrapeCounting

SuppliedOrange · 2025-05-05T23:08:24Z

Summary by CodeRabbit

New Features
- Added support for configuring headless mode and page load timeout when running tests.
- Test commands now accept optional arguments for headless mode and timeout settings.
Documentation
- Updated installation instructions and CLI usage in the README to reflect new test options.
Refactor
- Standardized test function signatures to accept additional properties for enhanced configurability.

…numbers

coderabbitai · 2025-05-05T23:08:29Z

Walkthrough

The changes introduce enhanced configurability for the web scraper by allowing test runners to specify headless mode and pageLoadTimeout as options. These are threaded through new and updated type definitions, test runner logic, and the scraper implementation. Documentation is updated to reflect the new optional test arguments, and test files are refactored to accept the new properties interface.

Changes

File(s)	Change Summary
README.md	Updated installation instructions to mention new test arguments (`--headless`, `--pageLoadTimeout`) and clarified CLI usage wording.
data/numbers.ts	Updated URLs and DOM extraction logic for queries 4, 5, and 6 to match new target websites and structures.
src/types/TestProperties.ts	Added new `TestProperties` interface with optional `scraperConstructorProperties` of type `ScraperConstructor`.
src/webscraper/Scraper.ts	Extended `ScraperConstructor` with optional `headless` and `pageLoadTimeout`; updated `Scraper` class and methods to use these options.
test.ts	Modified test runner to accept and pass `headless` and `pageLoadTimeout` as part of `TestProperties` to tests.
tests/check_formats.test.ts tests/check_order.test.ts tests/check_repeating_domain.test.ts	Updated test function signatures to accept an unused `TestProperties` parameter.
tests/scrape_and_check.test.ts	Updated test function to accept and utilize `TestProperties`, passing options to the `Scraper` constructor and logging properties.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Test Runner
    participant TestFunction
    participant Scraper

    User->>CLI/Test Runner: Run test with --headless/--pageLoadTimeout
    CLI/Test Runner->>TestFunction: Call with data, {scraperConstructorProperties}
    TestFunction->>Scraper: Instantiate with options (headless, pageLoadTimeout)
    Scraper->>Scraper: Launch browser (headless), set timeout
    Scraper->>Scraper: Fetch page with timeout
    Scraper-->>TestFunction: Return result
    TestFunction-->>CLI/Test Runner: Return test result
    CLI/Test Runner-->>User: Output results

Poem

🐇
A hop and a skip, new options abound,
Headless or not, let your scraper be crowned.
Timeouts are set, so tests run with grace,
Each rabbit dev beams with a smile on their face.
Docs now explain how to pass what you need—
Configurable tests, a rabbit’s good deed!

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🔭 Outside diff range comments (2)

tests/check_order.test.ts (1)
1-43: 💡 Verification agent
🧩 Analysis chain

Note the pipeline failure in CI

While your code changes look good, there appears to be a test failure in the CI pipeline. This isn't related to your parameter changes, but you might need to investigate:
Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.
🏁 Script executed:
#!/bin/bash
# Let's locate the definition for number 5 in the data to understand what might be failing

# Search for number 5 definition in the data files
echo "Looking for number 5 definition in data files..."
rg -A 10 -B 2 "number: 5" --type ts

# Check if any query functions use innerText on potentially null elements
echo "Checking for potential null element access in query functions..."
rg -A 5 -B 5 "innerText" --type ts | grep -C 2 "number: 5"
Length of output: 1240
Guard against null in number 5’s queryFunction

The CI failure shows that for number 5, document.querySelector(".we-customer-ratings__averages__display") can return null, leading to a TypeError when accessing .innerText. You’ll need to add a null‐check or fallback in the queryFunction for number 5:

• Location to update:

data/numbers.ts, the object with number: 5

• Suggested diff:
 queryFunction: () => {
-    return Math.ceil(document.querySelector(".we-customer-ratings__averages__display").innerText)
+    const el = document.querySelector(".we-customer-ratings__averages__display");
+    if (!el) {
+        throw new Error("Element .we-customer-ratings__averages__display not found for number 5");
+    }
+    return Math.ceil(el.innerText);
 }
This will prevent the null‐access error during page.evaluate and make the test more robust.
data/numbers.ts (1)
1-63: 💡 Verification agent
🧩 Analysis chain

Consider adding retry logic and error handling across all query functions.

None of the query functions have proper error handling for cases where elements aren't found or have unexpected content.

Consider implementing a helper function to standardize error handling across all queries:
function safeQuery(selector: string, defaultValue: string, transform?: (text: string) => string): string {
  const element = document.querySelector(selector);
  if (!element) return defaultValue;
  
  const text = element.innerText;
  return transform ? transform(text) : text;
}
Then update each query function to use this pattern for more consistent error handling.

Please check if the websites being scraped have robots.txt files that might prohibit scraping:

🏁 Script executed:
#!/bin/bash
# Check robots.txt files for the three updated URLs

echo "Checking PokemonDB robots.txt"
curl -s https://pokemondb.net/robots.txt

echo -e "\nChecking Apple App Store robots.txt"
curl -s https://apps.apple.com/robots.txt

echo -e "\nChecking Microsoft robots.txt"
curl -s https://dotnet.microsoft.com/robots.txt
Length of output: 1645
Add standardized error handling and confirm robots.txt compliance

I’ve checked the robots.txt for all three domains—

pokemondb.net: /pokedex/* isn’t disallowed (only /pokebase/* is blocked, with a crawl-delay of 2)

apps.apple.com: your /us/app/... path is allowed

dotnet.microsoft.com: only /download/thank-you/ is disallowed, not /en-us/download/dotnet/6.0

To improve resilience when elements are missing or text is malformed, introduce a helper in, say, src/utils/safeQuery.ts:
export function safeQuery<T>(
  selector: string,
  defaultValue: T,
  transform: (text: string, el: Element) => T = text => (text as unknown as T),
  retries = 1,
  delayMs = 200
): Promise<T> {
  return new Promise((resolve) => {
    const attempt = (n: number) => {
      const el = document.querySelector(selector);
      if (el) {
        try {
          resolve(transform(el.innerText, el));
        } catch {
          resolve(defaultValue);
        }
      } else if (n > 0) {
        setTimeout(() => attempt(n - 1), delayMs);
      } else {
        resolve(defaultValue);
      }
    };
    attempt(retries);
  });
}
Then in data/numbers.ts, refactor each queryFunction:
- queryFunction: () => {
-   return document.querySelector("td").innerText.substring(3,4)
- }
+ queryFunction: () =>
+   safeQuery("td", "0", text => text.substring(3, 4)),
• Wrap all document.querySelector / querySelectorAll calls in safeQuery
• Provide sensible defaults ("0", "", etc.) and optional retry counts
• Centralize error handling for missing nodes or transform failures

🧹 Nitpick comments (4)

src/types/TestProperties.ts (1)
1-7: Consider simplifying interface structure.

The interface definition has unnecessary blank lines (lines 4 and 6) that could be removed for conciseness. The interface structure is correct and properly imports the ScraperConstructor type from the appropriate location.
import { ScraperConstructor } from "../webscraper/Scraper";

export default interface TestProperties {
-
    scraperConstructorProperties?: ScraperConstructor;
-    
}
test.ts (1)
19-21: Review default headless value and consider adding validation for pageLoadTimeout.

The default headless mode is set to true, which is a sensible default. However, there's no validation for pageLoadTimeout which could lead to issues if invalid values are provided.

Consider adding validation for pageLoadTimeout to ensure it's a positive number:
let headless = (["true", "false"].includes(args.headless)) ? args.headless === "true" : true;
-let pageLoadTimeout = args.pageLoadTimeout || null;
+let pageLoadTimeout = args.pageLoadTimeout ? parseInt(args.pageLoadTimeout) : null;
+if (pageLoadTimeout !== null && (isNaN(pageLoadTimeout) || pageLoadTimeout <= 0)) {
+    console.warn("Invalid pageLoadTimeout provided, using default timeout");
+    pageLoadTimeout = null;
+}
src/webscraper/Scraper.ts (2)
120-136: Fix timeout error message inconsistency.

The error message mentions a 30-second timeout but the default pageLoadTimeout is set to 15000ms (15 seconds). This inconsistency could be confusing.
if (error.message.includes("Timeout")) {

    this.error(
-        `Timeout of 30s while navigating to ${scrapeQuery.url}. Skipping waiting for page load and executing query function.`
+        `Timeout of ${this.pageLoadTimeout/1000}s while navigating to ${scrapeQuery.url}. Skipping waiting for page load and executing query function.`
    );

}
122-122: Consider more robust error type checking.

Checking for timeout errors by looking for the string "Timeout" in the error message is brittle and may break if Playwright changes their error message format.

Consider using a more reliable method to check for timeout errors:
-if (error.message.includes("Timeout")) {
+if (error.name === "TimeoutError" || error.message.includes("Timeout")) {

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aedc95e and 194faf7.

📒 Files selected for processing (9)

README.md (1 hunks)
data/numbers.ts (1 hunks)
src/types/TestProperties.ts (1 hunks)
src/webscraper/Scraper.ts (6 hunks)
test.ts (3 hunks)
tests/check_formats.test.ts (1 hunks)
tests/check_order.test.ts (1 hunks)
tests/check_repeating_domain.test.ts (1 hunks)
tests/scrape_and_check.test.ts (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (4)

tests/check_formats.test.ts (4)

tests/check_order.test.ts (1)

data (5-43)

tests/check_repeating_domain.test.ts (1)

data (5-60)

src/types/ScrapeQuery.ts (1)

ScrapeQuery (1-5)

src/types/TestProperties.ts (1)

TestProperties (3-7)

tests/check_order.test.ts (2)

src/types/ScrapeQuery.ts (1)

ScrapeQuery (1-5)

src/types/TestProperties.ts (1)

TestProperties (3-7)

src/types/TestProperties.ts (1)

src/webscraper/Scraper.ts (1)

ScraperConstructor (9-15)

test.ts (3)

tests/scrape_and_check.test.ts (1)

data (6-59)

src/types/ScrapeQuery.ts (1)

ScrapeQuery (1-5)

src/types/TestProperties.ts (1)

TestProperties (3-7)

🪛 GitHub Actions: Webscraped Results

tests/check_formats.test.ts