Add properties to tests, add additional params for tests, fix broken numbers.#9
Add properties to tests, add additional params for tests, fix broken numbers.#9SuppliedOrange merged 2 commits intomainfrom
Conversation
WalkthroughThe changes introduce enhanced configurability for the web scraper by allowing test runners to specify Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI/Test Runner
participant TestFunction
participant Scraper
User->>CLI/Test Runner: Run test with --headless/--pageLoadTimeout
CLI/Test Runner->>TestFunction: Call with data, {scraperConstructorProperties}
TestFunction->>Scraper: Instantiate with options (headless, pageLoadTimeout)
Scraper->>Scraper: Launch browser (headless), set timeout
Scraper->>Scraper: Fetch page with timeout
Scraper-->>TestFunction: Return result
TestFunction-->>CLI/Test Runner: Return test result
CLI/Test Runner-->>User: Output results
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 4
🔭 Outside diff range comments (2)
tests/check_order.test.ts (1)
1-43: 💡 Verification agent🧩 Analysis chain
Note the pipeline failure in CI
While your code changes look good, there appears to be a test failure in the CI pipeline. This isn't related to your parameter changes, but you might need to investigate:
Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.
🏁 Script executed:
#!/bin/bash # Let's locate the definition for number 5 in the data to understand what might be failing # Search for number 5 definition in the data files echo "Looking for number 5 definition in data files..." rg -A 10 -B 2 "number: 5" --type ts # Check if any query functions use innerText on potentially null elements echo "Checking for potential null element access in query functions..." rg -A 5 -B 5 "innerText" --type ts | grep -C 2 "number: 5"Length of output: 1240
Guard against null in number 5’s queryFunction
The CI failure shows that for number 5,
document.querySelector(".we-customer-ratings__averages__display")can returnnull, leading to aTypeErrorwhen accessing.innerText. You’ll need to add a null‐check or fallback in thequeryFunctionfor number 5:• Location to update:
- data/numbers.ts, the object with
number: 5• Suggested diff:
queryFunction: () => { - return Math.ceil(document.querySelector(".we-customer-ratings__averages__display").innerText) + const el = document.querySelector(".we-customer-ratings__averages__display"); + if (!el) { + throw new Error("Element .we-customer-ratings__averages__display not found for number 5"); + } + return Math.ceil(el.innerText); }This will prevent the null‐access error during
page.evaluateand make the test more robust.data/numbers.ts (1)
1-63: 💡 Verification agent🧩 Analysis chain
Consider adding retry logic and error handling across all query functions.
None of the query functions have proper error handling for cases where elements aren't found or have unexpected content.
Consider implementing a helper function to standardize error handling across all queries:
function safeQuery(selector: string, defaultValue: string, transform?: (text: string) => string): string { const element = document.querySelector(selector); if (!element) return defaultValue; const text = element.innerText; return transform ? transform(text) : text; }Then update each query function to use this pattern for more consistent error handling.
Please check if the websites being scraped have robots.txt files that might prohibit scraping:
🏁 Script executed:
#!/bin/bash # Check robots.txt files for the three updated URLs echo "Checking PokemonDB robots.txt" curl -s https://pokemondb.net/robots.txt echo -e "\nChecking Apple App Store robots.txt" curl -s https://apps.apple.com/robots.txt echo -e "\nChecking Microsoft robots.txt" curl -s https://dotnet.microsoft.com/robots.txtLength of output: 1645
Add standardized error handling and confirm robots.txt compliance
I’ve checked the robots.txt for all three domains—
- pokemondb.net:
/pokedex/*isn’t disallowed (only/pokebase/*is blocked, with a crawl-delay of 2)- apps.apple.com: your
/us/app/...path is allowed- dotnet.microsoft.com: only
/download/thank-you/is disallowed, not/en-us/download/dotnet/6.0To improve resilience when elements are missing or text is malformed, introduce a helper in, say,
src/utils/safeQuery.ts:export function safeQuery<T>( selector: string, defaultValue: T, transform: (text: string, el: Element) => T = text => (text as unknown as T), retries = 1, delayMs = 200 ): Promise<T> { return new Promise((resolve) => { const attempt = (n: number) => { const el = document.querySelector(selector); if (el) { try { resolve(transform(el.innerText, el)); } catch { resolve(defaultValue); } } else if (n > 0) { setTimeout(() => attempt(n - 1), delayMs); } else { resolve(defaultValue); } }; attempt(retries); }); }Then in
data/numbers.ts, refactor eachqueryFunction:- queryFunction: () => { - return document.querySelector("td").innerText.substring(3,4) - } + queryFunction: () => + safeQuery("td", "0", text => text.substring(3, 4)),• Wrap all
document.querySelector/querySelectorAllcalls insafeQuery
• Provide sensible defaults ("0","", etc.) and optional retry counts
• Centralize error handling for missing nodes or transform failures
🧹 Nitpick comments (4)
src/types/TestProperties.ts (1)
1-7: Consider simplifying interface structure.The interface definition has unnecessary blank lines (lines 4 and 6) that could be removed for conciseness. The interface structure is correct and properly imports the ScraperConstructor type from the appropriate location.
import { ScraperConstructor } from "../webscraper/Scraper"; export default interface TestProperties { - scraperConstructorProperties?: ScraperConstructor; - }test.ts (1)
19-21: Review default headless value and consider adding validation for pageLoadTimeout.The default headless mode is set to true, which is a sensible default. However, there's no validation for pageLoadTimeout which could lead to issues if invalid values are provided.
Consider adding validation for pageLoadTimeout to ensure it's a positive number:
let headless = (["true", "false"].includes(args.headless)) ? args.headless === "true" : true; -let pageLoadTimeout = args.pageLoadTimeout || null; +let pageLoadTimeout = args.pageLoadTimeout ? parseInt(args.pageLoadTimeout) : null; +if (pageLoadTimeout !== null && (isNaN(pageLoadTimeout) || pageLoadTimeout <= 0)) { + console.warn("Invalid pageLoadTimeout provided, using default timeout"); + pageLoadTimeout = null; +}src/webscraper/Scraper.ts (2)
120-136: Fix timeout error message inconsistency.The error message mentions a 30-second timeout but the default pageLoadTimeout is set to 15000ms (15 seconds). This inconsistency could be confusing.
if (error.message.includes("Timeout")) { this.error( - `Timeout of 30s while navigating to ${scrapeQuery.url}. Skipping waiting for page load and executing query function.` + `Timeout of ${this.pageLoadTimeout/1000}s while navigating to ${scrapeQuery.url}. Skipping waiting for page load and executing query function.` ); }
122-122: Consider more robust error type checking.Checking for timeout errors by looking for the string "Timeout" in the error message is brittle and may break if Playwright changes their error message format.
Consider using a more reliable method to check for timeout errors:
-if (error.message.includes("Timeout")) { +if (error.name === "TimeoutError" || error.message.includes("Timeout")) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
README.md(1 hunks)data/numbers.ts(1 hunks)src/types/TestProperties.ts(1 hunks)src/webscraper/Scraper.ts(6 hunks)test.ts(3 hunks)tests/check_formats.test.ts(1 hunks)tests/check_order.test.ts(1 hunks)tests/check_repeating_domain.test.ts(1 hunks)tests/scrape_and_check.test.ts(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
tests/check_formats.test.ts (4)
tests/check_order.test.ts (1)
data(5-43)tests/check_repeating_domain.test.ts (1)
data(5-60)src/types/ScrapeQuery.ts (1)
ScrapeQuery(1-5)src/types/TestProperties.ts (1)
TestProperties(3-7)
tests/check_order.test.ts (2)
src/types/ScrapeQuery.ts (1)
ScrapeQuery(1-5)src/types/TestProperties.ts (1)
TestProperties(3-7)
src/types/TestProperties.ts (1)
src/webscraper/Scraper.ts (1)
ScraperConstructor(9-15)
test.ts (3)
tests/scrape_and_check.test.ts (1)
data(6-59)src/types/ScrapeQuery.ts (1)
ScrapeQuery(1-5)src/types/TestProperties.ts (1)
TestProperties(3-7)
🪛 GitHub Actions: Webscraped Results
tests/check_formats.test.ts
[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.
tests/check_order.test.ts
[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.
tests/check_repeating_domain.test.ts
[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.
tests/scrape_and_check.test.ts
[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.
test.ts
[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.
🔇 Additional comments (20)
tests/check_order.test.ts (1)
2-2: LGTM: TestProperties parameter added correctlyThese changes align well with the broader refactoring to support passing optional test configuration properties. The unused parameter is correctly prefixed with an underscore following TypeScript conventions.
Also applies to: 5-5
tests/check_formats.test.ts (1)
2-2: LGTM: TestProperties parameter added consistentlyThe addition of the TestProperties import and parameter follows the same pattern as in other test files, maintaining consistency throughout the codebase.
Also applies to: 5-5
tests/check_repeating_domain.test.ts (1)
2-2: LGTM: TestProperties parameter added consistentlyThis file follows the same pattern as the other test files, correctly importing and adding the TestProperties parameter.
Also applies to: 5-5
README.md (2)
18-20: Documentation updated correctly for new parametersGreat job updating the README to document the new test properties functionality. This clearly informs users about the new optional parameters available for tests.
26-26: Minor clarity improvementThis minor wording change improves clarity for users.
test.ts (4)
5-5: LGTM: Appropriate import for TestProperties.The TestProperties import is correctly added to support the new type being used in test function signatures.
30-30: LGTM: Correct type declaration for test functions.The Map type has been properly updated to include the optional TestProperties parameter in the function signature.
41-43: LGTM: Updated test function signature to accept optional TestProperties.The test function signature has been correctly updated to include the optional TestProperties parameter while preserving backward compatibility.
58-63: LGTM: Properly passing scraper configuration properties to test functions.The test invocation properly constructs and passes TestProperties with scraperConstructorProperties containing the headless and pageLoadTimeout values from command line arguments.
tests/scrape_and_check.test.ts (5)
4-4: LGTM: Appropriate TestProperties import.The TestProperties import is correctly added to support the new type being used as a parameter.
14-14: LGTM: Useful logging of scraper constructor properties.Logging the scraper properties provides useful debug information, though it would be better with a null check on properties.
19-19: LGTM: Properly spreading constructor properties into scraper config.The spread operator correctly applies any provided scraperConstructorProperties to the Scraper constructor, giving tests flexibility to configure scraper behavior.
32-35: LGTM: Improved code readability with whitespace.The added blank lines improve readability around error handling code.
43-43: LGTM: Improved code readability with whitespace.The added blank line improves the visual separation between error handling code and the try/catch block.
src/webscraper/Scraper.ts (6)
13-14: LGTM: Good extension of the ScraperConstructor interface.The interface correctly adds optional headless and pageLoadTimeout properties to allow configurable scraper behavior.
29-30: LGTM: Added private properties with sensible default for pageLoadTimeout.The private properties are correctly added with a sensible default timeout of 15 seconds.
44-45: LGTM: Updated JSDoc with new parameters.The JSDoc comment has been properly updated to document the new headless and pageLoadTimeout parameters.
54-55: LGTM: Properly setting properties in constructor.The constructor correctly sets the headless and pageLoadTimeout properties when provided, with an undefined check for the boolean headless property.
72-74: LGTM: Properly applying headless mode to browser launch.The headless configuration is correctly applied when launching the browser.
113-116: LGTM: Properly setting page timeout during navigation.The pageLoadTimeout is correctly applied to the page navigation options.
| import TestProperties from "../src/types/TestProperties"; | ||
|
|
||
| export default async ( data: ScrapeQuery[] ) => { | ||
| export default async ( data: ScrapeQuery[], properties: TestProperties ) => { |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add null check for the properties parameter.
The function accepts a properties parameter but doesn't check if it's undefined or null before accessing its properties, which could lead to runtime errors.
-export default async ( data: ScrapeQuery[], properties: TestProperties ) => {
+export default async ( data: ScrapeQuery[], properties?: TestProperties ) => {
+ properties = properties || {};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export default async ( data: ScrapeQuery[], properties: TestProperties ) => { | |
| export default async ( data: ScrapeQuery[], properties?: TestProperties ) => { | |
| properties = properties || {}; | |
| // …rest of implementation… |
| url: "https://pokemondb.net/pokedex/charmander", | ||
| queryFunction: () => { | ||
| return Array.from( | ||
| document.querySelectorAll(".user-action-time > span.relativetime") ) | ||
| .map(x => x.innerText.split(" ")[1] | ||
| ).filter( x => x[0] == 4 )[0][0] | ||
| return document.querySelector("td").innerText.substring(3,4) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add more specific selector for improved robustness.
The current implementation selects the first td element on the page without any specific filtering. This approach is fragile as it could break if the page structure changes or if the order of table cells is modified.
Consider using a more specific selector that targets the exact data you need:
- url: "https://pokemondb.net/pokedex/charmander",
queryFunction: () => {
- return document.querySelector("td").innerText.substring(3,4)
+ return document.querySelector("table.vitals-table td:nth-child(1)").innerText.substring(3,4)
}Also, add a comment explaining what this substring is extracting for better maintainability.
Committable suggestion skipped: line range outside the PR's diff.
| url: "https://dotnet.microsoft.com/en-us/download/dotnet/6.0", | ||
| queryFunction: () => { | ||
| return document.getElementById("splash-stars").innerText.split(" ").length + 1 | ||
| return document.querySelector("h1").innerText.substring(14) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Improve robustness of .NET version extraction.
The current implementation relies on a fixed substring position (14) which is brittle and may break if the heading text changes.
Use a regex or more targeted approach:
url: "https://dotnet.microsoft.com/en-us/download/dotnet/6.0",
queryFunction: () => {
- return document.querySelector("h1").innerText.substring(14)
+ const h1Text = document.querySelector("h1")?.innerText || "";
+ const match = h1Text.match(/\.NET (\d+)\.(\d+)/);
+ return match ? match[1] : "6"; // Extract the major version or use default
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| url: "https://dotnet.microsoft.com/en-us/download/dotnet/6.0", | |
| queryFunction: () => { | |
| return document.getElementById("splash-stars").innerText.split(" ").length + 1 | |
| return document.querySelector("h1").innerText.substring(14) | |
| } | |
| url: "https://dotnet.microsoft.com/en-us/download/dotnet/6.0", | |
| queryFunction: () => { | |
| const h1Text = document.querySelector("h1")?.innerText || ""; | |
| const match = h1Text.match(/\.NET (\d+)\.(\d+)/); | |
| return match ? match[1] : "6"; // Extract the major version or use default | |
| } |
| url: "https://apps.apple.com/us/app/nyan-cat-lost-in-space/id433592592", | ||
| queryFunction: () => { | ||
| let [hp, atk] = document.querySelector("a[href='/dex/sm/pokemon/bewear/']") | ||
| .parentElement.parentElement.querySelectorAll(".PokemonAltRow-hp, .PokemonAltRow-atk"); | ||
| return atk.querySelector("span").innerText - hp.querySelector("span").innerText | ||
| return Math.ceil(document.querySelector(".we-customer-ratings__averages__display").innerText) | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add error handling to the rating extraction.
The current implementation doesn't handle cases where the rating element might not be present or have an unexpected format.
Add error handling and make the selector more specific:
url: "https://apps.apple.com/us/app/nyan-cat-lost-in-space/id433592592",
queryFunction: () => {
- return Math.ceil(document.querySelector(".we-customer-ratings__averages__display").innerText)
+ const ratingElement = document.querySelector(".we-customer-ratings__averages__display");
+ if (!ratingElement) return '5'; // Default fallback
+ const rating = parseFloat(ratingElement.innerText);
+ return isNaN(rating) ? '5' : Math.ceil(rating).toString();
}Committable suggestion skipped: line range outside the PR's diff.
Summary by CodeRabbit
New Features
Documentation
Refactor