Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sweep: provide me all css selectors for content for selectolax #10

Closed
4 tasks done
Hardeepex opened this issue Jan 3, 2024 · 2 comments · Fixed by #11 or #12
Closed
4 tasks done

sweep: provide me all css selectors for content for selectolax #10

Hardeepex opened this issue Jan 3, 2024 · 2 comments · Fixed by #11 or #12
Labels
sweep Sweep your software chores

Comments

@Hardeepex
Copy link
Owner

Hardeepex commented Jan 3, 2024

<iframe id="google_ads_iframe_/1030735/redflagdeals/hotdeals_2" name="google_ads_iframe_/1030735/redflagdeals/hotdeals_2" title="3rd party ad content" width="1" height="4" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" role="region" aria-label="Advertisement" tabindex="0" allow="attribution-reporting" sandbox="allow-forms allow-popups allow-popups-to-escape-sandbox allow-pointer-lock allow-same-origin allow-scripts allow-top-navigation-by-user-activation" style="border: 0px; vertical-align: bottom;" data-load-complete="true" data-google-container-id="4"></iframe>
<style type="text/css">#customStyles .list_item_body{text-align: left;}#customStyles h2.offer_title {font-size: 1rem; letter-spacing: 0;line-height: 1.4;font-weight: 600; margin: 0;}#customStyles .list_item_body li {list-style-type: none; padding: 5px 0 5px 20px; position: relative;}#customStyles .list_item_body {font-size: .75rem; min-height: 90px;}#customStyles .list_item_body strong {font-weight: 600;}#customStyles .sponsored_flag {background-color: #eaeaea; color: red; float: right; font-size: 11px; font-weight: 600; padding: 5px 10px; font-style: italic;}#customStyles { margin-bottom: 30px; padding: 15px 0 0 15px; border-bottom: 1px solid #e2e2e2;}#customStyles .list_item_body ul {display: none;}#customStyles .list_item_body li:before { content: '\f0da'; font-family: FontAwesome; font-size: .75rem;left: 0; padding-right: 15px; position: absolute;top: 7px;}#customStyles .offer_description a {color: #444;text-decoration: none;}@media only screen and (min-width: 40.063em){#customStyles .list_item_body ul {display: block;}}</style>
SPONSORED
Next

Displaying 120 of 260 Deals

Checklist
  • Modify docs/examples/tutorial/redflagdeals_scraper.pya53e451 Edit
  • Running GitHub Actions for docs/examples/tutorial/redflagdeals_scraper.pyEdit
  • Create tests/live/test_new_scraper.pyde65ed7 Edit
  • Running GitHub Actions for tests/live/test_new_scraper.pyEdit
@sweep-ai sweep-ai bot added the sweep Sweep your software chores label Jan 3, 2024
Copy link
Contributor

sweep-ai bot commented Jan 3, 2024

🚀 Here's the PR! #12

See Sweep's progress at the progress dashboard!
💎 Sweep Pro: I'm using GPT-4. You have unlimited GPT-4 tickets. (tracking ID: 4eab7df4bc)

Tip

I'll email you at hardeep.ex@gmail.com when I complete this pull request!


Actions (click)

  • ↻ Restart Sweep

Sandbox Execution ✓

Here are the sandbox execution logs prior to making any changes:

Sandbox logs for f0a2d7b
Checking docs/examples/tutorial/redflagdeals_scraper.py for syntax errors... ✅ docs/examples/tutorial/redflagdeals_scraper.py has no syntax errors! 1/1 ✓
Checking docs/examples/tutorial/redflagdeals_scraper.py for syntax errors...
✅ docs/examples/tutorial/redflagdeals_scraper.py has no syntax errors!

Sandbox passed on the latest main, so sandbox checks will be enabled for this issue.


Step 1: 🔎 Searching

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I think are relevant in decreasing order of relevance (click to expand). If some file is missing from here, you can mention the path in the ticket description.

},
extra_preprocessors=[CSS("div.list_item")],
)
# Define the SchemaScraper for the single deal pages
deal_scraper = SchemaScraper(
{
"title": "str",
"url": "url",
"price": "float",
"regular_price": "float",
"details": "str",
},

actor_schema = {
"actor": "string",
"image": "url",
"roles": {"name": "string", "character": "string"},

If you visit the page <https://comedybangbang.fandom.com/wiki/Operation_Golden_Orb> viewing the source will reveal that all of the interesting content is in an element `<div id="content" class="page-content">`.
Just as we might if we were writing a real scraper, we'll write a CSS selector to grab this element, `div.page-content` will do.

I also found the following external resources that might be helpful:

Summaries of links found in the content:

https://w.dam-img.rfdcontent.com/offers/013/736/864/200x200_pad.jpg:

The page contains a list of deals from different merchants. The first deal is a sponsored offer for a Samsung 77" OLED 4K Smart TV, which is $1000 off. The offer includes a description of the TV's features such as deep blacks, clean whites, and full shades of lively colors. The second deal is from Kitchen Stuff Plus, offering 50% off the Ballarini Bologna Non-Stick Wok & Frypan Set and more. The third deal is from the PlayStation Store, offering free monthly games for PlayStation Plus members. The page also includes pagination for navigating through the deals.

https://q.dam-img.rfdcontent.com/offers/013/736/863/100x100_pad.jpg:

The page contains a list of deals from different merchants. The first deal is from Kitchen Stuff Plus, offering 50% off on the Ballarini Bologna Non-Stick Wok & Frypan Set. The second deal is from PlayStation Store, offering free monthly games for PlayStation Plus members. The page also includes pagination for navigating through the deals.

https://p.dam-img.rfdcontent.com/offers/013/736/864/100x100_pad.jpg:

The page contains information about two deals: a $1000 off 77" OLED 4K Smart TV from Samsung and 50% off Ballarini Bologna Non-Stick Wok & Frypan Set from Kitchen Stuff Plus. The Samsung TV features OLED technology with deep blacks, clean whites, and full shades of lively colors. It also has 4K AI upscaling and a Dolby Atmos experience. The Kitchen Stuff Plus deal includes various red hot deals. The page also includes pagination for more deals and a footer with additional information.

https://t.dam-img.rfdcontent.com/offers/013/736/863/100x100_pad.jpg:

The page contains a list of deals from different merchants. The first deal is from Kitchen Stuff Plus, offering 50% off on the Ballarini Bologna Non-Stick Wok & Frypan Set. The second deal is from PlayStation Store, offering free monthly games for PlayStation Plus members. The page also includes pagination for navigating through the deals.

https://tpc.googlesyndication.com/simgad/461817906561256968:

The page contains a list of deals from different merchants. The first deal is a $1000 off 77" OLED 4K Smart TV from Samsung. The deal includes a description of the TV's features such as deep blacks, clean whites, and full shades of lively colors. The second deal is about free monthly games for PlayStation Plus subscribers. The games mentioned are A Plague Tale Requiem and Evil West. The page also includes pagination for navigating through the deals.

https://h.dam-img.rfdcontent.com/offers/013/736/863/200x200_pad.jpg:

The page contains a list of deals from different merchants. The first deal is a sponsored offer for a $1000 discount on a 77" OLED 4K Smart TV from Samsung. The second deal is from Kitchen Stuff Plus, offering 50% off on a Ballarini Bologna Non-Stick Wok & Frypan Set. The third deal is from PlayStation Store, offering free monthly games for PlayStation Plus members. The page also includes pagination for navigating through the deals.

https://o.dam-img.rfdcontent.com/offers/013/736/864/100x100_pad.jpg:

The page contains information about two deals: a $1000 off 77" OLED 4K Smart TV from Samsung and 50% off Ballarini Bologna Non-Stick Wok & Frypan Set from Kitchen Stuff Plus. The Samsung TV features OLED technology with deep blacks, clean whites, and full shades of lively colors. It also has 4K AI upscaling and a Dolby Atmos experience. The Kitchen Stuff Plus deal includes various red hot deals. The page also includes pagination for more deals and a footer with additional information.


Step 2: ⌨️ Coding

  • Modify docs/examples/tutorial/redflagdeals_scraper.pya53e451 Edit
Modify docs/examples/tutorial/redflagdeals_scraper.py with contents:
• Modify the SchemaScraper definition to include the additional fields the user wants to scrape. The modified schema should look like this: { "title": "str", "url": "url", "image": "url", "description": "str", "price": "float", "regular_price": "float", "details": "str", }
• Add the appropriate CSS selectors to the extra_preprocessors list to target the correct elements on the webpage. The selectors should target the elements containing the title, URL, image, and description for each deal. The modified extra_preprocessors list should look like this: extra_preprocessors=[CSS("div.list_item"), CSS("a.offer_image"), CSS("h2.offer_title"), CSS("p.offer_description")]
--- 
+++ 
@@ -11,7 +11,7 @@
         "dealer": "str",
         "comments_count": "int",
     },
-    extra_preprocessors=[CSS("div.list_item")],
+    extra_preprocessors=[CSS("div.list_item"), CSS("a.offer_image"), CSS("h2.offer_title"), CSS("p.offer_description")],
 )
 
 # Define the SchemaScraper for the single deal pages
@@ -19,6 +19,8 @@
     {
         "title": "str",
         "url": "url",
+        "image": "url",
+        "description": "str",
         "price": "float",
         "regular_price": "float",
         "details": "str",
  • Running GitHub Actions for docs/examples/tutorial/redflagdeals_scraper.pyEdit
Check docs/examples/tutorial/redflagdeals_scraper.py with contents:

Ran GitHub Actions for a53e451d9659b74837cc2291a418d7802299642b:

  • Create tests/live/test_new_scraper.pyde65ed7 Edit
Create tests/live/test_new_scraper.py with contents:
• Create a new test case to ensure that the modified scraper correctly extracts the desired data from the webpage.
• Import the necessary libraries and modules at the beginning of the file. This should include unittest and the modified scraper from the redflagdeals_scraper.py file.
• Define a new class for the test case that inherits from unittest.TestCase.
• Within this class, define a new method for the test case. This method should use the modified scraper to scrape data from a test webpage that contains the same structure as the actual webpage the user wants to scrape.
• The test case should assert that the scraped data matches the expected data for each field in the schema. The expected data should be manually defined within the test case based on the content of the test webpage.
  • Running GitHub Actions for tests/live/test_new_scraper.pyEdit
Check tests/live/test_new_scraper.py with contents:

Ran GitHub Actions for de65ed7faa208042b58c260aa323d2e2a4f36141:


Step 3: 🔁 Code Review

I have finished reviewing the code for completeness. I did not find errors for sweep/provide_me_all_css_selectors_for_content_1.


🎉 Latest improvements to Sweep:

  • We just released a dashboard to track Sweep's progress on your issue in real-time, showing every stage of the process – from search to planning and coding.
  • Sweep uses OpenAI's latest Assistant API to plan code changes and modify code! This is 3x faster and significantly more reliable as it allows Sweep to edit code and validate the changes in tight iterations, the same way as a human would.
  • Try using the GitHub issues extension to create Sweep issues directly from your editor! GitHub Issues and Pull Requests.

💡 To recreate the pull request edit the issue title or description. To tweak the pull request, leave a comment on the pull request.
Join Our Discord

@Hardeepex
Copy link
Owner Author

sweep: how it will work like for example i need for this url https://www.redflagdeals.com/deals/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sweep Sweep your software chores
Projects
None yet
1 participant