- Python 3.10+
- An API key for
scrapegraph_py(SGAI). Get one from ScrapeGraphAI.
-
Clone the repo and enter the folder.
-
Install dependencies (choose one):
# Using uv (recommended if you have uv)
uv sync
# Or using pip
pip install -e .- Configure your API key:
- Option A: Open
main.pyand set your key where the client is initialized. - Option B: Set
SGAI_API_KEYin a.envfile.
python main.py- Pages: Adjust
TOTAL_PAGESat the top ofmain.pyfor the first step. - Concurrency: Update
CONCURRENCYto control parallel requests.
-
Configure pagination
- You can set the website url and prompt as per your requirements.
- Sets
TOTAL_PAGESto control how many result pages to scan.
-
Request listing data
- Calls
client.smartscraper(...)withoutput_schema=MainSchemaandtotal_pages=TOTAL_PAGES.
- Calls
-
Extract URLs
- Iterates
response.result.pages[*].result.phonesand collects URLs.
- Iterates
-
Helpers
format_value_as_text(...)normalizes values.build_row_from_detail(...)maps a detail response to a CSV row.
-
Async parallel detail scraping
CONCURRENCYsets max parallelism.fetch_one(...)runsclient.smartscraperin a worker thread viaasyncio.to_thread.scrape_all(...)uses a semaphore andasyncio.gatherto fetch all URLs.
-
Write CSV
- Calls
asyncio.run(scrape_all(product_urls)), then writes rows withcsv.DictWriter.
- Calls
- Be mindful of target site policies and rate limits.
- You can change the listing and detail prompts in
main.pyto extract exactly what you need. - You can also adjust the
MainSchemaand the CSV columns (build_row_from_detail(...)andfieldnames) to fit your data model.
MIT