A smart desktop web scraper built with Python, PyQt6, and Playwright. ScrapeBot follows gallery and search result pages to their source images, downloads full-size photos, and automatically pages through results — all while respecting a hardcoded rate limit.
- Two-phase crawler — detects item/detail page links on listing pages, follows each one to find and download the full-size source image, then backtracks and moves to the next
- Smart pagination — automatically detects and clicks Next buttons, numbered pagination,
rel="next"links, and Load More buttons - 6-strategy image detection — finds the best image on any detail page using direct download links,
og:image, JSON-LD structured data,data-*attributes, srcset, and largest visible<img> - Strict quality filters — minimum 500×500 resolution, verified photo MIME type (JPEG/PNG/WebP/GIF/AVIF/BMP), minimum 10 KB file size
- Thumbnail rejection — skips images with thumbnail URL patterns or dimensions at or below the minimum threshold
- Hardcoded 5-second delay between every image download — never skipped, even across page transitions
- JavaScript-rendered pages — uses a real Chromium browser via Playwright, so dynamic and JS-heavy sites work out of the box
- Lazy image support — scrolls pages gradually to trigger lazy-loaded content before extracting
- Dark-themed desktop UI — clean PyQt6 interface with live log, progress bar, and real-time counters
- 4 parallel scrape tabs — run up to four independent scrape sessions at the same time in one window
Listing / Search Results Page
│
├─ Collects all item detail page links
├─ Collects the "Next Page" URL
│
└─ For each item:
Navigate to detail page
Find best full-size image (6 strategies)
Download image
Wait 5 seconds ← hardcoded, always enforced
Move to next item
│
└─ Navigate to Next Page → repeat until stopped
If no item links are detected (e.g. the page is already a direct image gallery), ScrapeBot downloads full-size images straight from the page.
- macOS, Linux, or Windows
- Python 3.9 or higher
# 1. Clone the repo
git clone https://github.com/realbobcorbin/scrapebot.git
cd scrapebot
# 2. Install Python dependencies
pip install -r requirements.txt
# 3. Install the Playwright Chromium browser
playwright install chromiumpython main.py- Double-click Start Scrapebot.command in the live app folder
- Or double-click ScrapeBot.app from the suite Apps folder
- Enter a URL — paste the starting URL (search results page, gallery index, etc.)
- Set Max Pages — number of listing pages to crawl. Set to
0for unlimited (runs until you press Stop) - Choose output folder — a timestamped subfolder is created automatically
- Click Start Scraping
Each top-level tab is a separate scrape session, so you can run multiple jobs in parallel without mixing logs, counters, or output folders.
Each session creates a folder like output/archive_org_20240315_142301/ containing:
images/
img_000001.jpg
img_000002.png
img_000003.gif
...
data.json
data.json contains metadata for every downloaded image:
{
"start_url": "https://archive.org/search?query=dogs&mediatype=image",
"listing_pages": 3,
"items_visited": 90,
"images_downloaded": 87,
"images": [
{
"url": "https://archive.org/download/dogs-1923/photo.jpg",
"local_file": "img_000001.jpg",
"size_bytes": 862304,
"width": 1200,
"height": 900,
"alt": "Dogs at the park, 1923",
"method": "a-link"
}
]
}| Filter | Value |
|---|---|
| Minimum resolution | 500 × 500 px |
| Minimum file size | 10 KB |
| Accepted formats | JPEG, PNG, GIF, WebP, AVIF, BMP, TIFF |
| MIME type verified | Yes — Content-Type header must be a valid image type |
| Thumbnails skipped | Yes — URL keywords and small dimensions rejected |
| Download delay | 5 seconds (hardcoded) |
ScrapeBot is designed to work on any public website with a gallery or search results layout. It has been tested on:
- Internet Archive image search
- Standard image gallery sites
- Any site using JavaScript rendering (SPAs, lazy loading, dynamic pagination)
Note: ScrapeBot only downloads publicly accessible images. It does not bypass authentication, paywalls, or rate-limiting mechanisms beyond the built-in delay.
scrapebot/
├── main.py # Entry point — launch the desktop app
├── requirements.txt # Python dependencies
├── scrapebot/
│ ├── app.py # PyQt6 desktop UI
│ ├── worker.py # QThread bridge (keeps UI responsive)
│ └── scraper.py # Playwright crawl engine
- The 5-second download delay is hardcoded and cannot be changed from the UI
- Only scrape sites where you have permission to do so
- Respect each site's
robots.txtand terms of service - Do not use this tool to scrape private, copyrighted, or personal data without authorization
MIT License — free to use, modify, and distribute.