ScrapeBot

A smart desktop web scraper built with Python, PyQt6, and Playwright. ScrapeBot follows gallery and search result pages to their source images, downloads full-size photos, and automatically pages through results — all while respecting a hardcoded rate limit.

Features

Two-phase crawler — detects item/detail page links on listing pages, follows each one to find and download the full-size source image, then backtracks and moves to the next
Smart pagination — automatically detects and clicks Next buttons, numbered pagination, rel="next" links, and Load More buttons
6-strategy image detection — finds the best image on any detail page using direct download links, og:image, JSON-LD structured data, data-* attributes, srcset, and largest visible <img>
Strict quality filters — minimum 500×500 resolution, verified photo MIME type (JPEG/PNG/WebP/GIF/AVIF/BMP), minimum 10 KB file size
Thumbnail rejection — skips images with thumbnail URL patterns or dimensions at or below the minimum threshold
Hardcoded 5-second delay between every image download — never skipped, even across page transitions
JavaScript-rendered pages — uses a real Chromium browser via Playwright, so dynamic and JS-heavy sites work out of the box
Lazy image support — scrolls pages gradually to trigger lazy-loaded content before extracting
Dark-themed desktop UI — clean PyQt6 interface with live log, progress bar, and real-time counters
4 parallel scrape tabs — run up to four independent scrape sessions at the same time in one window

How It Works

Listing / Search Results Page
  │
  ├─ Collects all item detail page links
  ├─ Collects the "Next Page" URL
  │
  └─ For each item:
        Navigate to detail page
        Find best full-size image (6 strategies)
        Download image
        Wait 5 seconds  ← hardcoded, always enforced
        Move to next item
  │
  └─ Navigate to Next Page → repeat until stopped

If no item links are detected (e.g. the page is already a direct image gallery), ScrapeBot downloads full-size images straight from the page.

Requirements

macOS, Linux, or Windows
Python 3.9 or higher

Installation

# 1. Clone the repo
git clone https://github.com/realbobcorbin/scrapebot.git
cd scrapebot

# 2. Install Python dependencies
pip install -r requirements.txt

# 3. Install the Playwright Chromium browser
playwright install chromium

Usage

python main.py

Double-click launch

Double-click Start Scrapebot.command in the live app folder
Or double-click ScrapeBot.app from the suite Apps folder

Steps

Enter a URL — paste the starting URL (search results page, gallery index, etc.)
Set Max Pages — number of listing pages to crawl. Set to 0 for unlimited (runs until you press Stop)
Choose output folder — a timestamped subfolder is created automatically
Click Start Scraping

Each top-level tab is a separate scrape session, so you can run multiple jobs in parallel without mixing logs, counters, or output folders.

Output

Each session creates a folder like output/archive_org_20240315_142301/ containing:

images/
  img_000001.jpg
  img_000002.png
  img_000003.gif
  ...
data.json

data.json contains metadata for every downloaded image:

{
  "start_url": "https://archive.org/search?query=dogs&mediatype=image",
  "listing_pages": 3,
  "items_visited": 90,
  "images_downloaded": 87,
  "images": [
    {
      "url": "https://archive.org/download/dogs-1923/photo.jpg",
      "local_file": "img_000001.jpg",
      "size_bytes": 862304,
      "width": 1200,
      "height": 900,
      "alt": "Dogs at the park, 1923",
      "method": "a-link"
    }
  ]
}

Image Quality Rules

Filter	Value
Minimum resolution	500 × 500 px
Minimum file size	10 KB
Accepted formats	JPEG, PNG, GIF, WebP, AVIF, BMP, TIFF
MIME type verified	Yes — Content-Type header must be a valid image type
Thumbnails skipped	Yes — URL keywords and small dimensions rejected
Download delay	5 seconds (hardcoded)

Supported Sites

ScrapeBot is designed to work on any public website with a gallery or search results layout. It has been tested on:

Internet Archive image search
Standard image gallery sites
Any site using JavaScript rendering (SPAs, lazy loading, dynamic pagination)

Note: ScrapeBot only downloads publicly accessible images. It does not bypass authentication, paywalls, or rate-limiting mechanisms beyond the built-in delay.

Project Structure

scrapebot/
├── main.py                  # Entry point — launch the desktop app
├── requirements.txt         # Python dependencies
├── scrapebot/
│   ├── app.py               # PyQt6 desktop UI
│   ├── worker.py            # QThread bridge (keeps UI responsive)
│   └── scraper.py           # Playwright crawl engine

Responsible Use

The 5-second download delay is hardcoded and cannot be changed from the UI
Only scrape sites where you have permission to do so
Respect each site's robots.txt and terms of service
Do not use this tool to scrape private, copyrighted, or personal data without authorization

License

MIT License — free to use, modify, and distribute.

Author

realbobcorbin

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scrapebot		scrapebot
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Start Scrapebot.command		Start Scrapebot.command
launch_scrapebot.applescript		launch_scrapebot.applescript
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapeBot

Features

How It Works

Requirements

Installation

Usage

Double-click launch

Steps

Output

Image Quality Rules

Supported Sites

Project Structure

Responsible Use

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScrapeBot

Features

How It Works

Requirements

Installation

Usage

Double-click launch

Steps

Output

Image Quality Rules

Supported Sites

Project Structure

Responsible Use

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages