Skip to content

ShadowfetchAI/scrapebot

Repository files navigation

ScrapeBot

A smart desktop web scraper built with Python, PyQt6, and Playwright. ScrapeBot follows gallery and search result pages to their source images, downloads full-size photos, and automatically pages through results — all while respecting a hardcoded rate limit.

Python PyQt6 Playwright License


Features

  • Two-phase crawler — detects item/detail page links on listing pages, follows each one to find and download the full-size source image, then backtracks and moves to the next
  • Smart pagination — automatically detects and clicks Next buttons, numbered pagination, rel="next" links, and Load More buttons
  • 6-strategy image detection — finds the best image on any detail page using direct download links, og:image, JSON-LD structured data, data-* attributes, srcset, and largest visible <img>
  • Strict quality filters — minimum 500×500 resolution, verified photo MIME type (JPEG/PNG/WebP/GIF/AVIF/BMP), minimum 10 KB file size
  • Thumbnail rejection — skips images with thumbnail URL patterns or dimensions at or below the minimum threshold
  • Hardcoded 5-second delay between every image download — never skipped, even across page transitions
  • JavaScript-rendered pages — uses a real Chromium browser via Playwright, so dynamic and JS-heavy sites work out of the box
  • Lazy image support — scrolls pages gradually to trigger lazy-loaded content before extracting
  • Dark-themed desktop UI — clean PyQt6 interface with live log, progress bar, and real-time counters
  • 4 parallel scrape tabs — run up to four independent scrape sessions at the same time in one window

How It Works

Listing / Search Results Page
  │
  ├─ Collects all item detail page links
  ├─ Collects the "Next Page" URL
  │
  └─ For each item:
        Navigate to detail page
        Find best full-size image (6 strategies)
        Download image
        Wait 5 seconds  ← hardcoded, always enforced
        Move to next item
  │
  └─ Navigate to Next Page → repeat until stopped

If no item links are detected (e.g. the page is already a direct image gallery), ScrapeBot downloads full-size images straight from the page.


Requirements

  • macOS, Linux, or Windows
  • Python 3.9 or higher

Installation

# 1. Clone the repo
git clone https://github.com/realbobcorbin/scrapebot.git
cd scrapebot

# 2. Install Python dependencies
pip install -r requirements.txt

# 3. Install the Playwright Chromium browser
playwright install chromium

Usage

python main.py

Double-click launch

Steps

  1. Enter a URL — paste the starting URL (search results page, gallery index, etc.)
  2. Set Max Pages — number of listing pages to crawl. Set to 0 for unlimited (runs until you press Stop)
  3. Choose output folder — a timestamped subfolder is created automatically
  4. Click Start Scraping

Each top-level tab is a separate scrape session, so you can run multiple jobs in parallel without mixing logs, counters, or output folders.

Output

Each session creates a folder like output/archive_org_20240315_142301/ containing:

images/
  img_000001.jpg
  img_000002.png
  img_000003.gif
  ...
data.json

data.json contains metadata for every downloaded image:

{
  "start_url": "https://archive.org/search?query=dogs&mediatype=image",
  "listing_pages": 3,
  "items_visited": 90,
  "images_downloaded": 87,
  "images": [
    {
      "url": "https://archive.org/download/dogs-1923/photo.jpg",
      "local_file": "img_000001.jpg",
      "size_bytes": 862304,
      "width": 1200,
      "height": 900,
      "alt": "Dogs at the park, 1923",
      "method": "a-link"
    }
  ]
}

Image Quality Rules

Filter Value
Minimum resolution 500 × 500 px
Minimum file size 10 KB
Accepted formats JPEG, PNG, GIF, WebP, AVIF, BMP, TIFF
MIME type verified Yes — Content-Type header must be a valid image type
Thumbnails skipped Yes — URL keywords and small dimensions rejected
Download delay 5 seconds (hardcoded)

Supported Sites

ScrapeBot is designed to work on any public website with a gallery or search results layout. It has been tested on:

  • Internet Archive image search
  • Standard image gallery sites
  • Any site using JavaScript rendering (SPAs, lazy loading, dynamic pagination)

Note: ScrapeBot only downloads publicly accessible images. It does not bypass authentication, paywalls, or rate-limiting mechanisms beyond the built-in delay.


Project Structure

scrapebot/
├── main.py                  # Entry point — launch the desktop app
├── requirements.txt         # Python dependencies
├── scrapebot/
│   ├── app.py               # PyQt6 desktop UI
│   ├── worker.py            # QThread bridge (keeps UI responsive)
│   └── scraper.py           # Playwright crawl engine

Responsible Use

  • The 5-second download delay is hardcoded and cannot be changed from the UI
  • Only scrape sites where you have permission to do so
  • Respect each site's robots.txt and terms of service
  • Do not use this tool to scrape private, copyrighted, or personal data without authorization

License

MIT License — free to use, modify, and distribute.


Author

realbobcorbin

About

Desktop web scraper with 4 parallel sessions, Playwright Chromium, PyQt6 dark UI, lazy image support — double-clickable macOS app

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors