Blazing-fast web scraper for structured websites, built with C++20, Intel TBB and libcurl.
Scrape, parse and analyse large sets of pages – in serial and in parallel – and measure real speedup on modern CPUs.
- 💥 The Problem
- 💡 The Solution
- ✨ Key Features
- 🏗 Architecture & Tech Stack
- 🔥 How It Works (The Flow)
- 📈 Performance
- 🛠 Getting Started
- 📂 Project Structure
- 👥 Author
- 📄 License
Traditional web scrapers usually:
- Run single-threaded, leaving most CPU cores idle.
- Mix network I/O, HTML parsing and data aggregation in one big ball of mud.
- Make it hard to measure and compare performance between serial and parallel approaches.
If you want to experiment with parallelism, benchmark speedups, or learn Intel TBB on a real-world workload (HTTP calls + parsing + stats), you need a clean, focused playground.
Parallel Web Scraper is a C++20 project that:
- Scrapes a structured demo site (e.g. an online book catalogue).
- Implements both serial and parallel scraping pipelines.
- Uses Intel Threading Building Blocks (TBB) for task-based parallelism.
- Aggregates statistics in a thread-safe way.
- Outputs comparable metrics so you can see the actual speedup.
It’s designed as a practical reference for:
- Parallel programming with Intel TBB
- Efficient use of libcurl in C++
- Lock-aware, thread-safe statistics aggregation
- ⚙️ Dual implementation – serial scraper vs. parallel scraper
- 🧵 Task-based parallelism with Intel TBB (no manual thread management)
- 🌐 HTTP layer using libcurl with timeouts & retry logic
- 📊 Rich statistics:
- Total pages visited & unique URLs
- Total items/books scraped
- Rating distribution (1★–5★)
- Average rating & average price
- Cheapest & most expensive item
- 🧷 Thread-safe stats using
std::atomic,std::mutexand TBB concurrent containers - 📝 Human-readable report printed to console and saved to
results.txt - 🧪 Perfect as a benchmark / learning project for parallel patterns
- Language: C++20
- Parallelism: Intel Threading Building Blocks (TBB)
- Networking: libcurl
- Standard Library:
<atomic>,<mutex>,<queue>,<vector>,<string>,<chrono>,<fstream>,<iostream>, etc.
-
Book/BookInfo(Book.h)
Small structs representing a scraped item (title, price, rating, category) and helper info for min/max book. -
Stats(Stats.h,Stats.cpp)
Thread-safe statistics aggregator:- Uses
std::atomiccounters for counts, sums and price accumulators. - Uses a
std::mutexto protect min/max book updates. update(const std::vector<Book>&)merges a local batch into global stats.reset()clears everything for a fresh run.
- Uses
-
Downloader (
Downloader.h,Downloader.cpp)
Thin wrapper around libcurl:- Performs HTTP GET with a timeout.
- Retries failed requests a few times before giving up.
- Returns HTML as
std::string.
-
Parser (
Parser.cpp+ header)
Lightweight HTML parser tuned for the target site:- Extracts title, price, rating and category from each product block.
- Finds links to the next page and other relevant URLs.
- Normalizes relative URLs into absolute ones.
-
Scraper (
Scraper.cpp+ header)
Orchestrates the whole process:- Maintains:
tbb::concurrent_unordered_set<std::string> visited;tbb::concurrent_unordered_set<std::string> categories;- A shared
Statsinstance.
- Provides both:
serialScrape(startUrl)parallelScrape(startUrl)
- Uses task groups / tasks in the parallel version to recursively spawn new work.
- Maintains:
-
Entry Point (
main.cpp)- Initializes libcurl.
- Defines the starting URL (e.g.
https://books.toscrape.com/index.html). - Runs serial scrape and measures execution time.
- Resets shared state.
- Runs parallel scrape and measures execution time again.
- Computes averages and min/max data.
- Writes a detailed report to:
std::coutresults.txt
-
Bootstrap 🚀
main.cppinitializes libcurl and creates aScraperinstance with freshStatsand empty visited sets. -
Serial Scrape 🐢
- Starts from the root URL.
- Uses a simple queue (BFS-style) to:
- Download page → parse HTML → extract books & new URLs.
- Update statistics.
- Enqueue new URLs that haven’t been visited.
-
Reset 🔄
- Clears
visitedand other shared data. - Resets
Statsto zero.
- Clears
-
Parallel Scrape ⚡
- Starts from the same root URL.
- Creates TBB tasks for discovered URLs:
- Each task checks
visited(concurrent set). - Downloads and parses the page.
- Updates shared
Statsusing thread-safe APIs. - Spawns new tasks for newly discovered URLs.
- Each task checks
-
Reporting 📊
- Once both runs finish, the program:
- Prints serial vs. parallel stats.
- Shows total time, time per page/book, and computed speedup.
- Saves the same report to
results.txtfor later inspection.
- Once both runs finish, the program:
The project is designed to showcase speedup on multi-core CPUs:
- Parallel execution leverages multiple cores for downloading & parsing.
- Thread-safe statistics ensure correctness despite concurrency.
- Serial and parallel implementations scrape the same dataset, so metrics are directly comparable.
Exact timings and speedup depend on your machine and network, but on a modern multi-core CPU you should clearly see the parallel version outperform the serial baseline.
git clone https://github.com/MilanSazdov/parallel-web-scraper.git
cd parallel-web-scraper/Web_sakupljacYou have two options:
- Open
Web_sakupljac/Web_sakupljac.slnin Visual Studio. - Select the desired configuration (e.g.
x64-Debugorx64-Release). - Press Build → Build Solution or
Ctrl+Shift+B. - Run the project with F5.
If you want to build from the command line (Linux, WSL, MinGW, etc.):
Make sure you have:
- C++20 compatible compiler (
g++,clang++, etc.) - Intel TBB installed
- libcurl development libraries installed
From inside Web_sakupljac/:
g++ -std=c++20 -O2 \
main.cpp web_scraper.cpp \
-ltbb -lcurl \
-o web_scraperAdjust library flags (-ltbb, -lcurl) if your platform uses different names.
From the Web_sakupljac/ directory:
./web_scraper
# or on Windows
web_scraper.exeThe program will:
- Run the serial scraper.
- Run the parallel scraper.
- Print a side-by-side comparison.
- Generate results.txt with the same report.
.
├── Book.h # Book & BookInfo structs
├── Downloader.h # Downloader interface
├── Downloader.cpp # libcurl implementation
├── Parser.cpp # HTML parsing & URL discovery
├── Scraper.cpp # Scraper class (serial + parallel)
├── Stats.h # Thread-safe statistics (declaration)
├── Stats.cpp # Thread-safe statistics (implementation)
├── main.cpp # Entry point, timing & reporting
Project developed by:
This project is licensed under the MIT License.
You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software, subject to the conditions stated in the license.
See the LICENSE file for the full text.
