⚡ Parallel Web Scraper – High-Performance C++ + Intel TBB

Blazing-fast web scraper for structured websites, built with C++20, Intel TBB and libcurl.
Scrape, parse and analyse large sets of pages – in serial and in parallel – and measure real speedup on modern CPUs.

📖 Table of Contents

💥 The Problem
💡 The Solution
✨ Key Features
🏗 Architecture & Tech Stack
🔥 How It Works (The Flow)
📈 Performance
🛠 Getting Started
📂 Project Structure
👥 Author
📄 License

💥 The Problem

Traditional web scrapers usually:

Run single-threaded, leaving most CPU cores idle.
Mix network I/O, HTML parsing and data aggregation in one big ball of mud.
Make it hard to measure and compare performance between serial and parallel approaches.

If you want to experiment with parallelism, benchmark speedups, or learn Intel TBB on a real-world workload (HTTP calls + parsing + stats), you need a clean, focused playground.

💡 The Solution

Parallel Web Scraper is a C++20 project that:

Scrapes a structured demo site (e.g. an online book catalogue).
Implements both serial and parallel scraping pipelines.
Uses Intel Threading Building Blocks (TBB) for task-based parallelism.
Aggregates statistics in a thread-safe way.
Outputs comparable metrics so you can see the actual speedup.

It’s designed as a practical reference for:

Parallel programming with Intel TBB
Efficient use of libcurl in C++
Lock-aware, thread-safe statistics aggregation

✨ Key Features

⚙️ Dual implementation – serial scraper vs. parallel scraper
🧵 Task-based parallelism with Intel TBB (no manual thread management)
🌐 HTTP layer using libcurl with timeouts & retry logic
📊 Rich statistics:
- Total pages visited & unique URLs
- Total items/books scraped
- Rating distribution (1★–5★)
- Average rating & average price
- Cheapest & most expensive item
🧷 Thread-safe stats using std::atomic, std::mutex and TBB concurrent containers
📝 Human-readable report printed to console and saved to results.txt
🧪 Perfect as a benchmark / learning project for parallel patterns

🏗 Architecture & Tech Stack

Tech Stack

Language: C++20
Parallelism: Intel Threading Building Blocks (TBB)
Networking: libcurl
Standard Library: <atomic>, <mutex>, <queue>, <vector>, <string>, <chrono>, <fstream>, <iostream>, etc.

High-Level Components

Book / BookInfo (Book.h)
Small structs representing a scraped item (title, price, rating, category) and helper info for min/max book.
Stats (Stats.h, Stats.cpp)
Thread-safe statistics aggregator:
- Uses std::atomic counters for counts, sums and price accumulators.
- Uses a std::mutex to protect min/max book updates.
- update(const std::vector<Book>&) merges a local batch into global stats.
- reset() clears everything for a fresh run.
Downloader (Downloader.h, Downloader.cpp)
Thin wrapper around libcurl:
- Performs HTTP GET with a timeout.
- Retries failed requests a few times before giving up.
- Returns HTML as std::string.
Parser (Parser.cpp + header)
Lightweight HTML parser tuned for the target site:
- Extracts title, price, rating and category from each product block.
- Finds links to the next page and other relevant URLs.
- Normalizes relative URLs into absolute ones.
Scraper (Scraper.cpp + header)
Orchestrates the whole process:
- Maintains:
  - tbb::concurrent_unordered_set<std::string> visited;
  - tbb::concurrent_unordered_set<std::string> categories;
  - A shared Stats instance.
- Provides both:
  - serialScrape(startUrl)
  - parallelScrape(startUrl)
- Uses task groups / tasks in the parallel version to recursively spawn new work.
Entry Point (main.cpp)
- Initializes libcurl.
- Defines the starting URL (e.g. https://books.toscrape.com/index.html).
- Runs serial scrape and measures execution time.
- Resets shared state.
- Runs parallel scrape and measures execution time again.
- Computes averages and min/max data.
- Writes a detailed report to:
  - std::cout
  - results.txt

🔥 How It Works (The Flow)

Bootstrap 🚀
main.cpp initializes libcurl and creates a Scraper instance with fresh Stats and empty visited sets.
Serial Scrape 🐢
- Starts from the root URL.
- Uses a simple queue (BFS-style) to:
  - Download page → parse HTML → extract books & new URLs.
  - Update statistics.
  - Enqueue new URLs that haven’t been visited.
Reset 🔄
- Clears visited and other shared data.
- Resets Stats to zero.
Parallel Scrape ⚡
- Starts from the same root URL.
- Creates TBB tasks for discovered URLs:
  - Each task checks visited (concurrent set).
  - Downloads and parses the page.
  - Updates shared Stats using thread-safe APIs.
  - Spawns new tasks for newly discovered URLs.
Reporting 📊
- Once both runs finish, the program:
  - Prints serial vs. parallel stats.
  - Shows total time, time per page/book, and computed speedup.
  - Saves the same report to results.txt for later inspection.

📈 Performance

The project is designed to showcase speedup on multi-core CPUs:

Parallel execution leverages multiple cores for downloading & parsing.
Thread-safe statistics ensure correctness despite concurrency.
Serial and parallel implementations scrape the same dataset, so metrics are directly comparable.

Exact timings and speedup depend on your machine and network, but on a modern multi-core CPU you should clearly see the parallel version outperform the serial baseline.

🛠 Getting Started

1. Clone the Repository

git clone https://github.com/MilanSazdov/parallel-web-scraper.git
cd parallel-web-scraper/Web_sakupljac

2. Build

You have two options:

Option A – Visual Studio (Windows)

Open Web_sakupljac/Web_sakupljac.sln in Visual Studio.
Select the desired configuration (e.g. x64-Debug or x64-Release).
Press Build → Build Solution or Ctrl+Shift+B.
Run the project with F5.

Option B – g++ / CLI build

If you want to build from the command line (Linux, WSL, MinGW, etc.):

Make sure you have:

C++20 compatible compiler (g++, clang++, etc.)
Intel TBB installed
libcurl development libraries installed

From inside Web_sakupljac/:

g++ -std=c++20 -O2 \
    main.cpp web_scraper.cpp \
    -ltbb -lcurl \
    -o web_scraper

Adjust library flags (-ltbb, -lcurl) if your platform uses different names.

3. Run

From the Web_sakupljac/ directory:

./web_scraper
# or on Windows
web_scraper.exe

The program will:

Run the serial scraper.
Run the parallel scraper.
Print a side-by-side comparison.
Generate results.txt with the same report.

📂 Project Structure

.
├── Book.h               # Book & BookInfo structs
├── Downloader.h         # Downloader interface
├── Downloader.cpp       # libcurl implementation
├── Parser.cpp           # HTML parsing & URL discovery
├── Scraper.cpp          # Scraper class (serial + parallel)
├── Stats.h              # Thread-safe statistics (declaration)
├── Stats.cpp            # Thread-safe statistics (implementation)
├── main.cpp             # Entry point, timing & reporting

👥 Author

Project developed by:

Milan Sazdov - LinkedIn | Email

📄 License

This project is licensed under the MIT License.

You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software, subject to the conditions stated in the license.

See the LICENSE file for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Web_sakupljac		Web_sakupljac
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚡ Parallel Web Scraper – High-Performance C++ + Intel TBB

📖 Table of Contents

💥 The Problem

💡 The Solution

✨ Key Features

🏗 Architecture & Tech Stack

Tech Stack

High-Level Components

🔥 How It Works (The Flow)

📈 Performance

🛠 Getting Started

1. Clone the Repository

2. Build

Option A – Visual Studio (Windows)

Option B – g++ / CLI build

3. Run

📂 Project Structure

👥 Author

📄 License

About

Uh oh!

Releases

Packages

Languages

License

MilanSazdov/parallel-web-scraper

Folders and files

Latest commit

History

Repository files navigation

⚡ Parallel Web Scraper – High-Performance C++ + Intel TBB

📖 Table of Contents

💥 The Problem

💡 The Solution

✨ Key Features

🏗 Architecture & Tech Stack

Tech Stack

High-Level Components

🔥 How It Works (The Flow)

📈 Performance

🛠 Getting Started

1. Clone the Repository

2. Build

Option A – Visual Studio (Windows)

Option B – g++ / CLI build

3. Run

📂 Project Structure

👥 Author

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages