Skip to content

MuhammadHashimRN/daraz-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Daraz Scraper & Analytics API

A production-deployed Flask service that scrapes laptop listings from Daraz.pk, cleans and dimensionally reduces them with PCA + UMAP, and serves the results through a REST + WebSocket API — with full systemd / ngrok / MCP-automation deployment.

Python Flask scikit--learn Plotly Deployed License: MIT


What it does

  1. Scrapes laptop product listings from Daraz.pk — title, price, ratings, specs (RAM, storage), and images.
  2. Preprocesses the raw HTML into a clean tabular dataset with derived numeric features.
  3. Reduces dimensionality via PCA and UMAP and clusters with K-Means / DBSCAN, surfacing groupings like "budget vs. mid vs. premium" and price-vs-spec outliers.
  4. Serves the results through a Flask + SocketIO API with live progress events, downloadable CSVs, and interactive Plotly visualisations.
  5. Auto-runs daily via an MCP-triggered cron job + GitHub Actions workflow, so the dataset stays fresh without manual intervention.

Built as the final project for the Professional in AI Lab course at GIKI.

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                       Daraz Scraper Service                          │
│                                                                       │
│   ┌──────────────┐    ┌──────────────┐    ┌─────────────────────┐    │
│   │  scraper.py  │ →  │ preprocess   │ →  │   dims.py           │    │
│   │  (BS4 + req) │    │   .py        │    │ (PCA + UMAP + KMeans│    │
│   │              │    │ (pandas)     │    │  + Plotly figures)  │    │
│   └──────┬───────┘    └──────┬───────┘    └──────────┬──────────┘    │
│          │                   │                       │                │
│          ▼                   ▼                       ▼                │
│       raw.csv          processed.csv         pca/umap PNGs + JSON    │
│                                                                       │
└───────────────────────────────┬───────────────────────────────────────┘
                                │
                                ▼
                  ┌─────────────────────────────┐
                  │  Flask + SocketIO (app.py)  │
                  │  REST endpoints + live      │
                  │  progress over WebSocket    │
                  └──────────────┬──────────────┘
                                 │
                                 ▼
                  ┌─────────────────────────────┐
                  │   systemd unit + ngrok      │
                  │   tunnel (public URL)       │
                  └─────────────────────────────┘
                                 │
                                 ▼
                  ┌─────────────────────────────┐
                  │   MCP automation client     │
                  │   (mcp_automation.py)       │
                  │   runs daily via cron / GHA │
                  └─────────────────────────────┘

API surface

Endpoint Method What it does
/ GET Health check + endpoint list
/status GET Pipeline state + file existence
/scrape POST Trigger a new scrape (async, emits progress over WebSocket)
/process POST Preprocess raw scrape into processed CSV
/all POST Run the full pipeline (scrape → process → reduce → plot)
/csv GET Download the latest processed CSV
/plot/<name> GET Download a specific plot (pca / umap_ram / umap_storage / umap_price / umap_composite)

Swagger UI is enabled via flasgger for browsable documentation.

Tech stack

Layer Tech
Scraping requests + BeautifulSoup
Data processing pandas, numpy
Dimensionality reduction scikit-learn (PCA), umap-learn
Clustering scikit-learn (K-Means, DBSCAN)
API Flask + Flask-SocketIO + Flask-CORS + Flasgger
Visualisation matplotlib (static) + Plotly (interactive)
Deployment gunicorn + systemd + ngrok tunnel
Automation MCP client (mcp_automation.py) + GitHub Actions cron

Repository layout

daraz-scraper/
├── app.py                  # Flask + SocketIO API server
├── scraper.py              # Daraz product scraper (BS4 + requests)
├── preprocess.py           # Raw → clean tabular pipeline
├── dims.py                 # PCA + UMAP + clustering + plot generation
├── mcp_automation.py       # MCP client that drives the API end-to-end
├── mcp_scheduler.py        # Cron-style scheduler for periodic runs
├── performance_test.py     # Load test for the API
├── deploy.sh, setup.sh,    # Production deployment scripts
│   backup.sh, monitor.sh
├── start_flash.sh,         # Service start helpers
│   start_ngrok.sh
├── ngrok.yml,              # ngrok tunnel config
│   ngrok-tunnel.service
├── systemd/system/         # systemd unit files for daraz-mcp + ngrok
├── static/                 # Static assets (index.html + generated plots)
├── .github/workflows/      # Daily MCP cron via GitHub Actions
└── requirements.txt

Local run

git clone https://github.com/MuhammadHashimRN/daraz-scraper.git
cd daraz-scraper

python -m venv .venv
source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# start the API on localhost:5000
python app.py

Then in another terminal:

# trigger a scrape + processing + plot end-to-end
curl -X POST http://localhost:5000/all

Production deployment (Linux server)

bash setup.sh                  # one-time: install deps, create systemd units
sudo systemctl start daraz-mcp # bring up the API
sudo systemctl start ngrok-tunnel
bash get-ngrok-url.sh          # prints the public ngrok URL

Notable engineering bits

  • Async scraping with live progress — SocketIO emits progress events per page so the frontend can show a real-time counter.
  • Idempotent image cache — images keyed by MD5 hash of URL; re-scrapes don't re-download.
  • Systemd-managed lifecycle — proper service units with restart-on-failure for both the API and the ngrok tunnel.
  • MCP-driven daily pipeline — a model-context-protocol-style client (mcp_automation.py) drives the entire scrape → process → plot pipeline through the API and uploads artefacts.

Author

Muhammad Hashim — BS Artificial Intelligence, GIK Institute (2026) 📧 muhammad808alvi@gmail.com · 🔗 github.com/MuhammadHashimRN

License

MIT

About

Production Flask service that scrapes Daraz.pk product listings and serves PCA + UMAP analytics via REST + WebSocket API — systemd / ngrok / MCP-automated daily pipeline

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors