Daraz Scraper & Analytics API

A production-deployed Flask service that scrapes laptop listings from Daraz.pk, cleans and dimensionally reduces them with PCA + UMAP, and serves the results through a REST + WebSocket API — with full systemd / ngrok / MCP-automation deployment.

What it does

Scrapes laptop product listings from Daraz.pk — title, price, ratings, specs (RAM, storage), and images.
Preprocesses the raw HTML into a clean tabular dataset with derived numeric features.
Reduces dimensionality via PCA and UMAP and clusters with K-Means / DBSCAN, surfacing groupings like "budget vs. mid vs. premium" and price-vs-spec outliers.
Serves the results through a Flask + SocketIO API with live progress events, downloadable CSVs, and interactive Plotly visualisations.
Auto-runs daily via an MCP-triggered cron job + GitHub Actions workflow, so the dataset stays fresh without manual intervention.

Built as the final project for the Professional in AI Lab course at GIKI.

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                       Daraz Scraper Service                          │
│                                                                       │
│   ┌──────────────┐    ┌──────────────┐    ┌─────────────────────┐    │
│   │  scraper.py  │ →  │ preprocess   │ →  │   dims.py           │    │
│   │  (BS4 + req) │    │   .py        │    │ (PCA + UMAP + KMeans│    │
│   │              │    │ (pandas)     │    │  + Plotly figures)  │    │
│   └──────┬───────┘    └──────┬───────┘    └──────────┬──────────┘    │
│          │                   │                       │                │
│          ▼                   ▼                       ▼                │
│       raw.csv          processed.csv         pca/umap PNGs + JSON    │
│                                                                       │
└───────────────────────────────┬───────────────────────────────────────┘
                                │
                                ▼
                  ┌─────────────────────────────┐
                  │  Flask + SocketIO (app.py)  │
                  │  REST endpoints + live      │
                  │  progress over WebSocket    │
                  └──────────────┬──────────────┘
                                 │
                                 ▼
                  ┌─────────────────────────────┐
                  │   systemd unit + ngrok      │
                  │   tunnel (public URL)       │
                  └─────────────────────────────┘
                                 │
                                 ▼
                  ┌─────────────────────────────┐
                  │   MCP automation client     │
                  │   (mcp_automation.py)       │
                  │   runs daily via cron / GHA │
                  └─────────────────────────────┘

API surface

Endpoint	Method	What it does
`/`	GET	Health check + endpoint list
`/status`	GET	Pipeline state + file existence
`/scrape`	POST	Trigger a new scrape (async, emits progress over WebSocket)
`/process`	POST	Preprocess raw scrape into processed CSV
`/all`	POST	Run the full pipeline (scrape → process → reduce → plot)
`/csv`	GET	Download the latest processed CSV
`/plot/<name>`	GET	Download a specific plot (pca / umap_ram / umap_storage / umap_price / umap_composite)

Swagger UI is enabled via flasgger for browsable documentation.

Tech stack

Layer	Tech
Scraping	requests + BeautifulSoup
Data processing	pandas, numpy
Dimensionality reduction	scikit-learn (PCA), umap-learn
Clustering	scikit-learn (K-Means, DBSCAN)
API	Flask + Flask-SocketIO + Flask-CORS + Flasgger
Visualisation	matplotlib (static) + Plotly (interactive)
Deployment	gunicorn + systemd + ngrok tunnel
Automation	MCP client (`mcp_automation.py`) + GitHub Actions cron

Repository layout

daraz-scraper/
├── app.py                  # Flask + SocketIO API server
├── scraper.py              # Daraz product scraper (BS4 + requests)
├── preprocess.py           # Raw → clean tabular pipeline
├── dims.py                 # PCA + UMAP + clustering + plot generation
├── mcp_automation.py       # MCP client that drives the API end-to-end
├── mcp_scheduler.py        # Cron-style scheduler for periodic runs
├── performance_test.py     # Load test for the API
├── deploy.sh, setup.sh,    # Production deployment scripts
│   backup.sh, monitor.sh
├── start_flash.sh,         # Service start helpers
│   start_ngrok.sh
├── ngrok.yml,              # ngrok tunnel config
│   ngrok-tunnel.service
├── systemd/system/         # systemd unit files for daraz-mcp + ngrok
├── static/                 # Static assets (index.html + generated plots)
├── .github/workflows/      # Daily MCP cron via GitHub Actions
└── requirements.txt

Local run

git clone https://github.com/MuhammadHashimRN/daraz-scraper.git
cd daraz-scraper

python -m venv .venv
source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# start the API on localhost:5000
python app.py

Then in another terminal:

# trigger a scrape + processing + plot end-to-end
curl -X POST http://localhost:5000/all

Production deployment (Linux server)

bash setup.sh                  # one-time: install deps, create systemd units
sudo systemctl start daraz-mcp # bring up the API
sudo systemctl start ngrok-tunnel
bash get-ngrok-url.sh          # prints the public ngrok URL

Notable engineering bits

Async scraping with live progress — SocketIO emits progress events per page so the frontend can show a real-time counter.
Idempotent image cache — images keyed by MD5 hash of URL; re-scrapes don't re-download.
Systemd-managed lifecycle — proper service units with restart-on-failure for both the API and the ngrok tunnel.
MCP-driven daily pipeline — a model-context-protocol-style client (mcp_automation.py) drives the entire scrape → process → plot pipeline through the API and uploads artefacts.

Author

Muhammad Hashim — BS Artificial Intelligence, GIK Institute (2026) 📧 muhammad808alvi@gmail.com · 🔗 github.com/MuhammadHashimRN

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Daraz Scraper & Analytics API

What it does

Architecture

API surface

Tech stack

Repository layout

Local run

Production deployment (Linux server)

Notable engineering bits

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
static		static
systemd/system		systemd/system
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
app.py		app.py
backup.sh		backup.sh
deploy.sh		deploy.sh
dims.py		dims.py
get-ngrok-url.sh		get-ngrok-url.sh
get-url.sh		get-url.sh
mcp_automation.py		mcp_automation.py
mcp_scheduler.py		mcp_scheduler.py
monitor.sh		monitor.sh
ngrok-tunnel.service		ngrok-tunnel.service
ngrok.yml		ngrok.yml
performance_test.py		performance_test.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
scraper.py		scraper.py
setup.sh		setup.sh
start_flash.sh		start_flash.sh
start_ngrok.sh		start_ngrok.sh
test_api.sh		test_api.sh

Folders and files

Latest commit

History

Repository files navigation

Daraz Scraper & Analytics API

What it does

Architecture

API surface

Tech stack

Repository layout

Local run

Production deployment (Linux server)

Notable engineering bits

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages