A production-deployed Flask service that scrapes laptop listings from Daraz.pk, cleans and dimensionally reduces them with PCA + UMAP, and serves the results through a REST + WebSocket API — with full systemd / ngrok / MCP-automation deployment.
- Scrapes laptop product listings from Daraz.pk — title, price, ratings, specs (RAM, storage), and images.
- Preprocesses the raw HTML into a clean tabular dataset with derived numeric features.
- Reduces dimensionality via PCA and UMAP and clusters with K-Means / DBSCAN, surfacing groupings like "budget vs. mid vs. premium" and price-vs-spec outliers.
- Serves the results through a Flask + SocketIO API with live progress events, downloadable CSVs, and interactive Plotly visualisations.
- Auto-runs daily via an MCP-triggered cron job + GitHub Actions workflow, so the dataset stays fresh without manual intervention.
Built as the final project for the Professional in AI Lab course at GIKI.
┌──────────────────────────────────────────────────────────────────────┐
│ Daraz Scraper Service │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────────┐ │
│ │ scraper.py │ → │ preprocess │ → │ dims.py │ │
│ │ (BS4 + req) │ │ .py │ │ (PCA + UMAP + KMeans│ │
│ │ │ │ (pandas) │ │ + Plotly figures) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────┬──────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ raw.csv processed.csv pca/umap PNGs + JSON │
│ │
└───────────────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Flask + SocketIO (app.py) │
│ REST endpoints + live │
│ progress over WebSocket │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ systemd unit + ngrok │
│ tunnel (public URL) │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ MCP automation client │
│ (mcp_automation.py) │
│ runs daily via cron / GHA │
└─────────────────────────────┘
| Endpoint | Method | What it does |
|---|---|---|
/ |
GET | Health check + endpoint list |
/status |
GET | Pipeline state + file existence |
/scrape |
POST | Trigger a new scrape (async, emits progress over WebSocket) |
/process |
POST | Preprocess raw scrape into processed CSV |
/all |
POST | Run the full pipeline (scrape → process → reduce → plot) |
/csv |
GET | Download the latest processed CSV |
/plot/<name> |
GET | Download a specific plot (pca / umap_ram / umap_storage / umap_price / umap_composite) |
Swagger UI is enabled via flasgger for browsable documentation.
| Layer | Tech |
|---|---|
| Scraping | requests + BeautifulSoup |
| Data processing | pandas, numpy |
| Dimensionality reduction | scikit-learn (PCA), umap-learn |
| Clustering | scikit-learn (K-Means, DBSCAN) |
| API | Flask + Flask-SocketIO + Flask-CORS + Flasgger |
| Visualisation | matplotlib (static) + Plotly (interactive) |
| Deployment | gunicorn + systemd + ngrok tunnel |
| Automation | MCP client (mcp_automation.py) + GitHub Actions cron |
daraz-scraper/
├── app.py # Flask + SocketIO API server
├── scraper.py # Daraz product scraper (BS4 + requests)
├── preprocess.py # Raw → clean tabular pipeline
├── dims.py # PCA + UMAP + clustering + plot generation
├── mcp_automation.py # MCP client that drives the API end-to-end
├── mcp_scheduler.py # Cron-style scheduler for periodic runs
├── performance_test.py # Load test for the API
├── deploy.sh, setup.sh, # Production deployment scripts
│ backup.sh, monitor.sh
├── start_flash.sh, # Service start helpers
│ start_ngrok.sh
├── ngrok.yml, # ngrok tunnel config
│ ngrok-tunnel.service
├── systemd/system/ # systemd unit files for daraz-mcp + ngrok
├── static/ # Static assets (index.html + generated plots)
├── .github/workflows/ # Daily MCP cron via GitHub Actions
└── requirements.txt
git clone https://github.com/MuhammadHashimRN/daraz-scraper.git
cd daraz-scraper
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# start the API on localhost:5000
python app.pyThen in another terminal:
# trigger a scrape + processing + plot end-to-end
curl -X POST http://localhost:5000/allbash setup.sh # one-time: install deps, create systemd units
sudo systemctl start daraz-mcp # bring up the API
sudo systemctl start ngrok-tunnel
bash get-ngrok-url.sh # prints the public ngrok URL- Async scraping with live progress — SocketIO emits
progressevents per page so the frontend can show a real-time counter. - Idempotent image cache — images keyed by MD5 hash of URL; re-scrapes don't re-download.
- Systemd-managed lifecycle — proper service units with restart-on-failure for both the API and the ngrok tunnel.
- MCP-driven daily pipeline — a model-context-protocol-style client (
mcp_automation.py) drives the entire scrape → process → plot pipeline through the API and uploads artefacts.
Muhammad Hashim — BS Artificial Intelligence, GIK Institute (2026) 📧 muhammad808alvi@gmail.com · 🔗 github.com/MuhammadHashimRN