BlockQuant v2.0

Discord bot for quantizing HuggingFace models to EXL3 format using ExLlamaV3, with automatic upload to HuggingFace Hub. This branch runs in HF-model-only mode (GGUF conversion is not in the active pipeline).

Features

/quant — Submit a model for quantization with selectable BPW (bits-per-weight)
Live progress updates in a dedicated Discord thread
Pre-flight validation (HF token, model existence, write access) before any work starts
Automatic HuggingFace repo creation & upload
EXP system — users earn EXP by chatting, spend it on quantization jobs
Admin commands for queue management and EXP grants
Persistent jobs with restart recovery and idempotent EXP charge/refund ledger
Local model cache reuse with prune controls
Graceful shutdown with workspace cleanup
Winston-based structured logging

Prerequisites

Node.js ≥ 20
Python ≥ 3.10 with CUDA-capable PyTorch

ExLlamaV3 cloned separately and installed:

git clone https://github.com/turboderp-org/exllamav3.git
cd exllamav3
python3 -m pip install --user -r requirements.txt
python3 -m pip install --user -e .
cd ..

Hugging Face helper script dependencies:

python3 -m pip install --user -r scripts/requirements.txt

Setup

# 1. Clone and install Node deps
git clone <this-repo>
cd BlockQuant
npm install

# 2. Configure environment
cp .env.example .env
# Edit .env with your Discord bot token, HF token, etc.
# Set EXLLAMAV3_DIR to your external exllamav3 clone path.

# 3. Install Python deps (required)
python3 -m pip install --user -r scripts/requirements.txt

# 4. Run
npm start        # production
npm run dev      # development (debug logging)

Tooling

npm run lint         # lint JavaScript
npm run format:check # check formatting
npm run test         # run unit tests
npm run preflight    # run Python preflight script

Reliability and Security

Persistent jobs: queue state checkpoints are stored in data/jobs.json.
Startup recovery: queued/running/interrupted jobs are recovered on boot.
Cache reuse: downloaded source models are cached under tmp/workdir/cache.
Retries/timeouts: HF and quantization subprocesses are governed by .env timeout/retry settings.
Secret hygiene: logs and user-facing errors are redacted for token-like values.

Project Structure

BlockQuant/
├── src/
│   ├── index.js              # Entry point — bot bootstrap
│   ├── config.js             # Validated env config
│   ├── logger.js             # Winston logger
│   ├── commands/
│   │   ├── definitions.js    # SlashCommandBuilder definitions
│   │   ├── router.js         # Command → handler dispatch
│   │   ├── quant.js          # /quant handler
│   │   ├── info.js           # /score, /leaderboard, /queue
│   │   └── admin.js          # /give, /pause, /resume
│   ├── services/
│   │   ├── db.js             # JSON file persistence
│   │   ├── workspace.js      # Filesystem workspace management
│   │   ├── huggingface.js    # HF download/upload/preflight
│   │   ├── quantizer.js      # ExLlamaV3 convert.py runner
│   │   └── queue.js          # PQueue job orchestration
│   └── utils/
│       ├── embeds.js         # Discord embed builders
│       └── format.js         # Progress bars, duration formatting
├── scripts/
│   ├── preflight.py          # HF token & model validation
│   ├── download_model.py     # Model downloader
│   ├── upload_model.py       # Model uploader
│   ├── generate_quant_readme.py # HF model card writer
│   ├── check_repo.py         # Post-upload verification
│   └── requirements.txt      # Python deps
├── tests/
│   └── format.test.js
├── data/                     # Auto-created JSON databases
├── logs/                     # Auto-created log files
├── .env.example
├── .gitignore
└── package.json

Slash Commands

Command	Description	Access
`/quant`	Submit a quantization job	All
`/queue`	Check queue status	All
`/health`	Check bot health summary	All
`/score`	Check EXP balance	All
`/leaderboard`	Top 10 EXP earners	All
`/give`	Grant/deduct EXP	Admin
`/pause`	Pause the job queue	Admin
`/resume`	Resume the job queue	Admin
`/diag`	Detailed diagnostics	Admin
`/cache`	Cache status/prune/clear	Admin

How Quantization Works

Pre-flight — Validates HF token has write access and the source model exists
Download — snapshot_download() pulls the model to a temp workspace
Validate — Checks for config.json and weight files
Quantize — Runs ExLlamaV3 convert.py (single-step EXL3 conversion, profile-aware)
Upload — upload_folder() pushes the quantized model to HuggingFace
Manifest — blockquant-manifest.json is written into output artifacts
Cleanup — Active workspace is wiped after each job (cache retained)

Each BPW in a job follows steps 4-5 sequentially. If any step fails, EXP is refunded and the user is notified.

Differences from v1 (ExLlamaV2)

Single-step conversion — No separate measurement pass; ExLlamaV3 computes Hessians on-the-fly
EXL3 format — Based on QTIP, significantly better quality/size ratio
Simpler CLI — -i input -o output -w workdir -b bpw is all you need
Pre-flight checks — Token and model validation before any GPU work
Modular architecture — Commands, services, and utilities are cleanly separated

Quantization Guide

Understanding BPW (Bits Per Weight)

BPW determines the compression level and quality of your quantized model:

BPW	Quality	Use Case	Relative Size
3.0–3.5	⚠️ Low	VRAM-constrained only	~40% of FP16
4.0–4.5	✅ Good	Sweet spot — most popular	~50% of FP16
5.0–5.5	✅ Great	High quality	~60% of FP16
6.0–6.5	⭐ Excellent	Near-lossless	~70% of FP16
8.0	⭐ Premium	Minimal loss	~90% of FP16

Auto Mode (Recommended)

BlockQuant uses Auto mode by default, which intelligently adjusts parameters based on your chosen BPW:

/quant url: meta-llama/Llama-3.1-8B bpw: 4.0,5.0

Auto mode automatically sets:

head_bits — Precision for the output layer (lm_head)
cal_rows/cal_cols — Calibration data size for better quality

BPW	head_bits	Calibration	Description
≤3.5	6	200×2048	Fast, minimal VRAM
4.0–4.5	6	250×2048	Balanced (default)
5.0–5.5	8	300×2560	High quality
6.0+	8	300×2560	Higher lm_head precision

Manual Profiles

Override auto settings with profiles (optional):

/quant url: model bpw: 4.0 profile: quality   # Fixed head_bits=8
/quant url: model bpw: 4.0 profile: fast      # Fast conversion

Profile	head_bits	Best For
`auto`	Scales with BPW	Most users (recommended)
`fast`	6	Quick conversion
`balanced`	6	Original behavior
`quality`	8	Maximum quality

Note: Admins can override head_bits directly with head_bits: 4-16.

Technical Notes

head_bits differences: Based on community testing (ExLlamaV2/V3), the difference between head_bits=6 and head_bits=8 is typically less than 1% in perplexity. The Auto mode uses head_bits=6 for most BPWs as the quality gain from head_bits=8 is minimal compared to the VRAM cost.

Calibration data: The default calibration size (250 rows × 2048 columns) works well for most models. Increasing calibration data (e.g., to 300×2560) may marginally improve quality for high-BPW quants but significantly increases conversion time.

Source format: EXL3 uses QTIP quantization. The calibration data is used to compute Hessians on-the-fly during conversion, unlike EXL2 which required a separate measurement pass.

Recommended Workflows

For VRAM-constrained (24GB GPU):

/quant url: model bpw: 3.5        # Fits 70B models

For balanced quality/size (most popular):

/quant url: model bpw: 4.0,4.5    # ~90% quality, half the size

For high-quality outputs:

/quant url: model bpw: 5.0,6.0    # 95-98% quality

Multiple variants at once:

/quant url: model bpw: 3.5,4.0,4.5,5.0

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
README.md		README.md
eslint.config.js		eslint.config.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BlockQuant v2.0

Features

Prerequisites

Setup

Tooling

Reliability and Security

Project Structure

Slash Commands

How Quantization Works

Differences from v1 (ExLlamaV2)

Quantization Guide

Understanding BPW (Bits Per Weight)

Auto Mode (Recommended)

Manual Profiles

Technical Notes

Recommended Workflows

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BlockQuant v2.0

Features

Prerequisites

Setup

Tooling

Reliability and Security

Project Structure

Slash Commands

How Quantization Works

Differences from v1 (ExLlamaV2)

Quantization Guide

Understanding BPW (Bits Per Weight)

Auto Mode (Recommended)

Manual Profiles

Technical Notes

Recommended Workflows

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages