Course Link: https://www.samy101.com/edge-ai-26/
This repository demonstrates a full compression and deployment pipeline that makes large language models practical on commodity edge devices (for example, Raspberry Pi 5). The pipeline compresses Qwen2-0.5B using layer-aware structured sparsity, knowledge distillation, and SG-GPTQ 4-bit quantisation to produce a small, fast model suitable for on-device inference.
LLMs exhibit impressive functionality but are impractical for deployment outside cloud environments owing to massive memory and computational overhead. This project implements EdgeLLM, a full-fledged model compression framework to make LLMs practical on commodity edge devices. The pipeline used for Qwen2-0.5B comprises three steps:
- SparseGPT with Layer-Aware Sparsity Sensitivity (LASS): an efficient sparse pruning algorithm that assigns sparsity budgets to individual layers according to their output tolerance to sparsity.
- Knowledge Distillation (KD): mitigates accuracy loss resulting from sparse pruning.
- SG-GPTQ with NF4 4-bit quantisation: an effective compression technique that scales the model size down by ~4.4× (942 MB → 215 MB).
The final compressed model achieves a WikiText-2 perplexity of 21.58 (≈ +3.49 from the baseline), while being small and fast enough to run on edge hardware.
notebooks/— Jupyter notebooks for experiments, training, quantisation, and evaluation.third_party/llama.cpp/— vendor build and helper scripts used for deploying quantised/gguf models withllama.cpptooling.README.md— this file.
- Reproducible three-step compression pipeline: sparsity-aware pruning (LASS), knowledge distillation, and SG-GPTQ NF4 quantisation.
- Working artifacts: compressed GGUF models ready for runtime on CPU-bound devices.
- Notebook-driven experiments for conversion, evaluation, and lightweight inference.
For best results, download and run the notebooks from Kaggle where free GPU compute is available:
-
Download the notebooks:
notebooks/qwen.ipynb- Qwen2-0.5b compressionnotebooks/tinyllama.ipynb- TinyLlama1.1b compressionnotebooks/opt1-3.ipynb- OPT1-3 compression
-
Upload to Kaggle Notebook environment.
-
Follow notebook cells in order:
- Data preparation & pruning
- Knowledge distillation
- SG-GPTQ quantisation
- Model export to
.ggufand evaluation
-
Download the resulting
.ggufmodels and test results on to your local machine(CPU machine).
Open the notebooks in notebooks/ to reproduce experiments and analyses:
qwen.ipynb— compression steps, KD, and quantisation applied to Qwen2-0.5B.tinyllama.ipynb— smaller model experiments and speed/size trade-offs.opt1-3.ipynb— additional experiments, baselines, and comparisons.
Each notebook contains detailed, runnable steps for data preparation, pruning, distillation, quantisation, and evaluation. They also show how to export final models to the gguf format for llama.cpp consumption.
-
Sparse Pruning (SparseGPT + LASS)
- Analyze layer sensitivity and assign per-layer sparsity budgets.
- Run SparseGPT-style pruning to induce sparsity while preserving important weights.
-
Knowledge Distillation
- Use a teacher–student setup to fine-tune the pruned model and recover accuracy.
- Distillation scripts and parameter choices are documented in the notebooks.
-
SG-GPTQ (NF4) Quantisation
- Convert the distilled/pruned model to 4-bit NF4 representation using SG-GPTQ.
- Export the quantised model to
.gguffor use withllama.cppand other runtime backends.
- The final Qwen2-0.5B compressed model achieves WikiText-2 perplexity ≈ 21.58 (baseline +3.49).
- Notebooks provide the evaluation scripts and the exact commands used to compute perplexity and other metrics.
-
Open each notebook in the Kaggle environment (GPU enabled).
-
Follow notebook cells in order:
- Data preparation: Load WikiText-2 or custom dataset
- SparseGPT + LASS: Run layer-aware sparsity analysis and pruning
- Knowledge Distillation: Fine-tune the pruned model with teacher guidance
- SG-GPTQ Quantisation: Convert to 4-bit NF4 and export to
.gguf - Evaluation: Compute perplexity and other metrics
-
Download resulting models and results to your
models/directory.
This section covers deploying the quantised GGUF models on Raspberry Pi 5 for edge inference.
Before attempting SSH connection or deployment:
- Laptop connected to hotspot/WiFi
- Raspberry Pi connected to same hotspot/WiFi
- Hotspot band set to 2.4 GHz (recommended for better range)
- Raspberry Pi IP obtained (run
hostname -Ion Pi) - SSH enabled on Raspberry Pi
On the Raspberry Pi, open a terminal and run the following commands one at a time:
-
Get the Raspberry Pi username:
whoami
Example output:
rpi10 -
Get the Raspberry Pi IP address:
hostname -I
Example output:
192.168.1.25
Note the username and IP address for later steps.
- On your laptop, open VS Code.
- Press Ctrl + Shift + P (or Cmd + Shift + P on macOS).
- Search for and select: Remote-SSH: Connect to Host
- Select: Add New SSH Host
- Enter the SSH command:
Example:
ssh <username>@<ip_address>ssh rpi10@192.168.1.25 - Select a SSH config file location when prompted.
- Choose Linux as the remote platform.
- Enter the Raspberry Pi password when prompted.
After successful connection, VS Code will open the Raspberry Pi folder remotely.
SSH into your Raspberry Pi and run the setup commands:
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install required build tools
sudo apt install -y git cmake build-essential
# Optional but recommended for monitoring
sudo apt install -y htopcd ~
git clone https://github.com/ggerganov/llama.cpp
cd llama.cppcmake -B build
cmake --build build -j$(nproc)This creates the llama-server binary in build/bin/.
On your laptop, transfer the .gguf model to the Pi:
scp models/tinyllama.gguf pi@<your_pi_ip>:~/models/Or for Qwen2.5:
scp models/qwen2.5.gguf pi@<your_pi_ip>:~/models/On the Raspberry Pi, ensure the models directory exists:
mkdir -p ~/models
ls ~/models/ # Verify the .gguf file is presentCreate a script to run the inference server. On the Raspberry Pi:
nano ~/smoke_test.shPaste the following content:
#!/bin/bash
MODEL=~/models/tinyllama-q8_0.gguf
SERVER=~/llama.cpp/build/bin/llama-server
echo "Starting LLM server on Raspberry Pi..."
$SERVER \
-m $MODEL \
-c 2048 \
-t $(nproc) \
--host 0.0.0.0 \
--port 8080Make it executable:
chmod +x ~/smoke_test.sh~/smoke_test.shThe server will start and listen on http://<pi-ip>:8080/.
From your laptop or another machine, query the server:
curl -X POST http://<pi_ip>:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello, world!",
"n_predict": 50
}'Replace <pi_ip> with your Raspberry Pi IP address (e.g., 192.168.1.25).
- Quantised model size: ~215 MB (TinyLLaMA/Qwen2.5)
- Memory usage on Pi: ~500–800 MB for inference
- Inference speed: 1–3 tokens/second (depends on model size and
nprocthreads) - Optimal settings: Use
-t $(nproc)to utilize all CPU threads on Pi 5
third_party/llama.cpp/— used for fast CPU inference of GGUF models; follow itsREADMEfor building and platform-specific notes.- Any other third-party tools used for conversion/quantisation are documented inline in the notebooks.
- If inference fails with memory errors, ensure you are using the quantised
.ggufmodel, not the original checkpoint. - For build issues with
llama.cpp, consultthird_party/llama.cpp/README.mdand the project's issue tracker. - If tokenizer mismatches occur, confirm that the tokenizer used in conversion matches the runtime tokenizer.
Contributions are welcome — open an issue or a pull request for fixes, improvements, or additional conversion scripts. When contributing:
- Describe the OS and environment used for testing.
- Include commands and minimal reproduction steps.
If you use this work in research, please cite the underlying ideas and the compression pipeline concepts implemented here. Example citation (adapt to your citation format):
EdgeLLM: Layer-aware sparsity + KD + SG-GPTQ for edge deployment. Perplexity improvements and compression details are provided in the project documentation and notebooks.
Check repository root for a license file. Third-party components (e.g., llama.cpp) include their own licenses and must be followed.
For questions, open an issue or contact the maintainers via the repository issue tracker.
For step-by-step reproduction and parametrization, see the notebooks in notebooks/ and create your own models/ directory.