[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IST-DASLab/sparsegpt/blob/master/demo.ipynb)

Install dependencies

In [1]:
!pip install -q datasets
!pip install -q transformers

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

Clone repository

In [2]:
!git clone https://github.com/IST-DASLab/sparsegpt

Cloning into 'sparsegpt'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 46 (delta 22), reused 13 (delta 10), pack-reused 11 (from 1)[K
Receiving objects: 100% (46/46), 26.80 KiB | 13.40 MiB/s, done.
Resolving deltas: 100% (22/22), done.


### Pruning example
---

Below we will show an example of SparseGPT applied to OPT model.

In [3]:
%cd sparsegpt

/content/sparsegpt


Crerate directory to store prune model(s)

In [4]:
!mkdir -p sparse_opt

We will use `opt.py` script to prune the model.
Select one of the following OPT versions to fit into colab (with `bitsandbytes` one should be able to use larger 6.7b and 13b models):
* facebook/opt-125m
* facebook/opt-350m
* facebook/opt-1.3b

To prune the model select dataset for calibration (`c4`, `ptb` or `wikitext`). The SparseGPT paper uses `c4` by default.

One can prune model to uniform sparsity with SparseGPT either with unstructured pruning or semistructured `N:M` pattern.

To apply unstructured pruning specify `--sparsity` - floating point number in `[0, 1]`.

For semitstructured specify `--prunen` and `--prunem` arguments - integer numbers.

To apply magnitude pruning instead of SparseGPT select `--gmp` option.

To apply quantization on top of sparsity specify `--wbits`.

In the example below we prune `facebook/opt-125m` to 0.5 unstructured sparsity via SparseGPT. Try different options.


In [5]:
!python opt.py facebook/opt-125m c4 --sparsity 0.5 --save sparse_opt/opt-125m

2024-12-13 17:38:05.154278: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-13 17:38:05.173659: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-13 17:38:05.179436: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-13 17:38:05.193330: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
config.json: 100% 651/651 [00:00<00:00, 3.62M

Code above prints perplexity on `wikitext2`, `ptb` and `c4` benchmarks in the end.

### Compare generations
---

Let us compare generations produced by the dense and sparse model

In [6]:
from transformers import AutoTokenizer, OPTForCausalLM

In [7]:
device = 'cuda'

In [8]:
# load dense model
model_dn = OPTForCausalLM.from_pretrained('facebook/opt-125m', torch_dtype='auto').to(device)
# load sparse model
model_sp = OPTForCausalLM.from_pretrained('sparse_opt/opt-125m', torch_dtype='auto').to(device)
# init tokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OSError: sparse_opt/opt-125m is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

In [None]:
input_text = "It takes a great deal of bravery"

In [None]:
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

Completion by dense model:

In [None]:
output_ids = model_dn.generate(input_ids)

In [None]:
print(tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True))

Completion by sparse model:

In [None]:
output_ids = model_sp.generate(input_ids)

In [None]:
print(tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True))