## Load Base Model

In [1]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from pruna.algorithms.smasher_config import SmasherConfig
from pruna.smash import smash

tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m')
model = AutoModelForCausalLM.from_pretrained('facebook/opt-125m', trust_remote_code=True, torch_dtype="auto")
model.to('cuda')
ins = tokenizer("What are we having for dinner?", return_tensors="pt", truncation=True).to('cuda')

Post-training Optimization Tool is deprecated and will be removed in the future. Please use Neural Network Compression Framework instead: https://github.com/openvinotoolkit/nncf
Nevergrad package could not be imported. If you are planning to use any hyperparameter optimization algo, consider installing it using pip. This implies advanced usage of the tool. Note that nevergrad is compatible only with Python 3.7+
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Smash it!

### Define Config

In [2]:
smasher_config = SmasherConfig()
smasher_config['compiler'] = 'ctranslate2_generation'
smasher_config['n_quantization_bits'] = 16
smasher_config['tokenizer_name'] = tokenizer

### Smash

In [3]:
smashed_model = smash(
        model=model,
        data_module="Polyglot_1000",
        api_key='your-api-key',
        model_config=None,
        smasher_config=smasher_config,
        device='cuda',
    )

Compile...


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Success.


## Base Model Generation

In [4]:
%%time
results = model.generate(**ins, max_length=50)

CPU times: user 1.11 s, sys: 18.7 ms, total: 1.13 s
Wall time: 1.13 s


In [5]:
output = tokenizer.decode(results[0])

In [6]:
output

"</s>What are we having for dinner?\nA nice dinner with a friend.\nI'm not sure what to do with the rest of the night.\nI'm going to have to go to bed.\nI'm going to have to go"

## Smashed Model Generation

In [7]:
%%time
results = smashed_model(ins, max_length=50)

CPU times: user 141 ms, sys: 40.2 ms, total: 181 ms
Wall time: 180 ms


In [8]:
output = tokenizer.decode(results[0].sequences_ids[0])

In [9]:
output

"What are we having for dinner?\nA nice dinner with a friend.\nI'm not sure what to do with the rest of the night.\nI'm going to have to go to bed.\nI'm going to have to go to"