In [1]:
# %load ../firstcell.py
%load_ext autoreload
%autoreload 2
%matplotlib inline

# Typical Inference from PyTorch

This Jupyter Notebook uses the Huggingface Transformers library for natural language processing tasks. In this notebook, we demonstrate a typical inference workflow using the GPT-2 model from Huggingface Transformers.

## Workflow

1. Load the GPT-2 model from Huggingface Transformers.
2. Tokenize the input text using the GPT-2 tokenizer.
3. Pass the tokenized input to the GPT-2 model for inference.
4. Decode the output from the GPT-2 model to obtain the generated text.

## Code Example

Here's an example of how to perform typical inference using the GPT-2 model from Huggingface Transformers:

In [2]:
from transformers import GPT2Tokenizer, GPT2Model

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
model = GPT2Model.from_pretrained("gpt2-large")

In [4]:
text = "Testing a model for everyone."
encoded_input = tokenizer(text, return_tensors="pt")

In [9]:
output = model(**encoded_input)

In [10]:
model

GPT2Model(
  (wte): Embedding(50257, 1280)
  (wpe): Embedding(1024, 1280)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-35): 36 x GPT2Block(
      (ln_1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
)

# DeepCrunch Model Compression

DeepCrunch is a model compression tool that can accelerate inference by compressing the model. In this notebook, we demonstrate how to use DeepCrunch for weight-only compression without tuning. Please note that activation compression needs tuning.

## Workflow

1. Load the GPT-2 model from Huggingface Transformers.
2. Compress the model using DeepCrunch.
3. Evaluate the compressed model.

## Code Example

Here's an example of how to use DeepCrunch for weight-only compression without tuning:

```bash
# if deepcrunch is not installed on your system and you cloned the repo
import sys

if 'deepcrunch' not in sys.modules:
    sys.path.append('../..')
```

In [6]:
# if deepcrunch is installed on your system
import deepcrunch

You can specify the backend to use for quantization

```python
deepcrunch.config(framework='torch', mode='inference', backend='neural_compressor')
```

In [18]:
deepcrunch.config(framework='torch', mode='inference')

quantized_model = deepcrunch.quantize(model, type='dynamic', dtype='quint8', output_path='quantized_gpt2_large.pt')

Quantized model saved to quantized_gpt2_large.pt


In [19]:
quantized_model

GPT2Model(
  (wte): QuantizedEmbedding(num_embeddings=50257, embedding_dim=1280, dtype=torch.quint8, qscheme=torch.per_channel_affine_float_qparams)
  (wpe): QuantizedEmbedding(num_embeddings=1024, embedding_dim=1280, dtype=torch.quint8, qscheme=torch.per_channel_affine_float_qparams)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-35): 36 x GPT2Block(
      (ln_1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
)

In [15]:
# Original model size
print("Original Model Size")
deepcrunch.performance.size_in_mb('gpt2-large.pt', human_readable=True)

# Quantized model size 
print("Quantized Model Size")
deepcrunch.performance.size_in_mb('quantized_gpt2_large.pt', human_readable=True)

Original Model Size
Quantized Model Size


'2801.54 MB'