# GPT-J 6B

## 1. Load model and tokenizer from HuggingFace Hub

GPT-J is loaded in fp32 mode by default which takes about 24GB CPU memory.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")

## 2. Use BMInf wrapper for low-resource inference

In [2]:
import torch
import bminf
with torch.cuda.device(0):
    model = bminf.wrapper(model, quantization=False, memory_limit=8 << 30)  # 8GB

## 3. See the GPU usage

In [3]:
print(torch.cuda.memory_summary())

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    9297 MB |    9297 MB |    9297 MB |       0 B  |
|       from large pool |    9296 MB |    9296 MB |    9296 MB |       0 B  |
|       from small pool |       1 MB |       1 MB |       1 MB |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |    9297 MB |    9297 MB |    9297 MB |       0 B  |
|       from large pool |    9296 MB |    9296 MB |    9296 MB |       0 B  |
|       from small pool |       1 MB |       1 MB |       1 MB |       0 B  |
|---------------------------------------------------------------

## 4. Run generation

In [9]:
prompt = "To be or not to be, that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(
    input_ids.cuda(),
    do_sample=True,
    temperature=0.9,
    max_length=20
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## 5. Get the generated text

In [10]:
tokenizer.batch_decode(gen_tokens)

['To be or not to be, that is the question — that has been the question, and still']