# Loading the LLM

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline
# from transformers import BitsAndBytesConfig # Removed
import torch

# Load model and tokenizer

tokenizer=AutoTokenizer.from_pretrained("gpt2")

# quantization_config = BitsAndBytesConfig(load_in_8bit=True) # Removed

model=AutoModelForCausalLM.from_pretrained("gpt2",
                                           load_in_8bit=True, # Added directly
                                           device_map="auto",
                                           output_hidden_states=True)

# Create a pipeline

generator=pipeline("text-generation",
                   model=model,
                   tokenizer=tokenizer,
                   return_full_text=False,
                   max_new_tokens=50,
                   do_sample=False)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['output_hidden_states']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The inputs and outputs of a trained transformers LLM

In [None]:
prompt="Write an email apologizing to Sarah for the tragic gardening mishap. Explain hot it happened."
output=generator(prompt)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': " Give all the info you can about the process and the process itself.\n\nI will contact you if you would like to help me find this piece about hot it's a big problem.\n\nPlease email us your contact information, it really helps"}]


In [None]:
print(output[0])

{'generated_text': " Give all the info you can about the process and the process itself.\n\nI will contact you if you would like to help me find this piece about hot it's a big problem.\n\nPlease email us your contact information, it really helps"}


In [None]:
print(output[0]['generated_text'])

 Give all the info you can about the process and the process itself.

I will contact you if you would like to help me find this piece about hot it's a big problem.

Please email us your contact information, it really helps


In [None]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Linear8bitLt(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Linear8bitLt(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear8bitLt(in_features=3072, out_features=768, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwi

Choosing a single token from the probability distributions(sampling/decoding)

In [None]:
prompt="The capital of France is "

input_ids=tokenizer(prompt,return_tensors="pt").input_ids

input_ids=input_ids.to("cuda")

model_output=model(input_ids)

lm_head_output=model.lm_head(model_output.hidden_states[-1])

In [None]:
token_id=lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)

'\xa0'

In [None]:
model_output[0].shape

torch.Size([1, 6, 50257])

In [None]:
lm_head_output.shape

torch.Size([1, 6, 50257])

Speeding up generation by caching keys and values

In [None]:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

In [None]:
%%timeit -n 1

generation_output=model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    use_cache=True,
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask

4.61 s ± 499 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit -n 1

generation_output=model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    use_cache=False,
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generati

7.12 s ± 1.42 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
