<a href="https://colab.research.google.com/github/StrikingHour/Hands-On-Large-Language-Models/blob/main/Chap_3_Looking_Insider_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
%%capture
!pip install transformers>=4.41.2 accelerate>=0.31.0

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map = "cuda",
    torch_dtype = "auto",
    trust_remote_code = True,
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model = model,
    tokenizer = tokenizer,
    return_full_text = False, # only output will be returned ; prompt will not returned
    max_new_tokens = 500,
    do_sample = False # whether the model uses a sampling strategy
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda


In [13]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happend"
output = generator(prompt)
print(output[0]['generated_text'])

, express your sincere regret, and propose a plan to make amends.

Dear Sarah,

I hope this email finds you well. I am writing to express my deepest apologies for the unfortunate incident that occurred in your garden. It was a tragic mishap that I never intended to happen, and I am truly sorry for the damage caused.

The incident happened when I was attempting to trim the overgrown branches of the tree in your garden. Unfortunately, due to a momentary lapse in concentration, I accidentally cut through the power line that was hidden beneath the foliage. This resulted in a power outage in your house and caused significant damage to the electrical system.

I understand the inconvenience and distress this has caused you, and I am sincerely sorry for my carelessness. Please know that I take full responsibility for my actions, and I am committed to making things right.

To make amends, I have already contacted a professional electrician to assess the damage and repair the electrical system i

In [14]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

<span style = "color: Yellow;">**Choosing a single token from the probability distribution (sampling / decoding)**</span>

In [15]:
prompt = "The capital of France is"

input_ids = tokenizer(prompt, return_tensors = 'pt').input_ids.to("cuda")

model_output = model.model(input_ids)
lm_head_output = model.lm_head(model_output[0])

token_id = lm_head_output[0, -1].argmax(-1)
tokenizer.decode(token_id)


'Paris'

**Speed up generation by caching keys and values**

In [16]:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

input_ids = tokenizer(prompt, return_tensors = 'pt').input_ids.to('cuda')

In [17]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)

4.96 s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)

32.7 s ± 87.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
