<h1>Chapter 3 - Looking Inside Transformer LLMs</h1>
<i>An extensive look into the transformer architecture of generative LLMs</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter03/Chapter%203%20-%20Looking%20Inside%20LLMs.ipynb)

---

This notebook is for Chapter 3 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [None]:
# %%capture
# !pip install transformers>=4.41.2 accelerate>=0.31.0

# Loading the LLM

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="auto", # Changed from "cuda" to "auto"
    torch_dtype=torch.float16, # Changed from "auto" to torch.float16
    trust_remote_code=False,
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.93s/it]


# The Inputs and Outputs of a Trained Transformer LLM


In [3]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)

print(output[0]['generated_text'])

You are not running the flash-attention implementation, expect numerical differences.


 Mention that you've taken steps to prevent it in the future.


Email to Sarah:

Subject: Sincere Apologies for the Gardening Mishap


Dear Sarah,


I hope


In [8]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

# Choosing a single token from the probability distribution (sampling / decoding)

In [10]:
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Tokenize the input prompt
input_ids = input_ids.to("cpu")  # Changed from "cuda" to "cpu"

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

In [16]:
lm_head_output

tensor([[[25.0625, 25.1719, 23.3438,  ..., 19.2656, 19.2656, 19.2812],
         [31.0000, 31.4375, 25.9219,  ..., 25.8281, 25.8281, 25.8281],
         [31.3750, 28.7969, 31.0000,  ..., 26.1250, 26.1250, 26.1250],
         [33.0625, 31.8906, 35.9375,  ..., 27.7812, 27.7812, 27.7812],
         [27.8438, 29.4688, 28.0156,  ..., 20.4062, 20.4062, 20.4062]]],
       device='mps:0', dtype=torch.float16, grad_fn=<LinearBackward0>)

In [13]:
token_id = lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)

'Paris'

In [17]:
model_output

BaseModelOutputWithPast(last_hidden_state=tensor([[[-0.2644,  1.2207,  0.4192,  ..., -0.3667,  0.7808,  0.1046],
         [-0.1307,  0.3118,  0.3672,  ...,  0.5542, -0.1405, -0.5952],
         [-0.5864,  1.0713,  1.5508,  ..., -0.4216,  0.2820,  0.4099],
         [-0.4075,  0.6694,  0.2534,  ..., -0.0640,  0.8022, -0.6709],
         [-0.6655,  0.6943,  0.5415,  ...,  0.2544,  0.1948, -0.2866]]],
       device='mps:0', dtype=torch.float16, grad_fn=<MulBackward0>), past_key_values=((tensor([[[[-2.9736e-01,  1.3831e-01, -7.8308e-02,  ..., -9.1919e-02,
           -2.6321e-02,  5.8441e-02],
          [ 8.7036e-02,  9.7473e-02, -1.3647e-01,  ..., -2.3621e-02,
           -1.0889e-01, -2.7527e-02],
          [ 4.0747e-01,  4.2480e-02, -1.9275e-01,  ..., -1.5112e-01,
           -2.1228e-01,  8.7402e-02],
          [ 2.5903e-01, -3.2959e-02, -3.7750e-02,  ..., -1.6571e-02,
           -1.1986e-02,  6.4278e-03],
          [ 9.8511e-02, -9.8511e-02, -7.1533e-02,  ..., -1.9299e-01,
            2.314

In [14]:
model_output[0].shape

torch.Size([1, 5, 3072])

In [15]:
lm_head_output.shape

torch.Size([1, 5, 32064])

# Speeding up generation by caching keys and values


In [19]:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cpu")  # Changed from "cuda" to "cpu"

In [None]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)

In [None]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)