<h1>Chapter 3 - Looking Inside Transformer LLMs</h1>
<i>An extensive look into the transformer architecture of generative LLMs</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter03/Chapter%203%20-%20Looking%20Inside%20LLMs.ipynb)

---

This notebook is for Chapter 3 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
%%capture
!pip install transformers>=4.41.2 accelerate>=0.31.0

# Loading the LLM

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [3]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [4]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

HOw to know when to use the trust_remote_code or not ?

In [5]:
# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

# The Inputs and Outputs of a Trained Transformer LLM


In [7]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

In [8]:
output = generator(prompt)

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


In [9]:
print(output[0]['generated_text'])

 Mention the steps you're taking to prevent it in the future.

Dear Sarah,

I hope this message finds you well. I am writing to express my deepest apologies for the unfortunate incident that occurred in


In [10]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

# Choosing a single token from the probability distribution (sampling / decoding)

In [11]:
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids

tensor([[ 450, 7483,  310, 3444,  338]])

In [12]:
# send to the GPU
input_ids = input_ids.to("cuda")

```python
Phi3ForCausalLM(
  (model): Phi3Model(
    .................
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)
```

In [14]:
# Get the output of the model before the lm_head
model_output = model.model(input_ids)

In [22]:
type(model_output)

transformers.modeling_outputs.BaseModelOutputWithPast

In [32]:
# last_hidden_state ,
model_output[0].shape

torch.Size([1, 5, 3072])

In [35]:
# pen-ultimate past key values
model_output[1][0][0].shape

torch.Size([1, 32, 5, 96])

In [36]:
# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

In [37]:
token_id = lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)

'Paris'

In [38]:
model_output[0].shape

torch.Size([1, 5, 3072])

In [39]:
lm_head_output.shape

torch.Size([1, 5, 32064])

`Question`
Prompt
Role
You are an xpert in LLMs and using the hugginface to achieve all the required tasks with the hugginface.
context
I'm reading a book which is talking about Looking inside the LLMs. COuld you please explain the doubts i have on the same?
task/Query:

```python
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

token_id = lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)
```

What is the use of using the model in the above way instead of the diretly inferecing using the whole end-to-end model architecture?

<details>
	<summary> <h3> Answer by Claude</></summary>
  Let me explain why this step-by-step approach is valuable compared to direct end-to-end inference.

The code you've shown is actually "peeking under the hood" of how language models work internally. Here's why this approach is useful:

1. **Understanding Model Components**
   - The code separates the model into two main components:
     - The base model (`model.model`) - handles contextual understanding
     - The language modeling head (`model.lm_head`) - handles token prediction
   - This separation helps understand how LLMs transform input tokens into predictions

2. **Intermediate Representations**
   - By accessing `model_output`, you can analyze the internal representations (hidden states) before final prediction
   - These representations contain rich semantic information that can be useful for:
     - Feature extraction
     - Transfer learning
     - Model analysis and interpretability studies

3. **Debugging & Research**
   ```python
   # Normal end-to-end inference
   output = model.generate(input_ids)  # Black box

   # Step-by-step approach
   hidden_states = model.model(input_ids)  # Intermediate representation
   logits = model.lm_head(hidden_states[0])  # Final layer transformation
   ```
   - You can inspect intermediate values
   - Helpful for debugging unexpected behaviors
   - Useful for understanding model decisions

4. **Custom Modifications**
   - This approach allows you to:
     - Modify intermediate representations
     - Insert custom processing steps
     - Experiment with different decoding strategies
     - Add additional layers or transformations

5. **Educational Value**
   The step-by-step approach demonstrates:
   - How tokens flow through the model
   - Where transformations occur
   - How final predictions are made

Here's a practical example of why this might be useful:

```python
# Get intermediate representations
hidden_states = model.model(input_ids)[0]

# Analyze attention patterns
attention_weights = model_output.attentions

# Custom processing
modified_hidden_states = custom_processing(hidden_states)
custom_outputs = model.lm_head(modified_hidden_states)
```

This granular access is particularly valuable for:
- Researchers studying model behavior
- Developers implementing custom modifications
- Anyone wanting to understand LLM internals

Would you like me to elaborate on any of these aspects or show more specific examples of how this approach can be used in practice?
</details >


<details>
	<summary> <h3> Answer by ChatGPT</></summary>
  In the code you provided, the distinction between using the model in a stepwise manner (by extracting intermediate representations) and directly performing inference via the whole model architecture boils down to gaining insights into how the model processes information internally. Here's a breakdown of the differences:

### 1. **Direct Inference (End-to-End Model)**:
Normally, when you call a language model end-to-end, like this:

```python
output = model.generate(input_ids)
```

It handles the entire pipeline for you, producing a sequence of predicted tokens as output. You don’t see the individual layers or components. This is great for fast, standard predictions, where you're only interested in the final result (e.g., generating text or answering a question).

### 2. **Manual Inspection (Step-by-Step Processing)**:
In the code you provided, you are accessing the model's components in a more granular way:

- **`model.model(input_ids)`** retrieves the output from the transformer layers **before** passing it through the final language modeling head (`lm_head`), which is used to map the hidden states to vocabulary logits.
  
- **`model.lm_head(model_output[0])`** then applies the final linear transformation (often called the "language modeling head") to generate the raw scores (logits) for each token in the vocabulary.

This approach allows you to inspect intermediate representations, such as the hidden states output by the transformer layers. This can be useful for:
- **Debugging**: Understanding which part of the model is behaving unexpectedly.
- **Fine-tuning/Custom Modifications**: You may want to modify the hidden states before they are passed to the `lm_head` for specific tasks.
- **Interpretability**: You can analyze the model’s internal representations to better understand how it is processing the input text, which might help in determining why certain predictions are made.

By doing this, you can also experiment with **alternative heads** (e.g., using a different layer instead of the `lm_head`) or even fine-tune specific components of the model.

In summary:
- **Direct inference** gives you fast and straightforward results.
- **Manual inspection** gives you access to how the model processes information internally, providing more control and understanding of its inner workings.

Let me know if you'd like further clarification!
</details>

<details>
	<summary> <h3> Answer by Phind</></summary>
  As an expert in LLMs and using the HuggingFace ecosystem, I'll address your question about the use of the model in the provided code snippet versus directly using the entire end-to-end model architecture. Let me break down the key points and explain why this approach might be useful in certain scenarios.

### Understanding the Code Approach

The code snippet you've shared demonstrates a specific way of interacting with a language model:

```python
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[1])

token_id = lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)
```

### Key Points to Consider

1. **Specific Model Access**: This approach allows for more granular control over the model architecture. By accessing `model.model` and `model.lm_head`, you're working directly with the core components of the transformer stack.

2. **Intermediate Representation**: By getting the output before the `lm_head`, you're capturing the full representation of the input sequence through the entire transformer stack.

3. **Attention Mechanism**: This method allows you to inspect or manipulate the attention patterns applied by the model during processing.

4. **Debugging and Analysis**: It provides insights into how the model processes inputs and generates outputs, which can be valuable for understanding model behavior.

5. **Flexibility**: This approach gives you more flexibility to modify or extend the model's behavior if needed.

### Comparison with Direct End-to-End Usage

Using the model directly for inference (e.g., `model(**inputs)`), as shown in the HuggingFace documentation, is generally simpler and faster for production use cases. However, the approach you're using has its advantages:

1. **Detailed Analysis**: It allows for a deeper dive into the model's inner workings, which can be crucial for research or fine-tuning purposes.

2. **Customization**: You can potentially modify or extend the model's behavior at various stages of processing.

3. **Debugging Tools**: Libraries like PyTorch offer debugging tools that work well with this level of access to the model components.

### Best Practices

1. **Choose Wisely**: For most practical applications, the simpler end-to-end approach is sufficient and faster.

2. **Consider Your Needs**: If you need deep insights into model behavior or are working on research projects, this approach can be valuable.

3. **Performance Trade-offs**: Be aware that accessing intermediate representations can be slower than using the model directly.

4. **Documentation**: When working with custom model architectures or modifications, ensure you understand the implications of each component.

5. **Testing**: Always thoroughly test your modified approaches to ensure they produce expected results.

In summary, while the end-to-end approach is simpler and faster for most use cases, the method you're using provides more granular control and insight into the model's inner workings. Choose the approach that best fits your needs, considering factors like performance requirements, analysis depth needed, and the specific task at hand.

Citations:
[1] https://huggingface.co/docs/transformers/v4.19.0/en/task_summary

[2] https://huggingface.co/docs/transformers/en/tasks/language_modeling

[3] https://huggingface.co/blog/huggingface-and-optimum-amd

[4] https://discuss.huggingface.co/t/difference-between-causallmwithvaluehead-vs-modelforcausallm/58549

[5] https://huggingface.co/docs/transformers/en/model_doc/gpt2

[6] https://huggingface.co/microsoft/Phi-3.5-mini-instruct

[7] https://www.reddit.com/r/MachineLearning/comments/160ts9g/d_is_it_me_or_huggingface_do_too_many_things/

[8] https://huggingface.co/microsoft/Phi-3.5-MoE-instruct

[9] https://github.com/EleutherAI/lm-evaluation-harness

[10] https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html
</>

# Speeding up generation by caching keys and values


In [40]:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

In [41]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


5.15 s ± 972 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [42]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)

30.5 s ± 707 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
