## Chapter 3: Looking Inside Large Language Models

- The model does not generate the text all in one operation; it actually generates one token at a time.
- Each token generation step is one forward pass through the mode
- After each token generation, we tweak the input prompt for the next generation step by appending the output token to the end of the input prompt.
-  Text generation LLMs being called autoregressive models.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda


In [2]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)

print(output[0]['generated_text'])

 Mention the steps you're taking to prevent it in the future.

Dear Sarah,

I hope this message finds you well. I am writing to express my deepest apologies for the unfortunate incident that occurred in


- It stopped abruptly because it reached the token limit we established by setting max_new_tokens to 50 tokens. 

In [3]:
print("Generating with max_new_tokens=100 (overriding default):")
output_100 = generator("What is the capital of France?", max_new_tokens=100)
print(output_100[0]['generated_text'])

Generating with max_new_tokens=100 (overriding default):


# Answer
The capital of France is Paris.


## The Components of the Forward Pass
- Tokenizers break down the text into a sequence of token IDs that then become the input to the model
- The tokenizer is followed by the neural network: a stack of Transformer blocks that do all of the processing. That stack is then followed by the LM head, which translates the output of the stack into probability scores for what the most likely next token is.
  
![Alt text for the image](images/transformer_blocks.png)

- The tokenizer contains a table of tokens—the tokenizer’s vocabulary. The model has a vector representation associated with each of these tokens in the vocabulary (token embeddings)

![Alt text for the image](images/tokenzier.png)

- The flow of the computation follows the direction of the arrow from top to bottom. For each generated token, the process flows once through each of the Transformer blocks in the stack in order, then to the LM head, which finally outputs the probability distribution for the next token.

  

In [4]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, 

# Choosing a Single Token from the Probability Distribution (Sampling Decoding)

- At the end of processing, the output of the model is a probability score for each token in the vocabulary.
- The method of choosing a single token from the probability distribution is called the decoding strategy.
- The easiest decoding strategy would be to always pick the token with the highest probability score.
- In practice, this doesn’t tend to lead to the best outputs for most use cases. A better approach is to add some randomness and sometimes choose the second or third highest probability token.
- Choosing the highest scoring token every time is called greedy decoding.

In [18]:
prompt = 'The capital of France is'

## Tokenize the input prompt
input_ids=tokenizer(prompt,return_tensors='pt').input_ids
input_ids

tensor([[ 450, 7483,  310, 3444,  338]])

- Each of the word in prompt is encoded and its in the input_ids tensor

In [20]:
input_ids=input_ids.to('cuda') ##send to gpu
## Get the output of the model before the lm_head
model_output=model.model(input_ids)
## Get the output of lm head
lm_head_output=model.lm_head(model_output[0])

In [21]:
model_output[0].shape

torch.Size([1, 5, 3072])

In [22]:
lm_head_output.shape

torch.Size([1, 5, 32064])

-  batch of one input string, containing five tokens, each of them represented by a vector of size 3,072 corresponding to the output vectors after the stack of Transformer blocks

In [23]:
## pick the best highest probbaility token (greedy decoding)
## access the token probability scores for the last generated token
##  which uses the index 0 across the batch dimension; the index –1 gets us the last token in the sequence
token_id=lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)

'Paris'

## Parallel Token Processing and Context Size
- Tranformers can provide parrallel processing
- Tokenizer will break down the text into tokens. Each of these input tokens then flows through its own computation path
- Current Transformer models have a limit for how many tokens they can process at once. That limit is called the model’s context length. A model with 4K context length can only process 4K tokens and would only have 4K of these streams.
- Each of the token streams starts with an input vector (the embedding vector and some positional information.
  
![Alt text for the image](images/floe.png)

- For text generation, only the output result of the last stream is used to predict the next token. That output vector is the only input into the LM head as it calculates the probabilities of the next token.
- The calculations of the previous streams are required and used in calculating the final stream. Yes, we’re not using their final output vector, but we use earlier outputs (in each Transformer block) in the Transformer block’s attention mechanism.

## Speeding Up Generation by Caching Keys and Values
- Recall that when generating the second token, we simply append the output token to the input and do another forward pass through the model.
- If we give the model the ability to cache the results of the previous calculation (especially some of the specific vectors in the attention mechanism), we no longer need to repeat the calculations of the previous streams. This time the only needed calculation is for the last stream (key-value cache and it speed up of generation process).
- In Hugging Face Transformers, cache is enabled by default. We can disable it by setting use_cache to False. We can see the difference in speed by asking for a long generation, and timing the generation with and without caching.


In [24]:

prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

In [25]:

%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


4.24 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [26]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)

5.74 s ± 445 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


- Enabling the cache of key and values save time and make the execution process faster

## Inside the transformer

- Transformer consist of a series of transformers (6 in the origional paper - Attention is all you need)
- Each block process its inputs, then passes the results of its processing to the next block
- Each transformer block is made up of attention layer (incorporate relevant info from other tokens and positions) and FFN (majority of the model's processing capacity)

![Alt text for the image](images/transformer.png)

- Attention is a mechanism that helps the model incorporate context as it’s processing a
specific token. for example referring it to dog in “The dog chased the squirrel because it"

- A way to score how relevant each of the previous input tokens are to the current
token being processed (last token processed).

- To give the Transformer more extensive attention capability, the attention mechanism is duplicated and executed multiple times in parallel. Each of these parallel applications of attention is conducted into an attention head.

- Attention starts by multiplying the inputs by the projection matrices to create three new matrices. These are called the queries, keys, and values matrices. These matrices contain the information of the input tokens projected to three different spaces that help carry out the two steps of attention:

Relevence Scoring and Combining information

![Alt text for the image](images/key_value_query.png)

- ** Self attention: Relevance Scoring**

In a generative Transformer, we’re generating one token at a time. This means we’re processing one position at a time. So the attention mechanism here is only concerned with this one position, and how information from other positions can be pulled in to inform this position.

- The relevance scoring step of attention is conducted by multiplying the query vector of the current position with the keys matrix. This produces a score stating how relevant each previous token is. Passing that by a softmax operation normalizes these scores so they sum up to 1.

![Alt text for the image](images/query_key_multi.png)

**Self-attention: Combining information**

-  Now that we have the relevance scores, we multiply the value vector associated with each token by that token’s score. Summing up those resulting vectors produces the output of this attention step.


![Alt text for the image](images/key_value_sum.png)

## Recent Improvements to the Transformer Architecture

**Local/sparse attention**

-  Sparse attention limits the context of previous tokens that the model can attend to.

![Alt text for the image](images/attention_types.png)

- **Full attention**: In full attention, every token in a sequence attends to every other token in that same sequence. This means that for each token, its "query" is compared against the "keys" of all other tokens (including itself) to calculate attention scores. These scores are then used to create a weighted sum of the "values" from all tokens (quadratic computational complexity with respect to the sequence length (N). If you have a sequence of length N, the attention mechanism requires O(N^2) computations and memory.
          1) Adventages: Global Context and simplicity
          2) Disadvantages: Scalability issue and high resource consumption.

- **Sparse attention:**  Sparse attention is an optimization technique designed to reduce the computational and memory demands of full attention, particularly for long sequences. Instead of every token attending to all other tokens, sparse attention introduces "structured sparsity" in the attention matrix. This means that each token only attends to a subset of other tokens. (reduce the complexity from quadratic (O(N^2)) to something more efficient, often linear (O(N)) or nearly linear (O(N radical(N)), O(NlogN)).

**Approaches to sparsity**:
- **Local Attention** (Windowed/Sliding Attention): Tokens only attend to other tokens within a fixed-size local window around them. This is efficient but might miss very long-range dependencies.
- **Fixed Patterns**: Predetermined patterns of attention, such as: **Strided Attention**: Tokens attend to others at regular intervals (e.g., every 𝑘-th token). **Fixed Attention**: Tokens in a block attend to other tokens within the same block, or certain "global" tokens.
- **Random Attention**: Each token attends to a fixed number of randomly selected tokens
- **Global Attention**: Some designated "global" tokens can attend to all other tokens, acting as intermediaries for broader communication.
- **Learnable/Dynamic Sparsity**: The model learns which connections are most important and prunes the less important ones or routes queries to specific key subsets (e.g., Routing Transformers, Native Sparse Attention).
- **Content-Based Sparsity**: Attention is focused on the most relevant tokens based on their content, rather than just their position.
- **Benefits:**
-  **Scalability**, Significantly reduces computational cost and memory footprint, enabling the processing of much longer sequences than full attention. This is crucial for tasks like long-document summarization, time-series forecasting, and handling large-scale generative models.
-  **Efficiency**: Faster training and inference due to fewer computations.
-  **Drawbacks**: potential loss info, complexity of design, implementation challenges

** Multi-query and grouped-query attention **

- Presented in research paper: “GQA: Training generalized multi-query transformer models from multi-head checkpoints”
  
![Alt text for the image](images/multi_head.png)

1 - **Multi query:** In standard Multi-Head Attention, each "head" of attention has its own independent set of learnable weight matrices for Queries (Q), Keys (K), and Values (V). This means if you have 'H' attention heads, you have 'H' sets of (Q, K, V) projection matrices.

- **Benefit:** This allows the model to learn diverse relationships and focus on different aspects of the input simultaneously, contributing to its strong performance

- **Drawbacks:**

- **High Memory Bandwidth**: During inference, particularly in autoregressive decoding (generating text token by token), the Key and Value (KV) cache for previous tokens grows with sequence length. With 'H' separate K and V matrices, this cache can become very large, leading to significant memory consumption and bandwidth bottlenecks.

- **Slower Inference**: Loading and processing these large KV caches for each head can slow down the generation process.

2 - **Multi-Query Attention (MQA)**

- Mechanism: MQA is an aggressive optimization where all attention heads share the same single set of Key (K) and Value (V) projection matrices. Only the Query (Q) matrices remain separate for each head.

- **Benefits**

- Significantly Reduced KV Cache Size: Since all query heads use the same K and V, the KV cache size is drastically reduced (by a factor of 'H', the number of heads). This leads to much lower memory consumption and memory bandwidth requirements.


- Faster Inference: The smaller KV cache means less data needs to be loaded from memory, resulting in substantial speedups during inference. This is particularly beneficial for long sequence generation.

- **Drawbacks**

- Potential Quality Degradation: The major trade-off is a potential reduction in model quality or performance. Forcing all query heads to look at the same K and V information might limit the model's ability to learn diverse and intricate relationships, as each head can't specialize as much. This "information bottleneck" can lead to a slight drop in accuracy on some tasks.


- Training Instability: Some studies have noted that MQA can sometimes lead to training instability compared to MHA.

3- **Grouped-query attention**:  GQA strikes a balance between MHA and MQA. Instead of sharing a single K and V set across all query heads (like MQA), GQA divides the query heads into a smaller number of groups, and each group shares a single set of Key (K) and Value (V) projection matrices.

- **Benefits**

- **Balances Speed and Quality:** GQA achieves much of the speedup of MQA by reducing the KV cache size (though not as much as MQA), while retaining more of the representational power and quality of MHA. It reduces memory bandwidth and latency significantly compared to MHA, but typically performs better than MQA in terms of model quality.

- **Flexible Trade-off:** The number of groups 'G' is a tunable hyperparameter, allowing developers to explicitly choose the trade-off between inference speed and model quality.

- **Widely Adopted:** Many modern LLMs (e.g., Llama 2 70B, Mistral 7B) use GQA because of its favorable balance.

- Improve inference scalability of larger models by reducing the size of the matrices involved.

![Alt text for the image](images/attn1.png)

![Alt text for the image](images/attn2.png)

![Alt text for the image](images/attn3.png)
