<a href="https://colab.research.google.com/github/RoboMaroof/LLM-Applications-Building-Blocks/blob/main/LLM__Architecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installs and Imports

In [1]:
!pip3 install -q -U transformers==4.38.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.1/131.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoModel,
    AutoTokenizer,
    AutoConfig,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

# Hugging Face Login

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load Model

In [4]:
model_name = "google/gemma-2b-it"

In [5]:
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [None]:
print(model)

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
      )
    )
    (norm): GemmaRM

### **Model Architecture Breakdown: `GemmaForCausalLM`**

#### 1. **Embedding Layer**
- **`(embed_tokens): Embedding(256000, 2048, padding_idx=0)`**
  - **Description**: Converts input tokens into dense vectors.
  - **Matrix**:
    - **Embedding Matrix**: Shape `(256000, 2048)`
      - **Details**: 256,000 rows (one for each token in the vocabulary) and 2048 columns (the dimensionality of each token embedding). Given an input token, the corresponding row is selected as the token's embedding vector.

#### 2. **Self-Attention Mechanism**
- **Within `GemmaDecoderLayer`, focusing on the `GemmaSdpaAttention` layer (18 layers in total):**
  - **`(q_proj): Linear(in_features=2048, out_features=2048, bias=False)`**
    - **Description**: Projects the input into query vectors.
    - **Matrix**:
      - **Query Projection Matrix**: Shape `(2048, 2048)`
        - **Details**: Transforms 2048-dimensional input embeddings into query vectors of the same dimensionality.

  - **`(k_proj): Linear(in_features=2048, out_features=256, bias=False)`**
    - **Description**: Projects the input into key vectors.
    - **Matrix**:
      - **Key Projection Matrix**: Shape `(2048, 256)`
        - **Details**: Transforms 2048-dimensional input embeddings into 256-dimensional key vectors.

  - **`(v_proj): Linear(in_features=2048, out_features=256, bias=False)`**
    - **Description**: Projects the input into value vectors.
    - **Matrix**:
      - **Value Projection Matrix**: Shape `(2048, 256)`
        - **Details**: Transforms 2048-dimensional input embeddings into 256-dimensional value vectors.

  - **`(o_proj): Linear(in_features=2048, out_features=2048, bias=False)`**
    - **Description**: Projects the concatenated output of the attention heads back into the original input dimensionality.
    - **Matrix**:
      - **Output Projection Matrix**: Shape `(2048, 2048)`
        - **Details**: After attention computation, the results are projected back to a 2048-dimensional space.

#### 3. **Feedforward Network (MLP)**
- **Within `GemmaDecoderLayer`, focusing on the `GemmaMLP` layer (18 layers in total):**
  - **`(gate_proj): Linear(in_features=2048, out_features=16384, bias=False)`**
    - **Description**: A projection layer in the MLP, often used to control the flow of information.
    - **Matrix**:
      - **Gate Projection Matrix**: Shape `(2048, 16384)`
        - **Details**: Projects 2048-dimensional input into a higher-dimensional space (16384).

  - **`(up_proj): Linear(in_features=2048, out_features=16384, bias=False)`**
    - **Description**: Another projection layer in the MLP that projects the input to a higher dimension.
    - **Matrix**:
      - **Up Projection Matrix**: Shape `(2048, 16384)`
        - **Details**: Similar to `gate_proj`, this layer projects the input to 16384 dimensions.

  - **`(down_proj): Linear(in_features=16384, out_features=2048, bias=False)`**
    - **Description**: Projects the high-dimensional representation back down to the original input dimension.
    - **Matrix**:
      - **Down Projection Matrix**: Shape `(16384, 2048)`
        - **Details**: Reduces the dimensionality from 16384 back to 2048, completing the MLP's transformation.

#### 4. **Normalization Layers**
- **`(input_layernorm): GemmaRMSNorm()`** and **`(post_attention_layernorm): GemmaRMSNorm()`**
  - **Description**: These layers normalize the inputs before and after the attention mechanism and the feedforward network.
  - **Matrix**:
    - **Normalization Parameters**: These operations typically involve scaling and shifting the input across the 2048-dimensional space.

#### 5. **Output Layer**
- **`(lm_head): Linear(in_features=2048, out_features=256000, bias=False)`**
  - **Description**: The final layer that maps the output from the last decoder layer back to the size of the vocabulary, producing logits for each token.
  - **Matrix**:
    - **Output Projection Matrix**: Shape `(2048, 256000)`
      - **Details**: Projects the 2048-dimensional vectors back into the 256,000-dimensional space of the vocabulary, where each dimension corresponds to a logit representing the likelihood of a particular token.

### **Summary of Matrix Dimensions**
- **Embedding Layer**: `(256000, 2048)`
- **Attention Mechanism** (for each of the 18 layers):
  - **Query Projection**: `(2048, 2048)`
  - **Key Projection**: `(2048, 256)`
  - **Value Projection**: `(2048, 256)`
  - **Output Projection**: `(2048, 2048)`
- **Feedforward Network (MLP)** (for each of the 18 layers):
  - **Gate Projection**: `(2048, 16384)`
  - **Up Projection**: `(2048, 16384)`
  - **Down Projection**: `(16384, 2048)`
- **Output Layer**: `(2048, 256000)`


In [None]:
config = AutoConfig.from_pretrained(model_name)
print(config)

GemmaConfig {
  "_name_or_path": "google/gemma-2b-it",
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.38.0",
  "use_cache": true,
  "vocab_size": 256000
}



**Model Configuration Breakdown**

- **`_name_or_path`**: `"google/gemma-2b-it"`  
  Specifies the name or path to the model, indicating it's sourced from Google's repository with 2 billion parameters, likely specialized for Italian.

- **`architectures`**: `["GemmaForCausalLM"]`  
  Defines the model architecture as "Causal Language Modeling," used for predicting the next word in a sequence.

- **`attention_bias`**: `false`  
  Indicates that no bias terms are used in the attention mechanism.

- **`attention_dropout`**: `0.0`  
  Specifies the dropout rate for attention layers, with `0.0` meaning no dropout is applied.

- **`bos_token_id`**: `2`  
  The ID for the "Beginning Of Sentence" token, marking the start of a sentence.

- **`eos_token_id`**: `1`  
  The ID for the "End Of Sentence" token, marking the end of a sentence.

- **`head_dim`**: `256`  
  The dimensionality of each attention head, with each head processing 256-dimensional vectors. Affects how much information each head can process. Larger head_dim allows each head to capture more complex features, but it also increases the computational cost.

- **`hidden_act`**: `"gelu"`  
  Specifies the activation function used in hidden layers, with `"gelu"` standing for Gaussian Error Linear Unit.

- **`hidden_size`**: `2048`  
  The size of the model's hidden layers, particularly the size of the input and output vectors for the multi-head attention and feedforward layers, with each hidden state vector having 2048 components. Determines the overall capacity of the model to learn and represent information. A larger hidden_size allows the model to capture more nuanced patterns but requires more computational power.

- **`initializer_range`**: `0.02`  
  The standard deviation of the normal distribution used to initialize the model's weights.

- **`intermediate_size`**: `16384`  
  The size of the intermediate layer in the feedforward network within each Transformer block.  In the Transformer architecture, each block has a feedforward neural network (FFN) that processes the output of the attention mechanism. The FFN usually consists of two linear layers with a non-linear activation function in between. The intermediate_size is the size of the hidden layer in this FFN. It affects the model's ability to transform and refine the representations learned from the attention mechanism. Larger sizes enable the model to process and learn more complex transformations, enhancing its overall expressiveness.

- **`max_position_embeddings`**: `8192`  
  Defines the maximum number of tokens the model can handle in a single sequence, allowing processing of sequences up to 8192 tokens long.

- **`model_type`**: `"gemma"`  
  Indicates the specific model type or family.

- **`num_attention_heads`**: `8`  
  The number of parallel attention heads in each multi-head attention layer. Multi-head attention allows the model to focus on different parts of the input sequence simultaneously, with each head capturing different types of dependencies or patterns. These independent results are then concatenated and combined to form the final output. The number of attention heads helps the model to capture a wider range of dependencies in the input sequence.

- **`num_hidden_layers`**: `18`  
  The total number of hidden layers (Transformer blocks) in the model. Each hidden layer in a Transformer model consists of a multi-head attention mechanism followed by a feedforward neural network.  The depth of the model (i.e., the number of hidden layers) affects its ability to learn complex hierarchical features. More layers generally make the model more powerful but also make it harder to train and more prone to overfitting if not managed properly.

- **`num_key_value_heads`**: `1`  
  Specifies a single key-value pair shared across attention heads. In some variations of Transformer models, the queries, keys, and values (the components of the attention mechanism) can be processed using a different number of heads. `num_key_value_heads` determines how many sets of keys and values are used, which might be shared across multiple attention heads. This can be a memory and computation optimization technique. With   `num_key_value_heads` = 1, all attention heads share the same set of key-value pairs, rather than each head having its own independent set. This reduces the model's complexity and can make it more efficient.

- **`pad_token_id`**: `0`  
  The ID of the padding token used to fill in empty spaces in the input sequence.

- **`rms_norm_eps`**: `1e-06`  
  A small constant added in Root Mean Square Layer Normalization to prevent division by zero.

- **`rope_scaling`**: `null`  
  Indicates no additional scaling applied to Rotary Position Embeddings (RoPE).

- **`rope_theta`**: `10000.0`  
  A parameter for RoPE, affecting how positions are encoded in the model.

- **`torch_dtype`**: `"bfloat16"`  
  Defines the data type for tensors during computation, with `"bfloat16"` being efficient in terms of memory and speed.

- **`transformers_version`**: `"4.38.0"`  
  Specifies the version of the Transformers library used.

- **`use_cache`**: `true`  
  Enables caching of hidden states during generation to speed up inference.

- **`vocab_size`**: `256000`  
  The size of the model's vocabulary, indicating it can recognize up to 256,000 unique tokens.


In [None]:
for name, param in model.named_parameters():
    print(f"Parameter: {name}, dtype: {param.dtype}")

Parameter: model.embed_tokens.weight, dtype: torch.float32
Parameter: model.layers.0.self_attn.q_proj.weight, dtype: torch.float32
Parameter: model.layers.0.self_attn.k_proj.weight, dtype: torch.float32
Parameter: model.layers.0.self_attn.v_proj.weight, dtype: torch.float32
Parameter: model.layers.0.self_attn.o_proj.weight, dtype: torch.float32
Parameter: model.layers.0.mlp.gate_proj.weight, dtype: torch.float32
Parameter: model.layers.0.mlp.up_proj.weight, dtype: torch.float32
Parameter: model.layers.0.mlp.down_proj.weight, dtype: torch.float32
Parameter: model.layers.0.input_layernorm.weight, dtype: torch.float32
Parameter: model.layers.0.post_attention_layernorm.weight, dtype: torch.float32
Parameter: model.layers.1.self_attn.q_proj.weight, dtype: torch.float32
Parameter: model.layers.1.self_attn.k_proj.weight, dtype: torch.float32
Parameter: model.layers.1.self_attn.v_proj.weight, dtype: torch.float32
Parameter: model.layers.1.self_attn.o_proj.weight, dtype: torch.float32
Parameter

# Casting

In [None]:
# Cast model parameters to lower precision bfloat16 (if supported by GPU)
model = model.to(dtype=torch.bfloat16)

In [None]:
for name, param in model.named_parameters():
    print(f"Parameter: {name}, dtype: {param.dtype}")

# Load model in bfloat16 (if supported by model and GPU)

In [None]:
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16)

In [None]:
for name, param in model.named_parameters():
    print(f"Parameter: {name}, Data Type: {param.dtype}")

# Additional loading parameters

## Common Options for `from_pretrained`

### `pretrained_model_name_or_path`

Specifies the name of the pre-trained model or the path to the local directory containing the model weights and configuration files.
- **Example:** `'bert-base-uncased'` or `'/path/to/local/model'`.

### `config`
Accepts a configuration object (`PretrainedConfig` subclass) that overrides the default model configuration.
- **Example:** `config=AutoConfig.from_pretrained('bert-base-uncased')`.

### `state_dict`
Provides a state dictionary (a Python dictionary object) with custom model weights.
- **Example:** `state_dict=custom_state_dict`.

### `cache_dir`
Specifies the directory where the downloaded model weights and configuration files will be cached.
- **Example:** `cache_dir='./cache'`.

### `force_download`
Forces the download of the model weights and configuration files, even if they are already cached.
- **Example:** `force_download=True`.

### `resume_download`
Resumes downloading a model that was partially downloaded.
- **Example:** `resume_download=True`.

### `proxies`
A dictionary of proxy servers to use by protocol or endpoint.
- **Example:** `proxies={'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080'}`.

### `use_auth_token`
The token to use as HTTP bearer authorization for remote files.
- **Example:** `use_auth_token='your_token_here'`.

### `revision`
Specifies the model version to use (can be a branch name, tag name, or commit id).
- **Example:** `revision='main'` or `revision='v1.2.3'`.

### `subfolder`
Loads the model from a subfolder in the repository.
- **Example:** `subfolder='my_model'`.

### `mirror`
Specifies the base URL of the mirror to use to download the model weights and configuration files.
- **Example:** `mirror='https://mirror.s3.amazonaws.com'`.

## Precision and Device Management

### `torch_dtype`
Specifies the desired data type for the model parameters. Useful for reducing memory usage or leveraging specific hardware features.
- **Example:** `torch_dtype=torch.bfloat16` or `torch_dtype=torch.float16`.

### `device_map`
A dictionary specifying which device to place each layer of the model on. Useful for model parallelism.
- **Example:** `device_map={'layer.0': 'cuda:0', 'layer.1': 'cuda:1'}`.

### `load_in_8bit`
Loads the model in 8-bit precision using the bitsandbytes library, which can significantly reduce memory usage.
- **Example:** `load_in_8bit=True`.

### `quantization_config`
A configuration object for applying quantization-aware training or post-training quantization.
- **Example:** `quantization_config=my_quantization_config`.

### `low_cpu_mem_usage`
Reduces the CPU memory usage by loading the model in a more memory-efficient way, particularly useful when working with very large models.
- **Example:** `low_cpu_mem_usage=True`.


In [None]:
# Example
model = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,         # Set precision to bfloat16
    cache_dir='./cache',                # Set cache directory
    force_download=True,                # Force re-download of the model
    low_cpu_mem_usage=True              # Reduce CPU memory usage
)

# Tokenizer

In [7]:
print(len(tokenizer))

256000


In [6]:
print(tokenizer)

GemmaTokenizerFast(name_or_path='google/gemma-2b-it', vocab_size=256000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<bos>', 'eos_token': '<eos>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<start_of_turn>', '<end_of_turn>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<eos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<bos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<mask>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	5: AddedToken("<2mass>", rstrip=False, lstrip=False, single_w