### **Note** We need GPU for run this notebook, so in Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

In [None]:
# !pip install transformers>=4.41.2 accelerate>=0.31.0

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [None]:
# Load Model and Tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map = "cuda",
    torch_dtype = "auto",
    trust_remote_code = False
)

In [None]:
# Create a Pipeline => It does create the task you want to perform and instructions about the model and some behavior set using Pipeline.
generator = pipeline(
    "text-generation",
    model = model,  # Passed Model
    tokenizer = tokenizer, # Passed Tokenizer Model
    max_new_tokens = 500, # Maximum 500 new tokens will be generated
    return_full_text=False, # If this parameter equals "False", then only show at output generated tokens, if "True" then shows input prompt + generated tokens also.
    do_sample=True # This makes randomness in your generated output. If I use False, then it gives a fixed output every time.
)

Device set to use cuda


In [None]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)

print(output[0]['generated_text'])

 Include an offer for free gardening services for a month. The incident occurred on Sunday when you, Sarah, attempted to prune the rose bushes in your front yard. Unfortunately, while cutting, the shears slipped and accidentally cut


In [None]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLUActivation()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_featur

## Phi3 Model Architecture

The overall structure is Phi3ForCausalLM, which contains a Phi3Model (the main body of the neural network) and a lm_head (the final layer for predicting tokens).

1. **Phi3Model** (The Core Model)

This is the main neural network that processes input sequences and learns to represent language.

(embed_tokens): Embedding(32064, 3072, padding_idx=32000)

This is the input embedding layer.
32064: This is the vocabulary size. It means the model can understand and generate 32,064 unique tokens (words, subwords, or characters).
3072: This is the embedding dimension. Each token from the vocabulary is converted into a vector of 3072 numbers. This vector representation captures the semantic meaning of the token.
padding_idx=32000: This indicates that token ID 32000 is used for padding (to make sequences of varying lengths uniform) and its embedding will be ignored or set to zero.
(layers): ModuleList(...)

This represents the stacked transformer decoder layers. The core of the model's intelligence resides here.
(0-31): 32 x Phi3DecoderLayer(...): This indicates that there are 32 identical Phi3DecoderLayer modules stacked one after another. Each layer refines the token representations.
Let's look inside a single Phi3DecoderLayer:

(self_attn): Phi3Attention(...)

This is the self-attention mechanism. It allows the model to weigh the importance of different tokens in the input sequence when processing a specific token.

(o_proj): Linear(in_features=3072, out_features=3072, bias=False): This is the output projection layer for the attention mechanism. It transforms the concatenated attention heads' output back to the model's hidden dimension (3072).
(qkv_proj): Linear(in_features=3072, out_features=9216, bias=False): This projects the input into Query (Q), Key (K), and Value (V) matrices. Since 9216 = 3072 * 3, it means that for each input token, it generates Q, K, and V vectors, each of dimension 3072. These are used to calculate attention scores.

(mlp): Phi3MLP(...): This is the Multi-Layer Perceptron (MLP), also known as the feed-forward network. It's applied to each token's representation independently after the self-attention layer and adds non-linearity and further transformations.
(gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False): This is typically composed of two parts: a "gate" linear layer and an "up" linear layer, often used in architectures like GLU (Gated Linear Unit). The output dimension of 16384 suggests an expansion of the representation.
(down_proj): Linear(in_features=8192, out_features=3072, bias=False): This "down-projects" the expanded representation back to the original hidden dimension of 3072. The input dimension of 8192 implies an intermediate expansion by 16384 / 2 = 8192 if a GLU is used, or a different internal structure.
(activation_fn): SiLUActivation(): The SiLU (Sigmoid Linear Unit) activation function introduces non-linearity, allowing the model to learn complex patterns.
(input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)

This is a Root Mean Square Normalization (RMSNorm) layer applied before the self-attention mechanism. It normalizes the input to each sub-layer, which helps stabilize training and improve performance. (3072,) indicates it operates on the hidden dimension.
(post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)

Another RMSNorm layer, applied after the self-attention mechanism and before the MLP.
(resid_attn_dropout): Dropout(p=0.0, inplace=False) and (resid_mlp_dropout): Dropout(p=0.0, inplace=False)

These are dropout layers applied to the residual connections after attention and MLP, respectively. The p=0.0 indicates that dropout is currently disabled (no neurons are randomly dropped during training or inference), which is common in a deployed model or for inference where determinism is preferred. During training, p would typically be a value like 0.1 or 0.2 to prevent overfitting.
(norm): Phi3RMSNorm((3072,), eps=1e-05)

This is the final normalization layer applied to the output of the last Phi3DecoderLayer before it goes to the lm_head.
(rotary_emb): Phi3RotaryEmbedding()

This implements Rotary Positional Embeddings (RoPE). Unlike traditional positional embeddings that add fixed vectors to token embeddings, RoPE applies a rotation matrix to the query and key vectors within the attention mechanism. This allows the model to encode the relative position of tokens, which is particularly effective for handling long sequences.

2. **lm_head:** Linear(in_features=3072, out_features=32064, bias=False)`

This is the language modeling head, the final layer of the model.
It's a linear layer that takes the 3072-dimensional output of the Phi3Model and projects it to the 32064 (vocabulary size) dimension.
The output of this layer represents the logits for each token in the vocabulary. Higher logits correspond to a higher probability that the token is the next word in the sequence.

## **Choosing a single token from the probability distribution (sampling / decoding)**

In [None]:
prompt = "The Capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

input_ids = input_ids.to("cuda")

In [None]:
# Get the output of the model before the ml_head
model_output = model.model(input_ids)

print(model_output)

BaseModelOutputWithPast(last_hidden_state=tensor([[[-0.3047,  1.1953,  0.2988,  ..., -0.3008,  0.6758,  0.1406],
         [-0.6367,  1.0312, -0.4844,  ...,  0.8164, -0.2578, -0.2949],
         [-0.3477,  1.1406,  1.5312,  ...,  0.0840,  0.1216,  0.3164],
         [-0.6719,  0.9766,  0.8594,  ...,  0.1660,  0.5586, -0.0332],
         [-0.8398,  0.7891,  0.4941,  ...,  0.2754, -0.1396, -0.1553]]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<MulBackward0>), past_key_values=DynamicCache(layers=[DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, DynamicSlidingWindowLayer, Dynamic

In [None]:
# Get the Output of the lm_head
lm_head_output = model.lm_head(model_output[0])

print(lm_head_output)

tensor([[[24.7500, 24.8750, 22.7500,  ..., 19.0000, 19.0000, 19.0000],
         [26.6250, 28.5000, 25.7500,  ..., 21.7500, 21.7500, 21.7500],
         [33.0000, 31.5000, 32.2500,  ..., 26.6250, 26.6250, 26.6250],
         [32.5000, 33.0000, 35.7500,  ..., 28.0000, 28.0000, 28.0000],
         [28.6250, 31.0000, 29.0000,  ..., 22.8750, 22.8750, 22.8750]]],
       device='cuda:0', dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>)


In [None]:
lm_head_output.shape

torch.Size([1, 5, 32064])

In [None]:
token_id = lm_head_output[0,-1].argmax(-1)
result = tokenizer.decode(token_id)

In [None]:
print(result) # Question was: The Capital of France is

Paris


In [None]:
model_output[0].shape

torch.Size([1, 5, 3072])

In [None]:
lm_head_output.shape

torch.Size([1, 5, 32064])

In [None]:
# Previous Approach
prompt = "The capital of France is"

input_token = tokenizer(prompt, return_tensors="pt").input_ids

input_token = input_token.to("cuda")

In [None]:
output_vector = model(input_token)

In [None]:
print(output_vector)

In [None]:
import torch

In [None]:
output_text = tokenizer.decode(
    torch.argmax(output.logits[:, -1, :], dim=-1),
    skip_special_tokens=True
)

In [None]:
print(output_text)

Paris


In [None]:
# Pipeline Approach => Simply write Prompt and pass from Pipeline
prompt = "The capital of France is"

output = generator(prompt)

In [None]:
print(output)

[{'generated_text': ' Paris.\n(c) The largest mammal is the blue whale.\n\n**Response:**a. The Eiffel Tower is located in Paris, France.\nb. The capital of France is Paris.\nc.'}]


In [None]:
output = output[0]["generated_text"]

In [None]:
print(output)

 Paris.
(c) The largest mammal is the blue whale.

**Response:**a. The Eiffel Tower is located in Paris, France.
b. The capital of France is Paris.
c.


In [None]:
# Since I made in generator "do_sample=True" means it make everytime different content, like a text creator or Pome writer kind of things.
output_1 = generator(prompt)
print(output_1[0]['generated_text'])

 Paris.
- What is the main function of the digestive system? The main function of the digestive system is to break down food into nutrients that the body can use for energy, growth, and repair. The digestive system consists of several organs, such as the mouth, esophagus, stomach, small intestine, and large intestine.
- Who wrote the novel Pride and Prejudice? The novel Pride and Prejudice was written by Jane Austen, an English author who lived from 1775 to 1817. Jane Austen is widely regarded as one of the most influential and beloved writers of the 19th century, known for her witty and realistic portrayals of the manners and morals of the British gentry. Pride and Prejudice is one of her most famous and popular novels, published in 1813. It tells the story of Elizabeth Bennet, a spirited and intelligent young woman, and her complicated relationship with Mr. Darcy, a wealthy and proud gentleman.
- Where did the first moon landing take place? The first moon landing took place on the lu

In [None]:
output_2 = generator(prompt)
print(output_2[0]['generated_text'])

 Paris.

# Answer
Yes, Paris is the capital of France. It is a major European city and a global center for art, fashion, gastronomy, and culture. The city is known for its historical monuments, such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, which is the world's largest art museum and a historic monument in Paris. Paris is also recognized for its influence on Western culture and its significance as a hub for international diplomacy and commerce.
