**💡 NOTE:** We will want to use a GPU to run the examples in this notebook. In Google Colab, go to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.**

In [16]:
# 1. Install Dependencies
# transformers: For working with pre-trained models from Hugging Face.
# accelerate: Optimizes model training on hardware like GPUs.
# %%capture: Suppresses the output (e.g., installation logs).
%%capture
!pip install transformers>=4.41.2 accelerate>=0.31.0

In [17]:
# 2. Import Required Modules

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Purpose:
# AutoModelForCausalLM: Loads a causal language model (designed for text generation).
# AutoTokenizer: Loads the tokenizer corresponding to the model.
# pipeline: Simplifies using Hugging Face models for various tasks like text generation, sentiment analysis, etc.


In [18]:
# 3. Load the Tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Purpose:
# Load the tokenizer for the Phi-3-mini-4k-instruct model from the Hugging Face repository.
# The tokenizer converts input text into numerical tokens (IDs) for the model and converts output token IDs back into text.

In [19]:
# 4. Load the Pre-trained Model
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

# Purpose:
# Load the pre-trained Phi-3-mini-4k-instruct model, which is optimized for causal language tasks (e.g., text generation).
# Parameters:
# device_map="cuda": Ensures the model runs on the GPU for faster processing.
# torch_dtype="auto": Automatically chooses the appropriate precision for computations (e.g., FP16 on GPU).
# trust_remote_code=True: Allows execution of custom model code from the repository.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [20]:
# 5. Create a Text-Generation Pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

**Purpose:**
Creates a pipeline for text generation, which integrates the model and tokenizer into a simple interface.

**Parameters:**
- "text-generation": Specifies the task type as text generation.
- model=model: Uses the loaded Phi-3-mini-4k-instruct model.
- tokenizer=tokenizer: Uses the loaded tokenizer for tokenizing input and decoding output.
- return_full_text=False: Ensures the output only includes the generated text, not the input prompt.
- max_new_tokens=50: Limits the generated text to 50 tokens.
- do_sample=False: Disables randomness, making the output deterministic (greedy decoding).

In [21]:
# 6. Pipeline Functionality
# Input: You can pass text prompts directly to the generator object. For example:

output = generator("Explain the concept of AI in simple terms.")
print(output)

# Output: The model will generate a continuation of the provided prompt, up to max_new_tokens tokens.

[{'generated_text': ' AI, or Artificial Intelligence, refers to the creation of machines or software that can perform tasks that typically require human intelligence. These tasks include learning, problem-solving, recognizing patterns, understanding language, and making decisions.'}]


**Benefits of Using pipeline**
- Simplified API: Abstracts away tokenization and model inference complexities.
- Flexibility: Supports various tasks beyond text generation, such as sentiment analysis, translation, etc.
- Customizability: Allows fine-tuning with parameters like max_new_tokens and do_sample.

This setup is now ready to generate text efficiently with the Phi-3-mini-4k-instruct model!

In [22]:
# 7. Set a Prompt and Generate Text

prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)

print(output[0]['generated_text'])

# Purpose: Generates text based on the prompt using the previously created generator pipeline.
# Output: The generated text is printed, showing the model's response to the prompt.

 Mention the steps you're taking to prevent it in the future.

Dear Sarah,

I hope this message finds you well. I am writing to express my deepest apologies for the unfortunate incident that occurred in


In [23]:
print(model)

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

In [24]:
# Tokenization of a Prompt
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

# Purpose: Converts the prompt into token IDs and moves them to the GPU for processing.
# return_tensors="pt": Ensures the output is in PyTorch tensor format.

In [25]:
# Accessing Model Outputs
# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

# Purpose:
# model.model(input_ids): Extracts the raw outputs of the Transformer model before applying the language model head (lm_head).
# model.lm_head(model_output[0]): Applies the final layer (lm_head) to get the logits for each token.


In [26]:
# Token Decoding

token_id = lm_head_output[0, -1].argmax(-1)
tokenizer.decode(token_id)

# Purpose:
#lm_head_output[0, -1]: Focuses on the logits for the last token in the sequence.
# .argmax(-1): Identifies the token ID with the highest probability.
# tokenizer.decode(token_id): Converts the predicted token ID back into a human-readable word.

'Paris'

In [27]:
model_output[0].shape

torch.Size([1, 5, 3072])

In [28]:
lm_head_output.shape

torch.Size([1, 5, 32064])

In [29]:
# Generating a Longer Text

prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

# Purpose: Prepares the input tokens for generating a longer response


In [31]:
# Timing Text Generation with %%timeit
import time

# Timing with `use_cache=True`
start_time = time.time()
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    use_cache=True
)
end_time = time.time()
print(f"Execution Time with Cache: {end_time - start_time:.4f} seconds")

# Timing with `use_cache=False`
start_time = time.time()
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    use_cache=False
)
end_time = time.time()
print(f"Execution Time without Cache: {end_time - start_time:.4f} seconds")


Execution Time with Cache: 7.5783 seconds
Execution Time without Cache: 28.1427 seconds


**Purpose:**
- %%timeit: Measures the execution time of the text generation process.
- use_cache=True:
   - Enables caching of key-value pairs for faster inference in
  autoregressive decoding.
   - Suitable for long text generation tasks.
- use_cache=False:
  - Disables caching, potentially slowing down inference but using less memory.
- Comparison: Observes the performance difference when caching is enabled vs. disabled.

#### Summary of the Code
1. Generates text using both a pipeline and direct model calls.
2. Demonstrates tokenization, model output inspection, and decoding.
3. Explores the performance of text generation with and without caching.
4. Provides insight into internal operations of a Hugging Face model, such as logits extraction and token prediction.

