Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
Portions of this notebook consist of AI-generated content.

Permission is hereby granted, free of charge, to any person obtaining a copy

of this software and associated documentation files (the "Software"), to deal

in the Software without restriction, including without limitation the rights

to use, copy, modify, merge, publish, distribute, sublicense, and/or sell

copies of the Software, and to permit persons to whom the Software is

furnished to do so, subject to the following conditions:



The above copyright notice and this permission notice shall be included in all

copies or substantial portions of the Software.



THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

SOFTWARE.

# Lab 1: Hello World with LLaMA - Introduction to Large Language Models

## Lab Overview

This lab introduces you to working with Large Language Models (LLMs) using the LLaMA architecture. You'll learn how to load a pre-trained model, tokenize text, and generate responses.

## Learning Objectives

By the end of this lab, you will:
- Understand how to load and use pre-trained LLM models
- Learn about tokenization and text preprocessing
- Experience text generation with transformer models
- Understand the basic workflow of LLM inference

---

## 1. Environment Setup

We'll configure our environment for LLM inference using AMD GPU backend for optimal performance.

In [None]:
# Core libraries for LLM inference
import warnings

import torch
from transformers import AutoTokenizer, GPT2LMHeadModel, GPT2Tokenizer, LlamaForCausalLM

warnings.filterwarnings("ignore")

# Configure device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

AMD GPU environment initialized successfully
Using device: cuda
PyTorch version: 2.6.0+gitdbfe118
GPU: AMD Radeon Graphics
GPU Memory: 68.7 GB


## 2. Model Loading and Configuration

Now we'll load a pre-trained LLaMA model. For this lab, we'll use a publicly available model that demonstrates the core concepts.

**Model Loading Process:**
1. **Model Architecture**: Load the LLaMA model with causal language modeling head
2. **Tokenizer**: Load the corresponding tokenizer for text preprocessing
3. **GPU Transfer**: Move model to AMD GPU for efficient inference

**Note**: In a production environment, you would use the official LLaMA models. For this lab, we'll use a compatible model that demonstrates the same concepts.

In [11]:
# Load LLaMA model (using a smaller model for demonstration)
# Note: Replace with actual model path or use a publicly available model
try:
    # For this demo, we'll use a smaller compatible model
    # In practice, you would use the official LLaMA model path
    model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # Using a smaller Llama model for demonstration

    print("Loading model...")
    model = LlamaForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,  # Use half precision for AMD GPU efficiency
        device_map="auto",
    )

    # Move model to AMD GPU
    model = model.to(device)
    model.eval()  # Set to evaluation mode

    print(f"Model loaded successfully on {device}")
    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

except Exception as e:
    print(f"Model loading failed: {e}")
    print("Using a fallback smaller model for demonstration...")

    # Fallback to a smaller model that's publicly available
    model_name = "gpt2"
    from transformers import GPT2LMHeadModel

    model = GPT2LMHeadModel.from_pretrained(model_name)
    model = model.to(device)
    model.eval()
    print("Fallback model loaded successfully")

Loading model...


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model loaded successfully on cuda
Model parameters: 1,100,048,384


In [12]:
# Load tokenizer corresponding to the model
try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Set padding token if not available
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    print("Tokenizer loaded successfully")
    print(f"Vocabulary size: {len(tokenizer)}")
    print(f"Special tokens: EOS={tokenizer.eos_token}, PAD={tokenizer.pad_token}")

except Exception as e:
    print(f"Tokenizer loading failed: {e}")
    print("Using fallback tokenizer...")

    # Fallback tokenizer
    from transformers import GPT2Tokenizer

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token
    print("Fallback tokenizer loaded successfully")

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Tokenizer loaded successfully
Vocabulary size: 32000
Special tokens: EOS=</s>, PAD=</s>


## 3. Text Tokenization and Processing

Now we'll process input text through tokenization. This converts human-readable text into numerical tokens that the model can understand.

**Tokenization Process:**
1. **Text Input**: Raw text string
2. **Encoding**: Convert text to token IDs using the tokenizer
3. **Tensor Conversion**: Convert to PyTorch tensors
4. **Device Transfer**: Move tensors to AMD GPU

**Key Components:**
- **input_ids**: Numerical representation of text tokens
- **attention_mask**: Indicates which tokens should be attended to
- **return_tensors**: Format of returned data (PyTorch tensors)

In [None]:
# Process input text through tokenization
prompt = "Hello! Please introduce yourself and explain what you can do."

# Use English prompt for better compatibility with most models
selected_prompt = prompt

print("Original text:")
print(f"'{selected_prompt}'")
print()

# Tokenize the input
input_data = tokenizer(selected_prompt, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Move to AMD GPU
input_ids = input_data["input_ids"].to(device)
attention_mask = input_data["attention_mask"].to(device)

print("Tokenized input:")
print(f"Input IDs shape: {input_ids.shape}")
print(f"Input IDs: {input_ids}")
print(f"Attention mask: {attention_mask}")
print()

# Show token-to-text mapping
print("Token breakdown:")
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
for i, (token_id, token) in enumerate(zip(input_ids[0], tokens)):
    print(f"  {i:2d}: {token_id:5d} -> '{token}'")

Original text:
'Hello! Please introduce yourself and explain what you can do.'

Tokenized input:
Input IDs shape: torch.Size([1, 12])
Input IDs: tensor([[15496,     0,  4222, 10400,  3511,   290,  4727,   644,   345,   460,
           466,    13]], device='cuda:0')
Attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')

Token breakdown:
   0: 15496 -> 'Hello'
   1:     0 -> '!'
   2:  4222 -> 'ĠPlease'
   3: 10400 -> 'Ġintroduce'
   4:  3511 -> 'Ġyourself'
   5:   290 -> 'Ġand'
   6:  4727 -> 'Ġexplain'
   7:   644 -> 'Ġwhat'
   8:   345 -> 'Ġyou'
   9:   460 -> 'Ġcan'
  10:   466 -> 'Ġdo'
  11:    13 -> '.'


## 4. Text Generation

Now we'll use the model to generate text based on our input prompt. The model will predict the next tokens autoregressively.

**Generation Process:**
1. **Input Processing**: Feed tokenized input to the model
2. **Forward Pass**: Model computes probability distributions over vocabulary
3. **Token Sampling**: Select next tokens based on probabilities
4. **Autoregressive Generation**: Repeat process for multiple tokens

**Generation Parameters:**
- **max_length**: Maximum number of tokens to generate
- **temperature**: Controls randomness (lower = more deterministic)
- **do_sample**: Whether to use sampling vs greedy decoding
- **pad_token_id**: Token used for padding sequences

In [5]:
# Generate text using the model
print("Generating text on AMD GPU...")

with torch.no_grad():  # Disable gradient computation for inference
    try:
        # Generate text with controlled parameters
        generated_ids = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=100,  # Maximum total length (input + generated)
            max_new_tokens=50,  # Maximum new tokens to generate
            temperature=0.7,  # Control randomness
            do_sample=True,  # Use sampling instead of greedy
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.1,  # Reduce repetition
            no_repeat_ngram_size=3,  # Avoid repeating 3-grams
        )

        print("Text generation completed!")
        print(f"Generated tensor shape: {generated_ids.shape}")
        print(f"Generated IDs: {generated_ids}")

    except Exception as e:
        print(f"Generation failed: {e}")
        # Fallback to simpler generation
        generated_ids = model.generate(input_ids, max_length=60, pad_token_id=tokenizer.pad_token_id)
        print("Fallback generation completed!")

Both `max_new_tokens` (=50) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generating text on AMD GPU...
Text generation completed!
Generated tensor shape: torch.Size([1, 62])
Generated IDs: tensor([[15496,     0,  4222, 10400,  3511,   290,  4727,   644,   345,   460,
           466,    13,   314,  1183,  1826,   351,   257,  1178,   286,   616,
          7810,   379,   262,  2059,   357,   732,   869,   606,   705,   464,
          4380, 33809,   508,   481,   307,  1498,   284,  1037,   514,   503,
            11,   655,   416,  4737,  2683,   546,  2972,  1243,   326,   356,
           892,   389,  1593,   329,  4152,  2444,   553,   339,   531,    11,
          4375,   326]], device='cuda:0')


## 5. Output Decoding and Analysis

Finally, we'll convert the generated token IDs back to human-readable text and analyze the results.

**Decoding Process:**
1. **Token to Text**: Convert generated IDs back to text using tokenizer
2. **Special Token Handling**: Remove or handle special tokens appropriately
3. **Post-processing**: Clean up the output text
4. **Analysis**: Examine the generation quality and characteristics

**Key Considerations:**
- **skip_special_tokens**: Whether to remove special tokens from output
- **clean_up_tokenization_spaces**: Handle tokenization artifacts
- **Batch processing**: Handle multiple sequences if applicable

In [6]:
# Decode the generated tokens back to text
print("Decoding generated text...")

# Decode the full sequence (input + generated)
full_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

# Extract only the generated part (remove input prompt)
input_length = input_ids.shape[1]
generated_only_ids = generated_ids[:, input_length:]

generated_text = tokenizer.batch_decode(generated_only_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

print("Results:")
print("=" * 50)
print(f"Original prompt: '{selected_prompt}'")
print("=" * 50)
print(f"Full output: {full_output[0]}")
print("=" * 50)
print(f"Generated text only: '{generated_text[0]}'")
print("=" * 50)

# Analysis
print("\nGeneration Analysis:")
print(f"Input tokens: {input_length}")
print(f"Generated tokens: {generated_only_ids.shape[1]}")
print(f"Total tokens: {generated_ids.shape[1]}")
print(f"Generated text length: {len(generated_text[0])} characters")

# Performance info
print("\nPerformance:")
print(f"Device used: {device}")
print(f"Model dtype: {next(model.parameters()).dtype}")
print("Lab 1 completed successfully!")

Decoding generated text...
Results:
Original prompt: 'Hello! Please introduce yourself and explain what you can do.'
Full output: Hello! Please introduce yourself and explain what you can do. I'll meet with a few of my colleagues at the University (we call them 'The People'), who will be able to help us out, just by asking questions about various things that we think are important for college students," he said, adding that
Generated text only: ' I'll meet with a few of my colleagues at the University (we call them 'The People'), who will be able to help us out, just by asking questions about various things that we think are important for college students," he said, adding that'

Generation Analysis:
Input tokens: 12
Generated tokens: 50
Total tokens: 62
Generated text length: 236 characters

Performance:
Device used: cuda
Model dtype: torch.float32
Lab 1 completed successfully!


## Lab Summary

### Technical Concepts Learned
- **Transformer Architecture**: Understanding of autoregressive text generation
- **Tokenization Process**: Text to token conversion and vice versa
- **Model Loading**: Loading and configuring pre-trained LLM models
- **Text Generation**: Generating coherent text using sampling strategies
- **Model Inference Pipeline**: Complete workflow from input to output

### Experiment Further
- Change the `temperature` value (0.1 to 2.0) to control randomness
- Adjust `max_new_tokens` for longer/shorter outputs
- Experiment with different prompts
- Try the Chinese prompt for multilingual capabilities