# Phi-3 Instruct Open Model

The Phi-3-Mini-4K-Instruct is a 3.8B parameters model with 4K context length. The [model](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct#model) is a dense decoder-only Transformer model which is fine-tuned with Supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) to ensure alignment with human preferences and safety guidelines. This model supports a vocabulary size of up to 32,064 tokens.

The model has been designed for general purpose AI systems and applications which require:

- memory/compute constrained environments
- latency bound scenarios and
- strong reasoning (especially math and logic)

## 🛠️ Supported Hardware

This notebook can run in a CPU or in a GPU.

✅ AMD Instinct™ Accelerators  
✅ AMD Radeon™ RX/PRO Graphics Cards  
⚠️ AMD EPYC™ Processors  
⚠️ AMD Ryzen™ (AI) Processors  

Suggested hardware: **AMD Instinct™ Accelerators**, this notebook can run in a CPU as well but inference is CPU will be slow.

## ⚡ Recommended Software Environment

::::{tab-set}

:::{tab-item} Linux
- [Install Docker container](https://amdresearch.github.io/aup-ai-tutorials//env/env-gpu.html)
- [Install PyTorch](https://amdresearch.github.io/aup-ai-tutorials//env/env-cpu.html)
:::

:::{tab-item} Windows
- [Install Direct-ML](https://amdresearch.github.io/aup-ai-tutorials//env/env-gpu-windows.html)
- [Install PyTorch](https://amdresearch.github.io/aup-ai-tutorials//env/env-cpu.html)
:::
::::

## 🎯 Goals

- Show you how to download a model from HuggingFace
- Run Phi-3 Instruct on an AMD platform
- Prompt the model and explore system and user role prompts


:::{seealso}
- [Phi-3-Mini-4K-Instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
- [Phi open models](https://azure.microsoft.com/en-us/products/phi/)
- [Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone](https://arxiv.org/abs/2404.14219)
:::

## 🚀 Run Phi-3 Instruct on an AMD Platform

Import the necessary packages

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

Check if GPU is available for acceleration.

```{note}
Running the model on a GPU is strongly recommended. If your device is `cpu`, the model token generation will be slow.
```

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'{device=}')

Download model and tokenizer from Hugging Face

In [None]:
model_id = "microsoft/Phi-3-mini-4k-instruct"

torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    torch_dtype="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
print(f'Model size: {model.num_parameters() * model.dtype.itemsize / 1024 / 1024:.2f} MB')

Define pipeline and generation arguments.
We are going to use the [transformers pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines) API to create the model call and pass the user prompt.

We start by creating a pipeline object with the goal of `text-generation`, we also specify the model and the tokenizer.

The `generation_args` is a helper dictionary that we will pass to the pipeline object, we specify certain parameters, such as the max tokens, temperature (creativity of the model) and sample (if True it would select from the most likely output tokens).

In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 512,
    "return_full_text": False,
    "temperature": 0.01,
    "do_sample": False,
}

Let's define a system prompt for our model

In [None]:
system_prompt = {"role": "system", "content": "You are a helpful AI assistant."}

Define a simple prompt asking about a simple math problem

In [None]:
prompt = [
    system_prompt,
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}
]

Generate model response

In [None]:
output = pipe(prompt, **generation_args) 
print(f'Prompt:\n {prompt[1]["content"]}\n\nResponse:\n{output[0]["generated_text"]}')

The response is good, but we want the model to respond in a more concise way, for this we are going to use  [few-shot prompting](https://www.promptingguide.ai/techniques/fewshot), in this case we will do one-shot prompting. In the prompt fed to the model, we are providing an example of how we would like the response to look like.

- In the system prompt we define that the model is a helpful assistant
- Then we provide the user question and and example of how we would like the model to answer, the we finally include the actual question we would like the model to reply.

In [None]:
messages_oneshot = [
    system_prompt,
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"}
]

Generate model response

In [None]:
output = pipe(messages_oneshot, **generation_args) 
print(f'Prompt:\n {messages_oneshot[3]["content"]}\n\nResponse:\n{output[0]["generated_text"]}')

Note how the response from the model is more concise now.

```{tip}
Exercise for the readers, modify the `generation_args` configuration, for instance increase the value of `temperature` (max is `2.0`) and set `do_sample` to `True`. What is the outcome?
```

----------
Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.

SPDX-License-Identifier: MIT