# Janus-Pro DeepSeek Model on AMD 🧪

This tutorial demonstrates how to perform multimodal inference with **Janus-Pro**, a new autoregressive framework, which is part of the [Janus-Series](https://github.com/deepseek-ai/Janus#-janus-series-unified-multimodal-understanding-and-generation-models) from Deepseek AI. We will run the model on high-performance AMD hardware, including EPYC™ CPUs and Instinct™ GPUs.

"Multimodal" simply means the model can understand and process information from multiple sources at once, such as both text and images. By unifying these different data types, known as "modalities," Janus enables sophisticated understanding and generation tasks.

When executing the model on a CPU, we also show you the ability to leverage **AMD ZenDNN** or more precisely the **zentorch** plugin, to PyTorch to accelerate CPU-inferencing.

## Prerequisites 🧪


### Hardware requirements
For this tutorial, you will need a system featuring an AMD Instinct GPU.  If you intend to also run the model on CPU and use AMD ZenDNN, you will need an AMD EPYC CPU. 

This tutorial was tested on:
* AMD Instinct MI100
* AMD Instinct MI210
* AMD Instinct MI300X
* 4th Gen AMD EPYC (Genoa)
* 5th Gen AMD EPYC (Turin)

### Software Requirements
* Ubuntu 22.04: Ensure your system is running Ubuntu version 22.04 or later
* For GPU executions, you will need ROCm 6.3 installed on your system. 
* PyTorch 2.6 or later
* ZenTorch 5.0.2 or later
* The official [Janus-Pro DeepSeek repository](https://github.com/deepseek-ai/Janus) cloned
  
**Note**: This tutorial was tested with `torch2.7.1+rocm6.3` and `torch2.6.0+cpu`and `zentorch-5.0.2`

### Install and launch Jupyter Notebooks
If Jupyter is not already installed on your system, install it and launch JupyterLab using the following commands:

```
pip install jupyter
```

Then to start the jupyter server run the following command:

```
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```

**Note**: Ensure port 8888 is not already in use on your system before running the above command. If it is, you can specify a different port by replacing `--port=8888` with another port number, for example, `--port=8890`.

After the command executes, the terminal output displays a URL and token. Copy and paste this URL into your web browser on the host machine to access JupyterLab. After launching JupyterLab, upload this notebook to the environment and continue to follow the steps in this tutorial.

## 0️⃣ Prepare Environment: Install Dependencies

The following commands will install all the required dependencies to ensure you can run this tutorial successfully. We will also install `janus` from Deepseek AI's Github repository. 

In [None]:
!pip3 install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.3
!pip install transformers ipywidgets
!pip install git+https://github.com/deepseek-ai/Janus.git

Let's run a quick sanity check for the PyTorch environment. We will validate taht the GPU hardware is accessible via the ROCm backend. If not, we will then execute the model on CPU. 

In [2]:
import torch 
print(f"PyTorch Version: {torch.__version__}")

print("--- GPU Verification ---")
if torch.cuda.is_available():
    print("✅ PyTorch has access to the GPU.")
    print(f"ROCm Version (detected by PyTorch): {torch.version.hip}")
    print(f"Number of available GPUs: {torch.cuda.device_count()}")
    print(f"GPU installed on the system: {torch.cuda.get_device_name(0)}")
else:
    print("❌ PyTorch CANNOT access the GPU. Please check your ROCm installation and drivers or proceed to continue with executing on CPU")

We start with importing the required Python libraries required for our tutorial.

In [None]:
import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
import time

Initializing the following variables for use in the upcoming inference process:

In [4]:
iteration = 1
warmup = 0
max_new_tokens = 512
dtype = "bfloat16"

## 0️⃣Choose Hardware backend - CPU or GPU

For this example, we show two possible hardware backends where you can execute your AI workload - namely CPU or GPU. If you want to deploy your workload on GPU, set `device = "cuda"`, else if you want to deploy on CPU, set `device = "cpu"`. The CPU supports multiple software backends, for example `zentorch`, the Intel® Extension for PyTorch (`ipex`), the default PyTorch CPU-backend, also known as `inductor` or you could execute in native mode (i.e eager-mode as opposed to graph-mode) 

Note: At the time of publishing this tutorial, the zentorch Plugin version 5.0.2 requires the CPU-only version of PyTorch 2.6. This is a limitation of PyTorch. We therefore remove any previous installation of torch and install the CPU-version of torch. 

In [5]:
device = "cuda" # Change to "cpu" to execute on CPU
backend = "zentorch" # Available CPU backends: zentorch, inductor, ipex, or native [Eager Mode]

if device == "cpu" and backend == "zentorch":
    !pip uninstall -y torch torchvision torchaudio pytorch-triton-rocm
    !pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu
    !pip install zentorch #--no-cache-dir
    import torch 
    import zentorch
    print(f"PyTorch Version: {torch.__version__}")
    print(f"Zentorch Version: {zentorch.__version__}")

amp_enabled = True if dtype != "float32" else False
amp_dtype = getattr(torch, dtype)

## 1️⃣ Model Initialization and Setup

We begin by specifying the model path and initializing the necessary components for processing images and text.


In [None]:
# Specify the path to the model  
model_path = "deepseek-ai/Janus-Pro-7B"  

# Load the multimodal chat processor and tokenizer  
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)  
tokenizer = vl_chat_processor.tokenizer  

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(  
    model_path, trust_remote_code=True  
)


Convert the model to use BFloat16 precision and move it to CPU for inference

In [None]:
vl_gpt = vl_gpt.to(amp_dtype).to(device).eval() 

## 2️⃣ Define Image and User Input
We define an image and a text-based query to analyze its content.

In [None]:
image = "../assets/deepseek_janus_demo_small.jpg" 

Let's use the code snippet below to check our image. That will also confirm that the image path is correct. 

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(image)

We will now prepare the input payload for the vision-language model. This is achieved by constructing a `conversation` list that adheres to the model's required chat template. The user's message is a dictionary containing the role, the textual question embedded with an `<image_placeholder>` token, and the corresponding image object. 

In [10]:
question = "What is happening in this image?"  

conversation = [  
    {  
        "role": "<|User|>",  
        "content": f"<image_placeholder>\n{question}",  
        "images": [image],  
    },   
]


## 3️⃣ Preprocess Inputs (Image + Text)
We load the image and convert the conversation data into a format suitable for the model.

In [12]:
# Load images and prepare inputs  
pil_images = load_pil_images(conversation)  

# Process conversation and images into model-compatible input format  
prepare_inputs = vl_chat_processor(  
    conversations=conversation, images=pil_images, force_batchify=True  
).to(vl_gpt.device)  


## 4️⃣ Generate Image Embeddings
Before running inference, we process the image to obtain its embeddings.

In [13]:
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)  

## 5️⃣ Leveraging AMD ZenDNN Plugin for PyTorch (`zentorch`) 
We have registered a custom backend to `torch.compile` called zentorch. This backend integrates ZenDNN optimizations after AOTAutograd through a function called `optimize()`. This function operates on the FX based graph at the ATEN IR to produce an optimized FX based graph as the output. Please checkout the [ZenDNN User Guide](https://docs.amd.com/r/en-US/57300-ZenDNN-user-guide/ZenDNN) for more information on the operation mechanism of the plugin. 

In [None]:
if device == "cpu":
    if(backend == "zentorch"):
        print("Backend: ZenTorch")
        import zentorch
        torch._dynamo.reset()
        vl_gpt.language_model.forward = torch.compile(vl_gpt.language_model.forward, backend="zentorch")  
    
    elif(backend == "inductor"):
        print("Backend: Inductor")
        torch._dynamo.reset()
        vl_gpt.language_model.forward = torch.compile(vl_gpt.language_model.forward)  
    
    else:
        print("Running in Eager mode")
else:
    print("We are executing on GPU therefore we won't be leveraging any CPU-acceleration software like Zentorch.")

##  Profiler
The **Profiler** helps verify the operations (ops) and assess the effectiveness of **torch.compile** in optimizing the model. It provides insights into the performance of the model by tracking execution times and pinpointing areas where optimizations can be made, ensuring that torch.compile is working as expected.

This part can be skipped if the focus is on performance checks rather than detailed analysis.

In [15]:
# # Start profiling
# from torch.profiler import profile, record_function, ProfilerActivity
# def trace_handler(prof):
#     # Print profiling information after each step
#     print(prof.key_averages().table(sort_by="self_cpu_time_total", row_limit=-1))

# with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU],record_shapes=False,schedule=torch.profiler.schedule(wait=1, warmup=1, active=1),on_trace_ready=trace_handler,) as prof:
#     for i in range(3):
#     # Run the model to get the response
#         outputs = vl_gpt.language_model.generate(
#             inputs_embeds=inputs_embeds,
#             attention_mask=prepare_inputs.attention_mask,
#             pad_token_id=tokenizer.eos_token_id,
#             bos_token_id=tokenizer.bos_token_id,
#             eos_token_id=tokenizer.eos_token_id,
#             max_new_tokens=max_new_tokens,
#             do_sample=False,
#             use_cache=True,
#         )
#         prof.step()

# # To check the DataType
# for name, param in vl_gpt.named_parameters():
#     print(f"Parameter: {name}, Shape: {param.shape}, Data Type: {param.dtype}")
#     print(f"First few values: {param.flatten()[:5]}\n") 

## 6️⃣ Warmup Inference (Stabilization Runs)
To ensure stable performance, we run a few inference cycles without measuring time.


In [16]:
for i in range(warmup):  
    # Generate a response without timing for warmup  
    outputs = vl_gpt.language_model.generate(  
        inputs_embeds=inputs_embeds,  
        attention_mask=prepare_inputs.attention_mask,  
        pad_token_id=tokenizer.eos_token_id,  
        bos_token_id=tokenizer.bos_token_id,  
        eos_token_id=tokenizer.eos_token_id,
        min_new_tokens = max_new_tokens,  
        max_new_tokens = max_new_tokens,  
        do_sample=False,  
        use_cache=True,  
    )  
    print(f"WARMUP:{i+1} COMPLETED!")  

## 7️⃣ Timed Inference Execution
We now run actual inference while measuring latency for performance analysis.

In [17]:
total_time = 0.0  

for i in range(iteration):  
    tic = time.time()  # Start time  

    # Generate response from the model  
    outputs = vl_gpt.language_model.generate(  
        inputs_embeds=inputs_embeds,  
        attention_mask=prepare_inputs.attention_mask,  
        pad_token_id=tokenizer.eos_token_id,  
        bos_token_id=tokenizer.bos_token_id,  
        eos_token_id=tokenizer.eos_token_id,  
        min_new_tokens = max_new_tokens,
        max_new_tokens = max_new_tokens,  
        do_sample=False,  
        use_cache=True,  
    )  

    toc = time.time()  # End time  
    delta = toc - tic  # Compute time taken  
    total_time = total_time + delta  

## 8️⃣ Compute and Display Latency
We calculate the average latency and print the result.

In [None]:
total_time = total_time / iteration
print( 
    f"e2e_latency (TTFT + Generation Time) for step: {total_time:.6f} sec", 
    flush=True, 
)

tps_per_step = (max_new_tokens / total_time)
print( 
    f"Throughput: {tps_per_step:.6f} tokens/sec", 
    flush=True, 
)

## 9️⃣ Decode and Display the Model’s Output
Finally, we decode the generated token sequence into human-readable text.

In [None]:
answer = tokenizer.decode(outputs[0].to(device).tolist(), skip_special_tokens=True)  
print(f"{prepare_inputs['sft_format'][0]}", answer)  

## ✨ Summary of the Pipeline
✅ Step 0: Choose your hardware backend - CPU or GPU
✅ Step 1: Load Janus-Pro model and processor.  
✅ Step 2: Define image and question for multimodal understanding.  
✅ Step 3: Preprocess text + image inputs.  
✅ Step 4: Generate image embeddings for the model.
✅ Step 5: Leveraging ZenTorch.  
✅ Step 6: Run warmup iterations to stabilize performance.  
✅ Step 7: Perform timed inference to measure latency.  
✅ Step 8: Compute and display average generation time.  
✅ Step 9: Decode and display the AI-generated response.  