<a href="https://colab.research.google.com/github/JSJeong-me/LiteLLM-OnDeive-App/blob/main/006-Device-Deployment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# L3: Preparing for on-device deployment


<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>


## Capture trained model

In [None]:
!pip install qai-hub-models

In [1]:
from qai_hub_models.models.ffnet_40s import Model as FFNet_40s

# Load from pre-trained weights
ffnet_40s = FFNet_40s.from_pretrained()

Downloading data at https://github.com/quic/aimet-model-zoo/releases/download/torch_segmentation_ffnet/ffnet40S_dBBB_cityscapes_state_dict_quarts.pth to /root/.qaihm/models/ffnet/v1/ffnet40S/ffnet40S_dBBB_cityscapes_state_dict_quarts.pth... 

100%|██████████| 55.8M/55.8M [00:00<00:00, 70.6MB/s]


Done
cityscapes_segmentation requires repository https://github.com/Qualcomm-AI-research/FFNet.git . Ok to clone? [Y/n] y
Cloning https://github.com/Qualcomm-AI-research/FFNet.git to /root/.qaihm/models/cityscapes_segmentation/v2/Qualcomm-AI-research_FFNet_git...
Done
Loading pretrained model state dict from /root/.qaihm/models/ffnet/v1/ffnet40S/ffnet40S_dBBB_cityscapes_state_dict_quarts.pth
Initializing ffnnet40S_dBBB_mobile weights


In [2]:
import torch
input_shape = (1, 3, 1024, 2048)
example_inputs = torch.rand(input_shape)

In [3]:
traced_model = torch.jit.trace(ffnet_40s, example_inputs)

In [None]:
traced_model

## Compile for device

<p style="background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> 💻 &nbsp; <b>Access Utils File and Helper Functions:</b> To access the files for this notebook, 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>. For more help, please see the <em>"Appendix - Tips and Help"</em> Lesson.</p>

In [None]:
!pip install qai-hub
!qai-hub configure --api_token 1xxx

In [13]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [14]:
import os
from dotenv import load_dotenv, find_dotenv


def load_env():
    _ = load_dotenv(find_dotenv())

def get_ai_hub_api_token():
    load_env()
    ai_hub_api_token = os.getenv("AI_HUB_API_KEY")
    return ai_hub_api_token

In [None]:
import qai_hub
import qai_hub_models

# from utils import get_ai_hub_api_token

ai_hub_api_token = get_ai_hub_api_token()

!qai-hub configure --api_token "xxx"

In [19]:
for device in qai_hub.get_devices():
    print(device.name)

Google Pixel 3 (Family)
Google Pixel 3
Google Pixel 3a
Google Pixel 3 XL
Google Pixel 4
Google Pixel 4
Google Pixel 4a
Google Pixel 5
Samsung Galaxy Tab S7
Samsung Galaxy Tab A8 (2021)
Samsung Galaxy Note 20 (Intl)
Samsung Galaxy S21 (Family)
Samsung Galaxy S21
Samsung Galaxy S21+
Samsung Galaxy S21 Ultra
Xiaomi Redmi Note 10 5G
Google Pixel 3a XL
Google Pixel 4a
Google Pixel 5 (Family)
Google Pixel 5
Google Pixel 5a 5G
Google Pixel 6
Samsung Galaxy A53 5G
Samsung Galaxy A73 5G
RB3 Gen 2 (Proxy)
QCS6490 (Proxy)
RB5 (Proxy)
QCS8250 (Proxy)
QCS8550 (Proxy)
Samsung Galaxy S21 (Family)
Samsung Galaxy S21
Samsung Galaxy S21 Ultra
Samsung Galaxy S22 (Family)
Samsung Galaxy S22 Ultra 5G
Samsung Galaxy S22 5G
Samsung Galaxy S22+ 5G
Samsung Galaxy Tab S8
Xiaomi 12 (Family)
Xiaomi 12
Xiaomi 12 Pro
Google Pixel 6 (Family)
Google Pixel 6
Google Pixel 6a
Google Pixel 7 (Family)
Google Pixel 7
Google Pixel 7 Pro
Samsung Galaxy A14 5G
Samsung Galaxy S22 5G
QCS8450 (Proxy)
XR2 Gen 2 (Proxy)
Samsung Ga

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note:</b> To spread the load across various devices, we are selecting a random device. Feel free to change it to any other device you prefer.</p>

In [20]:
devices = [
    "Samsung Galaxy S22 Ultra 5G",
    "Samsung Galaxy S22 5G",
    "Samsung Galaxy S22+ 5G",
    "Samsung Galaxy Tab S8",
    "Xiaomi 12",
    "Xiaomi 12 Pro",
    "Samsung Galaxy S22 5G",
    "Samsung Galaxy S23",
    "Samsung Galaxy S23+",
    "Samsung Galaxy S23 Ultra",
    "Samsung Galaxy S24",
    "Samsung Galaxy S24 Ultra",
    "Samsung Galaxy S24+",
]

import random
selected_device = random.choice(devices)
print(selected_device)

Samsung Galaxy Tab S8


In [21]:
device = qai_hub.Device(selected_device)

# Compile for target device
compile_job = qai_hub.submit_compile_job(
    model=traced_model,                        # Traced PyTorch model
    input_specs={"image": input_shape},        # Input specification
    device=device,                             # Device
)

Uploading model: 100%|[34m██████████[0m| 53.6M/53.6M [00:01<00:00, 52.4MB/s]


Scheduled compile job (j1gllwlmg) successfully. To see the status and results:
    https://app.aihub.qualcomm.com/jobs/j1gllwlmg/



In [22]:
# Download and save the target model for use on-device
target_model = compile_job.get_target_model()

Waiting for compile job (j1gllwlmg) completion. Type Ctrl+C to stop waiting at any time.
    ✅ SUCCESS                          


## Exercise: Try different runtimes

In [23]:
compile_options="--target_runtime tflite"                  # Uses TensorFlow Lite
compile_options="--target_runtime onnx"                    # Uses ONNX runtime
compile_options="--target_runtime qnn_lib_aarch64_android" # Runs with Qualcomm AI Engine

compile_job_expt = qai_hub.submit_compile_job(
    model=traced_model,                        # Traced PyTorch model
    input_specs={"image": input_shape},        # Input specification
    device=device,                             # Device
    options=compile_options,
)

Uploading model: 100%|[34m██████████[0m| 53.6M/53.6M [00:00<00:00, 57.9MB/s]


Scheduled compile job (jw56wowyg) successfully. To see the status and results:
    https://app.aihub.qualcomm.com/jobs/jw56wowyg/



Expore more compiler options <a href=https://app.aihub.qualcomm.com/docs/hub/compile_examples.html#compiling-pytorch-to-tflite> here</a>.

## On-Device Performance Profiling

In [24]:
from qai_hub_models.utils.printing import print_profile_metrics_from_job

# Choose device
device = qai_hub.Device(selected_device)

# Runs a performance profile on-device
profile_job = qai_hub.submit_profile_job(
    model=target_model,                       # Compiled model
    device=device,                            # Device
)

# Print summary
profile_data = profile_job.download_profile()
print_profile_metrics_from_job(profile_job, profile_data)

Scheduled profiling job (j1p36o6np) successfully. To see the status and results:
    https://app.aihub.qualcomm.com/jobs/j1p36o6np/

Waiting for profile job (j1p36o6np) completion. Type Ctrl+C to stop waiting at any time.
    ✅ SUCCESS                          

------------------------------------------------------------
Performance results on-device for Job_J1Gllwlmg_Optimized_Tflite.
------------------------------------------------------------
Device                          : Samsung Galaxy Tab S8 (12)
Runtime                         : TFLITE                    
Estimated inference time (ms)   : 41.5                      
Estimated peak memory usage (MB): [2, 93]                   
Total # Ops                     : 94                        
Compute Unit(s)                 : NPU (94 ops)              
------------------------------------------------------------
More details: https://app.aihub.qualcomm.com/jobs/j1p36o6np/



## Exercise: Try different compute units

In [25]:
profile_options="--compute_unit cpu"     # Use cpu
profile_options="--compute_unit gpu"     # Use gpu (with cpu fallback)
profile_options="--compute_unit npu"     # Use npu (with cpu fallback)

# Runs a performance profile on-device
profile_job_expt = qai_hub.submit_profile_job(
    model=target_model,                     # Compiled model
    device=device,                          # Device
    options=profile_options,
)

Scheduled profiling job (j1pv727rp) successfully. To see the status and results:
    https://app.aihub.qualcomm.com/jobs/j1pv727rp/



## On-Device Inference

In [None]:
sample_inputs = ffnet_40s.sample_inputs()
sample_inputs

In [None]:
torch_inputs = torch.Tensor(sample_inputs['image'][0])
torch_outputs = ffnet_40s(torch_inputs)
torch_outputs

In [28]:
inference_job = qai_hub.submit_inference_job(
        model=target_model,          # Compiled model
        inputs=sample_inputs,        # Sample input
        device=device,               # Device
)

Uploading dataset: 100%|[34m██████████[0m| 21.5M/21.5M [00:00<00:00, 38.7MB/s]


Scheduled inference job (jlpey6yv5) successfully. To see the status and results:
    https://app.aihub.qualcomm.com/jobs/jlpey6yv5/



In [None]:
ondevice_outputs = inference_job.download_output_data()
ondevice_outputs['output_0']

In [30]:
from qai_hub_models.utils.printing import print_inference_metrics
print_inference_metrics(inference_job, ondevice_outputs, torch_outputs)


Comparing on-device vs. local-cpu inference for Job_J1Gllwlmg_Optimized_Tflite.
+---------------+----------------------------+--------+
| output_name   | shape                      |   psnr |
| output_0      | torch.Size([19, 128, 256]) |  66.24 |
+---------------+----------------------------+--------+

- psnr: Peak Signal-to-Noise Ratio (PSNR). >30 dB is typically considered good.

More details: https://app.aihub.qualcomm.com/jobs/jlpey6yv5/


## Get ready for deployment!

In [31]:
target_model = compile_job.get_target_model()
_ = target_model.download("FFNet_40s.tflite")

job_j1gllwlmg_optimized_tflite_mxqv2p10n.tflite: 100%|[34m██████████[0m| 53.1M/53.1M [00:01<00:00, 49.0MB/s]
