Skip to content

Latest commit

 

History

History
120 lines (94 loc) · 5.26 KB

Quantized_Llama2_7B_with_MLC_LLM_on_Jetson.md

File metadata and controls

120 lines (94 loc) · 5.26 KB
description title keywords image slug last_update no_comments
Quantized Llama2-7B with MLC LLM on Jetson
Quantized Llama2-7B with MLC LLM on Jetson
Edge
reComputer
Jetson
Llama2
MLC LLM
/Quantized_Llama2_7B_with_MLC_LLM_on_Jetson
date author
04/1/2024
Jiahao
false

Quantized Llama2-7B with MLC LLM on Jetson

Introduction

In recent years, large language models such as GPT-3 have revolutionized natural language processing tasks. However, most of these models are trained on large-scale datasets, which require powerful computing resources and are not suitable for deployment on edge devices. To address this issue, researchers have developed quantization techniques to compress large models into smaller ones without sacrificing performance.

In this project, we introduce a quantized version of Llama2-7B, a large language model trained on 1.5TB of data, and deploy it on the Jetson Orin. We also leverage the Machine Learning Compiler Large Language Modle(MLC LLM) to accelerate the inference speed of the model. By deploying the quantized Llama2-7B with MLC LLM on the Jetson Orin NX, developers can build powerful natural language processing applications that deliver high accuracy and low latency on edge devices.

Hardware components

reComputer(Or other devices based on Jetson)

Install dependencies:

sudo apt-get update && sudo apt-get install git python3-pip
git clone --depth=1 https://github.com/dusty-nv/jetson-containers
cd jetson-containers pip3 install -r requirements.txt
cd ./data && git clone https://github.com/LJ-Hao/MLC-LLM-on-Jetson-Nano.git && cd ..

Install and run contiainer

first step: install image

./run.sh --env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN> $(./autotag mlc) /bin/bash -c 'ln -s $(huggingface-downloader meta-llama/Llama-2-7b-chat-hf) /data/models/mlc/dist/models/Llama-2-7b-chat-hf'

use sudo docker images to check wether the image is installed or not

pir

second step: Install Llama2-7b-chat-hf and Use MLC quantify the model

./run.sh $(./autotag mlc) \
python3 -m mlc_llm.build \
--model Llama-2-7b-chat-hf \
--quantization q4f16_ft \
--artifact-path /data/models/mlc/dist \
--max-seq-len 4096 \
--target cuda \
--use-cuda-graph \
--use-flash-attn-mqa

Thrid step: Run and enter docker

./run.sh <YOUR IMAGE NAME> 
#for me dustynv/mlc:51fb0f4-builder-r35.4.1 check result of first step

pir

Let's run it

run Llama without quanifing without MLC LLM quantified

cd /data/MLC-LLM-on-Jetson && python3 Llama-2-7b-chat-hf.py 

pir

you can see without quanifing with MLC, Jetson Nano 16GB can load the model but cant not run.

run Llama with quanifing with MLC LLM quantified

cd /data/MLC-LLM-on-Jetson && python3 Llama-2-7b-chat-hf-q4f16_ft.py 

here is the result:

pir

Video of running Llama with MLC on Jetson Orin NX 16GB:

<iframe width="560" height="315" src="https://www.youtube.com/embed/c2zbIwrOYyk?si=RydTL8dqmz5KRFpr" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Project Outlook

In this project, we have demonstrated how to deploy a quantized version of Llama2-7B with MLC LLM on the Jetson Orin. With the powerful computing capabilities of the Jetson Orin, developers can build natural language processing applications that deliver high accuracy and low latency on edge devices. In the future, we will continue to explore the potential of deploying large language models on edge devices and develop more efficient and optimized deployment methods.