Skip to content

Latest commit

 

History

History

llama

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Meta Llama: Next generation of Meta's Language Model

Llama

TorchServe supports serving Meta Llama in a number of ways. The examples covered in this document range from someone new to TorchServe learning how to serve Meta Llama with an app, to an advanced user of TorchServe using micro batching and streaming response with Meta Llama.

🦙💬 Meta Llama Chatbot

This example shows how to deploy a llama chat app using TorchServe. We use streamlit to create the app

This example is using llama-cpp-python.

You can run this example on your laptop to understand how to use TorchServe, how to scale up/down TorchServe backend workers and play around with batch_size to see its effect on inference time

Chatbot Architecture

Meta Llama with HuggingFace

This example shows how to serve meta-llama/Meta-Llama-3-70B-Instruct model with limited resource using HuggingFace. It shows the following optimizations 1) HuggingFace accelerate. This option can be activated with low_cpu_mem_usage=True. 2) Quantization from bitsandbytes using load_in_8bit=True The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).

Llama 2 on Inferentia

This example shows how to serve the Llama 2 model on AWS Inferentia2 for text completion with micro batching and streaming response support.

Inferentia2 uses Neuron SDK which is built on top of PyTorch XLA stack. For large model inference transformers-neuronx package is used that takes care of model partitioning and running inference.

Inferentia 2 Software Stack