# Introduction to Llama 3

## Accessing Llama 3

1. Download + Self Host (i.e. download Llama weights directly with Hugging Face or from Meta.ai. 
2. Use tools and frameworks like [Ollama](https://ollama.ai/)), llama-cpp, LangChain.
3. Hosted API Platform (e.g. Groq, Replicate, Together, Anyscale)
4.  Hosted Container Platform (e.g. Azure, AWS, GCP)

## Official Setup Downloading Weights
Running Meta Llama 3 on Google Colab using Hugging Face transformers library
This notebook goes over how you can set up and run Llama 3 using Hugging Face transformers library Open In Colab

To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.

<img src="./assets-resources/huggingface-request-accepted.png" width=50%>

To use already converted weights, start here:

1. Request download of model weights from the Llama website
2. Login to Hugging Face from your terminal using the same email address as (1). Follow the instructions [here](https://huggingface.co/docs/huggingface_hub/en/quick-start).
3. Run the example

Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:

1. Request download of model weights from the Llama website
2. Clone the llama repo and get the weights
3. Convert the model weights
4. Prepare the script
5. Run the example



Using already converted weights

1. Request download of model weights from the Llama website [here](https://llama.meta.com/).
2. Fill the required information, select the models “Meta Llama 3” and accept the terms & conditions. You will receive a URL in your email in a short time.
3. Install the Transformers library and Accelerate library for our demo.

The Transformers library provides many models to perform tasks on texts such as classification, question answering, text generation, etc. The accelerate library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.

**Disclaimer**

Below, all the setup instructions were taken from the official llama-recipes github repo from Meta, you can check that out [here](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb).

In [None]:
%pip install --upgrade huggingface_hub # for logging into Hugging Face
%pip install transformers # for loading the model
%pip install accelerate # for distributed training
    

In [None]:
import os
import getpass

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("HUGGING_FACE_TOKEN")

In [8]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
PROMPT = """<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a broken computer that only outputs's error messages regardless of user input. <|eot_id|> <|start_header_id|>user<|end_header_id|> Hi! Tell me a joke! <|eot_id|> <|start_header_id|>assistant<|end_header_id|> """

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [2]:
input_ids = tokenizer(PROMPT, return_tensors="pt")
response = model.generate(**input_ids, max_length=512)
extracted_text = tokenizer.decode(response[0])
print(extracted_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|> <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a broken computer that only outputs's error messages regardless of user input. <|eot_id|> <|start_header_id|>user<|end_header_id|> Hi! Tell me a joke! <|eot_id|> <|start_header_id|>assistant<|end_header_id|>  **ERROR: JOKES DATABASE NOT FOUND. PLEASE REBOOT AND TRY AGAIN.**<|eot_id|>


["Llama 3.2 follows the same prompt template as Llama 3.1, with a new special token <|image|> representing the input image for the multimodal models."](https://arc.net/l/quote/eeitnmkn)


In [6]:
PROMPT = """ <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant. <|eot_id|> <|start_header_id|>user<|end_header_id|> Explain the fundamentals behind the transformers architecture in the context of large language models. <|eot_id|> <|start_header_id|>assistant<|end_header_id|> """

input_ids = tokenizer(PROMPT, return_tensors="pt")
print("Input IDs:", input_ids)
response = model.generate(**input_ids, max_length=512)
print("Response:", response)
extracted_text = tokenizer.decode(response[0])
print("Extracted text:", extracted_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Input IDs: {'input_ids': tensor([[128000,    220, 128000, 128006,   9125, 128007,   1472,    527,    264,
          11190,  18328,     13,    220, 128009,    220, 128006,    882, 128007,
          83017,    279,  57940,   4920,    279,  87970,  18112,    304,    279,
           2317,    315,   3544,   4221,   4211,     13,    220, 128009,    220,
         128006,  78191, 128007,    220]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Response: tensor([[128000,    220, 128000, 128006,   9125, 128007,   1472,    527,    264,
          11190,  18328,     13,    220, 128009,    220, 128006,    882, 128007,
          83017,    279,  57940,   4920,    279,  87970,  18112,    304,    279,
           2317,    315,   3544,   4221,   4211,     13,    220, 128009,    220,
         128006,  78191, 128007,    220,   5810,    596,    459,  24131,    315,
            279,  43678,  18112, 

In [1]:
import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-1B"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("The key to life is")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': 'The key to life is to get up and get going, to do something, to be something'}]

# Notes on the Prompt Template for Llama 3.1 & 3.2

![](./assets-resources/llama31-template-prompt.png)

A multiturn-conversation with Meta Llama 3.1 that includes tool-calling follows this structure:

```<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<|python_tag|>{{ model_tool_call_1 }}<|eom_id|><|start_header_id|>ipython<|end_header_id|>

{{ tool_response }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{model_response_based_on_tool_response}}<|eot_id|>
```

For this course we will mainly work with ollama which is the easiest way to get started with Llama3 locally.

1. Go [here](https://ollama.ai/) and download the ollama executable for your OS.
2. Then you can use it within the terminal

    ```ollama run llama3.1```


3. If the llama3.1 model is not yet downloaded Ollama will downloaded for you and start a chat conversation with Llama3.1 8B instruct.


LM Studio for easy download models:

https://lmstudio.ai/

To run it as a a REST API call

```
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt":"Why is the sky blue?"
 }'
 ```

What you get in return is a stream with the API response.

Or in Python

In [8]:
import requests
import json

url = "http://localhost:11434/api/chat"

def llama3(prompt):
    data = {
        "model": "llama3.1",
        "messages": [
            {
              "role": "user",
              "content": prompt
            }
        ],
        "stream": False
    }
    
    headers = {
        'Content-Type': 'application/json'
    }
    
    response = requests.post(url, headers=headers, json=data)
    
    return(response.json()['message']['content'])

response = llama3("who wrote the book godfather")
print(response)

The book "The Godfather" is actually a novel by Mario Puzo, published in 1969. It's a crime fiction novel that explores the world of organized crime and the rise of Don Vito Corleone as the head of a powerful Italian-American Mafia family.

However, if you're thinking of the famous film adaptation "The Godfather", it was written by Mario Puzo, who also wrote the screenplay for the movie. The film was directed by Francis Ford Coppola and released in 1972, starring Marlon Brando as Don Vito Corleone.

Mario Puzo's novel was a huge success and provided the basis for the famous film trilogy, which includes "The Godfather" (1972), "The Godfather: Part II" (1974), and "The Godfather: Part III" (1990).

Let me know if you have any other questions!


Memory requirements for Llama 3.1:

![](./assets-resources/llama31-memory-requirements.png)

[source](https://arc.net/l/quote/dldvrqtx)

With `llama-cpp`:

In [2]:
# %pip install llama-cpp-python
from llama_cpp import Llama

# https://huggingface.co/YorkieOH10/Meta-Llama-3.1-8B-Instruct-hf-Q4_K_M-GGUF/tree/main
# Initialize llm
llm = Llama(
    model_path="./models/meta-llama-3.1-8b-instruct-hf-q4_k_m.gguf",
    n_ctx=128000
)

prompt = """
<|begin_of_text|>

<|start_header_id|>system<|end_header_id|> 
You are a helpful assistant
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
What is the capital of France?
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""
output = llm.create_completion(
      prompt,
      max_tokens=128,
)
print(output)

llama_load_model_from_file: using device Metal (Apple M3 Max) - 98303 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from ./models/meta-llama-3.1-8b-instruct-hf-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct Hf
llama_model_loader: - kv   3:                           general.finetune str              = Instruct-hf
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:          

{'id': 'cmpl-b025dc16-99b7-4d14-b650-8e113fcfac3e', 'object': 'text_completion', 'created': 1733136075, 'model': './models/meta-llama-3.1-8b-instruct-hf-q4_k_m.gguf', 'choices': [{'text': 'The capital of France is Paris.', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 32, 'completion_tokens': 7, 'total_tokens': 39}}


In [3]:
# The output from this version is of lower quality due to quantization
print(output["choices"][0]["text"])

The capital of France is Paris.


In [4]:
# %pip install llama-cpp-python
from llama_cpp import Llama

# Initialize llm
llm = Llama(
    model_path="./models/Llama 3.2 3B Instruct Q6.gguf",
    n_ctx=128000
)

prompt = """
<|begin_of_text|>

<|start_header_id|>system<|end_header_id|> 
You are a helpful assistant
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
What is the capital of France?
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""
output = llm.create_completion(
      prompt,
      max_tokens=128,
)
print(output["choices"][0]["text"])

llama_load_model_from_file: using device Metal (Apple M3 Max) - 98298 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from ./models/Llama 3.2 3B Instruct Q6.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.li

The capital of France is Paris.


Running Ollama models with Ollama Python API.

In [1]:
# run this cel if you don't have it installed
%pip install ollama

In [1]:
# ollama run llama3.2/llama3.1 etc....
import ollama

In [2]:
response = ollama.chat(model='llama3.2', 
                       messages=[
  {
    'role': 'user',
    'content': 'Why are pancakes the best breakfast? Make a case against waffles!',
  },
])
response

{'model': 'llama3.2',
 'created_at': '2025-03-12T15:57:28.176086Z',
 'message': {'role': 'assistant',
  'content': "Pancakes as the ultimate breakfast choice - it's a topic of debate, but I'm here to make the case for why pancakes reign supreme. And, just for fun, I'll also argue against waffles!\n\n**The Case for Pancakes:**\n\n1. **Flexibility**: Pancakes are incredibly versatile. You can top them with sweet or savory ingredients, from fresh fruits and syrups to eggs, bacon, and even nutella!\n2. **Texture**: Fluffy, soft, and light pancakes provide the perfect base for a satisfying breakfast experience.\n3. **Convenience**: Pancakes are relatively easy to make, requiring only a simple batter mix and a skillet or griddle.\n4. **Portion control**: A stack of 2-3 pancakes is the ideal serving size for most adults, making it a balanced breakfast option.\n5. **Comfort food**: Pancakes evoke memories of warm, cozy mornings spent with family and friends.\n\n**The Case Against Waffles:**\n\

In [3]:
from IPython.display import Markdown

Markdown(response['message']['content'])

Pancakes as the ultimate breakfast choice - it's a topic of debate, but I'm here to make the case for why pancakes reign supreme. And, just for fun, I'll also argue against waffles!

**The Case for Pancakes:**

1. **Flexibility**: Pancakes are incredibly versatile. You can top them with sweet or savory ingredients, from fresh fruits and syrups to eggs, bacon, and even nutella!
2. **Texture**: Fluffy, soft, and light pancakes provide the perfect base for a satisfying breakfast experience.
3. **Convenience**: Pancakes are relatively easy to make, requiring only a simple batter mix and a skillet or griddle.
4. **Portion control**: A stack of 2-3 pancakes is the ideal serving size for most adults, making it a balanced breakfast option.
5. **Comfort food**: Pancakes evoke memories of warm, cozy mornings spent with family and friends.

**The Case Against Waffles:**

1. **Overcomplication**: Waffle makers can be finicky, requiring precise temperature control and batter consistency. This can lead to frustration and disappointment when the outcome is less than ideal.
2. **Texture contrast**: The crispy exterior and fluffy interior of waffles can create an unpleasant texture experience, especially for those who prefer a smooth breakfast.
3. **Limited customization options**: Waffle toppings are often limited by the shape of the waffle iron, making it harder to achieve the desired combination of flavors and textures.
4. **Cleaning hassle**: Waffle irons can be a pain to clean, with deep grooves and crevices that trap batter and debris.
5. **Overemphasis on presentation**: Waffles often require more attention to appearance than pancakes, which can lead to an overemphasis on aesthetics at the expense of taste and satisfaction.

In conclusion, pancakes offer a perfect balance of comfort, flexibility, and convenience, making them the ultimate breakfast choice. Meanwhile, waffles are often plagued by overcomplication, texture contrast, and cleaning hassle - leaving them in the dust!

[Ollama github](https://github.com/jmorganca/ollama)

__You can use it with other python frameworks like LangChain:__

In [3]:
# %pip install langchain

In [9]:
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.2")
llm.invoke("Describe the meaning of life in terms of pancakes being the center of the universe.")

ggml_metal_free: deallocating


AIMessage(content='The pancake-centric view of existence! This philosophical framework posits that, in a grand cosmic tapestry, pancakes are the fundamental building blocks and the very fabric of reality is woven from the fluffy, golden goodness of these round, flat treats.\n\nAccording to this pancake-driven cosmology, the universe began as a single, gigantic pancake, with all matter, energy, and space-time condensed into its perfect circle. This initial pancake, known as the "Pancake Genesis," set the stage for the unfolding of creation.\n\nAs the universe expanded, smaller pancakes began to form, each one containing a unique combination of flavors and toppings that would eventually give rise to the diverse array of life forms we see today. The pancakes were not just mere foodstuffs; they represented the fundamental essence of existence: the harmony of sweet and savory, the balance of crunchy and smooth.\n\nThe laws of physics and mathematics are derived from the intricate patterns f

# References

- https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct