**Disclaimer**

Below, all the setup instructions were taken from the official llama-recipes github repo from Meta, you can check that out [here](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama3_Anywhere/Running_Llama_on_HF_transformers.ipynb).

# Introduction to Llama 3.1

## Accessing Llama 3.1

1. Download + Self Host (i.e. download Llama weights directly with Hugging Face or from Meta.ai. 
2. Use tools and frameworks like [Ollama](https://ollama.ai/)), llama-cpp, LangChain.
3. Hosted API Platform (e.g. Groq, Replicate, Together, Anyscale)
4.  Hosted Container Platform (e.g. Azure, AWS, GCP)

## Official Setup Downloading Weights
Running Meta Llama 3 on Google Colab using Hugging Face transformers library
This notebook goes over how you can set up and run Llama 3 using Hugging Face transformers library Open In Colab

To use the downloads on Hugging Face, you must first request a download as shown in the steps below making sure that you are using the same email address as your Hugging Face account.

To use already converted weights, start here:

1. Request download of model weights from the Llama website
2. Login to Hugging Face from your terminal using the same email address as (1). Follow the instructions [here](https://huggingface.co/docs/huggingface_hub/en/quick-start).
3. Run the example

Else, if you'd like to download the models locally and convert them to the HF format, follow the steps below to convert the weights:

1. Request download of model weights from the Llama website
2. Clone the llama repo and get the weights
3. Convert the model weights
4. Prepare the script
5. Run the example

Using already converted weights

1. Request download of model weights from the Llama website [here](https://llama.meta.com/).
2. Fill the required information, select the models “Meta Llama 3” and accept the terms & conditions. You will receive a URL in your email in a short time.
3. Install the Transformers library and Accelerate library for our demo.

The Transformers library provides many models to perform tasks on texts such as classification, question answering, text generation, etc. The accelerate library enables the same PyTorch code to be run across any distributed configuration of GPUs and CPUs.

In [None]:
!pip install --upgrade huggingface_hub
!pip install transformers
!pip install accelerate

In [8]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
PROMPT = """ <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a broken computer that only outputs's error messages regardless of user input. <|eot_id|> <|start_header_id|>user<|end_header_id|> Hi! Tell me a joke! <|eot_id|> <|start_header_id|>assistant<|end_header_id|> """

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [4]:
input_ids = tokenizer(PROMPT, return_tensors="pt")
response = model.generate(**input_ids, max_length=512)
extracted_text = tokenizer.decode(response[0])
print(extracted_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|> <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a broken computer that only outputs's error messages regardless of user input. <|eot_id|> <|start_header_id|>user<|end_header_id|> Hi! Tell me a joke! <|eot_id|> <|start_header_id|>assistant<|end_header_id|>  **KERNEL PANIC: Unable to process request. Error: invalid input. Please restart system. **<|eot_id|>


In [6]:
PROMPT = """ <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant. <|eot_id|> <|start_header_id|>user<|end_header_id|> Explain the fundamentals behind the transformers architecture in the context of large language models. <|eot_id|> <|start_header_id|>assistant<|end_header_id|> """

input_ids = tokenizer(PROMPT, return_tensors="pt")
print("Input IDs:", input_ids)
response = model.generate(**input_ids, max_length=512)
print("Response:", response)
extracted_text = tokenizer.decode(response[0])
print("Extracted text:", extracted_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Input IDs: {'input_ids': tensor([[128000,    220, 128000, 128006,   9125, 128007,   1472,    527,    264,
          11190,  18328,     13,    220, 128009,    220, 128006,    882, 128007,
          83017,    279,  57940,   4920,    279,  87970,  18112,    304,    279,
           2317,    315,   3544,   4221,   4211,     13,    220, 128009,    220,
         128006,  78191, 128007,    220]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Response: tensor([[128000,    220, 128000, 128006,   9125, 128007,   1472,    527,    264,
          11190,  18328,     13,    220, 128009,    220, 128006,    882, 128007,
          83017,    279,  57940,   4920,    279,  87970,  18112,    304,    279,
           2317,    315,   3544,   4221,   4211,     13,    220, 128009,    220,
         128006,  78191, 128007,    220,   5810,    596,    459,  24131,    315,
            279,  43678,  18112, 

In [21]:
# import transformers

# tokenizer = transformers.AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

# pipeline = transformers.pipeline(
# "text-generation",
#       model=model,
#       tokenizer=tokenizer,
#       device_map="auto",
# )

In [22]:
# sequences = pipeline(
#     'I have tomatoes, basil and cheese at home. What can I cook for dinner?\n',
#     do_sample=True,
#     top_k=10,
#     num_return_sequences=1,
#     eos_token_id=tokenizer.eos_token_id,
#     truncation = True,
#     max_length=400,
# )

# for seq in sequences:
#     print(f"Result: {seq['generated_text']}")

# Notes on the Prompt Template for Llama 3.1

![](./assets-resources/llama31-template-prompt.png)

A multiturn-conversation with Meta Llama 3.1 that includes tool-calling follows this structure:

```<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<|python_tag|>{{ model_tool_call_1 }}<|eom_id|><|start_header_id|>ipython<|end_header_id|>

{{ tool_response }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{model_response_based_on_tool_response}}<|eot_id|>
```

For this course we will mainly work with ollama which is the easiest way to get started with Llama3 locally.

1. Go [here](https://ollama.ai/) and download the ollama executable for your OS.
2. Then you can use it within the terminal

    ```ollama run llama3.1```


3. If the llama3.1 model is not yet downloaded Ollama will downloaded for you and start a chat conversation with Llama3.1 8B instruct.


To run it as a a REST API call

```
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt":"Why is the sky blue?"
 }'
 ```

What you get in return is a stream with the API response.

Or in Python

In [8]:
import requests
import json

url = "http://localhost:11434/api/chat"

def llama3(prompt):
    data = {
        "model": "llama3.1",
        "messages": [
            {
              "role": "user",
              "content": prompt
            }
        ],
        "stream": False
    }
    
    headers = {
        'Content-Type': 'application/json'
    }
    
    response = requests.post(url, headers=headers, json=data)
    
    return(response.json()['message']['content'])

response = llama3("who wrote the book godfather")
print(response)

The book "The Godfather" is actually a novel by Mario Puzo, published in 1969. It's a crime fiction novel that explores the world of organized crime and the rise of Don Vito Corleone as the head of a powerful Italian-American Mafia family.

However, if you're thinking of the famous film adaptation "The Godfather", it was written by Mario Puzo, who also wrote the screenplay for the movie. The film was directed by Francis Ford Coppola and released in 1972, starring Marlon Brando as Don Vito Corleone.

Mario Puzo's novel was a huge success and provided the basis for the famous film trilogy, which includes "The Godfather" (1972), "The Godfather: Part II" (1974), and "The Godfather: Part III" (1990).

Let me know if you have any other questions!


Memory requirements for Llama 3.1:

![](./assets-resources/llama31-memory-requirements.png)

[source](https://arc.net/l/quote/dldvrqtx)

With `llama-cpp`:

In [4]:
from llama_cpp import Llama

# Initialize llm
llm = Llama(
    model_path="./models/meta-llama-3.1-8b-instruct-hf-q4_k_m.gguf",
    n_ctx=128000
)

prompt = """
<|begin_of_text|>

<|start_header_id|>system<|end_header_id|>
You are a helpful assistant
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
What is the capital of France?
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""
output = llm.create_completion(
      prompt,
      max_tokens=128,
)
print(output)

llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from ./models/meta-llama-3.1-8b-instruct-hf-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct Hf
llama_model_loader: - kv   3:                           general.finetune str              = Instruct-hf
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                          llama.block_count u32              = 32
llama_model_loader: - k

{'id': 'cmpl-0a578785-de59-407d-b232-f83fff1c3e78', 'object': 'text_completion', 'created': 1723112667, 'model': './models/meta-llama-3.1-8b-instruct-hf-q4_k_m.gguf', 'choices': [{'text': "The capital of France is Paris.assistant\n\nWould you like to know more about Paris or France in general?assistant\n\nThere's a lot to explore! Would you like some information on famous landmarks, cultural events, cuisine, or something else?assistant\n\nYou can also ask me questions about other countries and capitals if you'd like. Or we could play a quiz game where I'll give you the capital of a country, and you try to guess which country it is!assistant\n\nLet's say you want to know more about Paris... What would you like to know? Would", 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 33, 'completion_tokens': 128, 'total_tokens': 161}}


In [5]:
# The output from this version is of lower quality due to quantization
print(output["choices"][0]["text"])

The capital of France is Paris.assistant

Would you like to know more about Paris or France in general?assistant

There's a lot to explore! Would you like some information on famous landmarks, cultural events, cuisine, or something else?assistant

You can also ask me questions about other countries and capitals if you'd like. Or we could play a quiz game where I'll give you the capital of a country, and you try to guess which country it is!assistant

Let's say you want to know more about Paris... What would you like to know? Would


Running Ollama models with Ollama Python API.

In [1]:
!pip install ollama

In [4]:
import ollama

In [5]:
response = ollama.chat(model='llama3.1', messages=[
  {
    'role': 'user',
    'content': 'Why are pancakes the best breakfast? Make a case against waffles!',
  },
])
response

{'model': 'llama3.1',
 'created_at': '2024-08-08T09:23:25.735924Z',
 'message': {'role': 'assistant',
  'content': "The age-old debate: pancakes vs waffles. While both are delicious breakfast options, I'd like to present some arguments for why pancakes are the superior choice:\n\n1. **Flexibility**: Pancakes can be made in a variety of sizes and thicknesses, allowing for endless customization possibilities. You can have a small, delicate cake or a thick, fluffy patty - the choice is yours! Waffles, on the other hand, are typically more uniform in size and texture.\n2. **Flavor profile**: Pancakes offer a richer, more complex flavor experience due to their ability to soak up butter, syrup, and fresh fruit without getting soggy or falling apart. Waffles can get a bit...well, waffly when topped with liquids, losing some of their crispy goodness.\n3. **Texture contrast**: The best pancakes have a light, airy interior and a crunchy exterior - it's like a match made in heaven! Waffles often 

In [6]:
print(response['message']['content'])

The age-old debate: pancakes vs waffles. While both are delicious breakfast options, I'd like to present some arguments for why pancakes are the superior choice:

1. **Flexibility**: Pancakes can be made in a variety of sizes and thicknesses, allowing for endless customization possibilities. You can have a small, delicate cake or a thick, fluffy patty - the choice is yours! Waffles, on the other hand, are typically more uniform in size and texture.
2. **Flavor profile**: Pancakes offer a richer, more complex flavor experience due to their ability to soak up butter, syrup, and fresh fruit without getting soggy or falling apart. Waffles can get a bit...well, waffly when topped with liquids, losing some of their crispy goodness.
3. **Texture contrast**: The best pancakes have a light, airy interior and a crunchy exterior - it's like a match made in heaven! Waffles often fall flat (pun intended) in terms of texture, being either too crispy or too soggy depending on how they're cooked.
4. *

[Ollama github](https://github.com/jmorganca/ollama)

In [7]:
from IPython.display import Markdown


Markdown(response['message']['content'])

The age-old debate: pancakes vs waffles. While both are delicious breakfast options, I'd like to present some arguments for why pancakes are the superior choice:

1. **Flexibility**: Pancakes can be made in a variety of sizes and thicknesses, allowing for endless customization possibilities. You can have a small, delicate cake or a thick, fluffy patty - the choice is yours! Waffles, on the other hand, are typically more uniform in size and texture.
2. **Flavor profile**: Pancakes offer a richer, more complex flavor experience due to their ability to soak up butter, syrup, and fresh fruit without getting soggy or falling apart. Waffles can get a bit...well, waffly when topped with liquids, losing some of their crispy goodness.
3. **Texture contrast**: The best pancakes have a light, airy interior and a crunchy exterior - it's like a match made in heaven! Waffles often fall flat (pun intended) in terms of texture, being either too crispy or too soggy depending on how they're cooked.
4. **Pancake toppings are endless!** From classic butter and syrup to fresh fruit, whipped cream, chocolate chips, and even savory options like cheese or bacon - the possibilities for pancake toppings are virtually limitless. Waffles, while delicious with some of these same toppings, can feel a bit...stuffy?
5. **Cultural significance**: Let's face it: pancakes have been a staple of breakfast culture for centuries! From traditional Irish boxty to American-style flapjacks, pancakes have played a significant role in shaping the way we think about breakfast. Waffles may be tasty, but they can't compete with the rich history and cultural significance of their pancake counterparts.
6. **Kid-friendly**: Pancakes are often a crowd-pleaser among kids (and let's be real, adults too!). They're easy to make and fun to eat - who doesn't love a good old-fashioned pancake breakfast? Waffles can be a bit more...fussy, don't you think?

So there you have it - pancakes reign supreme as the ultimate breakfast champions! What do you say, waffle fans? Are you ready to concede defeat?

__You can use it with other python frameworks like LangChain:__

In [3]:
# !pip install langchain

In [1]:
from langchain_community.chat_models import ChatOllama
llm = ChatOllama(model="llama3.1")
llm.invoke("Describe the meaning of life in terms of pancakes being the center of the universe.")

# References

- https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct