## Open notebook in:
| Colab                                 |  Gradient                                                                                                                                         |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/Nicolepcx/transformers-the-definitive-guide/blob/main/CH02/ch02_llama_index_llama3.ipynb)                                              | [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com//github.com/Nicolepcx/transformers-the-definitive-guide/blob/main/CH02/ch02_llama_index_llama3.ipynb)|             

# About this notebook


In this notebook you use the [Hugging Face Chat Template](https://huggingface.co/docs/transformers/main/en/chat_templating) to prompt Llama 3. This streamlines the way you prompt different models, as it automatically provides the correct template for the language model your are using.
Additionally, you will use Llama 3 for text completion and for prompting factual information about the Statue of Liberty. You will load the Llama 3 model with [quantization](https://huggingface.co/docs/bitsandbytes/main/en/index) to leverage an optimized, less resource-hungry version of the model.


# Installs

In [None]:
!pip -q install transformers==4.38.2 \
                datasets==2.18.0 \
                loralib==0.1.2 \
                sentencepiece==0.1.99 \
                bitsandbytes==0.43.0 \
                accelerate==0.28.0

In [None]:
import os
import torch
import transformers
from huggingface_hub import HfApi, HfFolder
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          PreTrainedTokenizer,
                          PreTrainedModel,
                          BitsAndBytesConfig,
                          pipeline
                        )

from huggingface_hub import HfApi, HfFolder
from textwrap import TextWrapper

In [None]:
def print_wrapper(print):
    """Adapted from: https://stackoverflow.com/questions/27621655/how-to-overload-print-function-to-expand-its-functionality/27621927"""

    def function_wrapper(text):
        if not isinstance(text, str):
            text = str(text)
        wrapper = TextWrapper()
        return print("\n".join([wrapper.fill(line) for line in text.split("\n")]))

    return function_wrapper

print = print_wrapper(print)

In [None]:
# Hugging Face access token
hf_token = "your_access_token"

# HfFolder to save the token for subsequent API calls
HfFolder.save_token(hf_token)

In [None]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)

# BitsAndBytes configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False, # You can optionally load it in 8bit
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# load in 4bit
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config = bnb_config,
    token=True,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [None]:
prompt = "Nicole lives in Zurich, Switzerland and is a Data Scientist. Her personal interests include "

model_input = tokenizer(prompt, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
    output_ids = model.generate(model_input["input_ids"], max_new_tokens=10)[0]
    response = tokenizer.decode(output_ids, skip_special_tokens=True)
    print(response)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Nicole lives in Zurich, Switzerland and is a Data Scientist. Her
personal interests include 3D printing, DIY electronics, and coding.


In [None]:
messages = [
    {
        "role": "system",
        "content": "Tell me five facts about the statue of liberty.",
    },
    {"role": "user", "content": "You are an expert in History and providing factual information."},
 ]

chat_template = tokenizer.apply_chat_template(messages, tokenize=False)
print(chat_template)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Tell me five facts about the statue of
liberty.<|eot_id|><|start_header_id|>user<|end_header_id|>

You are an expert in History and providing factual
information.<|eot_id|>


In [None]:
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

In [None]:
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    prompt,
    max_new_tokens=512,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.2,
    top_p=0.8,
)
response = outputs[0][prompt.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


The Statue of Liberty! A iconic symbol of freedom and democracy. Here
are five fascinating facts about this beloved landmark:

1. **Gift from France**: The Statue of Liberty was a gift from the
people of France to the people of the United States. It was designed
by French sculptor Frédéric Auguste Bartholdi and built by Gustave
Eiffel. The statue was dedicated on October 28, 1886.
2. **Colossal Size**: The Statue of Liberty stands 151 feet tall,
including the pedestal. The statue itself is 111 feet tall, making it
one of the largest statues in the world at the time of its
construction.
3. **Broken Chains**: The statue depicts Libertas, the Roman goddess
of freedom, holding a torch above her head and broken chains at her
feet. The broken chains represent the abolition of slavery and the
idea of freedom from oppression.
4. **Seven Points of the Crown**: The Statue of Liberty's crown is
made up of seven points, representing the seven seas and continents.
The crown is also adorned with a t