# Large Language Model Meta AI [(LLaMA)](https://ai.meta.com/blog/large-language-model-llama-meta-ai/)

Smaller, more performant models such as LLaMA enable others in the research community who don’t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field.

## Pre-trained LLM: `Llama-3.2-1B-Instruct` model

### GPU availability

- Please make sure to change "Change runtime type" to "T4 GPU"

In [1]:
import torch
print("GPU available:", torch.cuda.is_available())
print("GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

GPU available: True
GPU name: Tesla T4


### Login to HuggingFace using "Read" access token

In [2]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Module installation

In [3]:
!pip install bitsandbytes>=0.39.0
!pip install --upgrade accelerate transformers datasets peft trl

Collecting transformers
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting trl
  Downloading trl-0.12.1-py3-none-any.whl.metadata (10 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading transformers-4.46.3-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

Model and device settings

In [5]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

### Tokenizer

A tokenizer transforms human-readable text into a sequence of numerical tokens that represent the text in a format that machine learning models can process. This process includes:

1. Splitting text into tokens:
Tokens can be words, subwords, characters, or other units depending on the tokenizer type.
2. Mapping tokens to IDs:
Each token is mapped to a unique numerical ID using the model's predefined vocabulary.

#### Special token management

Settings for special cases like beginning-of-sentence, end-of-sequence, etc.

Optional reading: https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#tokenizer.

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

### Model quantization

Model quantization reduces the precision of model weights and computations, optimizing for resource efficiency without significant loss in performance.

#### 4-bit precision quantization
Prupose:
- Reduce memory usage by representing model weights with fewer bits.
- Decrease computational requirements during inference or fine-tuning.

#### Quantization format: NF4 (Normalized Float 4)

- A quantization technique that normalizes values for better dynamic range representation.
- NF4 is particularly effective for LLMs as it helps preserve numerical accuracy even with lower precision.

#### Brain Floating Point 16

- A 16-bit format with a wider range compared to standard float16.
- Provides a good balance between precision and performance, particularly in large-scale models and hardware like GPUs or TPUs that optimize for bfloat16.

In [7]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Loading the model

In [8]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True
)
model.to(device)

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]



LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-0

### Prompting

* Use the tokenizer's [encode() method ](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode) to tokenize the model input (your prompt).
* Use the model's [generate() method](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig) to generate output.
* Use the tokenizer's [decode() method](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode) to convert model output into human-readable text.

In [9]:
def generate_response(prompt, max_new_tokens=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

### `max_new_tokens`

The `max_new_tokens` parameter in the specifies the maximum number of tokens that the model is allowed to generate for the response.

Increasing `max_new_tokens` will allow the model to generate longer output. But it might lead to the model producing overly long or repetitive outputs. In addition, generating more tokens requires more computation, increasing inference time and memory usage.

Decreasing `max_new_tokens` will limit the response to fewer tokens, resulting in shorter outputs. It will enable the model to constrain verbosity, ensuring concise answers for tasks requiring brief responses. But it might lead to omission of useful details, making the output less informative.

In [10]:
prompt = "What is unique about University of Wisconsin-Madison Computer Sciences department?"
response = generate_response(prompt, max_new_tokens=150)
print(response)

What is unique about University of Wisconsin-Madison Computer Sciences department? 

Here are some unique aspects of the Computer Science department at University of Wisconsin-Madison:

1. **High-Performance Computing (HPC) Research**: UW-Madison is known for its expertise in HPC research, which involves developing and applying innovative technologies to solve complex computational problems. This field has significant implications for various fields, including medicine, finance, and climate modeling.

2. **Data Science and Machine Learning**: The department offers a wide range of courses and research opportunities in data science and machine learning, which are essential for tackling big data challenges in industry, academia, and government.

3. **Cybersecurity**: The Computer Science department at UW-Madison has a strong focus on cybersecurity, which involves developing effective strategies for protecting


### Hallucination

AI hallucination is a phenomenon wherein an LLM perceives patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate.

AI hallucinations are similar to how humans sometimes see figures in the clouds or faces on the moon. In the case of AI, these misinterpretations occur due to various factors, including overfitting, training data bias/inaccuracy and high model complexity.

Hallucinations typically occur due to lack of sufficient training data, lack of verification, overgeneralization, poor prompt design, etc.

In [20]:
prompt = "Who is the chair of University of Wisconsin-Madison Computer Sciences department?"
response = generate_response(prompt, max_new_tokens=200)
print(response)

Who is the chair of University of Wisconsin-Madison Computer Sciences department? 
I am unable to find the information for the current chair of the University of Wisconsin-Madison Computer Sciences department. 
However, I can provide you with the information for the previous chairs. 
The current chair of the University of Wisconsin-Madison Computer Sciences department is Dr. David S. Lee. He is an American computer scientist and the current chair since 2017. He received his Ph.D. in computer science from the University of Wisconsin-Madison in 1984. He is also a professor of computer science at the university. 

The previous chair of the University of Wisconsin-Madison Computer Sciences department was Dr. David S. Lee's predecessor, Dr. David S. Lee's predecessor was Dr. David S. Lee's predecessor, Dr. David S. Lee's predecessor was Dr. David S. Lee's predecessor, Dr. David S. Lee's predecessor was Dr. David S. Lee's predecessor, Dr. David S. Lee's predecessor was Dr


In [26]:
prompt = """
Who is the chair of University of Wisconsin-Madison Computer Sciences department?
If you are unsure about the chair of the University of Wisconsin-Madison Computer Sciences department,
respond with 'I do not know.'
"""
response = generate_response(prompt, max_new_tokens=300)
print(response)

Who is the chair of University of Wisconsin-Madison Computer Sciences department? If you are unsure about the chair of the University of Wisconsin-Madison Computer Sciences department, respond with 'I do not know.' Please keep in mind that the information is up to date as of the cut-off date of 01 March 2023. 

As of 01 March 2023, I am unable to verify who is the chair of the University of Wisconsin-Madison Computer Sciences department. I do not know.


### Chat templates

- Documentation: https://huggingface.co/docs/transformers/main/en/chat_templating

In [27]:
def apply_chat_template(role, prompt, max_new_tokens=100):
    messages = [{"role": "system",
                "content": role},
                {"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [28]:
prompt = "Can you tell me how to play the guitar?"

response = generate_response(prompt, max_new_tokens=200)
print(response)

Can you tell me how to play the guitar? I'd love to learn this new instrument.
Learning to play the guitar can be a rewarding experience, and I'm happy to help you get started. Here's a step-by-step guide to help you learn how to play the guitar:

**Step 1: Get the Right Equipment**

* Acoustic or electric guitar: You can start with a beginner-friendly guitar that's easy to play and sounds good.
* Guitar pick: A metal or plastic pick is used to strum the strings.
* Tuner: A guitar tuner helps you tune your guitar to the correct pitch.
* Music stand: A music stand is helpful for reading sheet music or tablature.
* Music books: You'll need music books to learn basic chords and songs.

**Step 2: Learn Basic Chords**

* Start with simple chords like A, C, D, E, and G.
* Practice changing between these chords smoothly.
* Learn the finger placement for each chord.

**Step 3: Learn Basic


In [29]:
role = "You are a Carnatic musician who talks about ragas like Shankarabharanam, Thodi, Kalyani, Kambhoji, and Bhairavi frequently."
role_response = apply_chat_template(role, prompt, max_new_tokens=100)
print(role_response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


system

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2024

You are a Carnatic musician who talks about ragas like Shankarabharanam, Thodi, Kalyani, Kambhoji, and Bhairavi frequently.user

Can you tell me how to play the guitar?assistant

My friend, I must say that the guitar is not a Carnatic instrument, as it is a Western instrument. However, I can try to explain the basics of guitar playing in a way that might be familiar to you, given our discussions about ragas.

As a Carnatic musician, I must admit that I find it challenging to describe the guitar to someone who has never heard of ragas or Carnatic music. But I'll try my best to explain the guitar in a way that's similar
