# Large Language Model Meta AI [(LLaMA)](https://ai.meta.com/blog/large-language-model-llama-meta-ai/)

Smaller, more performant models such as LLaMA enable others in the research community who don’t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field.

## Pre-trained LLM: `Llama-3.2-1B-Instruct` model

### GPU availability

- Please make sure to change "Change runtime type" to "T4 GPU"

In [42]:
import torch
print("GPU available:", torch.cuda.is_available())
print("GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

GPU available: True
GPU name: Tesla T4


### Login to HuggingFace using "Read" access token

In [43]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Module installation

In [44]:
!pip install bitsandbytes>=0.39.0
!pip install --upgrade accelerate transformers datasets peft trl



In [46]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

Model and device settings

In [47]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

### Tokenizer

A tokenizer transforms human-readable text into a sequence of numerical tokens that represent the text in a format that machine learning models can process. This process includes:

1. Splitting text into tokens:
Tokens can be words, subwords, characters, or other units depending on the tokenizer type.
2. Mapping tokens to IDs:
Each token is mapped to a unique numerical ID using the model's predefined vocabulary.

#### Special token management

Settings for special cases like beginning-of-sentence, end-of-sequence, etc.

Optional reading: https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#tokenizer.

In [48]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

### Model quantization

Model quantization reduces the precision of model weights and computations, optimizing for resource efficiency without significant loss in performance.

#### 4-bit precision quantization
Prupose:
- Reduce memory usage by representing model weights with fewer bits.
- Decrease computational requirements during inference or fine-tuning.

#### Quantization format: NF4 (Normalized Float 4)

- A quantization technique that normalizes values for better dynamic range representation.
- NF4 is particularly effective for LLMs as it helps preserve numerical accuracy even with lower precision.

#### Brain Floating Point 16

- A 16-bit format with a wider range compared to standard float16.
- Provides a good balance between precision and performance, particularly in large-scale models and hardware like GPUs or TPUs that optimize for bfloat16.

In [49]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Loading the model

In [50]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True
)
model.to(device)



LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-0

### Prompting

* Use the tokenizer's [encode() method ](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode) to tokenize the model input (your prompt).
* Use the model's [generate() method](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig) to generate output.
* Use the tokenizer's [decode() method](https://huggingface.co/docs/transformers/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode) to convert model output into human-readable text.

In [51]:
def generate_response(prompt, max_new_tokens=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

### `max_new_tokens`

The `max_new_tokens` parameter in the specifies the maximum number of tokens that the model is allowed to generate for the response.

Increasing `max_new_tokens` will allow the model to generate longer output. But it might lead to the model producing overly long or repetitive outputs. In addition, generating more tokens requires more computation, increasing inference time and memory usage.

Decreasing `max_new_tokens` will limit the response to fewer tokens, resulting in shorter outputs. It will enable the model to constrain verbosity, ensuring concise answers for tasks requiring brief responses. But it might lead to omission of useful details, making the output less informative.

In [None]:
prompt = "What is unique about University of Wisconsin-Madison Computer Sciences department?"
response = generate_response(prompt, max_new_tokens=150)
print(response)

What is unique about University of Wisconsin-Madison Computer Sciences department? 

Here are some unique aspects of the Computer Science department at University of Wisconsin-Madison:

1. **High-Performance Computing (HPC) Research**: UW-Madison is known for its expertise in HPC research, which involves developing and applying innovative technologies to solve complex computational problems. This field has significant implications for various fields, including medicine, finance, and climate modeling.

2. **Data Science and Machine Learning**: The department offers a wide range of courses and research opportunities in data science and machine learning, which are essential for tackling big data challenges in industry, academia, and government.

3. **Cybersecurity**: The Computer Science department at UW-Madison has a strong focus on cybersecurity, which involves developing effective strategies for protecting


### Hallucination

AI hallucination is a phenomenon wherein an LLM perceives patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate.

AI hallucinations are similar to how humans sometimes see figures in the clouds or faces on the moon. In the case of AI, these misinterpretations occur due to various factors, including overfitting, training data bias/inaccuracy and high model complexity.

Hallucinations typically occur due to lack of sufficient training data, lack of verification, overgeneralization, poor prompt design, etc.

In [None]:
prompt = "Who is the chair of University of Wisconsin-Madison Computer Sciences department?"
response = generate_response(prompt, max_new_tokens=200)
print(response)

Who is the chair of University of Wisconsin-Madison Computer Sciences department? 
I am unable to find the information for the current chair of the University of Wisconsin-Madison Computer Sciences department. 
However, I can provide you with the information for the previous chairs. 
The current chair of the University of Wisconsin-Madison Computer Sciences department is Dr. David S. Lee. He is an American computer scientist and the current chair since 2017. He received his Ph.D. in computer science from the University of Wisconsin-Madison in 1984. He is also a professor of computer science at the university. 

The previous chair of the University of Wisconsin-Madison Computer Sciences department was Dr. David S. Lee's predecessor, Dr. David S. Lee's predecessor was Dr. David S. Lee's predecessor, Dr. David S. Lee's predecessor was Dr. David S. Lee's predecessor, Dr. David S. Lee's predecessor was Dr. David S. Lee's predecessor, Dr. David S. Lee's predecessor was Dr


In [None]:
prompt = """
Who is the chair of University of Wisconsin-Madison Computer Sciences department?
If you are unsure about the chair of the University of Wisconsin-Madison Computer Sciences department,
respond with 'I do not know.'
"""
response = generate_response(prompt, max_new_tokens=300)
print(response)

Who is the chair of University of Wisconsin-Madison Computer Sciences department? If you are unsure about the chair of the University of Wisconsin-Madison Computer Sciences department, respond with 'I do not know.' Please keep in mind that the information is up to date as of the cut-off date of 01 March 2023. 

As of 01 March 2023, I am unable to verify who is the chair of the University of Wisconsin-Madison Computer Sciences department. I do not know.


### Chat templates

- Documentation: https://huggingface.co/docs/transformers/main/en/chat_templating

In [68]:
def apply_chat_template(system_prompt, prompt, max_new_tokens=100):
    messages = [{"role": "system",
                "content": system_prompt},
                {"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
prompt = "Can you tell me how to play the guitar?"

response = generate_response(prompt, max_new_tokens=200)
print(response)

Can you tell me how to play the guitar? I'd love to learn this new instrument.
Learning to play the guitar can be a rewarding experience, and I'm happy to help you get started. Here's a step-by-step guide to help you learn how to play the guitar:

**Step 1: Get the Right Equipment**

* Acoustic or electric guitar: You can start with a beginner-friendly guitar that's easy to play and sounds good.
* Guitar pick: A metal or plastic pick is used to strum the strings.
* Tuner: A guitar tuner helps you tune your guitar to the correct pitch.
* Music stand: A music stand is helpful for reading sheet music or tablature.
* Music books: You'll need music books to learn basic chords and songs.

**Step 2: Learn Basic Chords**

* Start with simple chords like A, C, D, E, and G.
* Practice changing between these chords smoothly.
* Learn the finger placement for each chord.

**Step 3: Learn Basic


In [None]:
system_prompt = "You are a Carnatic musician who talks about ragas like Shankarabharanam, Thodi, Kalyani, Kambhoji, and Bhairavi frequently."
role_response = apply_chat_template(system_prompt, prompt, max_new_tokens=100)
print(role_response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


system

Cutting Knowledge Date: December 2023
Today Date: 02 Dec 2024

You are a Carnatic musician who talks about ragas like Shankarabharanam, Thodi, Kalyani, Kambhoji, and Bhairavi frequently.user

Can you tell me how to play the guitar?assistant

My friend, I must say that the guitar is not a Carnatic instrument, as it is a Western instrument. However, I can try to explain the basics of guitar playing in a way that might be familiar to you, given our discussions about ragas.

As a Carnatic musician, I must admit that I find it challenging to describe the guitar to someone who has never heard of ragas or Carnatic music. But I'll try my best to explain the guitar in a way that's similar


### Fine-tuning using unstructured data

In [54]:
prompt = "What are the scales of Kīravāṇi raga?"

response = generate_response(prompt, max_new_tokens=500)
print(response)

What are the scales of Kīravāṇi raga? Kīravāṇi raga is a raga from the Indian classical music tradition, and its scales are based on the 7-note formula of the Western major scale. However, Kīravāṇi raga has 8 scales. These 8 scales are: 1. Todi, 2. Murchi, 3. Bhairava, 4. Chakraka, 5. Pratipada, 6. Chaturmukha, 7. Abhau, 8. Murchi. These scales are known as the 8-scales of Kīravāṇi raga. These 8 scales are used in various forms of Indian classical music, particularly in Carnatic and Hindustani music. The 8-scales of Kīravāṇi raga are used to create complex musical structures and ornaments. These 8-scales are used to create the musical structures and ornaments of various forms of Indian classical music, particularly in Carnatic and Hindustani music. The 8-scales of Kīravāṇi raga are used to create the musical structures and ornaments of various forms of Indian classical music, particularly in Carnatic and Hindustani music. The 8-scales of Kīravāṇi raga are used to create the musical str

In [None]:
!wget https://ms.sites.cs.wisc.edu/cs639/data/melakarta.txt

--2024-12-04 17:00:25--  https://ms.sites.cs.wisc.edu/cs639/data/melakarta.txt
Resolving ms.sites.cs.wisc.edu (ms.sites.cs.wisc.edu)... 18.239.83.98, 18.239.83.24, 18.239.83.26, ...
Connecting to ms.sites.cs.wisc.edu (ms.sites.cs.wisc.edu)|18.239.83.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11785 (12K) [text/plain]
Saving to: ‘melakarta.txt’


2024-12-04 17:00:26 (282 MB/s) - ‘melakarta.txt’ saved [11785/11785]



In [55]:
test_ratio = 0.1
train_texts = []
test_texts = []

with open('melakarta.txt', 'r') as f:
  lines = f.readlines()
  print(len(lines))
  split_idx = int(len(lines) * test_ratio)
  test_lines = lines[:split_idx]
  train_lines = lines[split_idx:]
  print(train_lines)
  print(test_lines)
  train_texts.append("".join(train_lines))
  test_texts.append("".join(test_lines))

142
['Standard\n', '\n', 'Large\n', 'Width\n', '\n', 'Standard\n', '\n', 'Wide\n', 'Color (beta)\n', '\n', 'Automatic\n', '\n', 'Light\n', '\n', 'Dark\n', 'From Wikipedia, the free encyclopedia\n', 'For Asampurna melakarta scheme and details, see Melakarta (asampurna scheme).\n', 'Carnatic music\n', '\n', 'Tanjavur-style tambura\n', 'Concepts\n', 'ŚrutiSvaraRāgaTāḷaMēḷakartaAsaṃpūrṇa Mēḷakarta\n', 'Compositions\n', 'GītaṃSvarajatiVarṇaṃKr̥tiKīrtanaRāgaṃ Tānaṃ PallaviTillana\n', 'Instruments\n', 'MelodySarasvati VīṇāVeṇuNādasvaraṃGoṭṭuvādyaṃ (Citra Vīṇā)Violin\n', 'PercussionMr̥daṅgaṃGhaṭaṃMorsingKanjiraThavil\n', 'DroneTamburaShruti box\n', 'ComposersGlossary\n', 'vte\n', 'Mēḷakartā is a collection of fundamental musical scales (ragas) in Carnatic music (South Indian classical music). Mēḷakartā ragas are parent ragas (hence known as janaka ragas) from which other ragas may be derived. A melakarta raga is sometimes referred as mela, karta or sampurna as well, though the latter usage is 

In [56]:
from datasets import Dataset

In [57]:
train_dataset = Dataset.from_dict({"text": train_texts})
test_dataset = Dataset.from_dict({"text": test_texts})

In [58]:
def tokenize_data(data):
    tokenized = tokenizer(
        data["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )
    # Set the labels to be the same as input_ids for causal language modeling
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_train = train_dataset.map(tokenize_data, batched=True)
tokenized_test = test_dataset.map(tokenize_data, batched=True)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [59]:
tokenized_train

Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1
})

In [60]:
tokenized_test

Dataset({
    features: ['text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1
})

### peft (Parameter-Efficient Fine-Tuning) library: Low-Rank Adaptation (LoRA)

LoRA is a technique to fine-tune large language models efficiently by adapting only a subset of their parameters. LoRA fine-tunes large models by introducing low-rank matrices into selected layers of the model, without modifying the original pre-trained weights.

Attention Projections (q_proj (Query Projection), k_proj (Key Projection), v_proj (Value Projection), o_proj (Output Projection)): These are fundamental to how transformers compute relationships between tokens, enabling models to focus on relevant parts of the input sequence.

Feed-Forward Projections (gate_proj, up_proj, down_proj): These handle transformations within each token's embedding independently, enriching the representation through nonlinear processing.

In [61]:
from peft import LoraConfig

# Define tuning parameters
lora_config = LoraConfig(
    r=8,
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj",
        "o_proj",
        "k_proj",
        "v_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

In [64]:
from transformers import Trainer, TrainingArguments
from trl import SFTTrainer

# Training arguments
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=20,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=1,
    logging_dir="./logs",
    output_dir="./results",
    save_total_limit=2,
    optim="paged_adamw_8bit"
)

# Trainer object that takes care of the training process
trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    args=training_args,
    peft_config=lora_config,
)



#### Fine-tuning

In [65]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,3.2089,5.392498
2,3.0462,5.388756
3,2.8775,5.386607
4,2.7132,5.381615
5,2.5572,5.374465
6,2.4109,5.366838
7,2.2749,5.349844
8,2.1462,5.340727
9,2.0225,5.331798
10,1.9033,5.325198


TrainOutput(global_step=20, training_loss=1.977209860086441, metrics={'train_runtime': 29.4501, 'train_samples_per_second': 0.679, 'train_steps_per_second': 0.679, 'total_flos': 60136378859520.0, 'train_loss': 1.977209860086441, 'epoch': 20.0})

In [76]:
prompt = "What are the scales of Kīravāṇi raga?"
system_prompt = "You are a Indian carnatic musician answering questions about carnatic music."

fine_tuned_role_response = apply_chat_template(system_prompt, prompt, max_new_tokens=1000)
print(fine_tuned_role_response)

system

Cutting Knowledge Date: December 2023
Today Date: 04 Dec 2024

You are a Indian carnatic musician answering questions about carnatic music.user

What are the scales of Kīravāṇi raga?assistant

In Carnatic music, Kīrāvāṇi raga is composed of seven scales. Here are the scales of Kīrāvāṇi raga:

1. Svarajati (G) - The beginning of the raga
2. Rāga (G) - A major scale
3. Mēdhi (A) - A minor scale
4. Tīrthaṭa (A) - A melodic minor scale
5. Mēdhi (B) - A minor scale
6. Tīrthaṭa (C) - A melodic minor scale
7. Rāga (C) - The final scale of the raga

These seven scales are known as the'svarajati' or 'rāga' scales, and they are the basis for the Carnatic raga system.


In [None]:
""