Text Generation

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AdamW
from transformers import pipeline
import torch 

model_id = "meta-llama/Llama-3.2-1B-Instruct"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map=device)

In [3]:
generation_pipeline = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
generation_pipeline("Hello what are you?", max_new_tokens=25)

Device set to use cuda
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': "Hello what are you? I'm a computer program, and I'm here to help answer any questions you may have.\n\nWhat do you do? I"}]

In [4]:
generation_pipeline(["Hello what are you?","What is the capitol of India?"], max_new_tokens=25)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[[{'generated_text': 'Hello what are you? You have a lot of info on your profile. What is your name and what are you? I am a bot, but'}],
 [{'generated_text': 'What is the capitol of India? New Delhi\nNew Delhi is the capital of India.'}]]

Tokenization

In [10]:
input_prompt = ["Hello what are you?","What is the capitol of India?"]

tokenized = tokenizer(input_prompt, return_tensors="pt", padding=True).to(device)
print(tokenized["input_ids"].shape)
tokenized["input_ids"]

torch.Size([2, 9])


tensor([[128009, 128009, 128009, 128000,   9906,   1148,    527,    499,     30],
        [128000,   3923,    374,    279,   2107,  27094,    315,   6890,     30]],
       device='cuda:0')

In [11]:
tokenizer.batch_decode(tokenized["input_ids"])

['<|eot_id|><|eot_id|><|eot_id|><|begin_of_text|>Hello what are you?',
 '<|begin_of_text|>What is the capitol of India?']

Instruction Prompts & Chat Templates

In [None]:
prompt_template = [
    {
        "role":"system",
        "content":"You are a helpful AI assistant who answers questions."
    }, 
    {
        "role":"user",
        "content":"When does the sun rise?"
    }
]
tokenizer.pad_token = tokenizer.eos_token
tokenized = tokenizer.apply_chat_template(prompt_template, add_generation_prompt=True, tokenize=True, padding=True, return_tensors="pt")
print(tokenized)


tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,    605,  13806,    220,   2366,     20,    271,   2675,    527,
            264,  11190,  15592,  18328,    889,  11503,   4860,     13, 128009,
         128006,    882, 128007,    271,   4599,   1587,    279,   7160,  10205,
             30, 128009, 128006,  78191, 128007,    271]])


In [19]:
tokenized = tokenized.to(device)
out = model.generate(tokenized, max_new_tokens=20)
print(tokenizer.batch_decode(out)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 10 Feb 2025

You are a helpful AI assistant who answers questions.<|eot_id|><|start_header_id|>user<|end_header_id|>

When does the sun rise?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The sun rises in the east and sets in the west due to the Earth's rotation on its axis


Continue Final Message

In [29]:
prompt_template = [
    {
        "role":"system",
        "content":"You are a helpful AI assistant who answers questions."
    }, 
    {
        "role":"user",
        "content":"When does the sun rise?"
    },
    {
        "role":"assistant",
        "content":"The sun rises"
    }
]

tokenized = tokenizer.apply_chat_template(prompt_template, add_generation_prompt=False, continue_final_message=True, tokenize=True, padding=True, return_tensors="pt")
print(tokenized)

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,    605,  13806,    220,   2366,     20,    271,   2675,    527,
            264,  11190,  15592,  18328,    889,  11503,   4860,     13, 128009,
         128006,    882, 128007,    271,   4599,   1587,    279,   7160,  10205,
             30, 128009, 128006,  78191, 128007,    271,    791,   7160,  38268]])


In [31]:
out = model.generate(tokenized.to(device), max_new_tokens=20)
print(tokenizer.batch_decode(out)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 10 Feb 2025

You are a helpful AI assistant who answers questions.<|eot_id|><|start_header_id|>user<|end_header_id|>

When does the sun rise?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The sun rises in the east and sets in the west. This is because the Earth rotates on its axis, which


Next Word Prediction

In [None]:
import torch.nn as nn

text = "Hello how are"
input_ids = tokenizer([text], return_tensors="pt")["input_ids"].to(device)
out = model(input_ids = input_ids)
print(out.logits.shape)
most_likely = out.logits.argmax(axis=-1)[0,-1].item()
print("Most Likely Token:",most_likely)
probability_dist = nn.Softmax()(out.logits[0,-1])
print("With Probability:",probability_dist[most_likely].item())
print("String Value:",tokenizer.convert_ids_to_tokens(most_likely))

torch.Size([1, 4, 128256])
Most Likely Token: 499
With Probability: 0.98828125
String Value: Ġyou


Loss Function

In [52]:
sentence = ["subscribe to neural breakdown with avb"]
tokenized = tokenizer(sentence, return_tensors="pt")["input_ids"]
print(tokenized)
print(tokenizer.batch_decode(tokenized))

tensor([[128000,   9569,    311,  30828,  31085,    449,   1860,     65]])
['<|begin_of_text|>subscribe to neural breakdown with avb']


In [53]:
input_ids = tokenized[:,:-1]
target_ids = tokenized[:,1:]

print("Input Seq:", input_ids)
print("Target Seq:", target_ids)

Input Seq: tensor([[128000,   9569,    311,  30828,  31085,    449,   1860]])
Target Seq: tensor([[ 9569,   311, 30828, 31085,   449,  1860,    65]])


In [55]:
prompt = [
    { "role":"user", "content":"What is the capital of India?" },
    { "role":"assistant", "content":"Captial:" }
]

chat_template = tokenizer.apply_chat_template(prompt, continue_final_message=True, tokenize=False)
print(chat_template)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 10 Feb 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of India?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Captial:


In [64]:
prompt = [
    { "role":"user", "content":"What is the capital of India?" },
    { "role":"assistant", "content":"Captial:" }
]
answer = "New Delhi"

chat_template = tokenizer.apply_chat_template(prompt, continue_final_message=True, tokenize=False)
full_response_text = chat_template + " " + answer + tokenizer.eos_token

print(full_response_text)
print("............................................................................................")
tokenized = tokenizer(full_response_text, return_tensors="pt", add_special_tokens=False)["input_ids"]
print(tokenized)

input_ids = tokenized[:,:-1]
target_ids = tokenized[:,1:]

print("Input Seq:", input_ids.shape)
print("Target Seq:", target_ids.shape)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 10 Feb 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of India?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Captial: New Delhi<|eot_id|>
............................................................................................
tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,    605,  13806,    220,   2366,     20,    271, 128009, 128006,
            882, 128007,    271,   3923,    374,    279,   6864,    315,   6890,
             30, 128009, 128006,  78191, 128007,    271,  41636,    532,     25,
           1561,  22767, 128009]])
Input Seq: torch.Size([1, 47])
Target Seq: torch.Size([1, 47])


In [66]:
labels_tokenized = tokenizer([" " + answer + tokenizer.eos_token], add_special_tokens=False, return_tensors="pt",padding="max_length", max_length=target_ids.shape[1])["input_ids"]
print(labels_tokenized)
labels_tokenized_fixed = torch.where(labels_tokenized != tokenizer.pad_token_id, labels_tokenized, -100)
labels_tokenized_fixed[:,-1] = tokenizer.eos_token_id
print(labels_tokenized_fixed)

tensor([[128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009,
         128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009,
         128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009,
         128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009,
         128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009,   1561,
          22767, 128009]])
tensor([[  -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
           -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
           -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
           -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
           -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   1561,
          22767, 128009]])


In [81]:
def generate_input_output_pair(prompt, target_responses):
    chat_templates = tokenizer.apply_chat_template(prompt, continue_final_message=True, tokenize=False) 

    full_response_text = [
        (chat_template + " " + target_response + tokenizer.eos_token)
        for chat_template, target_response in zip(chat_templates, target_responses)
    ]
    input_ids_tokenized = tokenizer(full_response_text, return_tensors="pt", add_special_tokens=False)["input_ids"]

    labels_tokenized = tokenizer([" " + response + tokenizer.eos_token for response in target_responses],
            add_special_tokens=False, return_tensors="pt", padding="max_length", max_length=input_ids_tokenized.shape[1])["input_ids"]
        
    labels_tokenized_fixed = torch.where(labels_tokenized != tokenizer.pad_token_id, labels_tokenized, -100) 
    labels_tokenized_fixed[:, -1] = tokenizer.pad_token_id

    input_ids_tokenized_left_shifted = input_ids_tokenized[:, :-1] 
    labels_tokenized_right_shifted  = labels_tokenized_fixed[:, 1:]
    attention_mask = input_ids_tokenized_left_shifted != tokenizer.pad_token_id

    return {"input_ids": input_ids_tokenized_left_shifted, "attention_mask": attention_mask, "labels": labels_tokenized_right_shifted}

In [82]:
prompt = [[
    { "role":"user", "content":"What is the capital of India?" },
    { "role":"assistant", "content":"Captial:" }
]]
answer = ["New Delhi"]
data = generate_input_output_pair(prompt, answer)

print(data["input_ids"])
print(data["labels"])

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,    605,  13806,    220,   2366,     20,    271, 128009, 128006,
            882, 128007,    271,   3923,    374,    279,   6864,    315,   6890,
             30, 128009, 128006,  78191, 128007,    271,  41636,    532,     25,
           1561,  22767]])
tensor([[  -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
           -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
           -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
           -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,
           -100,   -100,   -100,   -100,   -100,   -100,   -100,   -100,   1561,
          22767, 128009]])


Loss Calculation

In [69]:
import torch.nn as nn
def calculate_loss(logits, labels):
    loss_fn = nn.CrossEntropyLoss (reduction='none')
    cross_entropy_loss = loss_fn(logits.view(-1, logits.size(-1)), labels.view(-1)) 
    return cross_entropy_loss

In [87]:
out = model(input_ids=data["input_ids"].to(device))
print(out.logits.shape)
print(calculate_loss(out.logits, data["labels"].to(device)))
print(out.logits.argmax(-1))

torch.Size([1, 47, 128256])
tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0806,
        0.1206, 0.4883], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<NllLossBackward0>)
tensor([[ 16309, 128006,    198,    271,    567,   1303,    311,  12299,     25,
           5936,    220,   2366,     18,    198,    791,    596,     25,   2360,
             19,   5936,    220,   2366,     19,    198,   3146, 128006,  78191,
           3638,    271,     40,    527,    279,   1925,    315,   1561,   5380,
           1561, 128006,  78191, 128007,    271,    791,   1711,    315,   1561,
          22767, 128009]], device='cuda:0')


Basic Fine Tuning

In [93]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=['q_proj', 'v_proj'])

model = get_peft_model(model, lora_config) 
model.print_trainable_parameters()

trainable params: 6,815,744 || all params: 1,242,630,144 || trainable%: 0.5484933737451648


In [94]:
training_prompt = [
    { "role":"user", "content":"Who to subscribe to on YT for ML?" },
    { "role":"assistant", "content":"Subscribe to:" }
]
target_response = "neural breakdown with avb"

In [97]:
# OBSERVING UNFINETUNED OUTPUT
test_tokenized = tokenizer.apply_chat_template(training_prompt, continue_final_message=True, return_tensors="pt").to(device)
test_out = model.base_model.generate(test_tokenized, max_new_tokens=10)
print(tokenizer.batch_decode(test_out, skip_special_tokens=True)[0])


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


system

Cutting Knowledge Date: December 2023
Today Date: 10 Feb 2025

user

Who to subscribe to on YT for ML?assistant

Subscribe to: neural breakdown with avb


In [98]:
from transformers import AdamW

data = generate_input_output_pair(prompt=[training_prompt], target_responses=[target_response])
data["input_ids"] = data["input_ids"].to(device)
data["labels"] = data["labels"].to(device)

optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)


for _ in range(10):
    out = model(input_ids=data["input_ids"])
    loss = calculate_loss(out.logits, data["labels"]).mean()

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print("loss: ", loss.item())




loss:  0.00017833709716796875
loss:  0.0001850128173828125
loss:  0.000186920166015625
loss:  0.0001678466796875
loss:  0.00017833709716796875
loss:  0.00017833709716796875
loss:  0.0001583099365234375
loss:  0.00018787384033203125
loss:  0.00016689300537109375
loss:  0.0001678466796875


In [92]:
# OBSERVING FINETUNED OUTPUT
test_tokenized = tokenizer.apply_chat_template(training_prompt, continue_final_message=True, return_tensors="pt").to(device)
test_out = model.generate(test_tokenized, max_new_tokens=10)
print(tokenizer.batch_decode(test_out, skip_special_tokens=True)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


system

Cutting Knowledge Date: December 2023
Today Date: 10 Feb 2025

user

Who to subscribe to on YT for ML?assistant

Subscribe to: neural breakdown with avb
