<a href="https://colab.research.google.com/github/Ak-Gautam/llm_infer/blob/main/llama3_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama 3.2 model inference

### Llama-3.2-1B

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = False

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
from google.colab import userdata

In [7]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.2-1B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = userdata.get('HF_TOKEN_READ'), # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.9.post2: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


meta-llama/Llama-3.2-1B-Instruct does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


In [12]:
from transformers import TextStreamer
FastLanguageModel.for_inference(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (no

In [14]:
generation_args = {
    "max_new_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "do_sample": True,
}

In [18]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

### result 1B

In [20]:
# @title Write a C program to implement binary search
text = "Write a C program to implement binary search."
inputs = tokenizer(
[
alpaca_prompt.format(
    "You are helpful AI Assistant.", # instruction
    "Write a C program to implement binary search.", # input
    "", # output - leave this blank for generation!
)], return_tensors="pt", padding=True, truncation=True).to("cuda")


text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, **generation_args)
print()

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are helpful AI Assistant.

### Input:
Write a C program to implement binary search.

### Response:
Below is a C program that implements binary search.

```c
#include <stdio.h>

// Function to perform binary search
int binarySearch(int arr[], int n, int target) {
    int low = 0, high = n - 1;

    // Continue until low is greater than high
    while (low <= high) {
        // Calculate mid
        int mid = (low + high) / 2;

        // If target is found
        if (arr[mid] == target) {
            return mid;
        }
        // If target is smaller than arr[mid]
        else if (arr[mid] > target) {
            high = mid - 1;
        }
        // If target is larger than arr[mid]
        else {
            low = mid + 1;
        }
    }

    // If target is not found
    return -1; // R

### llama-3.2-3B

In [21]:
model2, tokenizer2 = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = userdata.get('HF_TOKEN_READ'), # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.9.post2: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

meta-llama/Llama-3.2-3B-Instruct does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


In [22]:
FastLanguageModel.for_inference(model2)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
      )
    )
    (

### result 3B

In [23]:
# @title Write a C program to implement binary search
inputs2 = tokenizer2(
[
alpaca_prompt.format(
    "You are helpful AI Assistant.", # instruction
    "Write a C program to implement binary search.", # input
    "", # output - leave this blank for generation!
)], return_tensors="pt", padding=True, truncation=True).to("cuda")


text_streamer = TextStreamer(tokenizer2)
_ = model2.generate(**inputs2, streamer=text_streamer, **generation_args)
print()

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are helpful AI Assistant.

### Input:
Write a C program to implement binary search.

### Response:
```c
/**
 * binary_search.c
 *
 * Implementation of binary search algorithm.
 *
 * @author [Your Name]
 */

#include <stdio.h>
#include <stdlib.h>

/**
 * Searches for an element in a sorted array using binary search.
 *
 * @param arr   The sorted array to search in.
 * @param target The element to search for.
 * @param size  The size of the array.
 * @return The index of the target element if found, -1 otherwise.
 */
int binary_search(int arr[], int target, int size) {
    int left = 0;
    int right = size - 1;

    while (left <= right) {
        int mid = left + (right - left) / 2;

        // Check if target is at the middle
        if (arr[mid] == target) {
            return mid;
        