# Run Inference on Llama-3-8B-Instruct

## References
1. [Model Card on HuggingFace](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

## Install and Load Dependencies

In [1]:
!pip install accelerate



In [2]:
import torch
from datetime import datetime
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from huggingface_hub import notebook_login as hfl

2024-06-12 11:25:57.215209: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-12 11:25:57.215348: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-12 11:25:57.379892: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
hfl()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Import Model and Tokenizer

In [4]:
checkpoint = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    torch_dtype=torch.float16,
    device_map="auto",
)

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [5]:
print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")

Memory footprint: 16060.53 MB


## Create Test Prompt

In [6]:
prompt = """
Write a script in Python that implements a circular queue and creates one with user input. Give an example of how the script works and document the code with appropriate comments and docstrings wherever necessary. 
Generate a response only for the prompt that is given and nothing else. 
"""

In [7]:
messages = [
    {"role": "system", "content": "You are a helpful and masterful programming assistant that generates markdown documentation for any input program."},
    {"role": "user", "content": prompt},
]

## Run Inference with `model.generate()`

In [8]:
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

In [9]:
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

In [10]:
%%time
outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


CPU times: user 18.3 s, sys: 39.6 ms, total: 18.3 s
Wall time: 19 s


In [11]:
print(tokenizer.decode(response, skip_special_tokens=True))

Here is a Python script that implements a circular queue and creates one with user input:
```
class CircularQueue:
    """
    A circular queue implementation in Python.
    """
    def __init__(self, size):
        """
        Initialize the circular queue with a given size.

        Args:
            size (int): The size of the queue.
        """
        self.size = size
        self.queue = [None] * size
        self.front = 0
        self.rear = 0

    def is_full(self):
        """
        Check if the queue is full.

        Returns:
            bool: True if the queue is full, False otherwise.
        """
        return (self.rear + 1) % self.size == self.front

    def is_empty(self):
        """
        Check if the queue is empty.

        Returns:
            bool: True if the queue is empty, False otherwise.
        """
        return self.front == self.rear

    def enqueue(self, item):
        """
        Enqueue an item into the queue.

        Args:
            item: Th

## Test Generate with Custom Input

In [12]:
language = "Python"
code = """
def test_euler_method(self):
    def dydx(x, y):
        return x + y

    x0, y0 = 0, 1
    x_end = 1
    h = 0.01
    x, y = differential_equations.euler_method(dydx, x0, y0, x_end, h)
    expected = np.exp(x_end) - x_end - 1  # Analytical solution for the differential equation dy/dx = x + y
    self.assertAlmostEqual(y[-1], expected, places=2)
"""

In [13]:
prompt = f"""
Given a script file in {language}, generate its documentation for each function. For each function in the script, document its name, arguments, return values, and a brief explanation of its logic. Ensure that code within comments is not parsed and documented. Generate nothing else than what is asked.

Strictly use the following format for each function:

## Function Name: `function_name`

### Arguments
* `arg1` (type): Description of argument 1.
* `arg2` (type): Description of argument 2.
* ...

### Return Values
* `return_value1` (type): Description of return value 1.
* `return_value2` (type): Description of return value 2.
* ...

### Explanation of Function Logic:
1. Brief explanation of the function logic step by step.
2. ...
3. ...

-----------------------------------------------------------------------------

Here is the function I want you to generate documentation of:

{code}
"""

In [14]:
prompt

'\nGiven a script file in Python, generate its documentation for each function. For each function in the script, document its name, arguments, return values, and a brief explanation of its logic. Ensure that code within comments is not parsed and documented. Generate nothing else than what is asked.\n\nStrictly use the following format for each function:\n\n## Function Name: `function_name`\n\n### Arguments\n* `arg1` (type): Description of argument 1.\n* `arg2` (type): Description of argument 2.\n* ...\n\n### Return Values\n* `return_value1` (type): Description of return value 1.\n* `return_value2` (type): Description of return value 2.\n* ...\n\n### Explanation of Function Logic:\n1. Brief explanation of the function logic step by step.\n2. ...\n3. ...\n\n-----------------------------------------------------------------------------\n\nHere is the function I want you to generate documentation of:\n\n\ndef test_euler_method(self):\n    def dydx(x, y):\n        return x + y\n\n    x0, y0

In [21]:
%%time
messages = [
    {"role": "system", "content": "You are a helpful and masterful programming assistant that generates markdown documentation for any input program."},
    {"role": "user", "content": prompt},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=1024,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


CPU times: user 17.7 s, sys: 0 ns, total: 17.7 s
Wall time: 17.7 s


In [22]:
print(tokenizer.decode(response, skip_special_tokens=True))

## Function Name: `test_euler_method`

### Arguments
* `self` (object): Instance of the class being tested.
* `dydx` (function): The function that defines the differential equation dy/dx = x + y.
* `x0` (float): The initial value of x.
* `y0` (float): The initial value of y.
* `x_end` (float): The end value of x.
* `h` (float): The step size for the Euler method.

### Return Values
* `y` (list of floats): The solution of the differential equation at each step.

### Explanation of Function Logic:
1. The function tests the Euler method for solving the differential equation dy/dx = x + y.
2. It defines the function dydx(x, y) that returns the value of dy/dx at a given point (x, y).
3. It sets the initial values of x and y, the end value of x, and the step size h.
4. It calls the euler_method function to solve the differential equation and gets the solution y.
5. It compares the last value of y with the analytical solution and checks if they are close within a certain tolerance.


## Remarks

- This model works beautifully for our use case
- We need a way to objectively quantify it's performance however