In [1]:
import transformers
import torch
from transformers import pipeline
from functools import wraps
import time
import logging

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
from IPython.display import Markdown, display

In [4]:
# Load the model with quantization and other configuration
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
text_generation_pipeline = pipeline(
    "text-generation",
    model=model_id,
    token="##",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda"
) #use device_map="auto" for multiple GPUs

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def logging_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        # Start timing
        start_time = time.time()
        
        # Extract the system and user prompts
        system_prompt = kwargs.get('system_prompt') if 'system_prompt' in kwargs else args[0]
        user_prompt = kwargs.get('user_prompt') if 'user_prompt' in kwargs else args[1]

        # Calculate the total number of words in the input prompts
        input_word_count = len(system_prompt.split()) + len(user_prompt.split())

        # Execute the function and get the result
        result = func(*args, **kwargs)
        
        # End timing
        end_time = time.time()
        execution_time = end_time - start_time

        # Calculate the total number of words in the output
        output_word_count = len(result.split())

        # Log the results
        logging.info(f"Execution Time: {execution_time:.2f} seconds")
        logging.info(f"Input Word Count: {input_word_count}")
        logging.info(f"Output Word Count: {output_word_count}")

        return result
    
    return wrapper

In [6]:
@logging_decorator
def generate_text(system_prompt, 
                  user_prompt, 
                  pipeline=text_generation_pipeline,
                  print_prompt=False):
    # Prepare the message prompts
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    # Generate the prompt
    prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Optionally print the entire prompt
    if print_prompt:
        print("Generated Prompt: ", prompt)

    # Define the termination tokens
    terminators = [
        pipeline.tokenizer.eos_token_id,
        pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    # Perform the text generation
    outputs = pipeline(
        prompt,
        max_new_tokens=4000,
        eos_token_id=terminators,
        do_sample=False, #set to false so every time output is same, also ignore top p and temperature args
        pad_token_id = pipeline.tokenizer.eos_token_id
    )

    # Return the generated text
    return outputs[0]["generated_text"][len(prompt):]

In [7]:
system_prompt = "You're a helpful assistant."

In [11]:
user_prompt = "What is the architecture of LLaMa-3?"

In [12]:
Answer = generate_text(system_prompt,user_prompt)

print(Answer)

2024-06-28 06:19:46,713 - INFO - Execution Time: 14.38 seconds
2024-06-28 06:19:46,713 - INFO - Input Word Count: 10
2024-06-28 06:19:46,714 - INFO - Output Word Count: 387


LLaMA-3 is a large language model developed by Meta AI, and its architecture is based on the transformer model introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017. Here's a high-level overview of the architecture:

1. **Encoder**: The encoder is a stack of identical layers, each consisting of two sub-layers:
	* **Self-Attention Mechanism**: This mechanism allows the model to attend to different parts of the input sequence simultaneously and weigh their importance. It's composed of three linear layers (query, key, and value) and a softmax function.
	* **Feed-Forward Network (FFN)**: This is a fully connected feed-forward network with a ReLU activation function and a linear output layer.
2. **Decoder**: The decoder is also a stack of identical layers, each consisting of three sub-layers:
	* **Self-Attention Mechanism**: Similar to the encoder, this mechanism allows the model to attend to different parts of the input sequence and weigh their importance.
	* **Enco

In [13]:
display(Markdown(Answer))

LLaMA-3 is a large language model developed by Meta AI, and its architecture is based on the transformer model introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017. Here's a high-level overview of the architecture:

1. **Encoder**: The encoder is a stack of identical layers, each consisting of two sub-layers:
	* **Self-Attention Mechanism**: This mechanism allows the model to attend to different parts of the input sequence simultaneously and weigh their importance. It's composed of three linear layers (query, key, and value) and a softmax function.
	* **Feed-Forward Network (FFN)**: This is a fully connected feed-forward network with a ReLU activation function and a linear output layer.
2. **Decoder**: The decoder is also a stack of identical layers, each consisting of three sub-layers:
	* **Self-Attention Mechanism**: Similar to the encoder, this mechanism allows the model to attend to different parts of the input sequence and weigh their importance.
	* **Encoder-Decoder Attention**: This mechanism allows the model to attend to the encoder output and weigh its importance.
	* **FFN**: Similar to the encoder, this is a fully connected feed-forward network with a ReLU activation function and a linear output layer.
3. **Positional Encoding**: To help the model understand the position of each token in the input sequence, positional encoding is added to the input embeddings.
4. **Input Embeddings**: The input text is first embedded into a vector space using a learned embedding layer.
5. **Output Layer**: The final output of the decoder is a probability distribution over the vocabulary, which is used to generate the next token in the sequence.

LLaMA-3 uses a combination of these components to generate text. The model is trained on a large corpus of text data and uses a masked language modeling objective to predict the missing tokens in the input sequence.

Some specific details about LLaMA-3's architecture include:

* The model has 12 encoder layers and 12 decoder layers.
* The self-attention mechanism uses a combination of parallel and serial attention mechanisms.
* The FFN layers use a ReLU activation function and a linear output layer.
* The model uses a byte-level byte-pair encoding (BPE) tokenizer to split the input text into subwords.

Keep in mind that this is a simplified overview, and there may be additional details or nuances in the actual architecture of LLaMA-3.