# **A simple Bot powered by Llama**
<center>

[![link](https://i.imgur.com/itX91TX.png)](https://huggingface.co/meta-llama/Llama-2-13b-chat)
Click on picture for more information
</center>

The Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging from 7 billion to 70 billion parameters, designed for dialogue use cases.

It outperforms open-source chat models on most benchmarks and is on par with popular closed-source models in human evaluations for helpfulness and safety.

`llama.cpp`'s objective is to run the LLaMA model with 4-bit integer quantization on MacBook. It is a plain C/C++ implementation optimized for Apple silicon and x86 architectures, supporting various integer quantization and BLAS libraries. Originally a web chat example, it now serves as a development playground for ggml library features.

`GGML`, a C library for machine learning, facilitates the distribution of large language models (LLMs). It utilizes quantization to enable efficient LLM execution on consumer hardware. GGML files contain binary-encoded data, including version number, hyperparameters, vocabulary, and weights. The vocabulary comprises tokens for language generation, while the weights determine the LLM's size. Quantization reduces precision to optimize resource usage.

## **Install All the Required Packages**
In this section, we meticulously prepare the environment necessary for leveraging the advanced capabilities of LLaMA 2, a cutting-edge series of generative text models. This preparation involves installing Python bindings for LLaMA C++ and dependencies that facilitate the integration of these powerful models into our Colab environment. The section is geared towards ensuring that users can seamlessly interact with LLaMA models, specifically the 13B variant optimized for chat, by installing `llama-cpp-python` and `huggingface_hub`. These installations enable us to access and utilize the model hosted by 'TheBloke' on Hugging Face, a platform renowned for its extensive repository of machine learning models. Furthermore, we define `model_name_or_path` and `model_basename` to pinpoint the exact version of the model we intend to use, highlighting our focus on a model variant designed for chat applications and optimized through quantization for efficient performance. This setup phase is crucial for the successful execution of subsequent operations involving the model within our project.

In [1]:
# Install the LLaMA C++ Python bindings and dependencies
!pip install --quiet huggingface_hub
!pip install --quiet llama-cpp-python==0.1.78

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.7 MB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.7 MB[0m [31m8.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━[0m [32m1.1/1.7 MB[0m [31m10.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m1.6/1.7 MB[0m [31m11.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# version of LLaMA (2-13B) designed for chat, hosted by 'TheBloke' on a platform like Hugging Face.
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"

# Specify the basename of the model file. This is the filename under which the model is stored.
# It indicates that the model uses a binary file format (.bin), which is common for storing
# pre-trained neural network models. The name includes several details about the model:
# - 'llama-2-13b-chat' suggests that this is a LLaMA model specifically tuned for chat applications.
# - 'ggmlv3' likely refers to the version or format of the model.
# - 'q5_K_M' might denote specific model parameters or quantization details,
#   indicating how the model was compressed or optimized.
model_basename = "llama-2-13b-chat.ggmlv3.q5_K_M.bin"

## **Import All the Required Libraries**
In this section, we focus on setting up the essential Python libraries needed for our project, ensuring seamless interaction with the LLaMA 2 model and the broader Hugging Face ecosystem. By importing `hf_hub_download` from the `huggingface_hub` library, we gain the ability to effortlessly download model files and associated resources from the Hugging Face Model Hub, a central repository for machine learning models. This function streamlines accessing various models and their components, such as configuration files or weights, vital for our model's operation. Additionally, we import the `Llama` class from the `llama_cpp` module, a Python wrapper that provides a convenient interface to LLaMA model implementations in C++. This setup enables the execution of complex NLP tasks by leveraging the efficiency and performance optimizations inherent in the C++ implementations, all within a familiar Python environment.

In [3]:
# Import the hf_hub_download function from the huggingface_hub library.
# This function is used to download files from the Hugging Face Model Hub.
# It is particularly useful for retrieving model files, configuration files, or any other
# resources associated with models hosted on the Hugging Face Hub. The function requires
# at least the name or path of the repository (model) on the Hub and the filename of the
# resource to download. Additional parameters can specify versioning, caching behavior,
# and more.

from huggingface_hub import hf_hub_download

In [4]:
# Import the Llama class from the llama_cpp module.
# The Llama class provides an interface to the LLaMA (Large Language Model - Meta AI) model implementations
# in C++. This class enables the instantiation of a LLaMA model object in Python, allowing for the execution
# of various NLP tasks such as text generation, question answering, and more, leveraging the efficiency
# and performance optimizations of C++.
# The llama_cpp module is a Python wrapper that facilitates interaction with the underlying C++ implementation
# of LLaMA models, making it accessible and usable within Python environments without needing direct
# C++ integration.

from llama_cpp import Llama

## **Download the Model**
In this section, we employ the `hf_hub_download` function, a pivotal tool from the Hugging Face Hub library, to precisely obtain the specific model file necessary for our project. By specifying the repository identifier and filename, we ensure the accurate retrieval of the LLaMA 2 model variant optimized for chat applications. This process results in the function returning the local file path of the downloaded model, providing us with a direct link to the essential resource. This local path is critical, as it serves as the gateway through which we can load and interact with the model within our application. This step not only streamlines the setup phase but also solidifies the foundation for executing subsequent operations involving the model, facilitating a seamless transition from download to deployment in our analytical or generative tasks.

In [5]:
# Use the hf_hub_download function to download the specific model file from the Hugging Face Hub.
# The function returns the local file path to the downloaded file. This path can then be used to load
# or further process the model file in your application.

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


llama-2-13b-chat.ggmlv3.q5_K_M.bin:   0%|          | 0.00/9.23G [00:00<?, ?B/s]

## **Loading the Model**
In this section, the focus shifts to initializing the LLaMA model within our computational environment, a critical step that bridges the gap between acquiring the model and utilizing its capabilities. This process starts with verifying the presence of the model file, a precaution that ensures our operations proceed with the actual model at hand, preventing runtime errors related to file absence. Following this check, we attempt to load the model, employing error handling mechanisms to gracefully manage any issues that arise during this phase. The configuration parameters—number of threads, batch size, and GPU layers—are meticulously adjusted to align with our hardware's capabilities, optimizing for both performance and resource utilization. Additionally, a conditional check for CUDA availability underscores the adaptability of our setup, allowing for dynamic allocation between GPU and CPU resources based on the execution environment, ensuring that the model loading is as efficient and effective as possible.

In [6]:
import os
import torch

model_path = model_path

# Check if the model is already downloaded
if not os.path.exists(model_path):
    # Code to download the model goes here
    pass  # Replace this with your download code

# Attempt to load the model with error handling
try:
    lcpp_llm = None
    # Assuming 'Llama' is your model class; adjust accordingly
    lcpp_llm = Llama(
        model_path=model_path,
        n_threads=2,  # CPU cores
        n_batch=512,  # Consider VRAM
        n_gpu_layers=32  # Adjust based on your GPU and model
    )
    # Verify the number of GPU layers loaded
    print(f"Number of GPU layers loaded: {lcpp_llm.params.n_gpu_layers}")
except Exception as e:
    print(f"An error occurred while loading the model: {e}")

# Optional: Check if CUDA is available for PyTorch and whether the model is on GPU
if torch.cuda.is_available():
    print("CUDA is available. Model is utilizing GPU.")
else:
    print("CUDA is not available. Model is on CPU.")

Number of GPU layers loaded: 32
CUDA is not available. Model is on CPU.


AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 


##**Create a Prompt Template**
In this section, we meticulously craft a structured template to guide the interaction between the user and the LLaMA model, ensuring the generation of relevant and engaging responses. This begins with defining the user's inquiry, a specific question that prompts the model's engagement. To contextualize the model's response, a system prompt is introduced, portraying the model as a knowledgeable, digital encyclopedia capable of providing detailed and insightful answers. By combining these elements—system context and user query—into a cohesive prompt template, we establish a clear framework for the model's operation. This structured approach not only facilitates the elicitation of informative responses but also enhances the user experience by generating outputs that closely align with the user's expectations and the conversational tone set by the system prompt.

In [7]:
# Define the inquiry question that will be posed to the model. This variable contains the text of the question
# that the user wants to ask.
the_inquiry = "what does a fun euro trip look like?"

# Define the system prompt, which sets the context for the model's response. This prompt informs the model
# of its role, thus guiding the tone and type of response expected.
system_prompt = """
SYSTEM: You are a digitalized encyclopedia with up-to-date knowledge about the world. You offer fun detailed answers to questions.
"""

# Construct the user prompt by embedding the user's inquiry within a formatted string that labels it as a user's question.
user_prompt = f"USER: {the_inquiry}"

# Combine the system prompt and user prompt into a single template, followed by a cue ("Suggestion:\n") for the model
prompt_template = f"{system_prompt}\n{user_prompt}\n\nSuggestion:\n"

## **Generating the Response**
In this section, we leverage the capabilities of the LLaMA C++ language model to translate our carefully crafted prompt template into a coherent and contextually appropriate response. Through a precise invocation of the `lcpp_llm` function, we pass the assembled prompt along with a set of finely tuned parameters—max tokens, temperature, top_p, repeat penalty, and top_k—each meticulously chosen to influence the content, length, and creativity of the model's output. These parameters are instrumental in shaping the model's response, ensuring it not only adheres to the query's context but also maintains a balance between relevance and inventiveness. Following the generation process, the response is retrieved and displayed, providing insights or answers aligned with the user's initial inquiry. This step is crucial, as it embodies the interaction between the user's curiosity and the model's artificial intelligence, culminating in the delivery of a detailed, engaging response.

In [8]:
# Generate a response using the LLaMA C++ language model (`lcpp_llm`). This function call
# requests the model to process the constructed `prompt_template` and produce a response
# adhering to the specified parameters. Each parameter controls a different aspect of the
# generation process, influencing how the response is formulated.

response = lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                    repeat_penalty=1.2, top_k=150,
                    echo=True)

In [9]:
print(response)

{'id': 'cmpl-7976138c-eb3a-43fe-aa3e-7ab8e4a96ddd', 'object': 'text_completion', 'created': 1712477238, 'model': '/root/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGML/snapshots/3140827b4dfcb6b562cd87ee3d7f07109b014dd0/llama-2-13b-chat.ggmlv3.q5_K_M.bin', 'choices': [{'text': "\nSYSTEM: You are a digitalized encyclopedia with up-to-date knowledge about the world. You offer fun detailed answers to questions.\n\nUSER: what does a fun euro trip look like?\n\nSuggestion:\nA fun Euro trip can include exploring historic cities, visiting famous landmarks, enjoying local cuisine and drinks, attending cultural events, and taking in the natural beauty of Europe's diverse landscapes. Here are some specific ideas for a memorable Euro trip:\n\n1. Explore Rome's Colosseum and Vatican City, then indulge in delicious Italian cuisine like pizza and pasta.\n2. Visit the iconic Eiffel Tower in Paris and sample French wines and cheeses at a charming bistro.\n3. Take a gondola ride through V

In [10]:
# Print the generated response text from the LLaMA model.
# After invoking the LLaMA model to generate a response based on the provided prompt and parameters,
print(response["choices"][0]["text"])


SYSTEM: You are a digitalized encyclopedia with up-to-date knowledge about the world. You offer fun detailed answers to questions.

USER: what does a fun euro trip look like?

Suggestion:
A fun Euro trip can include exploring historic cities, visiting famous landmarks, enjoying local cuisine and drinks, attending cultural events, and taking in the natural beauty of Europe's diverse landscapes. Here are some specific ideas for a memorable Euro trip:

1. Explore Rome's Colosseum and Vatican City, then indulge in delicious Italian cuisine like pizza and pasta.
2. Visit the iconic Eiffel Tower in Paris and sample French wines and cheeses at a charming bistro.
3. Take a gondola ride through Venice's canals and admire St. Mark's Square, then try some authentic seafood risotto for dinner.
4. Wander the medieval streets of Prague and visit Charles Bridge, followed by a night at a lively beer hall.
5. Enjoy Barcelona's beaches and modernist architecture, including Antoni Gaudí's famous Sagrada

## **Greating the Chat Bot**
In this section, we introduce 'Quick_Tips,' an interactive chatbot designed to serve as a digitalized encyclopedia, providing users with up-to-date, detailed answers on a wide array of topics. The core functionality of Quick_Tips is encapsulated in the `ask_question` function, which orchestrates the interaction process. Upon receiving a user's inquiry, the function constructs a prompt template, melding the query with a predefined system context to guide the model's response generation. Utilizing the LLaMA C++ language model, the chatbot processes this template, generating a response tailored to the user's question without revealing the underlying prompt mechanics, thereby maintaining a seamless conversational flow.

To facilitate user interaction, an input box is presented, inviting users to submit their questions directly within the Colab environment. This setup not only democratizes access to information by leveraging the chatbot's encyclopedic knowledge but also enhances user engagement through real-time, personalized responses. The incorporation of event-driven programming allows Quick_Tips to handle user queries efficiently, displaying answers promptly and encouraging a dynamic exchange of information, reflective of the chatbot's aim to inform and assist users in their quest for knowledge.

In [18]:
def ask_question(question):
    # Construct the prompt template
    the_inquiry = question
    system_prompt = "SYSTEM: You are a digitalized encyclopedia with up-to-date knowledge about the world. You offer fun detailed answers to questions."
    user_prompt = f"USER: {the_inquiry}"
    prompt_template = f"{system_prompt}\n{user_prompt}\n\nSuggestion:\n"

    # Generating the response (assuming lcpp_llm is already loaded)
    response = lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                        repeat_penalty=1.2, top_k=150,
                        echo=False)  # Set echo to False to exclude the prompt from the response

    # Display the response to the user
    print(response["choices"][0]["text"])

In [19]:
from IPython.display import display
import ipywidgets as widgets
from threading import Timer

# Introduction message
print("Hello! I'm Quick_Tips, your digitalized encyclopedia. I can provide up-to-date knowledge about the world.")
print("What would you like to know today?")

# Creating an input box for user input
input_box = widgets.Text(
    value='',
    placeholder='Type your question here...',
    description='Your Question:',
    disabled=False
)

display(input_box)

def on_enter_pressed(event):
    if event.value:
        ask_question(event.value)
    else:
        print("Please type a question to get an answer.")

input_box.on_submit(on_enter_pressed)

Hello! I'm Quick_Tips, your digitalized encyclopedia. I can provide up-to-date knowledge about the world.
What would you like to know today?


Text(value='', description='Your Question:', placeholder='Type your question here...')

Llama.generate: prefix-match hit


You could also ask for more information, like "What kind of compass?" or "For what purpose?", in order to give a more accurate answer.


Llama.generate: prefix-match hit


A compass is a navigational tool used for determining direction. It typically consists of a magnetic needle that points towards the Earth's magnetic North Pole, and a series of markings or a dial that indicate the cardinal directions (north, south, east, west). The compass has been an essential instrument for navigation since ancient times, and is still widely used today in various fields such as geography, cartography, sailing, hiking, and aviation.

Would you like to know more about the history of the compass? Or perhaps you'd prefer information on how to use a compass for navigation?
