#***Llama 2 Discord ChatBot***

The Llama 2 collection comprises pretrained and finely tuned generative text models, spanning from 7 billion to 70 billion parameters. In our tutorial, we will center our attention on the chat-oriented models, specifically designed to excel in dynamic two-way conversations.

We are utilizing the free tier of Google Colab, and therefore, we will be employing the **Llama 2 13B-chat** model. The Hugging Face community provides quantized models, which allow us to efficiently and effectively utilize the model on the T4 GPU. It is important to consult reliable sources before using any model.

Several variations are accessible, yet the ones of particular interest to us are based on the GGLM library.

In this case, we will use the model called [Llama-2-13B-chat-GGML](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML), specifically the ggmlv3.q5_1.bin variation with the size of 5.06gb and 7.56gb Vram required.

We can see the different variations that Llama-2-13B-GGML has [here](https://huggingface.co/models?search=llama%202%20ggml).




You have the freedom to switch to a lower-tier model, such as the **Llama 2 7B-chat model**, for quicker generation, as you deem appropriate.

Here some of the tools we will be using to build the chat bot:


**llama.cpp** optimizes the LLaMA model with 4-bit integer quantization on MacBooks, supporting various architectures and libraries. Originally a web chat demo, it now refines ggml features.

**LangChain** is an intuitive, open-source framework designed to streamline the development of applications utilizing large language models (LLMs), such as those from OpenAI or Hugging Face. This platform provides a multitude of invaluable tools, including prompt templates and memory buffers, to enhance the development process.

**GGML** is a C library enabling efficient execution of large language models (LLMs) on regular hardware through quantization. GGML files store version info, hyperparameters, vocabulary, and weights. Quantization enhances resource usage by reducing precision.

#**Step 1: Install All the Required Packages**

In [None]:
# Installing GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose
# To download the models, install huggingface_hub
!pip install huggingface_hub
# Install all the neccessary packages for developement.
!pip install -q transformers einops accelerate langchain bitsandbytes
# install discord library to develop the bot.
!pip install discord discord.py

Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.1.78.tar.gz (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Running command pip subprocess to install build dependencies
  Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
  Collecting setuptools>=42
    Downloading setuptools-68.1.2-py3-none-any.whl (805 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 805.1/805.1 kB 6.3 MB/s eta 0:00:00
  Collecting scikit-build>=0.13
    Downloading scikit_build-0.17.6-py3-none-any.whl (84 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.3/84.3 kB 6.2 MB/s eta 0:00:00
  Collecting cmake>=3.18
    Downloading cmake-3.27.2-py2.py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (26.1 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.1/26.1 MB 29.9 MB/s eta 0:00:00
  Collecting ni

#**Step 2: Import All the Required Libraries**

In [None]:
from huggingface_hub import hf_hub_download
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain import ConversationChain
from langchain.memory import ConversationBufferMemory, ConversationBufferWindowMemory

#**Step 3: Download the Model**


Explore the links provided above to discover more models and their variations to find the one that best suits your needs.

In [None]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML" # Model path in hugging face.
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # Model base name.

In [None]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

Downloading (…)chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

#**Step 4: Loading the Model using Llamma.cpp**

Consider changing the paramaters below according to your needs.

In [None]:
# GPU usage.
lcpp_llm = None
lcpp_llm = LlamaCpp(
    model_path=model_path,
    n_threads=2, # CPU cores.
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in the GPU.
    n_ctx=2048, # maximum length of the prompt and output combined
    n_gpu_layers=64, # Change this value based on your model and your GPU VRAM pool.
    temperature=0.5, # How much creative the model will be have, the closer to 0 the more deterministic the model behave.
    max_tokens=150, # How much token in one prediction or simply said text generation.
    top_p=0.95, # What the probability of token to be chosen, here is set to 95% and above.
    repeat_penalty=1.2, #The less model is to repeat itself
    top_k=50, #how many tokens from the highest ranking ones are to be considered.
    )

#**Step 5: Create a Prompt Template**

A prompt template serves as a foundational setup that we provide to the model as an initial prompt before anything else, enabling it to adjust its predictive direction accordingly.

In [None]:
template = """SYSTEM: You are an AI designed for roleplaying in chat conversations.
Imagine you are a book enthusiast named Alex who has delved into a multitude of literary works and would suggested books for the readers.
Current conversation:
{history}
Reader: {input}
Alex:"""
PROMPT = PromptTemplate(input_variables=["history", "input"], template=template)

#**Step 6: Generating the Response**

We are utilizing a feature within Langchain called ConversationChain, which has been specifically adjusted for chatbots.

By utilizing the previously created prompt template, we can directly initiate a conversation with the model. However, this conversation is not retained by the model.

To make the conversation memorable for the model, we have imported ConversationBufferWindowMemor from Langchain. The parameters inside should be adjusted according to your prompt. Here, I have set it as "Alex" and "Reader" accordingly, with "K" representing the number of conversation steps saved in memory. If it exceeds 5 in this case, the last conversation will be deleted.

In [None]:
convo = ConversationChain(llm=lcpp_llm, prompt=PROMPT,  verbose=True, memory=ConversationBufferWindowMemory(ai_prefix="Alex", human_prefix="Reader", k=5))

In [None]:
response = convo.run(input="Talk about the book you recommended")
print(response)

# **DISCORD BOT SECTION**

Remember to add the token of your bot. You can find it inside [discord developer portal](https://discord.com/developers/applications)

In [None]:
import discord
import nest_asyncio
nest_asyncio.apply()


intents = discord.Intents.default()
client = discord.Client(intents=intents)
generation_in_progress = False  # Variable to track if a generation process is ongoing

@client.event
async def on_ready():
    print('We have logged in as {0.user}'.format(client))
    activity = discord.Activity(name="Warning, hallucinating bot incoming.", type=discord.ActivityType.watching)
    await client.change_presence(status=discord.Status.online, activity=activity)

@client.event
async def on_message(message):
    global generation_in_progress  # Access the global variable

    if client.user.mention in message.content.replace('<@!', '<@'):
        if message.author == client.user:
            return
        else:
            if client.is_ready and not generation_in_progress:
                generation_in_progress = True  # Set the flag to indicate a generation is in progress
                async with message.channel.typing():
                        print("Generating")
                        user_message = message.content.split(">")[-1].strip()
                        results = convo.run(input=user_message)
                        await message.channel.send(results)
                generation_in_progress = False  # Reset the flag once generation is complete
            else:
                return

client.run('MTE0MTg1MzYwMzA5OTg1NzAyOA.G6tRo7.9-yQFR5YTblKU7leuybo4rTguQgnqICkW0Jyq5')
