# Chat with any LLM! 🤖

## Introduction
This notebook demonstrates how to build an application to chat with any Large Language Model (LLM) using Gradio. We will leverage the [Hugging Face API](https://huggingface.co) to interact with the `falcon-40b-instruct` model, which is one of the top-ranking open-source LLMs. By the end of this notebook, you will have a functional chat application that can interact with users using natural language.

- Key steps include:
1. Setting up the environment and installing necessary packages.
2. Loading the LLM and configuring the API.
3. Building the Gradio interface for user interaction.
4. Adding advanced features like chat history and customized prompts.

### Setting up the environment and installing necessary packages.

In [28]:
# Install necessary packages
%pip install python-dotenv text-generation gradio

Note: you may need to restart the kernel to use updated packages.


In [30]:
# Imports
import os
import io
import base64
import requests
import warnings
from dotenv import load_dotenv, find_dotenv
from IPython.display import Image, display, HTML
from PIL import Image
from text_generation import Client
import gradio as gr

# Load environment variables
load_dotenv(find_dotenv())
hf_api_key = os.environ.get('HF_API_KEY')
hf_api_falcom_base = os.environ.get('HF_API_FALCOM_BASE')

# Uncomment the following line to print HF API Key and hf_api_falcom_base
#print("HF API Key:", hf_api_key)
#print("Endpoint URL:", hf_api_falcom_base)

To configure the text-generation model, the correct endpoint (HF_API_FALCOM_BASE) must be selected, connecting the application to the desired model. Initially, the falcon-40b-instruct model [Inference Endpoint](https://huggingface.co/inference-endpoints) was suggested, but its Inference API has been deactivated. As a result, the tiiuae/falcon-7b-instruct model was choosen due to its similar capabilities, efficient performance, and lower resource requirements. This model remains effective for text-generation tasks, making it a practical choice. 

In [31]:
# Initialize the text generation client
client = Client(hf_api_falcom_base, headers={"Authorization": f"Bearer {hf_api_key}"}, timeout=120)
warnings.filterwarnings("ignore", message="Field .* has conflict with protected namespace .*")

- 1 method : Suggestion from Github copilot 

In [32]:
# Define the chat function
def generate(input, max_new_tokens):
    try:
        response = client.generate(input, max_new_tokens=max_new_tokens)
        return response.generated_text
    except Exception as e:
        return f"Error: {e}"

# Create the Gradio interface
demo = gr.Interface(
    fn=generate,
    inputs=[
        gr.Textbox(label="Prompt"),
        gr.Slider(label="Max new tokens", value=20, maximum=1024, minimum=1)
    ],
    outputs=gr.Textbox(label="Completion")
)

# Launch the interface
demo.launch(share=True, server_port=int(os.environ.get('PORT1', 7860)))

* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://3b1a945758ed97c56f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [33]:
gr.close_all()

Closing server running on port: 7860
Closing server running on port: 7860
Closing server running on port: 7860


## Building an app to chat with any LLM

Here we'll be using an [Inference Endpoint](https://huggingface.co/inference-endpoints) for `falcon-40b-instruct` , the best ranking open source LLM on the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). Nowdays, the `falcon-40b-instruct` Inference API has been deactivated. As a result, the tiiuae/falcon-7b-instruct model was choosen due to its similar capabilities, efficient performance, and lower resource requirements. This model remains effective for text-generation tasks, making it a practical choice. 

In [34]:
prompt = "Has math been invented or discovered?"
response=client.generate(prompt, max_new_tokens=256)
print(response)


generated_text='\nMath has been discovered, not invented. It is a system of rules and formulas that are used to describe the natural world and its behavior.' details=Details(finish_reason=<FinishReason.EndOfSequenceToken: 'eos_token'>, generated_tokens=30, seed=None, prefill=[], tokens=[Token(id=193, text='\n', logprob=-0.004764557, special=False), Token(id=25864, text='Math', logprob=-0.19885254, special=False), Token(id=504, text=' has', logprob=-0.3696289, special=False), Token(id=650, text=' been', logprob=-0.6875, special=False), Token(id=6524, text=' discovered', logprob=-0.5136719, special=False), Token(id=23, text=',', logprob=-1.4863281, special=False), Token(id=416, text=' not', logprob=-0.79785156, special=False), Token(id=21886, text=' invented', logprob=-0.0031585693, special=False), Token(id=25, text='.', logprob=-0.09484863, special=False), Token(id=605, text=' It', logprob=-1.2431641, special=False), Token(id=304, text=' is', logprob=-0.93408203, special=False), Token(i

In [35]:
# Initialize the text generation client
client = Client(hf_api_falcom_base, headers={"Authorization": f"Bearer {hf_api_key}"}, timeout=120)
warnings.filterwarnings("ignore", message="Field .* has conflict with protected namespace .*")

In [36]:
#import gradio as gr
def generate(input, slider):
    output = client.generate(input, max_new_tokens=slider).generated_text
    return output

demo = gr.Interface(fn=generate, 
                    inputs=[gr.Textbox(label="Prompt"), 
                            gr.Slider(label="Max new tokens", 
                                      value=20,  
                                      maximum=1024, 
                                      minimum=1)], 
                    outputs=[gr.Textbox(label="Completion")])

gr.close_all()
demo.launch(share=True, server_port=int(os.environ['PORT1']))

Closing server running on port: 7860
Closing server running on port: 7860
Closing server running on port: 7860
* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://983b3a5daccd2629d7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
gr.close_all()

## `gr.Chatbot()`

- `gr.Chatbot()` allows you to save the chat history (between the user and the LLM) as well as display the dialogue in the app.
- Define your `fn` to take in a `gr.Chatbot()` object.  
  - Within your defined `fn` function, append a tuple (or a list) containing the user message and the LLM's response:
`chatbot_object.append( (user_message, llm_message) )`

- Include the chatbot object in both the inputs and the outputs of the app.

In [None]:
import random

def respond(message, chat_history):
        #No LLM here, just respond with a random pre-made message
        bot_message = random.choice(["Tell me more about it", 
                                     "Cool, but I'm not interested", 
                                     "Hmmmm, ok then"]) 
        chat_history.append((message, bot_message))
        return "", chat_history

with gr.Blocks() as demo:
    chatbot = gr.Chatbot(height=240) #just to fit the notebook
    msg = gr.Textbox(label="Prompt")
    btn = gr.Button("Submit")
    clear = gr.ClearButton(components=[msg, chatbot], value="Clear console")

    btn.click(respond, inputs=[msg, chatbot], outputs=[msg, chatbot])
    msg.submit(respond, inputs=[msg, chatbot], outputs=[msg, chatbot]) #Press enter to submit

gr.close_all()
demo.launch(share=True, server_port=int(os.environ.get('PORT2', 7870)))


In [None]:
gr.close_all()

#### Format the prompt with the chat history

- You can iterate through the chatbot object with a for loop.
- Each item is a tuple containing the user message and the LLM's message.

```Python
for turn in chat_history:
    user_msg, bot_msg = turn
    ...
```

In [None]:
def format_chat_prompt(message, chat_history):
    prompt = ""
    for turn in chat_history:
        user_message, bot_message = turn
        prompt = f"{prompt}\nUser: {user_message}\nAssistant: {bot_message}"
    prompt = f"{prompt}\nUser: {message}\nAssistant:"
    return prompt

def respond(message, chat_history):
        formatted_prompt = format_chat_prompt(message, chat_history)
        bot_message = client.generate(formatted_prompt,
                                     max_new_tokens=1024,
                                     stop_sequences=["\nUser:", "<|endoftext|>"]).generated_text
        chat_history.append((message, bot_message))
        return "", chat_history

with gr.Blocks() as demo:
    chatbot = gr.Chatbot(height=240) #just to fit the notebook
    msg = gr.Textbox(label="Prompt")
    btn = gr.Button("Submit")
    clear = gr.ClearButton(components=[msg, chatbot], value="Clear console")

    btn.click(respond, inputs=[msg, chatbot], outputs=[msg, chatbot])
    msg.submit(respond, inputs=[msg, chatbot], outputs=[msg, chatbot]) #Press enter to submit

gr.close_all()
demo.launch(share=True, server_port=int(os.environ.get('PORT3', 7880)))

### Adding other advanced features

In [24]:
def format_chat_prompt(message, chat_history, instruction):
    prompt = f"System:{instruction}"
    for turn in chat_history:
        user_message, bot_message = turn
        prompt = f"{prompt}\nUser: {user_message}\nAssistant: {bot_message}"
    prompt = f"{prompt}\nUser: {message}\nAssistant:"
    return prompt

### Streaming

- If your LLM can provide its tokens one at a time in a stream, you can accumulate those tokens in the chatbot object.
- The `for` loop in the following function goes through all the tokens that are in the stream and appends them to the most recent conversational turn in the chatbot's message history.

In [25]:
def respond(message, chat_history, instruction, temperature=0.7):
    prompt = format_chat_prompt(message, chat_history, instruction)
    chat_history = chat_history + [[message, ""]]
    stream = client.generate_stream(prompt,
                                      max_new_tokens=1024,
                                      stop_sequences=["\nUser:", "<|endoftext|>"],
                                      temperature=temperature)
                                      #stop_sequences to not generate the user answer
    acc_text = ""
    #Streaming the tokens
    for idx, response in enumerate(stream):
            text_token = response.token.text

            if response.details:
                return

            if idx == 0 and text_token.startswith(" "):
                text_token = text_token[1:]

            acc_text += text_token
            last_turn = list(chat_history.pop(-1))
            last_turn[-1] += acc_text
            chat_history = chat_history + [last_turn]
            yield "", chat_history
            acc_text = ""

In [None]:
with gr.Blocks() as demo:
    chatbot = gr.Chatbot(height=240) #just to fit the notebook
    msg = gr.Textbox(label="Prompt")
    with gr.Accordion(label="Advanced options",open=False):
        system = gr.Textbox(label="System message", lines=2, value="A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.")
        temperature = gr.Slider(label="temperature", minimum=0.1, maximum=1, value=0.7, step=0.1)
    btn = gr.Button("Submit")
    clear = gr.ClearButton(components=[msg, chatbot], value="Clear console")

    btn.click(respond, inputs=[msg, chatbot, system], outputs=[msg, chatbot])
    msg.submit(respond, inputs=[msg, chatbot, system], outputs=[msg, chatbot]) #Press enter to submit

gr.close_all()
demo.queue().launch(share=True, server_port=int(os.environ.get('PORT4', 7890)))

Notice, in the cell above, you have used `demo.queue().launch()` instead of `demo.launch()`. "queue" helps you to boost up the performance for your demo. You can read [setting up a demo for maximum performance](https://www.gradio.app/guides/setting-up-a-demo-for-maximum-performance) for more details.

In [None]:
gr.close_all()

## Conclusion

In this notebook, we built a chat application using Gradio and the Hugging Face API to interact with the `falcon-40b-instruct` model. We covered the setup of the environment, loading the model, and creating an interactive interface with advanced features. We hope this tutorial helps you understand how to integrate LLMs into your applications.

For further reading and exploration:
- [Gradio Documentation](https://gradio.app)
- [Hugging Face API](https://huggingface.co/docs)
- [Open LLM Leaderboard](https://huggingface.co/open-llm-leaderboard)

Feel free to experiment with different models and customize the application to suit your needs. Happy coding!