# Task
Create a Google Colab notebook that sets up and runs a free, highly interactive, and advanced talking bot for customer service using the most capable free and open-source large language model runnable efficiently within Google Colab's typical resource limits.

## Setup environment

### Subtask:
Install necessary libraries and dependencies (e.g., `transformers`, `accelerate`, `bitsandbytes`, `gradio`).


**Reasoning**:
Install the required libraries using pip.



In [1]:
!pip install transformers accelerate bitsandbytes gradio

Collecting bitsandbytes
  Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2

## Model selection and loading

### Subtask:
Choose a suitable free and open-source LLM (like a quantized version of Llama 3 8B or Mistral 7B) that can run efficiently in Colab and load it using `transformers`.


**Reasoning**:
Import the necessary class and load a suitable quantized model using `from_pretrained` with appropriate arguments for Colab efficiency.



# Task
Create a Google Colab notebook that sets up and runs a free, highly interactive, and advanced talking bot for customer service using the most capable free and open-source large language model available that can be run efficiently within Google Colab's typical resource limits (e.g., a quantized version of Llama 3 8B, Mistral 7B, or similar, prioritizing performance and quality). The notebook should include model and tokenizer loading, pipeline setup, and a Gradio interface for interaction.

## Model selection and loading

### Subtask:
Choose a suitable free and open-source LLM (like a quantized version of Llama 3 8B or Mistral 7B) that can run efficiently in Colab and load it using `transformers`.


**Reasoning**:
Import the necessary class and load a suitable quantized model using `from_pretrained` with appropriate arguments for Colab efficiency.



In [2]:
from transformers import AutoModelForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    load_in_4bit=True,
    torch_dtype="auto",
    trust_remote_code=True,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.
401 Client Error. (Request ID: Root=1-6879fbe0-21bce2235252f71845670983;95a64768-028a-4cb0-9715-1bacb3ab2c19)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in.

**Reasoning**:
The previous attempt to load the model failed because it was a gated model requiring authentication. I will choose an alternative, readily available open-source model that is suitable for Colab and try loading it again.



In [3]:
from transformers import AutoModelForCausalLM

model_id = "NousResearch/Llama-3-8B-Instruct-DPO" # An alternative, freely available model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    load_in_4bit=True,
    torch_dtype="auto",
    trust_remote_code=True,
)

OSError: NousResearch/Llama-3-8B-Instruct-DPO is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

**Reasoning**:
The previous model ID was incorrect. I need to find a publicly available, 4-bit quantized Llama 3 8B model on Hugging Face Hub that is known to work well on Colab and try loading that.



In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit" # A readily available 4-bit quantized model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    load_in_4bit=True,
    trust_remote_code=True,
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

## Tokenizer loading

### Subtask:
Load the corresponding tokenizer for the selected model.


**Reasoning**:
Load the tokenizer for the chosen model.



In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

## Pipeline setup

### Subtask:
Set up a text generation pipeline using `transformers`.


**Reasoning**:
Import the `pipeline` function and create a text generation pipeline using the loaded model and tokenizer.



In [6]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Device set to use cuda:0


## Gradio interface creation

### Subtask:
Create a simple Gradio interface for the chatbot.


**Reasoning**:
Define the Gradio interface components and the chatbot response function using the previously loaded model and tokenizer.



In [7]:
import gradio as gr

def chatbot_response(message, history):
    # Format the input for the model (adjust based on the model's expected format)
    prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
    response = pipe(prompt, max_new_tokens=256)[0]['generated_text']

    # Extract only the assistant's response
    # This is a simple approach; more robust parsing might be needed depending on the model's output
    response_start = response.find("<|start_header_id|>assistant<|end_header_id|>")
    if response_start != -1:
        response = response[response_start + len("<|start_header_id|>assistant<|end_header_id|>") + 1:].strip()

    return response

# Define the Gradio interface
iface = gr.Interface(
    fn=chatbot_response,
    inputs=gr.Textbox(lines=2, placeholder="Enter your message here..."),
    outputs="textbox",
    title="Customer Service Chatbot",
    description="Interact with a free and open-source LLM for customer service.",
    examples=[["What is your return policy?"], ["How can I track my order?"]]
)



**Reasoning**:
The previous code block generated a warning because the `chatbot_response` function is expected to take two arguments (`message` and `history`) when used with Gradio's chatbot component, but the interface was set up for a single input textbox. The subtask is to create a simple Gradio interface, which should include handling chat history. I will modify the function signature and the Gradio interface to use the `gr.ChatInterface` which is designed for chatbot applications and automatically handles history.



In [8]:
import gradio as gr

def chatbot_response(message, history):
    # The history is a list of tuples, each tuple is (user_message, bot_response)
    # We need to format the entire conversation history for the model
    conversation = "".join([f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{bot}<|eot_id|>" for user, bot in history])
    conversation += f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"

    response = pipe(conversation, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)[0]['generated_text']

    # Extract only the assistant's response from the potentially longer generated text
    response_start = response.rfind("<|start_header_id|>assistant<|end_header_id|>")
    if response_start != -1:
        response = response[response_start + len("<|start_header_id|>assistant<|end_header_id|>") + 1:].strip()
        # Remove any trailing end-of-turn or end-of-text tokens if the model generated them
        response = response.split("<|eot_id|>")[0].strip()

    return response

# Define the Gradio ChatInterface
iface = gr.ChatInterface(
    fn=chatbot_response,
    title="Customer Service Chatbot",
    description="Interact with a free and open-source LLM for customer service.",
    examples=[["What is your return policy?"], ["How can I track my order?"]]
)

  self.chatbot = Chatbot(


## Run the gradio app

### Subtask:
Launch the Gradio app.


**Reasoning**:
Launch the Gradio interface to make the chatbot accessible.



In [9]:
iface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://ea991d82182abfc062.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Summary:

### Data Analysis Key Findings

*   A free and open-source language model suitable for Google Colab's resources, specifically `unsloth/llama-3-8b-Instruct-bnb-4bit`, was successfully loaded using 4-bit quantization.
*   The corresponding tokenizer for the selected model was loaded.
*   A text generation pipeline was successfully set up using the loaded model and tokenizer, configured to utilize the available GPU.
*   A Gradio `ChatInterface` was created, including a function to handle conversation history and generate responses using the text generation pipeline.
*   The Gradio application was launched, providing a public URL for interaction with the customer service chatbot.

### Insights or Next Steps

*   The current implementation extracts the assistant's response by searching for a specific header and splitting on end-of-turn tokens. Further refinement of the output parsing logic could improve robustness against unexpected model output formats.
*   Exploring techniques like prompt engineering or fine-tuning on customer service dialogues could enhance the chatbot's performance and relevance for specific customer service scenarios.
