## Retrieval Augmented Generation (RAG) with Hugging Face Transformers on Intel® Data Center GPU

**Note:** Please select the "PyTorch 2.7" kernel when running this notebook

#### 1. Retrieval Augmented Generation (RAG):
Retrieval Augmented Generation (RAG) is a novel approach that combines the strengths of large-scale retrieval systems with the generative capabilities of transformer models like Falcon. In our RAG-based system, implemented on Intel® Data Center GPU Max Series 1100, when a question is posed, relevant documents or passages are retrieved from a corpus, and then fed alongside the query to the language model. This two-step process enables the model to leverage both external knowledge from the corpus and its internal knowledge to produce more informed and contextually accurate responses.

#### 2. In-context Learning:
Traditional machine learning models learn from extensive labeled datasets. In contrast, in-context learning pertains to models, especially language models, leveraging a few examples or context provided at inference time to tailor their outputs. Our implementation with Falcon3-1B-Instruct demonstrates this capability through efficient context processing using the Transformers pipeline.

#### 3. LLM Chains/Pipelines:
LLM chains or pipelines involve stringing together multiple stages or components of a system to achieve a complex task. In our RAG system, the pipeline includes a vector database for efficient retrieval, followed by the Hugging Face Transformers pipeline for response generation. This modular approach allows for easy optimization and updates to individual components.

#### 4. RAG for On-Premise LLM Applications:
With the growing need for data privacy and proprietary data handling, many enterprises seek solutions to harness the power of LLMs in-house. Our implementation, leveraging Intel's GPU infrastructure, demonstrates how RAG can be deployed effectively on-premise. By integrating RAG with local data repositories and utilizing hardware acceleration, enterprises can build powerful LLM applications tailored to their specific needs while ensuring data confidentiality.

#### 5. RAG vs Fine-Tuning:
While RAG is a powerful approach on its own, it can also be combined with different model architectures to enhance LLM capabilities. Our implementation focuses on using RAG with the Falcon3-1B-Instruct model, demonstrating how external knowledge retrieval can complement the model's built-in capabilities. This approach allows for dynamic knowledge integration without the need for continuous model retraining.

### Getting Started
1. Run the installation cell (first time only)
2. Execute all cells in order
3. The interactive interface will appear in the final cell
4. Select your preferences and start chatting!

### Install dependencies. Only run the first time.

In [None]:
import sys
import os

%pip install langchain accelerate transformers datasets tiktoken chromadb sentence_transformers langchain-community ipywidgets --no-warn-script-location

The code below builds RAGBot. RAGBot class is a streamlined implementation of a Retrieval-Augmented Generation (RAG) chatbot, designed to integrate large language models for generating contextually relevant responses. At its core, it manages the downloading and initialization of HuggingFace Transformer models, such as the "Falcon3-1B-Instruct" model, with support for handling large model files. Running on Intel® Data Center GPU Max Series 1100, the system leverages hardware acceleration for efficient model inference. It automates the process of fetching and structuring dialogue datasets from predefined sources. The chatbot utilizes the HuggingFace Transformers library, allowing for efficient model inference with adjustable parameters like the number of threads and maximum tokens. A key feature of RAGBot is its ability to build a vector database for text retrieval, significantly bolstering its ability to pull relevant document snippets based on user queries. This functionality, combined with a retrieval mechanism and an inference method using the Transformers pipeline, makes RAGBot a simple tool for developers aiming to learn about RAG and leverage this implementation as a basis for their implementation.

## Package Imports

Core libraries for RAG implementation including Hugging FaceTransformers, LangChain components, and utilities for UI and data handling

In [None]:
%env HF_HOME=/opt/notebooks/.cache/huggingface

In [None]:
import sys
import os
import requests
import contextlib
import pandas as pd
import time
import io
import ipywidgets as widgets
from IPython.display import display, HTML

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings
from datasets import load_dataset

### RAGBot Class Implementation

The RAGBot class implements a Retrieval-Augmented Generation system optimized for Intel® Data Center GPU Max Series 1100. Key components include:

#### Core Features
- Model Management: Handles initialization of Hugging Face Transformer models (currently supporting Falcon3-1B-Instruct)
- Dataset Handling: Manages dialogue datasets from various domains (robot maintenance, sports coaching, academia, retail)
- Vector Database: Creates and manages embeddings for efficient text retrieval
- RAG Pipeline: Combines context retrieval with language model inference

#### Key Methods
- `get_model()`: Initializes the specified language model
- `download_dataset()`: Fetches and processes dialogue datasets
- `load_model()`: Configures the transformer model with specified parameters
- `build_vectordb()`: Creates a searchable vector database from text data
- `retrieval_mechanism()`: Implements context retrieval based on user queries
- `inference()`: Generates responses using the loaded model and retrieved context

The implementation leverages Hugging Face Transformers for model operations and LangChain for RAG functionality, providing a streamlined approach to context-aware response generation.

In [None]:
class RAGBot:
    """
    A class to handle model downloading, dataset management, model loading, vector database
    creation, retrieval mechanisms, and inference for a response generation bot.
    """
    def __init__(self):
        self.model_name = ""
        self.data_path = ""
        self.user_input = ""
        self.model = ""
        self.max_tokens = 50
        self.top_k = 50

    def get_model(self, model, chunk_size: int = 10000):
        self.model = model
        if self.model == "Falcon":
            self.model_name = "tiiuae/Falcon3-1B-Instruct"
        elif model == "More Models Coming Soon!":
            print("More models coming soon, defaulting to Falcon for now!")
            self.model_name = "tiiuae/Falcon3-1B-Instruct"

    def download_dataset(self, dataset):
        self.data_path = dataset + '_dialogues.txt'
        if not os.path.isfile(self.data_path):
            datasets = {
                "robot maintenance": "FunDialogues/customer-service-robot-support", 
                "basketball coach": "FunDialogues/sports-basketball-coach", 
                "physics professor": "FunDialogues/academia-physics-office-hours",
                "grocery cashier" : "FunDialogues/customer-service-grocery-cashier"
            }
            dataset = load_dataset(f"{datasets[dataset]}")
            dialogues = dataset['train']
            df = pd.DataFrame(dialogues, columns=['id', 'description', 'dialogue'])
            dialog_df = df['dialogue']
            dialog_df.to_csv(self.data_path, sep=' ', index=False)
        else:
            print('data already exists in path.')        

    def load_model(self, n_threads, max_tokens, repeat_penalty, n_batch, top_k):
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.llm = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            torch_dtype="auto",
            device_map="auto"
        )
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.llm.config.pad_token_id = self.llm.config.eos_token_id
        self.max_tokens = max_tokens
        self.top_k = top_k

    def build_vectordb(self, chunk_size, overlap):
        loader = TextLoader(self.data_path)
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
        self.index = VectorstoreIndexCreator(
            embedding=HuggingFaceEmbeddings(), 
            text_splitter=text_splitter
        ).from_loaders([loader])

    def retrieval_mechanism(self, user_input, top_k=1, context_verbosity=False, rag_off=False):
        self.user_input = user_input
        self.context_verbosity = context_verbosity   
        results = self.index.vectorstore.similarity_search(self.user_input, k=top_k)
        context = "\n".join([document.page_content for document in results])
        if self.context_verbosity:
            print(f"Retrieving information related to your question...")
            print(f"Found this content which is most similar to your question: {context}")
        if rag_off:
            self.full_prompt = f"Question: {user_input}\nAnswer:"
        else:     
            self.full_prompt = f"""Context: {context}\n\nQuestion: {user_input}\nAnswer:"""

    def inference(self):
        if self.context_verbosity:
            print(f"Your Query: {self.full_prompt}")
        
        messages = [
            {"role": "system", "content": "You are a helpful assistant. Use the provided context to answer questions accurately."},
            {"role": "user", "content": self.full_prompt}
        ]
        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        model_inputs = self.tokenizer([text], return_tensors="pt").to(self.llm.device)
        generated_ids = self.llm.generate(
            **model_inputs,
            max_new_tokens=self.max_tokens,
            top_k=self.top_k
        )
        generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
        ]
        response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
        return response

### Customizing Your RAG LLM Chatbot Experience

Welcome to the RAG LLM Chatbot interface! This guide will help you understand how to use the various widgets to customize your chatbot experience.

#### Interface Elements

##### Model Selection
- **Model Dropdown**: Choose the model for your chatbot.
  - Currently using the Falcon3-1B-Instruct model from TII.
  - More models will be added in future updates.

##### Query Input
- **Query Text Box**: Enter your query here.
  - Type the question or statement you want the chatbot to respond to.
  - The text area provides ample space for longer queries.

##### Response Customization
- **Top K Slider**: Adjust the number of top results to consider for generating responses.
  - Slide to increase or decrease the value. The range is from 1 to 4.
  - This adjusts how many similar context passages are retrieved for answering.

- **RAG OFF Checkbox**: Toggle whether to use Retrieval-Augmented Generation (RAG) or not.
  - Check this box if you want to turn off RAG and use only the base model for responses.
  - Useful for comparing RAG-enhanced vs. basic model responses.

##### Data Processing Settings
- **Chunk Size Input**: Set the size of text chunks for processing.
  - Enter a value to determine how large each text chunk should be (5-5000).
  - This affects how the context documents are segmented for retrieval.

- **Overlap Input**: Define the overlap size between chunks.
  - Set a value for how much overlap there should be between text chunks (0-1000).
  - Higher overlap can help maintain context continuity between chunks.

##### Dataset and Performance
- **Dataset Dropdown**: Choose the dataset for context retrieval.
  - Options include: 'robot maintenance', 'basketball coach', 'physics professor', 'grocery cashier'.
  - Each dataset provides different domain-specific knowledge.

- **Threads Slider**: Select the number of threads for processing.
  - Adjust the slider to set the number of threads (2-200).
  - Higher values may improve performance on multi-core systems.

- **Max Tokens Input**: Specify the maximum length of generated responses.
  - Enter a value to set the token limit (5-500).
  - Higher values allow for longer, more detailed responses.

##### Submit Button
- **Submit**: Click this button to process your query and generate a response.
  - The response will appear in a styled blue box below the interface.

### How to Use
1. Select your desired model from the **Model Dropdown**.
2. Type your query in the **Query Text Box**.
3. Set the **Top K Slider** for context retrieval amount.
4. Configure **Chunk Size** and **Overlap** for text processing.
5. Choose a dataset from the **Dataset Dropdown**.
6. Adjust performance parameters (**Threads**, **Max Tokens**).
7. Toggle **RAG OFF** if you want to test the base model.
8. Click **Submit** to generate the response.

> Note: The first execution will load the model into memory and may take longer. Subsequent queries will be faster unless you change model-specific parameters like threads or max tokens.nse.

Enjoy interacting with your custom RAG LLM Chatbot!

In [None]:
def process_inputs(b):
    """
    Process inputs from the interactive chat interface.
    """
    global previous_threads, previous_max_tokens, previous_top_k, previous_dataset
    global previous_chunk_size, previous_overlap

    with output:
        output.clear_output()
        f = io.StringIO()
        with contextlib.redirect_stdout(f), contextlib.redirect_stderr(f):
            model = model_dropdown.value
            query = query_text.value
            top_k = top_k_slider.value
            chunk_size = chunk_size_input.value
            overlap = overlap_input.value
            dataset = dataset_dropdown.value
            threads = threads_slider.value
            max_tokens = max_token_input.value
            rag_off = rag_off_checkbox.value
            
            bot.get_model(model=model)
            bot.download_dataset(dataset=dataset)
            if (threads != previous_threads or max_tokens != previous_max_tokens or 
                top_k != previous_top_k):
                print("Loading model with new parameters")
                bot.load_model(
                    n_threads=threads,
                    max_tokens=max_tokens,
                    repeat_penalty=1.50,
                    n_batch=threads,
                    top_k=top_k
                )
                previous_threads = threads
                previous_max_tokens = max_tokens
                previous_top_k = top_k
            
            # Rebuild vector DB if needed
            if (dataset != previous_dataset or chunk_size != previous_chunk_size or 
                overlap != previous_overlap):
                print("Rebuilding vector DB")
                bot.build_vectordb(chunk_size=chunk_size, overlap=overlap)
                previous_dataset = dataset
                previous_chunk_size = chunk_size
                previous_overlap = overlap
            bot.retrieval_mechanism(
                user_input=query, 
                top_k=2,
                context_verbosity=True,
                rag_off=rag_off
            )
            response = bot.inference()            
            styled_response = f"""
            <div style="
                background-color: lightblue;
                border-radius: 15px;
                padding: 10px;
                font-family: Arial, sans-serif;
                color: black;
                max-width: 600px;
                word-wrap: break-word;
                margin: 10px;
                font-size: 14px;">
                {response}
            </div>
            """
            display(HTML(styled_response))

def create_chat_interface():
    """
    Create and display the interactive chat interface widgets.
    """
    global model_dropdown, query_text, top_k_slider, rag_off_checkbox
    global chunk_size_input, overlap_input, dataset_dropdown
    global threads_slider, max_token_input, output
    
    model_dropdown = widgets.Dropdown(
        options=['Falcon', 'More Models Coming Soon!'],
        description='Model:',
        disabled=False,
    )
    query_layout = widgets.Layout(width='400px', height='100px')
    query_text = widgets.Textarea(
        placeholder='Type your query here',
        description='Query:',
        disabled=False,
        layout=query_layout
    )
    top_k_slider = widgets.IntSlider(
        value=2,
        min=1,
        max=4,
        step=1,
        description='Top K:',
        disabled=False,
        continuous_update=False,
        orientation='horizontal',
        readout=True,
        readout_format='d'
    )
    rag_off_checkbox = widgets.Checkbox(
        value=False,
        description='RAG OFF?',
        disabled=False,
        indent=False,
        tooltip='Turn off RAG and use only the base model'
    )
    chunk_size_input = widgets.BoundedIntText(
        value=500,
        min=5,
        max=5000,
        step=1,
        description='Chunk Size:',
        disabled=False
    )
    overlap_input = widgets.BoundedIntText(
        value=50,
        min=0,
        max=1000,
        step=1,
        description='Overlap:',
        disabled=False
    )
    dataset_dropdown = widgets.Dropdown(
        options=['robot maintenance',
                 'basketball coach',
                 'physics professor',
                 'grocery cashier'],
        description='Dataset:',
        disabled=False,
    )
    threads_slider = widgets.IntSlider(
        value=8,
        min=2,
        max=200,
        step=1,
        description='Threads:',
        disabled=False,
        continuous_update=False,
        orientation='horizontal',
        readout=True,
        readout_format='d'
    )
    max_token_input = widgets.BoundedIntText(
        value=50,
        min=5,
        max=500,
        step=5,
        description='Max Tokens:',
        disabled=False
    )

    # Layout
    left_column = widgets.VBox([
        model_dropdown, 
        top_k_slider, 
        rag_off_checkbox,
        chunk_size_input, 
        overlap_input, 
        dataset_dropdown, 
        threads_slider,
        max_token_input
    ])

    submit_button = widgets.Button(description="Submit")
    submit_button.on_click(process_inputs)
    right_column = widgets.VBox([query_text, submit_button])
    interface_layout = widgets.HBox([left_column, right_column])
    output = widgets.Output()
    display(interface_layout, output)

In [None]:
bot = RAGBot()
previous_threads = None
previous_max_tokens = None
previous_top_k = None
previous_dataset = None
previous_chunk_size = None
previous_overlap = None
previous_temp = None

# Create and display the interface
create_chat_interface()

## Disclaimer for Using Large Language Models

Please be aware that while Large Language Models like Falcon are powerful tools for text generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It's advisable to carefully review the generated text and consider the context and application in which you are using these models.

Usage of these models must also adhere to the licensing agreements and be in accordance with ethical guidelines and best practices for AI. If you have any concerns or encounter issues with the models, please refer to the respective model cards and documentation provided in the links above.

To the extent that any public or non-Intel datasets or models are referenced by or accessed using these materials those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.

 
Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.

Intel’s provision of these resources does not expand or otherwise alter Intel’s applicable published warranties or warranty disclaimers for Intel products or solutions, and no additional obligations, indemnifications, or liabilities arise from Intel providing such resources. Intel reserves the right, without notice, to make corrections, enhancements, improvements, and other changes to its materials.