### A cluster has been created for this demo
To run this demo, just select the cluster `dbdemos-llm-rag-chatbot-jason_bricks_std` from the dropdown menu ([open cluster configuration](https://dbc-458303b2-a0c9.cloud.databricks.com/#setting/clusters/0112-071913-b6023fae/configuration)). <br />
*Note: If the cluster was deleted after 30 days, you can re-create it with `dbdemos.create_cluster('llm-rag-chatbot')` or re-install the demo: `dbdemos.install('llm-rag-chatbot')`*

# 2/ Creating the chatbot with Retrieval Augmented Generation (RAG)

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/llm-rag-self-managed-flow-2.png?raw=true" style="float: right; margin-left: 10px"  width="900px;">

Our Vector Search Index is now ready!

Let's now create and deploy a new Model Serving Endpoint to perform RAG.

The flow will be the following:

- A user asks a question
- The question is sent to our serverless Chatbot RAG endpoint
- The endpoint compute the embeddings and searches for docs similar to the question, leveraging the Vector Search Index
- The endpoint creates a prompt enriched with the doc
- The prompt is sent to the Foundation Model Serving Endpoint
- We display the output to our users!


<!-- Collect usage data (view). Remove it to disable collection or disable tracker during installation. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-science&org_id=1126084104782633&notebook=%2F01-quickstart%2F02-Deploy-RAG-Chatbot-Model&demo_name=llm-rag-chatbot&event=VIEW&path=%2F_dbdemos%2Fdata-science%2Fllm-rag-chatbot%2F01-quickstart%2F02-Deploy-RAG-Chatbot-Model&version=1">

*Note: RAG performs document searches using Databricks Vector Search. In this notebook, we assume that the search index is ready for use. Make sure you run the previous [01-Data-Preparation-and-Index]($./01-Data-Preparation-and-Index [DO NOT EDIT]) notebook.*


In [0]:
%pip install mlflow==2.9.0 langchain==0.0.344 databricks-vectorsearch==0.22 databricks-sdk==0.12.0 mlflow[databricks]

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting mlflow==2.9.0
  Downloading mlflow-2.9.0-py3-none-any.whl (19.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.1/19.1 MB 41.6 MB/s eta 0:00:00
Collecting langchain==0.0.344
  Downloading langchain-0.0.344-py3-none-any.whl (1.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.9/1.9 MB 63.6 MB/s eta 0:00:00
Collecting databricks-vectorsearch==0.22
  Downloading databricks_vectorsearch-0.22-py3-none-any.whl (8.5 kB)
Collecting databricks-sdk==0.12.0
  Downloading databricks_sdk-0.12.0-py3-none-any.whl (301 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.7/301.7 kB 27.3 MB/s eta 0:00:00
Collecting mlflow[databricks]
  Downloading mlflow-2.9.2-py3-none-any.whl (19.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.1/19.1 MB 54.2 MB/s eta 0:00:00
Collecting docker<7,>=4.0.0
  Downloading docker-6.1.3-py3-none-any.whl (148 kB)
     ━━━━━━━━━━━━━━━━━━━━━━

#### Install Arize Phoenix AI Observability

This is a LLM-powered application utilizes LangChain. Arize Phoenix captures LangChain span and trace information helping with debugging of LangChain LLM calls. 

In addition to LangChain, Phoenix supports instrumentation using OTEL LLM tracing, LlamaIndex, LangChain and general dataframe analysis

Key Concepts:

LLM Traces are a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context (such as retrieval from vector stores, usage of external tools, etc).

Traces are made up of a sequence of spans. A span represents a unit of work or operation (think a span of time).

LLM Evaluations help get visbility into the performance of the application

In [0]:
!pip install typing-extensions==4.7.1
!pip install arize-phoenix
!pip install --upgrade openai
!pip install --upgrade nest_asyncio
!pip install tiktoken
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting typing-extensions==4.7.1
  Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Installing collected packages: typing-extensions
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.4.0
    Not uninstalling typing-extensions at /databricks/python3/lib/python3.10/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-43c61980-4d46-44f8-a334-26be5a687d60
    Can't uninstall 'typing_extensions'. No files were found to uninstall.
Successfully installed typing-extensions-4.7.1
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting arize-phoenix
  Downloading arize_phoenix-2.5.0-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━

In [0]:
%run ../_resources/00-init $reset_all_data=false

USE CATALOG `main`
using catalog.database `main`.`rag_chatbot`


DataFrame[]

  
###  This demo requires a secret to work:
Your Model Serving Endpoint needs a secret to authenticate against your Vector Search Index (see [Documentation](https://docs.databricks.com/en/security/secrets/secrets.html)).  <br/>
**Note: if you are using a shared demo workspace and you see that the secret is setup, please don't run these steps and do not override its value**<br/>

- You'll need to [setup the Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/install.html) on your laptop or using this cluster terminal: <br/>
`pip install databricks-cli` <br/>
- Configure the CLI. You'll need your workspace URL and a PAT token from your profile page<br>
`databricks configure`
- Create the dbdemos scope:<br/>
`databricks secrets create-scope dbdemos`
- Save your service principal secret. It will be used by the Model Endpoint to autenticate. If this is a demo/test, you can use one of your [PAT token](https://docs.databricks.com/en/dev-tools/auth/pat.html).<br>
`databricks secrets put-secret dbdemos rag_sp_token`

*Note: Make sure your service principal has access to the Vector Search index:*

```
spark.sql('GRANT USAGE ON CATALOG <catalog> TO `<YOUR_SP>`');
spark.sql('GRANT USAGE ON DATABASE <catalog>.<db> TO `<YOUR_SP>`');
from databricks.sdk import WorkspaceClient
import databricks.sdk.service.catalog as c
WorkspaceClient().grants.update(c.SecurableType.TABLE, <index_name>, 
                                changes=[c.PermissionsChange(add=[c.Privilege["SELECT"]], principal="<YOUR_SP>")])
  ```

In [0]:
index_name = f"{catalog}.{db}.databricks_documentation_vs_index"
host = "https://" + spark.conf.get("spark.databricks.workspaceUrl")

test_demo_permissions(
    host,
    secret_scope="dbdemos",
    secret_key="rag_sp_token",
    vs_endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
    index_name=index_name,
    embedding_endpoint_name="databricks-bge-large-en",
)

Secret and permissions seems to be properly setup, you can continue the demo!


### Langchain retriever

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/llm-rag-self-managed-model-1.png?raw=true" style="float: right" width="500px">

Let's start by building our Langchain retriever. 

It will be in charge of:

* Creating the input question embeddings (with Databricks `bge-large-en`)
* Calling the vector search index to find similar documents to augment the prompt with

Databricks Langchain wrapper makes it easy to do in one step, handling all the underlying logic and API call for you.

In [0]:
# url used to send the request to your model from the serverless endpoint
host = "https://" + spark.conf.get("spark.databricks.workspaceUrl")
os.environ["DATABRICKS_TOKEN"] = dbutils.secrets.get("dbdemos", "rag_sp_token")

In [0]:
from databricks.vector_search.client import VectorSearchClient
from langchain.vectorstores import DatabricksVectorSearch
from langchain.embeddings import DatabricksEmbeddings

# Test embedding Langchain model
# NOTE: your question embedding model must match the one used in the chunk in the previous model
embedding_model = DatabricksEmbeddings(endpoint="databricks-bge-large-en")
print(
    f"Test embeddings: {embedding_model.embed_query('What is Apache Spark?')[:20]}..."
)


def get_retriever(persist_dir: str = None):
    os.environ["DATABRICKS_HOST"] = host
    # Get the vector search index
    vsc = VectorSearchClient(
        workspace_url=host, personal_access_token=os.environ["DATABRICKS_TOKEN"]
    )
    vs_index = vsc.get_index(
        endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME, index_name=index_name
    )

    # Create the retriever
    vectorstore = DatabricksVectorSearch(
        vs_index, text_column="content", embedding=embedding_model
    )
    return vectorstore.as_retriever()


# test our retriever
vectorstore = get_retriever()
similar_documents = vectorstore.get_relevant_documents(
    "How do I track my Databricks Billing?"
)
print(f"Relevant documents: {similar_documents[0]}")

Test embeddings: [0.0186004638671875, -0.0141448974609375, -0.0574951171875, 0.0034027099609375, 0.008453369140625, -0.0216064453125, -0.02471923828125, -0.004688262939453125, 0.0136566162109375, 0.050384521484375, -0.0272064208984375, -0.01470184326171875, 0.054718017578125, -0.0538330078125, -0.01035308837890625, -0.0162200927734375, -0.0188140869140625, -0.017242431640625, -0.051300048828125, 0.0177764892578125]...
[NOTICE] Using a Personal Authentication Token (PAT). Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().
Relevant documents: page_content='View billable usage using the account console  \nThis article describes how to use the Usage page in the account console to view usage data across workspaces in your account. You can also download billable usage logs using the Account API.  \nYou can also access and query billable usage data using Datab

### Building Databricks Chat Model to query llama-2-70b-chat foundation model

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/llm-rag-self-managed-model-3.png?raw=true" style="float: right" width="500px">

Our chatbot will be using llama2 foundation model to provide answer. 

While the model is available using the built-in [Foundation endpoint](/ml/endpoints) (using the `/serving-endpoints/databricks-llama-2-70b-chat/invocations` API), we can use Databricks Langchain Chat Model wrapper to easily build our chain.  

Note: multipe type of endpoint or langchain models can be used:

- Databricks Foundation models (what we'll use)
- Your fined-tune model
- An external model provider (such as Azure OpenAI)

In [0]:
# Test Databricks Foundation LLM model
from langchain.chat_models import ChatDatabricks

chat_model = ChatDatabricks(
    endpoint="databricks-llama-2-70b-chat", max_tokens=200
)
print(f"Test chat model: {chat_model.predict('What is Apache Spark')}")

Test chat model: 
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Python, Scala, and R, and an optimized engine that supports general execution graphs. It also provides high-level tools and libraries for data loading, transformation, and machine learning.

Spark is designed to handle large-scale data processing tasks and can process data in real-time or batch mode. It is highly scalable and can handle data processing tasks that are too large for a single machine to handle. It is also highly fault-tolerant, meaning that it can continue to process data even if one or more machines fail.

Spark is widely used in a variety of industries, including finance, healthcare, retail, and telecommunications. It is often used for data warehousing, machine learning, and stream processing.

Some of the key features of Apache Spark include:

1


## Arize Phoenix AI Observability to Visualize and Troubleshoot RAG


In [0]:
from phoenix.trace.langchain import OpenInferenceTracer, LangChainInstrumentor

# If no exporter is specified, the tracer will export to the locally running Phoenix server
tracer = OpenInferenceTracer()
# If no tracer is specified, a tracer is constructed for you
LangChainInstrumentor(tracer).instrument()

2024-01-21 15:22:57.624439: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [0]:
import phoenix as px

session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit https://dbc-458303b2-a0c9.cloud.databricks.com/driver-proxy/o/1126084104782633/0112-070001-9dtid9cg/6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


In [0]:
session.view()

📺 Opening a view to the Phoenix app. The app is running at https://dbc-458303b2-a0c9.cloud.databricks.com/driver-proxy/o/1126084104782633/0112-070001-9dtid9cg/6006/



### Assembling the complete RAG Chain

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/llm-rag-self-managed-model-2.png?raw=true" style="float: right" width="600px">


Let's now merge the retriever and the model in a single Langchain chain.

We will use a custom langchain template for our assistant to give proper answer.

Make sure you take some time to try different templates and adjust your assistant tone and personality for your requirement.




In [0]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatDatabricks

TEMPLATE = """You are an assistant for Databricks users. You are answering python, coding, SQL, data engineering, spark, data science, DW and platform, API or infrastructure administration question related to Databricks. If the question is not related to one of these topics, kindly decline to answer. If you don't know the answer, just say that you don't know, don't try to make up an answer. Keep the answer as concise as possible.
Use the following pieces of context to answer the question at the end:
{context}
Question: {question}
Answer:
"""
prompt = PromptTemplate(
    template=TEMPLATE, input_variables=["context", "question"]
)

chain = RetrievalQA.from_chain_type(
    llm=chat_model,
    chain_type="stuff",
    retriever=get_retriever(),
    chain_type_kwargs={"prompt": prompt},
)

[NOTICE] Using a Personal Authentication Token (PAT). Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().


In [0]:
# langchain.debug = True #uncomment to see the chain details and the full prompt being sent
question = {"query": "How can I track billing usage on my workspaces?"}
answer = chain.run(question)
print(answer)

You can track billing usage on your workspaces by using the Usage page in the Databricks account console. The Usage page allows you to view detailed usage data for your workspaces, including estimated costs in $USD or DBUs. You can use the graph on the page to view usage data by workspace, SKU, or tag, and you can filter the data by date range, workspace, tag, or SKU. Additionally, you can download aggregated or unaggregated usage data by date range. To access the Usage page, go to the account console and click the Usage icon.


In [0]:
# langchain.debug = True #uncomment to see the chain details and the full prompt being sent
question = {"query": "Where is the URL path for a workspace located?"}
answer = chain.run(question)
print(answer)

The URL path for a workspace is located in the Workspace browser. To access the Workspace browser, click Workspace in the sidebar. The URL path is displayed in the address bar of your web browser. It starts with https://<workspace-name>.databricks.com, where <workspace-name> is the name of your Databricks workspace.


### Arize Phoenix - Run Optional RAG Evaluations 
Arize supports a suite of LLM Evaluations for retrieval
Phoenix has support for a large suite of models for Evals, this example only uses OpenAI

This example runs through:

Hallucination Eval - Is the answer made up?

Retrieval Eval - Is the chunk relevant?

Q&A - Is the overall answer correct based on reference text?



In [0]:
# Get traces from Phoenix into dataframe

spans_df = px.active_session().get_spans_dataframe()
spans_df[
    [
        "name",
        "span_kind",
        "attributes.input.value",
        "attributes.retrieval.documents",
    ]
].head()

from phoenix.session.evaluation import (
    get_qa_with_reference,
    get_retrieved_documents,
)

retrieved_documents_df = get_retrieved_documents(px.active_session())
queries_df = get_qa_with_reference(px.active_session())

In [0]:
# If you want to use a different model please see the Phoenix documentation
# https://docs.arize.com/phoenix/api/evaluation-models

from getpass import getpass

# Uses your OpenAI API Key
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

In [0]:
# For fast Eval runs enable concurrency
# import nest_asyncio
# nest_asyncio.apply()

In [0]:
from phoenix.trace import SpanEvaluations, DocumentEvaluations
from phoenix.experimental.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    QA_PROMPT_RAILS_MAP,
    QA_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

In [0]:
# Test the model is working - call GPT-4 turbo
model = OpenAIModel("gpt-4-1106-preview", temperature=0.0)
model("hello!")

'Hello! How can I assist you today?'

Next, let's take a look at how to use LLM evals to evaluate our application.

We will be going through a few common evaluation metrics


####Ealuation Example Results



<img src="https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/databricks_notebook_eval2.png"  >

In [0]:
if not openai_api_key:
    print("Skipping Evals as no OpenAI key is configured")
else:
    # Creating Hallucination Eval which checks if the application hallucinated
    hallucination_eval = llm_classify(
        dataframe=queries_df,
        model=OpenAIModel("gpt-4-1106-preview", temperature=0.0),
        template=HALLUCINATION_PROMPT_TEMPLATE,
        rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
        provide_explanation=True,  # Makes the LLM explain its reasoning
        concurrency=4,
    )
    hallucination_eval["score"] = (
        hallucination_eval.label[~hallucination_eval.label.isna()] == "factual"
    ).astype(int)

    # Creating Q&A Eval which checks if the application answered the question correctly
    qa_correctness_eval = llm_classify(
        dataframe=queries_df,
        model=OpenAIModel("gpt-4-1106-preview", temperature=0.0),
        template=QA_PROMPT_TEMPLATE,
        rails=list(QA_PROMPT_RAILS_MAP.values()),
        provide_explanation=True,  # Makes the LLM explain its reasoning
        concurrency=4,
    )

    qa_correctness_eval["score"] = (
        hallucination_eval.label[~qa_correctness_eval.label.isna()] == "correct"
    ).astype(int)

    # Logs the Evaluations to Phoenix
    px.log_evaluations(
        SpanEvaluations(
            eval_name="Hallucination", dataframe=hallucination_eval
        ),
        SpanEvaluations(
            eval_name="QA Correctness", dataframe=qa_correctness_eval
        ),
    )



llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s



llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s



Sending Evaluations:   0%|          | 0/4 [00:00<?, ?it/s][A[A

Sending Evaluations: 100%|██████████| 4/4 [00:00<00:00, 39.94it/s][A[ASending Evaluations: 100%|██████████| 4/4 [00:00<00:00, 38.87it/s]



We can use Retrieval Relevance Evals to identify if issues are caused by the retrieval process for RAG. We are going to use an LLM to grade whether or not the chunks retrieved are relevant to the query.

####Ealuation Retreival Chunks - Example Results


<img src="https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/databricks_notebook_retriever_eval.png"  >

In [0]:
# Generating Retrieval Relevance Eval

from phoenix.experimental.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

retrieved_documents_eval = llm_classify(
    dataframe=retrieved_documents_df,
    model=OpenAIModel("gpt-4-1106-preview", temperature=0.0),
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)

retrieved_documents_eval["score"] = (
    retrieved_documents_eval.label[~retrieved_documents_eval.label.isna()]
    == "relevant"
).astype(int)

px.log_evaluations(
    DocumentEvaluations(
        eval_name="Relevance", dataframe=retrieved_documents_eval
    )
)



llm_classify |          | 0/8 (0.0%) | ⏳ 00:00<? | ?it/s




Sending Evaluations:   0%|          | 0/8 [00:00<?, ?it/s][A[A[A


Sending Evaluations: 100%|██████████| 8/8 [00:00<00:00, 79.87it/s][A[A[ASending Evaluations: 100%|██████████| 8/8 [00:00<00:00, 77.42it/s]


### Saving our model to Unity Catalog registry

Now that our model is ready, we can register it within our Unity Catalog schema:

In [0]:
from mlflow.models import infer_signature
import mlflow

mlflow.set_registry_uri("databricks-uc")
model_name = f"{catalog}.{db}.dbdemos_chatbot_model"

with mlflow.start_run(run_name="dbdemos_chatbot_rag") as run:
    signature = infer_signature(question, answer)
    model_info = mlflow.langchain.log_model(
        chain,
        loader_fn=get_retriever,  # Load the retriever with DATABRICKS_TOKEN env as secret (for authentication).
        artifact_path="chain",
        registered_model_name=model_name,
        pip_requirements=[
            "mlflow==" + mlflow.__version__,
            "langchain==" + langchain.__version__,
            "databricks-vectorsearch",
        ],
        input_example=question,
        signature=signature,
    )



Uploading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Registered model 'main.rag_chatbot.dbdemos_chatbot_model' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

Created version '3' of model 'main.rag_chatbot.dbdemos_chatbot_model'.


### Deploying our Chat Model as a Serverless Model Endpoint 

Our model is saved in Unity Catalog. The last step is to deploy it as a Model Serving.

We'll then be able to sending requests from our assistant frontend.

In [0]:
# Create or update serving endpoint
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import (
    EndpointCoreConfigInput,
    ServedModelInput,
)

serving_endpoint_name = f"dbdemos_endpoint_{catalog}_{db}"[:63]
latest_model_version = get_latest_model_version(model_name)

w = WorkspaceClient()
endpoint_config = EndpointCoreConfigInput(
    name=serving_endpoint_name,
    served_models=[
        ServedModelInput(
            model_name=model_name,
            model_version=latest_model_version,
            workload_size="Small",
            scale_to_zero_enabled=True,
            environment_vars={
                "DATABRICKS_TOKEN": "{{secrets/dbdemos/rag_sp_token}}",  # <scope>/<secret> that contains an access token
                "PHOENIX_COLLECTOR_ENDPOINT": "URL of Phoenix Server",  # Where to send trace data for Phoenix
            },
        )
    ],
)

existing_endpoint = next(
    (e for e in w.serving_endpoints.list() if e.name == serving_endpoint_name),
    None,
)
serving_endpoint_url = f"{host}/ml/endpoints/{serving_endpoint_name}"
if existing_endpoint == None:
    print(
        f"Creating the endpoint {serving_endpoint_url}, this will take a few minutes to package and deploy the endpoint..."
    )
    w.serving_endpoints.create_and_wait(
        name=serving_endpoint_name, config=endpoint_config
    )
else:
    print(
        f"Updating the endpoint {serving_endpoint_url} to version {latest_model_version}, this will take a few minutes to package and deploy the endpoint..."
    )
    w.serving_endpoints.update_config_and_wait(
        served_models=endpoint_config.served_models, name=serving_endpoint_name
    )

displayHTML(
    f'Your Model Endpoint Serving is now available. Open the <a href="/ml/endpoints/{serving_endpoint_name}">Model Serving Endpoint page</a> for more details.'
)

Updating the endpoint https://dbc-458303b2-a0c9.cloud.databricks.com/ml/endpoints/dbdemos_endpoint_main_rag_chatbot to version 3, this will take a few minutes to package and deploy the endpoint...


Our endpoint is now deployed! You can search endpoint name on the [Serving Endpoint UI](#/mlflow/endpoints) and visualize its performance!

Let's run a REST query to try it in Python. As you can see, we send the `test sentence` doc and it returns an embedding representing our document.

In [0]:
question = "How can I track billing usage on my workspaces?"

answer = w.serving_endpoints.query(
    serving_endpoint_name, inputs=[{"query": question}]
)
print(answer.predictions[0])




### Let's give it a try, using Gradio as UI!

All you now have to do is deploy your chatbot UI. Here is a simple example using Gradio ([License](https://github.com/gradio-app/gradio/blob/main/LICENSE)). Explore the chatbot gradio [implementation](https://huggingface.co/spaces/databricks-demos/chatbot/blob/main/app.py).

*Note: this UI is hosted and maintained by Databricks for demo purpose and don't use the model you just created. We'll soon show you how to do that with Lakehouse Apps!*

In [0]:
display_gradio_app("databricks-demos-chatbot")



## Congratulations! You have deployed your first GenAI RAG model!

You're now ready to deploy the same logic for your internal knowledge base leveraging Lakehouse AI.

We've seen how the Lakehouse AI is uniquely positioned to help you solve your GenAI challenge:

- Simplify Data Ingestion and preparation with Databricks Engineering Capabilities
- Accelerate Vector Search  deployment with fully managed indexes
- Leverage Databricks LLama 2 foundation model endpoint
- Deploy realtime model endpoint to perform RAG and provide Q&A capabilities

Lakehouse AI is uniquely positioned to accelerate your GenAI deployment.


## Next: ready to take it to a next level?

Open the [02-advanced/01-PDF-Advanced-Data-Preparation]($../02-advanced/01-PDF-Advanced-Data-Preparation) notebook series to learn more about unstructured data, advanced chain, model evaluation and monitoring.

# Cleanup

To free up resources, please delete uncomment and run the below cell.

In [0]:
# /!\ THIS WILL DROP YOUR DEMO SCHEMA ENTIRELY /!\
# cleanup_demo(catalog, db, serving_endpoint_name, f"{catalog}.{db}.databricks_documentation_vs_index")

