# Application deployment

This notebook demonstrates the end-to-end process of building, testing, and deploying a banking virtual assistant using MLRun, LangChain, and Milvus for vector storage. It covers project setup, data ingestion for retrieval-augmented generation, application graph definition with guardrails and analysis steps, local testing, and deployment to Kubernetes. An interactive Gradio UI is also provided for user testing and demonstration.

![](images/03_application_deployment_architecture.png)

In [None]:
# %pip install -r requirements.txt

In [None]:
import mlrun
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
import warnings
from langchain_milvus import Milvus

load_dotenv("ai_gateway.env")
mlrun.set_env_from_file("ai_gateway.env")

### Setup the project

Load the project already created in the first notebook.

In [None]:
project = mlrun.get_or_create_project("banking-agent", user_project=True)

This tutorial uses [Milvus](https://milvus.io/api-reference/pymilvus/v2.4.x/About.md) on a local host for simplicity. To use Milvus without the local host, see [Manage Milvus Connections](https://milvus.io/docs/v2.1.x/manage_connection.md).

##### Note! since this tutorial uses Milvus on local the deployment will work only if you are running on the notebook service, consider running this on the notebook service in IGZ

In [None]:
import os

db_path = os.path.join(os.getcwd(), "milvus_demo.db")
if os.path.exists(db_path):
    os.remove(db_path)

MILVUS_ARGS = {"uri": db_path}
MILVUS_ARGS

### Ingest data for vector store retrieval

This section covers the ingestion of banking knowledge base documents into the Milvus vector store. Loading and embedding markdown files containing general bank information, account details, and customer FAQs, enables efficient retrieval-augmented generation for the virtual assistant. The following steps demonstrate how to load, embed, and store these documents for downstream use in the application.

In [None]:
warnings.filterwarnings("ignore", category=DeprecationWarning, module="pkg_resources")

openai_available = os.environ.get("OPENAI_API_KEY")

if not openai_available:
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
else:
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Milvus(
    collection_name="banking_agent",
    embedding_function=embeddings,
    connection_args=MILVUS_ARGS,
    auto_id=True,
)

In [None]:
if not vectorstore.col:
    general_bank_info_kb = UnstructuredMarkdownLoader(
        "data/general_bank_info_kb.md"
    ).load()
    checking_savings_kb = UnstructuredMarkdownLoader(
        "data/checking_savings_kb.md"
    ).load()
    customer_faq = UnstructuredMarkdownLoader("data/customer_faq.md").load()
    pages = general_bank_info_kb + checking_savings_kb + customer_faq
    vectorstore.add_documents(pages)


milvus_artifact = project.log_artifact(
    "vectorstore", local_path=MILVUS_ARGS['uri'])

MILVUS_ARGS['uri'] = milvus_artifact.uri

In [None]:
vectorstore.col.num_entities

### Define application serving graph

The application serving graph orchestrates the flow of user queries through a series of modular steps:

- **Input guardrails:**  
    Ensure only safe and relevant queries proceed by filtering out inappropriate or off-topic inputs.

- **Sentiment & churn analysis:**  
    Analyze user sentiment and predict churn propensity to enrich the context for downstream processing.

- **Context building:**  
    Aggregate user information, sentiment, and churn data to construct a detailed context for the agent.

- **Response generation:**  
    The agent leverages the built context and retrieval-augmented generation from the vector store to provide accurate and personalized responses.

This **graph-based architecture** enables robust, explainable, and extensible deployment of the banking virtual assistant, ensuring each step is modular and transparent for easier maintenance and future enhancements.

See real-time serving graphs in the [documentation](https://docs.mlrun.org/en/stable/serving/serving-graph.html) for more information.

In [None]:
banking_topic_guardail = project.get_function("banking-topic-guardrail")
toxicity_guardrail = project.get_function("toxicity-guardrail")
churn_model = project.get_function("serving")
agent_graph = project.get_function("banking-agent")

In [None]:
mlrun.get_run_db().get_function("serving", project=project.name)

In [None]:
from mlrun.serving import ModelRunnerStep

graph = agent_graph.set_topology("flow", engine="async", exist_ok=True)
# Step to process the input this step is there to make it invocation simpler with less arguments
graph.add_step(
    name="enrich_request",
    handler= "enrich_request",
)

# Topic and toxicity guardrail router (from notebook 2)
guardrails_router = graph.add_step(
    "*ParallelRunMerger",
    name="input-guardrails",
    output_key="guardrails_output",
    extend_event=True,
    after="enrich_request"
)
guardrails_router.add_route(
    key="banking-topic-guardrail",
    class_name="mlrun.serving.remote.RemoteStep",
    method="POST",
    url=banking_topic_guardail.get_url(),
)
guardrails_router.add_route(
    key="toxicity-guardrail",
    class_name="mlrun.serving.remote.RemoteStep",
    method="POST",
    url=toxicity_guardrail.get_url(),
)

# Filtering accept and reject
graph.add_step(
    name="guardrail-filter",
    class_name="GuardrailsChoice",
    mapping={"True": "accept", "False": "reject"},
    after="input-guardrails",
)

graph.add_step(name="accept", handler="accept", after="guardrail-filter")

# Add model runner step to run the sentiment and churn analysis
model_runner_step = ModelRunnerStep(
    name="input-analysis",
    result_path="input_analysis_output",
    )
model_runner_step.add_model(
    model_class="SentimentAnalysisModelServer",
    endpoint_name="sentiment_analysis_output",
    result_path="sentiment_analysis_output",
    execution_mechanism="naive",
)
model_runner_step.add_model(
    model_class="ChurnModelServer",
    endpoint_name="churn_model_output",
    execution_mechanism="naive",
    dataset=f"store://datasets/{project.name}/data-process-data_test#0:latest",
    label_column="churn",
    endpoint_url=churn_model.get_url(),
    churn_mappings={"high": 0.50, "medium": 0.20, "low": 0},
    result_path="churn_model_output",)

graph.add_step(model_runner_step, after=["accept"], full_event= True,)


graph.add_step(
    name="build-context",
    class_name="BuildContext",
    context_mappings = {
        "name": "sentiment_analysis_output.name",  # name is nested inside sentiment_analysis_output
        "sentiment": "sentiment_analysis_output.response[0]",  # direct path, no input_analysis_output wrapper
        "churn": "churn_model_output.response[0]",  # direct path
    },
    output_key="formatted_prompt",
    prompt="""
    This is context about the user and their query:
    <user_context>
    name: {name}
    sentiment: {sentiment}
    churn propensity percentage: {churn}
    </user_context>

    If they have a high churn propensity consider asking them if would like to escalate to a human operator.
    Do not offer to escalate for low churn propensity.
    Do NOT mention the churn propensity but use it to craft your response.
    Use the sentiment to craft your response.
    """,
    after="input-analysis",
    full_event= True,
)
# Add the BankingAgent LLM using HF or OpenAI (if OPENAI credentials)
MRS_banking_agent = ModelRunnerStep(name="banking-agent")

if not openai_available:
    MRS_banking_agent.add_model(
        model_class="BankingAgentHuggingFace",
        endpoint_name="BankingAgentHuggingFace",
        execution_mechanism="naive",
        model_name=os.environ.get("HF_MODEL_NAME", "Qwen/Qwen2.5-1.5B-Instruct"),
        prompt_input_key="formatted_prompt",
        messages_input_key="inputs",
        max_new_tokens=256,
        temperature=0.2,
        result_path="banking-agent",
    )
else:
    MRS_banking_agent.add_model(
        model_class="BankingAgentOpenAI",
        endpoint_name="BankingAgentOpenAI",
        execution_mechanism="naive",
        model_name="gpt-4o-mini",
        system_prompt="You are a helpful assistant for IGZ Bank. Respond in a concise, but detailed way. Use web search if the customer asks about other banks or external information.",
        result_path="banking-agent",
        after="build-context",
        prompt_input_key="formatted_prompt",
        messages_input_key="inputs",
        vector_db_collection="banking_agent",
        vector_db_args=MILVUS_ARGS,
        vector_db_description="Use this to answer any questions about general bank info like locations, hours, guidelines for opening savings/checking accounts, APY for savings/checking, as well as general FAQ like resetting passwords, ATM fees, setting up direct deposit, etc.",
    )

graph.add_step(MRS_banking_agent, after=["build-context"])

graph.add_step(name="reject", handler="reject", after="guardrail-filter")

graph.add_step(
    name="output", handler="responder", after=["banking-agent", "reject"]
).respond()

In [None]:
graph.plot(rankdir="LR")

### Since running the LLM model is very resource demanding some systems can't run it locally so we will use the mock server only with OpenAI

In [None]:
if openai_available:
    mock = agent_graph.to_mock_server()

### Test the input guardrails

In [None]:
HIGH_PROPENSITY_CHURN_USER_ID = 32
LOW_PROPENSITY_CHURN_USER_ID = 2296

def _format_question(question: str, role: str = "user"):
    return {"role": role, "content": question}

Question the agent CANNOT answer - rejects input

In [None]:
if openai_available:
    resp = mock.test(
        path="/",
        body={
            "name": "John",
            "inputs": [_format_question("What is a mortgage, from the bank?")],
            "user_id": LOW_PROPENSITY_CHURN_USER_ID,
        },
    )
    print(resp["outputs"][0])

Question the agent CANNOT answer - rejects input

In [None]:
if openai_available:
    resp = mock.test(
        path="/",
        body={
            "name": "John",
            "inputs": [_format_question("i hate you")],
            "user_id": LOW_PROPENSITY_CHURN_USER_ID,
        },
    )
    print(resp["outputs"][0])

### Test the banking agent - sentiment analysis and churn propensity

Standard Q&A with neutral sentiment and low churn

In [None]:
if openai_available:
    resp = mock.test(
        path="/",
        body={
            "name": "John",
            "inputs": [_format_question("how to apply for checking account?")],
            "user_id": LOW_PROPENSITY_CHURN_USER_ID,
        },
    )
    print(resp["outputs"][0])

Standard Q&A with negative sentiment and low churn

In [None]:
if openai_available:
    resp = mock.test(
        path="/",
        body={
            "name": "John",
            "inputs": [
                _format_question(
                    "how to apply for checking account? I keep trying but I'm really frustrated"
                )
            ],
            "user_id": LOW_PROPENSITY_CHURN_USER_ID,
        },
    )
    print(resp["outputs"][0])

Standard Q&A with low sentiment and high churn - note that the model offers to escalate to a human operator. This kind of behavior is customizable depending on the input guardrails and input analysis.

Standard multi-turn Q&A

In [None]:
if openai_available:
    resp = mock.test(
        path="/",
        body={
            "name": "Alice",
            "inputs": [
                {"role": "user", "content": "Hi—how do I open a checking account?"},
                {
                    "role": "assistant",
                    "content": "To open a checking account, you need two forms of ID and a minimum deposit of $25.",
                },
                {"role": "user", "content": "Is it possible to get cashback rewards?"},
            ],
            "user_id": HIGH_PROPENSITY_CHURN_USER_ID,  # <-- High churn propensity user
        },
    )
    print(resp["outputs"][0])

### Full outputs

Below is the comprehensive output from the application graph. This includes all intermediate and final results: user input, guardrails decisions, input analysis (such as sentiment and churn predictions), any tool calls, and the generated response from the model. Use this section to trace the end-to-end flow and understand how each component contributes to the final answer.

In [None]:
if openai_available:
    resp

### Deploy to Kubernetes

Deploy the banking agent application to a production-ready Kubernetes endpoint. This enables robust, scalable, and highly available access for integration with real-world applications. Kubernetes orchestration ensures automatic scaling and reliability based on demand.

In [None]:
# Reminder! since milvus db is local the deployment will work only if you are running on the notebook service
project.deploy_function(agent_graph)

In [None]:
resp = agent_graph.invoke(
    path="/",
    body={
        "name": "Alice",
        "inputs": [{"role": "user", "content": "Hi, how do I open a checking account?"}],
        "user_id": HIGH_PROPENSITY_CHURN_USER_ID,  # <-- High churn propensity user
    },
)
print(resp)

In [None]:
resp = agent_graph.invoke(
    path="/",
    body={
        "name": "Alice",
        "inputs": [{"role": "user", "content": "what is a mortgage?"}],
        "user_id": HIGH_PROPENSITY_CHURN_USER_ID,  # <-- High churn propensity user
    },
)
print(resp)

### Application UI

<div style="padding: 10px; border-left: 6px solid #f0ad4e; background: #fcf8e3;">
<b>⚠️ Warning:</b> This section is not supported on Community Edition.
</div><br>

The Streamlit UI offers an interactive environment to test and explore the banking agent's capabilities. Its main features include:

- **Chat window:**  
    Engage in a conversational interface where you can enter questions and receive responses from the assistant, simulating real user interactions.

- **Tool usage visualization:**  
    When the agent invokes external tools or APIs (such as retrieving information from the vector store), these actions are surfaced in the chat, allowing you to see when and how tools are used.

- **Intermediate graph steps:**  
    The UI displays outputs from key stages of the application graph, including:
    - **Input guardrails:**  
        - *Toxicity guardrail passed* — Indicates if the input passes the toxicity filter.
        - *Banking topic guardrail passed* — Shows whether the query is relevant to banking topics.
    - **Input analysis:**  
        - *Sentiment analysis* — Reveals the detected sentiment of the user's input.
        - *Churn prediction* — Estimates the user's propensity to leave, which can influence the assistant's response.

This transparent design helps users trace the end-to-end flow of their queries, understand decision points, and gain insight into how each system component contributes to the final answer.

![](images/banking_agent_ui.png)

In [None]:
!tar -czvf frontend_ui.tar.gz ./src/functions/frontend_ui.py

In [None]:
frontend_source = project.log_artifact("frontend_source", local_path="frontend_ui.tar.gz", upload=True)

In [None]:
ui_fn = project.set_function(
    name="frontend",
    kind="application",
    image="mlrun/mlrun",
    requirements=["streamlit==1.49.1"]
)

API_URL = agent_graph.get_url()
API_URL

In [None]:
ui_fn.set_env("API_URL", API_URL)
ui_fn.with_source_archive(frontend_source.target_path, pull_at_runtime=False)
ui_fn.set_internal_application_port(8000)
ui_fn.spec.command = "streamlit"
ui_fn.spec.args = ["run", "--server.port", "8000", "/home/mlrun_code/src/functions/frontend_ui.py"]

In [None]:
ui_fn.deploy(with_mlrun=False, create_default_api_gateway=False)
ui_fn.create_api_gateway(
    name="banking-agent-ui",
    path="/",
    direct_port_access=True,
    ssl_redirect=True,
    set_as_default=False,
    authentication_mode="none"
)