# Deploying Langchain Application as a Custom MLServer Model

This notebook runs through the deployment of a simple LangChain application using a custom MLServer on a kind Kubernetes cluster. We are going to deploy the following [example/tutorial](https://python.langchain.com/docs/tutorials/rag/) from the langchain documentation. You will need an OpenAI API key, although note that it's easy to adjust the process for other LLM API providers.

Before we start we'll need to create a virtual environment with all the required dependencies. We'll use conda:

```sh
conda create -n langchain-model python=3.10
conda activate langchain-model
```

We'll also need to install all the dependencies

```sh
pip install mlserver
pip install langchain-openai
pip install beautifulsoup4
pip install langchain
pip install langchain-chroma
pip install langchain-community
```

The following is the Custom MLServer Model that implements the LangChain application:

In [None]:
from mlserver import types
from mlserver.model import MLModel
from mlserver.codecs import NumpyCodec, StringCodec
from mlserver.types import InferenceRequest, InferenceResponse
from mlserver.logging import logger

import os
import bs4

from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter


class LangChainApp(MLModel):
    async def load(self) -> bool:
        openai_api_key = self._settings.parameters.extra['openai_api_key']
        os.environ["OPENAI_API_KEY"] = openai_api_key
        self.llm = ChatOpenAI(
            model="gpt-4o-mini",
        )
        self.loader = WebBaseLoader(
            web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
            bs_kwargs=dict(
                parse_only=bs4.SoupStrainer(
                    class_=("post-content", "post-title", "post-header")
                )
            ),
        )
        docs = self.loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        splits = text_splitter.split_documents(docs)
        vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
        retriever = vectorstore.as_retriever()
        prompt = hub.pull("rlm/rag-prompt")

        def format_docs(docs):
            return "\n\n".join(doc.page_content for doc in docs)

        self.rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )

        self.ready = True
        logger.info(f"---- loaded LangChain Application ----")
        return self.ready

    def unpack_input(self, payload: InferenceRequest):
        request_data = {}
        for inp in payload.inputs:
            if inp.name == 'query':
                request_data[inp.name] = (
                    NumpyCodec
                    .decode_input(inp)
                    .flatten()
                    .tolist()
                )
        return request_data
    
    async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
        unpacked_payload = self.unpack_input(payload)
        response = self.rag_chain.invoke(unpacked_payload['query'][0])
        outputs = [StringCodec.encode_output("response", [response])]
        return InferenceResponse(
            id=payload.id,
            model_name=self.name,
            model_version=self.version,
            outputs=outputs,
        )


The above `load` method is responsible for initialising and setting up the `LangChain` Application. Note that in production you'd probably want to provision a vector database separately and insert the embedded chunks in a separate process to load. Because this is just a demo and we're only loading and processing a single webpage it is fine to do it all in the load method. But not if we deploy multiple replicas of this model it'll rerun the load method which you probably don't want if the vector database setup takes a long time. 

Now that we have the above we need to write a `model-settings.json` file. This just contains the model configuration. In this case, we're going to pass the `openai_api_key` to the model here. But you could also include other settings if you wish. 

```json
{
    "name": "langchain-model",
    "implementation": "model.LangChainApp",
    "parameters": {
        "extra": {
            "openai_api_key": <openai_api_key>
        }
    }
}
```

Both of these files, the `model.py` and the `model-settings.json` file should be stored in a folder, I've called mine `langchain_model`.

Next, we can test run all of the above using `mlserver start langchain_model/`. If everything is set up correctly you should now be able to query the model locally using the following:

In [3]:
import requests

inference_request = {
    "inputs": [
        {
            "name": "query", 
            "shape": [1, 1], 
            "datatype": "BYTES", 
            "data": ['What is Task Decomposition?']
        }
    ]
}

response = requests.post(
    "http://localhost:8080/v2/models/langchain-model/infer",
    json=inference_request,
)

print(response.json()['outputs'][0]['data'][0])

Task Decomposition is the process of breaking down a complicated task into smaller, manageable steps. It often employs techniques like Chain of Thought (CoT) to enhance performance by prompting the model to think step by step. Additionally, it can involve human input or task-specific instructions to guide the decomposition process.


## Deploying with kubernetes

Next, we're going to try and deploy using Kubernetes. The base MLServer image doesn't come with the langchain dependency included so we'll need to include it ourselves. There are a couple of ways of doing this.

1. We create a tarball file and bundle it up with the `model.py` and `model-settings.json` files. When we deploy these the load method will also unpack the dependencies so that the custom mlserver model has access to what it needs.
2. We create a new docker image that includes the relevant dependencies and deploy it to a new server. If a model requires specific dependencies we can add a requirements field to the model.yaml and Seldon core will deploy the model on the correct server.

We're going to take approach 1 here which is also talked about [here](https://mlserver.readthedocs.io/en/latest/examples/conda/README.html#serialise-our-custom-environment). Approach 2, in case your interested, is detailed [here](https://mlserver.readthedocs.io/en/latest/examples/custom/README.html#deployment).

First, we need to use `conda-pack` to create the tar-ball. Run `conda install conda-pack` to download [conda-pack](https://conda.github.io/conda-pack/#conda-pack) which we'll use to create the serialized conda environment. We run:

```
conda pack --force -n langchain-model -o langchain_model/langchain-model.tar.gz
```

Now the `langchain_model` folder should contain:

In [5]:
!tree langchain_model

[01;34mlangchain_model[0m
├── [01;31mlangchain-model.tar.gz[0m
├── model.py
└── model-settings.json

0 directories, 3 files


We'll also need to update the `model-settings.json` file to reference the `.tar.gz` file.

```json
{
    "name": "langchain-model",
    "implementation": "model.LangChainApp",
    "parameters": {
        "environment_tarball": "./langchain-model.tar.gz",
        "extra": {
            "openai_api_key": <openai_api_key>
        }
    }
}
```

The Seldon-core deployment workflow requires that the model assets be stored on some external storage. Options include [minio](https://docs.seldon.io/projects/seldon-core/en/v2/contents/kubernetes/storage-secrets/index.html) but for simplicity, we'll use use google cloud storage. In particular, I've uploaded the model files to a Google bucket: `gs://llm-demos/llm-demos/langchain-example/langchain_model`. Note that I've removed the OpenAI api key from the `model-settings.json` file uploaded here. You'll want to set up your own key and google bucket, and replace the `storageUri` in the model yaml bellow to point at your bucket.

If we now deploy the following `model.yaml` file onto a cluster running seldon-core-v2 we should successfully deploy the LangChain model:

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: langchain-model
spec:
  storageUri: "gs://llm-demos/langchain-example/langchain_model"
  requirements:
  - mlserver
```


In [13]:
!kubectl apply -f model.yaml -n seldon-mesh

model.mlops.seldon.io/langchain-model created


In [14]:
import subprocess
import dotenv
import os

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

In [17]:
import requests


inference_request = {
    "inputs": [
        {
            "name": "query", 
            "shape": [1, 1], 
            "datatype": "BYTES", 
            "data": ['What is Task Decomposition?']
        }
    ]
}

response = requests.post(
    f"http://{get_mesh_ip()}/v2/models/langchain-model/infer",
    json=inference_request,
)

data = response.json()['outputs'][0]['data'][0]
print(data)

Task Decomposition is the process of breaking down a complicated task into smaller, manageable steps. This can be achieved through techniques like Chain of Thought (CoT) and Tree of Thoughts, which guide the model to think step by step and explore multiple reasoning paths. By doing so, it enhances the model's performance on complex tasks and provides insight into its reasoning process.


# Final Note

Although it is possible to do the above when deploying LangChain applications with MLServer this isn't how we recommend Large language model applications be deployed in general. The problem with the above approach is that each component in the application, the vector database, and the embedding model are all packaged into a single model run on a single server. This means that:

1. Single point of failure in the system
2. You can't scale each component independently
3. Data passed between components isn't exposed so debugging or auditing the data becomes harder

Seldons suggested approach is to deploy large language models using our [LLM-module](https://www.seldon.io/solutions/llm-module). The LLM module provides a set of MLServer runtimes for each of the components that you might need in an LLM application. It allows you to deploy each component separately and then wire them up using Seldon core v2 [pipelines](https://docs.seldon.io/projects/seldon-core/en/v2/contents/pipelines/index.html). This way you get:

- A truly distributed system
- Components are reusable
- The data coming in and out of each component can be examined
- You can deploy things like drift detectors or explainers on your pipeline or for specific components
- You can scale/autoscale components independently of the others
