# Conversational Memory Example

This example demonstrates how to run locally a Conversational Memory Runtime MLServer instance. It 
also illustrates the different ways it can be used. (Note: this example only showcases the Conversational 
Memory as a stand-alone component, for a more integrated example check out [chatbot](/examples/chat_bot/chat_bot.md).)

To get up and running we need to install the runtime. To do this create a virtual environment within 
the base memory runtime directory:

```sh
python3 -m venv venv
```

and then install:

```sh
pip install .
```

We are now ready to create the `model-settings.json`. If this kind of file is new to you, make sure to 
check out the [MLSerer docs](https://mlserver.readthedocs.io/en/latest/reference/model-settings.html) for 
a detailed description on what they are and how they work.

In [None]:
!cat filesys/model-settings.json

{
  "name": "conversational_memory",
  "implementation": "mlserver_memory.ConversationalMemory",
  "parameters": {
    "extra": {
      "database_config": {
        "database": "filesys"
      },
      "memory_config": {
        "window_size": 2,
        "tensor_names": ["content"]
      }
    }
  }
}


See the ConversationalMemory [specification](api-ref-cbms) for an explanation of the `memory_config` 
parameters as well as the file system backend [specification](api-ref-fsc). 


Now that the `model-settings.json` file is ready, we can start the runtime using:

```sh
mlserver start filesys
```

There are two intended workflows for using the memory runtime: memory updates and memory retrieval. In 
the first one, the developer wishes to update the memory allocated to an LLM with new messages and, 
in the second one, the developer wants to get messages from the memory store. When sending a request 
to the runtime the developer must always include a `memory_id` tensor that the runtime uses to identify 
the relevant part of a conversation needed. Regardless of whether the history of a conversation is updated, 
the runtime will return a `window_size` object with the number of messages from the memory.

![developer updates conversation history](./update_workflow.png)

For a more concrete example of the first workflow, imagine we want to get the conversational history from a conversation about 
some financial documents. A developer can do this by just sending the `memory_id` of a tensor corresponding to
the question `Why did Apple acquire X company?`. In doing so, the runtime will return the last `window_size` 
number of messages it has stored, for example, `7` if that is the number of questions previously asked in the 
same conversation.

For the second workflow, suppose we have set the max number of messages in our memory to 10 and that we would like 
to update the last one (`window_size - 1`) on behalf of our user. To do so we'd send a `memory_id` tensor as well 
as a `content` tensor to the runtime to update the memory with the new message, read the last `window_size` messages 
(including the new one) and return these. This is done to make the typical flow of a conversation easy to 
implement with the runtime. The next examples will shed some light on how this process works.

**Note**: If the developer wished they could store more than one tensor by updating the list of `tensor_names` 
in the `model-settings.json` file. The only restriction is that if they specify multiple tensors, then they always 
send an equal number of each to prevent the store from getting out of sync.

**Note**: The `window_size` parameter is intended to be used to control the size of requests being sent 
along the pipelines in which the memory runtime is deployed. It's not intended to be used to control the 
number of tokens sent to the LLM. Token bounds can be configured in the [LLM inference runtime](../api/index.md).

To get started, let's send a request with a `memory_id` and some content to initialize a conversational 
store. We've chosen the `memory_id=6e1f1413` at random.

In [2]:
import requests
import pprint


inference_request = {
    "inputs": [
        {
            "name": "memory_id",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["6e1f1413"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "content",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Hello, how are you?"],
            "parameters": {"content_type": "str"},
        }
    ]
}

endpoint = f"http://localhost:8080/v2/models/conversational_memory/infer"
response_with_params = requests.post(
    endpoint,
    json=inference_request,
)

pprint.pprint(response_with_params.json())


{'id': 'dccd1dbd-4ec8-4172-a737-fb6480d20bc4',
 'model_name': 'conversational_memory',
 'outputs': [{'data': ['6e1f1413'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['Hello, how are you?'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}


As you can see, the response returned contains the `memory_id` and the content. The message you've sent 
has now been updated in the conversational store. We can now send a second request with the `memory_id` 
alone in order to see what's contained in the memory:

In [3]:
inference_request = {
    "inputs": [
        {
            "name": "memory_id",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["6e1f1413"],
            "parameters": {"content_type": "str"},
        }
    ]
}
endpoint = f"http://localhost:8080/v2/models/conversational_memory/infer"
response_with_params = requests.post(
    endpoint,
    json=inference_request
)

pprint.pprint(response_with_params.json())

{'id': '618a7d1d-1a2b-40d5-9c43-28ab7769f4dc',
 'model_name': 'conversational_memory',
 'outputs': [{'data': ['6e1f1413'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['Hello, how are you?'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}


As expected, we got back the previous message we just sent.

Multiple messages can be sent to the memory runtime at one time. As an example:

In [4]:
inference_request = {
    "inputs": [
        {
            "name": "memory_id",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["5P5pB2Lv"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "content",
            "shape": [1],
            "datatype": "BYTES",
            "data": [
                "Hello, how are you?",
                "I'm well, how are you?",
                "Good, i'd like to buy a car.",
                "Really, what kind of car?"
            ],
            "parameters": {"content_type": "str"},
        }
    ],
}

endpoint = f"http://localhost:8080/v2/models/conversational_memory/infer"
response_with_params = requests.post(
    endpoint,
    json=inference_request,
)

pprint.pprint(response_with_params.json())

{'id': 'fac725c0-0195-408e-a8fe-be4ad2b2ad2a',
 'model_name': 'conversational_memory',
 'outputs': [{'data': ['5P5pB2Lv'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ["Good, i'd like to buy a car.",
                       'Really, what kind of car?'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [2, 1]}],
 'parameters': {}}


The above will write all the content tensors at the same time. This is useful for loading conversations 
from a different store into the current conversational memory. Note that, we only received 2 messages back 
out of the 4 we sent. This is because we've set the `window_size` parameter in the `model-settings.json` to 
2. To change this we can either do so in the model settings themselves and then restart MLServer, or we 
can pass a parameter in the inference request that sets the `window_size` value dynamically:

In [5]:
inference_request = {
    "inputs": [
        {
            "name": "memory_id",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["5P5pB2Lv"],
            "parameters": {"content_type": "str"},
        },
    ],
    "parameters": {
        "memory_parameters": {"window_size": 5}
    },
}

endpoint = f"http://localhost:8080/v2/models/conversational_memory/infer"
response_with_params = requests.post(
    endpoint,
    json=inference_request,
)

pprint.pprint(response_with_params.json())

{'id': '2e970796-4f75-4372-9cef-84c2afca854c',
 'model_name': 'conversational_memory',
 'outputs': [{'data': ['5P5pB2Lv'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['Hello, how are you?',
                       "I'm well, how are you?",
                       "Good, i'd like to buy a car.",
                       'Really, what kind of car?'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [4, 1]}],
 'parameters': {}}


The above showcases the memory runtime as a stand-alone server. The runtime itself is intended to be used 
as a component within a Seldon Core V2 pipeline along with other components such as large language models 
and other LLM-based tools. This [chatbot](/examples/chat_bot/chat_bot.md) example illustrates such a usecase.