# Filter by Tags in Zilliz Cloud Pipelines

> (Note) Zilliz Cloud Pipelines is about to deprecate. Please stay tuned for detailed instructions on alternative solutions.

In the previous [notebook](./zilliz_pipeline_rag.ipynb), we have learned the basics of Zilliz Cloud Pipelines. In this notebook, we show an example of filtering retrieval results by tags. The Pipelines operations are wrapped with a helper class to simply the code.

## Setup
### Prerequisites
Please make sure you have a Serverless cluster in Zilliz Cloud. If not already, you can [sign up for free](https://cloud.zilliz.com/signup?utm_source=referral&utm_medium=partner&utm_campaign=2023-12-22_github-docs_pipeline-filter-notebook_github).

To learn how to create a Serverless cluster and get your CLOUD_REGION, CLUSTER_ID, API_KEY and PROJECT_ID, please refer to this [page](https://docs.zilliz.com/docs/create-cluster) for more details.

In [10]:
import os

CLOUD_REGION = 'gcp-us-west1'
CLUSTER_ID = 'your CLUSTER_ID'
API_KEY = 'your API_KEY'
PROJECT_ID = 'your PROJECT_ID'

### Create an ingestion pipeline
[Ingestion pipelines](https://docs.zilliz.com/docs/understanding-pipelines#ingestion-pipelines) can transform unstructured data into searchable vector embeddings and store them in Zilliz Cloud Vector Database.

In the following example we create an Ingestion pipeline named as `my_ingestion_pipeline`. As part of creating the Ingestion pipeline, a vector database collection named `my_rag_collection` will be created in the cluster. It contains five fields:
- `doc_name`, `chunk_id`, `chunk_text`, `embedding` as defined by `INDEX_DOC` function
- `version` as defined by `PRESERVE` function

In [11]:
from pipeline_utils import IngestionPipeline

collection_name = 'my_rag_collection'

functions = [
    {
        "name": "index_my_doc",
        "action": "INDEX_DOC",
        "inputField": "doc_url",
        "language": "ENGLISH",
        "chunkSize": 500,
    },
    {
        "name": "keep_doc_info",
        "action": "PRESERVE",
        "inputField": "version",
        "outputField": "version",
        "fieldType": "VarChar"
    }
]

ingestion_pipeline = IngestionPipeline(cloud_region=CLOUD_REGION,
                                       cluster_id=CLUSTER_ID,
                                       api_key=API_KEY,
                                       project_id=PROJECT_ID,
                                       collection_name=collection_name,
                                       pipeline_name='my_ingestion_pipeline',
                                       functions=functions)

If you run this code and get the error "This collection already exists", it means you have created this collection before. You can change the `collection_name` or delete the collection manually.

### Create a search pipeline
[Search pipelines](https://docs.zilliz.com/docs/understanding-pipelines#search-pipelines) enables semantic search by converting a query string into a vector embedding and then retrieving top-K nearest neighbour vectors and doc chunks.

In the following example we create a Search pipeline named `my_search_pipeline`. It searches the collection created by the Ingestion pipeline above.



In [12]:
from pipeline_utils import SearchPipeline

functions = [
    {
        "name": "search_chunk_text",
        "action": "SEARCH_DOC_CHUNK",
        "inputField": "query_text",
        "clusterId": CLUSTER_ID,
        "collectionName": collection_name
    }
]

search_pipeline = SearchPipeline(
    cloud_region=CLOUD_REGION,
    api_key=API_KEY,
    project_id=PROJECT_ID,
    pipeline_name='my_search_pipeline',
    functions=functions,
)

### Run ingestion pipeline

Ingestion pipeline accepts files from Object Storage Service such as [AWS S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html) or [Google Cloud Storage (GCS)](https://cloud.google.com/storage/docs/uploads-downloads).

We use two versions of Milvus (an open-source vector database) doc, which are from [Milvus 2.3](https://milvus.io/docs/delete_data.md) and [Milvus 2.2](https://milvus.io/docs/v2.2.x/delete_data.md) . They are stored on Google Cloud Storage and attach its version info. We pass the version information into the keyword arguments of the `run()` method.


In [13]:
gcs_url_23 = 'https://publicdataset.zillizcloud.com/milvus_doc.md'  # The latest milvus 2.3 version documentation
gcs_url_22 = 'https://publicdataset.zillizcloud.com/milvus_doc_22.md'  # Milvus 2.2 version documentation

ingestion_pipeline.run(gcs_url=gcs_url_22, version='2.2')
ingestion_pipeline.run(gcs_url=gcs_url_23, version='2.3')

<Response [200]>

Now we have successfully ingested the document by splitting it into doc chunks and uploading the generated embedding into the vector database collection. If you want to inspect the data in the collection, you can use the Data Preview tool in [Zilliz Cloud web UI](https://cloud.zilliz.com).

## Build RAG application with Search pipeline

### Run search pipeline
We can use the `run()` method to run a search pipeline. The `run()` method takes a question as input and returns the top k knowledge fragments.
The returned information also needs to include `other_output_fields=['version']`, and the filter condition is `version == "2.2"`.

In [14]:
question = 'Can user delete milvus entities through non-primary key filtering?'
search_pipeline.run(question=question, top_k=2, other_output_fields=['version'], filter='version == "2.2"', )

[{'chunk_text': '# Delete Entities\nThis topic describes how to delete entities in Milvus.  \nMilvus supports deleting entities by primary key filtered with boolean expression.  \nDeleted entities can still be retrieved immediately after the deletion if the consistency level is set lower than Strong.\nEntities deleted beyond the pre-specified span of time for Time Travel cannot be retrieved again.\nFrequent deletion operations will impact the system performance.',
  'version': '2.2'},
 {'chunk_text': '# Delete Entities\n## Prepare boolean expression\nPrepare the boolean expression that filters the entities to delete.  \nMilvus only supports deleting entities with clearly specified primary keys, which can be achieved merely with the term expression in. Other operators can be used only in query or scalar filtering in vector search. See Boolean Expression Rules for more information.  \nThe following example filters data with primary key values of 0 and 1.  \n```python\nexpr = "book_id in 

Let’s try changing the filter conditions to `version == "2.3"`

In [15]:
search_pipeline.run(question=question, top_k=2, other_output_fields=['version'], filter='version == "2.3"', )

[{'chunk_text': '# Delete Entities\nThis topic describes how to delete entities in Milvus.  \nMilvus supports deleting entities by primary key or complex boolean expressions. Deleting entities by primary key is much faster and lighter than deleting them by complex boolean expressions. This is because Milvus executes queries first when deleting data by complex boolean expressions.  \nDeleted entities can still be retrieved immediately after the deletion if the consistency level is set lower than Strong.\nEntities deleted beyond the pre-specified span of time for Time Travel cannot be retrieved again.\nFrequent deletion operations will impact the system performance.  \nBefore deleting entities by comlpex boolean expressions, make sure the collection has been loaded.\nDeleting entities by complex boolean expressions is not an atomic operation. Therefore, if it fails halfway through, some data may still be deleted.\nDeleting entities by complex boolean expressions is supported only when th

We can see that when we ask a question, this search run can return the top k knowledge fragments we need. This is also a basis for forming RAG.

### Build a chatbot powered by RAG 
Below, we show a simple RAG app that can answer based on the knowledge we have ingested previously. It uses OpenAI `gpt-3.5-turbo` as LLM and a simple prompt. To test it, you can replace with your own OpenAI API Key.

In [16]:
from openai import OpenAI

client = OpenAI()
client.api_key = os.getenv('OPENAI_API_KEY')  # your OpenAI API key


class MilvusDocChatbot:
    def __init__(self, search_pipeline):
        self.search_pipeline = search_pipeline

    def retrieve(self, query: str, milvus_version: str) -> list:
        """
        Retrieve relevant text with Zilliz Cloud Pipelines.
        """
        results = self.search_pipeline.run(question=query, top_k=2, other_output_fields=['version'], filter=f'version == "{milvus_version}"', )
        return results

    def generate_answer(self, query: str, context_str: list) -> str:
        """
        Generate answer based on context, which is from the result of Search pipeline run.
        """
        completion = client.chat.completions.create(
            model="gpt-3.5-turbo",
            temperature=0,
            messages=
            [
                {"role": "user",
                 "content":
                     f"We have provided context information below. \n"
                     f"---------------------\n"
                     f"{context_str}"
                     f"\n---------------------\n"
                     f"Given this information, please answer the question: {query}"
                 }
            ]
        ).choices[0].message.content
        return completion

    def chat_with_rag(self, query: str, milvus_version: str) -> str:
        context_str = self.retrieve(query, milvus_version=milvus_version)
        completion = self.generate_answer(query, context_str)
        return completion



chatbot = MilvusDocChatbot(search_pipeline)

This implements an RAG chatbot, it will use Search pipeline to retrieve the most relevant chunks from ingested documents, and enhance the answer quality with it. Let's see how it works in action!

### Chat with RAG

In [17]:
question = 'Can user delete milvus entities through non-primary key filtering?'
chatbot.chat_with_rag(question, milvus_version='2.2')

'No, according to the provided information, Milvus only supports deleting entities by primary key filtered with a boolean expression. Other operators can be used only in query or scalar filtering in vector search.'

The ground truth content in the original knowledge text is:
> **Milvus supports deleting entities by primary key filtered with boolean expression.**

In [18]:
chatbot.chat_with_rag(question, milvus_version='2.3')

'Yes, users can delete Milvus entities through non-primary key filtering by using complex boolean expressions.'

The ground truth content in the original knowledge text is:
> **Milvus supports deleting entities by primary key or complex boolean expressions**. Deleting entities by primary key is much faster and lighter than deleting them by complex boolean expressions. This is because Milvus executes queries first when deleting data by complex boolean expressions.


Indeed, Milvus 2.3 has enhanced the [Delete Entities](https://milvus.io/docs/v2.2.x/delete_data.md) function. In the latest version 2.3, deleting by complex boolean expressions can be supported. By filtering different Milvus versions, we have achieved the ability to RAG different knowledge sources.

That's how to use Zilliz Cloud Pipelines to build RAG applications. To learn more, you can refer to https://docs.zilliz.com/docs/pipelines for detailed information.

If you have any question, feel free to contact us at support@zilliz.com