# Overview

📚 Useful Documentation: [Metadata Filtering](https://docs.haystack.deepset.ai/v2.0/docs/metadata-filtering)

Although new retrieval techniques are great, sometimes you just know that you want to perform search on a specific group of documents in your document store. This can be anything from all the documents that are related to a specific user, or that were published after a certain date and so on. Metadata filtering is very useful in these situations. In this tutorial, we will create a few simple documents containing information about Haystack, where the metadata includes information on what version of Haystack the information relates to. We will then do metadata filtering to make sure we are answering the question based only on information about Haystack 2.0.

## Enabling Telemetry

Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.

In [1]:
from haystack.telemetry import tutorial_running

tutorial_running(31)

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## Metadata Filtering

말 나온김에 가볍게 확인

When you index Documents into your Document Store, you can attach metadata to them.
One example is the `DocumentLanguageClassifier`, which adds the language of the Document's content to its metadata.
Components like `MetadataRouter` can then route Documents based on their metadata.
Additionally, you can apply filters to queries used with Retrievers to limit the scope of your search based on this metadata and ensure that your Answers come from a specific slice of your data.

### Filters

To illustrate how filters work, imagine you have a set of annual reports from various companies.
You may want to perform a search on just a specific year and just on a small selection of companies.
This can reduce the workload of the Retriever and also ensure that you get more relevant results.

Filters are applied via the `filters` argument of the `Retriever` class.
When working with a Pipeline, the filter can be given to `Pipeline.run()`, which will then route it to the `Retriever` class (see Pipelines docs on how to work with a Pipeline).

For example, you can supply a filter in the form of a nested dictionary where `field` is set to a Document metadata field, an operator is set to `in`, and the values are a list of accepted values.
In the example below, the filter ensures that any returned Document has a value of `2019` in the `years` metadata field and either `BMW` or `Mercedes` in the `companies` metadata field.

In [1]:
data = {"retrieval": 
        {
            "query": "Why did the revenue increase?",
            "filters": {"operator": "AND",
                        "conditions": [
                            {"filed": "meta.years", "operator": "==", "value": "2019"},
                            {"filed": "meta.companies", "operator": "in", "value": ["BMW", "Mercedes"]}]
                       }
        }
       }
pipeline.run(data=data)

NameError: name 'pipeline' is not defined

### Filtering Logic

Technically speaking, filters are defined as nested dictionaries that can be of two types: Comparison or Logic.

#### Comparison

Comparison dictionaries must contain the following keys:
- field
- operator
- value

The `field` value in Comparison dictionaries **must be the name of one of the meta fields** of a document, such as `meta.years`.

The operator value in Comparison dictionaries must be one of the following:
- ==
- !=
- \>
- \>=
- <
- <=
- in
- not in

The field `value` takes a single value or (in the case of "in" and “not in”) a list of values as value.

#### Logic

Logic dictionaries must contain the following keys:
- operator
- conditions

The `conditions` key must be a list of dictionaries, either of type Comparison or Logic.

The `operator` values in Logic dictionaries must be one of the following:
- NOT
- OR
- AND

In the Haystack code base, the filtering logic is defined in the DocumentStore protocol.

## Preparing Documents

First, let’s prepare some documents. 
Below, we’re manually creating 3 simple documents with `meta` attached.
We’re then writing these documents to an `InMemoryDocumentStore`, but you can use any of the available document stores instead such as OpenSearch, Chroma, Pinecone and more.. (Note that not all of them have options to store in memory and may require extra setup).

> ⭐️ For more information on how to write documents into different document stores, you can follow our tutorial on indexing different file types.

In [3]:
from datetime import datetime

from haystack import Document  # text, metadata, etc
from haystack.document_stores.in_memory import InMemoryDocumentStore  # DocumentStore: Object that stores Documents
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever  # Retrieval: Search Document in DocumentStore

documents = [Document(content="Use pip to install a basic version of Haystack's latest release: pip install farm-haystack. All the core Haystack components live in the haystack repo. But there's also the haystack-extras repo which contains components that are not as widely used, and you need to install them separately.",
                      meta={"version": 1.15, "date": datetime(2023, 3, 30)}),
             Document(content="Use pip to install a basic version of Haystack's latest release: pip install farm-haystack[inference]. All the core Haystack components live in the haystack repo. But there's also the haystack-extras repo which contains components that are not as widely used, and you need to install them separately.",
                      meta={"version": 1.22, "date": datetime(2023, 11, 7)}),
             Document(content="Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is built on the main branch which is an unstable beta version, but it's useful if you want to try the new features as soon as they are merged.",
                      meta={"version": 2.0, "date": datetime(2023, 12, 4)}),
]
document_store = InMemoryDocumentStore(bm25_algorithm="BM25Plus")
document_store.write_documents(documents=documents)

3

# Building a Document Search Pipeline

As an example, below we are building a simple document search pipeline that simply has a retriever. However, you can also change this pipeline to do more, such as generating answers to questions or more.

In [9]:
from haystack import Pipeline

pipeline = Pipeline()
pipeline.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retrieval")

# Do Metadata Filtering

이제 "version" > 1.21이라는 filter를 document에 걸고 질문해보자.

어떠한 종류의 필터링이 가능한지는 위를 참고.

In [10]:
query = "Haystack installation"
data = {"retrieval": {"query": query,
                     "filters": {"field": "meta.version",
                                 "operator": ">",
                                 "value": 1.21
                                }
                     }
       }
pipeline.run(data=data)

Ranking by BM25...: 100%|██████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 20919.22 docs/s]


{'retrieval': {'documents': [Document(id=b53625c67fee5ba5ac6dc86e7ca0adff567bf8376e86ae4b3fc6f6f858ccf1e5, content: 'Use pip to install a basic version of Haystack's latest release: pip install farm-haystack[inference...', meta: {'version': 1.22, 'date': datetime.datetime(2023, 11, 7, 0, 0)}, score: 1.1808764808011376),
   Document(id=8ac1f8119bdec5c898d5a5c69f49ff47f64056bce1a0f95073e34493bbaf9354, content: 'Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is b...', meta: {'version': 2.0, 'date': datetime.datetime(2023, 12, 4, 0, 0)}, score: 1.0867343954443756)]}}

마지막 단계로 logical operator 필터를 추가해보자.
이번에는 version > 1.21와(AND) 2023년 11월 7일 이후로 나온 것으로 해보자

In [12]:
query = "Haystack installation"
data = {"retrieval": {"query": query,
                      "filters": {"operator": "AND",
                                 "conditions": [{"field": "meta.version", "operator": ">", "value": 1.21},
                                                {"field": "meta.date", "operator": ">", "value": datetime(2023, 11, 7)}
                                     
                                 ]}
                     }}
pipeline.run(data=data)

Ranking by BM25...: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10866.07 docs/s]


{'retrieval': {'documents': [Document(id=8ac1f8119bdec5c898d5a5c69f49ff47f64056bce1a0f95073e34493bbaf9354, content: 'Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is b...', meta: {'version': 2.0, 'date': datetime.datetime(2023, 12, 4, 0, 0)}, score: 1.8483924814931876)]}}

## 번외: Custom retrieval 달기