# Filters
Filters remove unwanted (e.g. irrelevant) chunks, based on the chunk' attributes.

For this notebook, we'll use a small example corpus, defined below as `corpus`

In [None]:
import io
from pyarrow.csv import read_csv
from retrievall.ocr import corpus_from_tesseract_table

csv_str = (
    "level,page_num,block_num,par_num,line_num,word_num,left,top,width,height,conf,text\n"
    "1,1,0,0,0,0,0,0,300,400,-1,\n"
    "2,1,1,0,0,0,20,20,110,90,-1,\n"
    "3,1,1,1,0,0,20,20,180,30,-1,\n"
    "4,1,1,1,1,0,20,20,110,10,-1,\n"
    "5,1,1,1,1,1,20,20,30,10,96.063751,The\n"
    "5,1,1,1,1,2,60,20,50,10,95.965691,(quick)\n"
    "4,1,1,1,2,0,20,40,200,10,-1,\n"
    "5,1,1,1,2,1,20,40,70,10,95.835831,[brown]\n"
    "5,1,1,1,2,2,100,40,30,10,94.899742,fox\n"
    "5,1,1,1,2,3,140,40,60,10,96.683357,jumps!\n"
    "3,1,1,2,0,0,20,80,90,30,-1,\n"
    "4,1,1,2,1,0,20,80,80,10,-1,\n"
    "5,1,1,2,1,1,20,80,40,10,96.912064,Over\n"
    "5,1,1,2,1,2,40,80,30,10,96.887390,the\n"
    "4,1,1,2,2,0,20,100,100,10,-1,\n"
    "5,1,1,2,2,1,20,100,60,10,90.893219,<lazy>\n"
    "5,1,1,2,2,2,90,100,30,10,96.538940,dog\n"
    "1,2,0,0,0,0,0,0,300,400,-1,\n"
    "2,2,1,0,0,0,20,20,110,90,-1,\n"
    "3,2,1,1,0,0,20,20,180,30,-1,\n"
    "4,2,1,1,1,0,20,20,110,10,-1,\n"
    "5,2,1,1,1,1,20,20,30,10,96.063751,The\n"
    "5,2,1,1,1,2,60,20,50,10,95.965691,~groovy\n"
    "4,2,1,1,2,0,20,40,200,10,-1,\n"
    "5,2,1,1,2,1,20,40,70,10,95.835831,minute!\n"
    "5,2,1,1,2,2,100,40,30,10,94.899742,dog\n"
    "5,2,1,1,2,3,140,40,60,10,96.683357,bounds\n"
    "3,2,1,2,0,0,20,80,90,30,-1,\n"
    "4,2,1,2,1,0,20,80,80,10,-1,\n"
    "5,2,1,2,1,1,20,80,40,10,96.912064,UPON\n"
    "5,2,1,2,1,2,40,80,30,10,96.887390,the\n"
    "4,2,1,2,2,0,20,100,100,10,-1,\n"
    "5,2,1,2,2,1,20,100,60,10,90.893219,sleepy\n"
    "5,2,1,2,2,2,90,100,30,10,96.538940,fox\n"
)

corpus = corpus_from_tesseract_table(
    read_csv(io.BytesIO(csv_str.encode())), document_id="123"
)

corpus

<retrievall.core.Corpus at 0x10a4a2050>

## `ChunkFilter`s

To filter chunks `ChunkFilter`s are used.

`ChunkFilters`s are user-definable classes that have a `__call__` function which accepts a `Chunks` as an input and returns a `Chunks`. Generally, the input and output `Chunks` are identical except for a reduced number of chunks defined in the internal `chunks` table.

If a `ChunkFilter` acts on a specific attribute, this needs to be part of its `__init__` function, since `__call__` does not allow any extra information to be passed except for the input `Chunks` object.

In [None]:
from retrievall.core import ChunkFilter, Chunks


class MyFilter(ChunkFilter):
    """
    Filters out all but the first `n` chunks that happen to be
    in the input.
    """

    def __init__(self, n: int):
        self.n = n

    def __call__(self, chunks: Chunks) -> Chunks:
        return Chunks(
            corpus=chunks.corpus,
            # Just grab whatever's at the top of the table
            chunks=chunks.chunks[: self.n],
            chunk_atoms=chunks.chunk_atoms,
        )

Like chunk and attribute expressions, `ChunkFilter`s can theoretically be used a few different ways. The idiomatic use is inside the `Chunks.filter` method:

In [None]:
len(corpus.chunk("line").filter(MyFilter(3)))

3

It can also be used standalone if you want:

In [None]:
# Separate instantation and calling:
filt = MyFilter(3)

print(len(filt(corpus.chunk("line"))))

# # One-liner
print(len(MyFilter(4)(corpus.chunk("line"))))

3
4


## Existing filters
While custom `ChunkFilters` can be defined, there are 3 built-in filters that likely cover the vast majority of retrieval use cases. These filters are:
* `TopK`: Select the `k` chunks that have the highest value for a given `attr`. An optional keyword argument also allows you to flip this around and get the *lowest*-valued chunks.
* `Threshold`: Select all chunks that are greater than or less than a given threshold. (Optionally inclusive or exclusive.)
* `EqualTo`: Select all chunks that are equal to any value in a collection of input values. (The collecection can be one item long, to use this as a more traditional "is equal" kind of filter rather than an "is in" kind of filter.)

These are sufficient to cover many use cases, but more may be added in the future.