# Chunks

In `retrievall`, "chunks" are collections of arbitrary pieces of documents. A chunk can be as large as a whole document or even a full [corpus](01_corpus.ipynb), or as small as a few words or a single token. Chunks are also not *required* to be contiguous.

Broadly speaking, the retrieval process is all about 1) divvying documents up into chunks, and 2) selecting which chunks you want to actually use (usually based on some kind of metric or relevancy score).

For this notebook, we'll use a small example corpus, defined below as `corpus`

In [None]:
import io
from pyarrow.csv import read_csv
from retrievall.ocr import corpus_from_tesseract_table

csv_str = (
    "level,page_num,block_num,par_num,line_num,word_num,left,top,width,height,conf,text\n"
    "1,1,0,0,0,0,0,0,300,400,-1,\n"
    "2,1,1,0,0,0,20,20,110,90,-1,\n"
    "3,1,1,1,0,0,20,20,180,30,-1,\n"
    "4,1,1,1,1,0,20,20,110,10,-1,\n"
    "5,1,1,1,1,1,20,20,30,10,96.063751,The\n"
    "5,1,1,1,1,2,60,20,50,10,95.965691,(quick)\n"
    "4,1,1,1,2,0,20,40,200,10,-1,\n"
    "5,1,1,1,2,1,20,40,70,10,95.835831,[brown]\n"
    "5,1,1,1,2,2,100,40,30,10,94.899742,fox\n"
    "5,1,1,1,2,3,140,40,60,10,96.683357,jumps!\n"
    "3,1,1,2,0,0,20,80,90,30,-1,\n"
    "4,1,1,2,1,0,20,80,80,10,-1,\n"
    "5,1,1,2,1,1,20,80,40,10,96.912064,Over\n"
    "5,1,1,2,1,2,40,80,30,10,96.887390,the\n"
    "4,1,1,2,2,0,20,100,100,10,-1,\n"
    "5,1,1,2,2,1,20,100,60,10,90.893219,<lazy>\n"
    "5,1,1,2,2,2,90,100,30,10,96.538940,dog\n"
    "1,2,0,0,0,0,0,0,300,400,-1,\n"
    "2,2,1,0,0,0,20,20,110,90,-1,\n"
    "3,2,1,1,0,0,20,20,180,30,-1,\n"
    "4,2,1,1,1,0,20,20,110,10,-1,\n"
    "5,2,1,1,1,1,20,20,30,10,96.063751,The\n"
    "5,2,1,1,1,2,60,20,50,10,95.965691,~groovy\n"
    "4,2,1,1,2,0,20,40,200,10,-1,\n"
    "5,2,1,1,2,1,20,40,70,10,95.835831,minute!\n"
    "5,2,1,1,2,2,100,40,30,10,94.899742,dog\n"
    "5,2,1,1,2,3,140,40,60,10,96.683357,bounds\n"
    "3,2,1,2,0,0,20,80,90,30,-1,\n"
    "4,2,1,2,1,0,20,80,80,10,-1,\n"
    "5,2,1,2,1,1,20,80,40,10,96.912064,UPON\n"
    "5,2,1,2,1,2,40,80,30,10,96.887390,the\n"
    "4,2,1,2,2,0,20,100,100,10,-1,\n"
    "5,2,1,2,2,1,20,100,60,10,90.893219,sleepy\n"
    "5,2,1,2,2,2,90,100,30,10,96.538940,fox\n"
)

corpus = corpus_from_tesseract_table(
    read_csv(io.BytesIO(csv_str.encode())), document_id="123"
)

corpus

<retrievall.core.Corpus at 0x105f709a0>

## `Chunk` objects
Chunks in retrievall are represented by the `Chunks` object, via tabular data structures. `Chunks` store which atoms are in which chunks, plus metadata or attributes about each chunk.

A `Chunks` object represents a collection of chunks, has three attributes (which need to be supplied on instantiation):
* `corpus`: The parent `Corpus` of these chunks. This is necessary for accessing atom metadata—the `Chunks` object itself only stores references to atoms, so it doesn't instrisically have any details on thigs like the actual text content of atoms. If a `Chunks` object is added manually to a `Corpus`, this needs to be the same exact object as the parent corpus.
* `chunks`: A PyArrow `Table` that stores the IDs of all chunks of this type, along with any other chunk-level metadata. Chunk level metadata defines values that are specific to each chunk (e.g. spatial information, particular textual representations of chunks, metrics and scores, etc.).
* `chunk_atoms`: A PyArrow `Table` that outlines which atoms are parts of which chunks. This table simply references atoms and chunks using atom and chunk ID values, and doesn't store any additional information about what are in the chunks or atoms.

In documentation, different kinds of chunk-level metadate (i.e. different columns in the `chunks` table) are usually referred to as different "attributes" of the chunks.

### `ChunkExpr`s

Creating new chunks is most easily done with a `ChunkExpr` (chunk expression).

`ChunkExpr`s are user-definable classes that have a `__call__` function which accepts a `Corpus` as an input and returns a `Chunks` as an output. The usual pattern is to define parameters for the `ChunkExpr` in its `__init__`:

In [None]:
import pyarrow as pa
from retrievall.core import Corpus, Chunks, ChunkExpr


class MyExampleChunker(ChunkExpr):
    def __init__(self, coolness: int, size: int):
        self.coolness = coolness
        self.size = size

    def __call__(self, corpus: Corpus) -> Chunks:
        new_chunks = pa.Table.from_pydict(
            {"id": range(self.size), "coolness": [self.coolness] * self.size}
        )

        # Which atoms are in which chunks
        # (this example just has 1-token chunks)
        chunk_atoms = pa.Table.from_pydict(
            {
                "chunk": new_chunks["id"],  # chunk IDs
                "coolness": corpus.atoms["id"][: self.size],  # atom IDs
            }
        )

        return Chunks(corpus=corpus, chunks=new_chunks, chunk_atoms=chunk_atoms)

When defining your own `ChunkExpr`s, remember that each chunk needs a unique chunk ID, which you'll have to generate manually.

A chunk expression can be used to create chunks via `Corpus.chunk()`:

In [None]:
corpus.chunk(MyExampleChunker(coolness=10, size=4))

<retrievall.core.Chunks at 0x11d08f310>

or standalone by instantiating it and then calling it:

In [None]:
# Separate instantation and calling:
chunker = MyExampleChunker(coolness=10, size=4)
chunks = chunker(corpus)

# One-liner
chunks = MyExampleChunker(coolness=10, size=4)(corpus)
chunks

<retrievall.core.Chunks at 0x11d08f3a0>

## Accessing chunks

Chunks are most easily accessed via `Corpus.chunk()`, to get either pre-existing persistent chunks or to generate ephemeral chunks. See the [Corpus chunks](01_corpus.ipynb#Corpus-chunks) documentation for more details on interfacing with chunks via their parent corpus.

You can also create chunks by instantiating them manually or [using a `ChunkExpr`](#chunkexprs). 

Interacting with chunks is the same whether you use the `Corpus.chunk()` method or manually use a `ChunkExpr` or make a `Chunks` object yourself by hand—each approach gives a `Chunks` object to work with.

## Retrieving from chunks

Once you have some chunks, you can start retrieving the most relevant ones!

`Chunks` objects have three useful methods for enriching chunks with additional metadata and retrieving chunks based on that metadata:
* `Chunks.enrich` adds new columns (or "attributes") to the internal `chunks` table, which is helfpul for adding additional metadata to determine relevance.
* `Chunks.filter` returns a new `Chunks` object with certain chunks filtered out. This is the act of retrieval itself.
* `Chunks.select` materializes the loose collections of atoms and metadata contained in a `Chunks` object into a single, concrete PyArrow `Table`.

Here's a fairly minimal example retrieval process:


In [None]:
from retrievall.exprs import SimpleStringify
from retrievall.filters import Threshold

# Retrieve the text of the first 2 lines (a built-in chunk in this corpus)
# based on their `ordinal` value (a built-in attribute for line chunks)
(
    corpus.chunk("line")
    .filter(Threshold("ordinal", "<=", 2))
    .select(text=SimpleStringify())
)

pyarrow.Table
text: large_string
----
text: [["The (quick)","[brown] fox jumps!"]]

### Enrich
The `Chunks.enrich` method is somewhat analagous to Polars'/PySpark's `with_columns`, or Ibis' `mutate` function: it adds new metadata/attributes to the chunk collection, and passes them on alongside all the previously existing attributes.

`enrich` accepts multiple `AttrExpr` keyword arguments (and *no* positional arguments!). The name provided for the argument becomes the name of the column in the `chunks` table, which can then be referenced in future operations:

In [None]:
from retrievall.exprs import SimpleStringify

names = corpus.chunk("line").chunks.column_names

print(names)

names = corpus.chunk("line").enrich(my_attr=SimpleStringify()).chunks.column_names

print(names)

['paragraph', 'left', 'top', 'width', 'height', 'ordinal', 'id']
['paragraph', 'left', 'top', 'width', 'height', 'ordinal', 'id', 'my_attr']


### Filter

The `Chunks.filter` method, like most DataFrame libraries' `filter` functions, removes unwanted chunks, leaving just the ones that are queried for. See the [filters notebook](04_filters.ipynb) for more in-depth info about filters.

`filter` accepts multiple `ChunkFilter` *positional* arguments; if multiple filters are provided, they're applied in order.

In [None]:
from retrievall.filters import TopK

chunks = corpus.chunk("line")
print(len(chunks))

# Filter
chunks = corpus.chunk("line").filter(TopK("ordinal", 3))
print(len(chunks))

8
3


### Select

`Chunks.select` takes the loose collection of associated data that makes up a `Chunks` object and puts it together into a much more usable DataFrame/table. This lets you control 1) how the contents of your chunks are manifested for any kind of downstream use and 2) *which* attributes and metadata you actually want to use.

Similar to [Polars' `select` function](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.select.html) `Chunks.select` allows you to leverage chunks attribute/column names, expressions, and positional/keyword arguments to give you flexibility over what data you select:
* Existing chunk attributes can be selected by passing the name of the attribute as a string (e.g. `"ordinal"`, or `"custom_score"` if you've created an attribute with that name via `Chunks.enrich`)
* Attributes can be created on-the-fly by passing an `AttrExpr` as a keyword argument. Like `Chunks.enrich`, the argument name provided will become the name of the column in the table that is returned.
* Any attribute not selected will not be returned.

`select` returns a PyArrow `Table`, which can easily be converted to any DataFrame library, or to native Python data structures like lists-of-dicts or dicts-of-lists, or serialized to Arrow or Parquet or CSV, or just used as-is.

In [None]:
from retrievall.exprs import SimpleStringify

corpus.chunk("line").select("ordinal", fancy_text_col=SimpleStringify())

pyarrow.Table
ordinal: uint32
fancy_text_col: large_string
----
ordinal: [[1,2,3,4,5,6,7,8]]
fancy_text_col: [["The (quick)","[brown] fox jumps!","Over the","<lazy> dog","The ~groovy","minute! dog bounds","UPON the","sleepy fox"]]