In [1]:
%load_ext autoreload
%autoreload 2

# API workshopping
Let's figure out a better API!

## Existing component inputs/outputs
* Corpus: Nothin'. That's the main thing we're workign with/passing around.
* Chunker: settings -> chunks (which may or may not have any additional attributes? And may or may not need to be persistent)
    * (If we have lists of things, "atoms" may also be able to be considered a "value for each chunk")
* Scorer: chunks -> a value for each chunk
    * We also maybe want a chunk/atom *modifier* (like, something that changes which atoms are in the chunk) and also something that adds/removes chunks to an existing chunk set.
* Filter: chunks -> chunks (but less of them). We actually use a "filter" for the 
* Materializer: chunks -> One thing per chunk.

## General notes
* **Ownership.** Part of the difficulty is that a chunk does't "own" its atoms. It's all associations! We usually want to use the contents/attributes of the atoms (e.g. `text` or `ordinal`) when doing anything useful, but we also can't do that directly from the chunk itself because the chunk just knows which atoms it's referring to, not their other attributes (which could be something complicated, up to and including per-atom embeddings?). It seems like it would be nice to access atom attributes directly, but they aren't actually immediately available and requrie some cross-referencing.
* **Objects/memory/pointers.** If it's all referential—couldn't we just cover this all with *actual* data/objects? Like rather than referencing IDs everywhere, could we have actual object or pointer? I mean, the fact that this stuff lives in memory means it *already* has a unique address, right? Can't all this references-of-references be handled by using in-memory objects?
* **Normal forms.** If we've got custom attributes all the time, maybe 6th normal form *is* helpful? Like, if we register every single attribute separately, maybe that makes it easier to have an interface to access them?
* **Emphasizing materialization**. Materializing *does* result in a single value per chunk. At least, allegedly. So maybe we just need to ensure that any pipeline has a materialization step? Like, all the processes lead up to providing something to materialize?
    * Sticking with that idea for a sec, are there any cases where we *don't* want to materialize?
        * With a chunker, we're either persisting things (so someone can use them later) or we're actively working towards materialization. And we materialize chunks.
        * A "scorer" is adding metadata to a chunk. Usually as part of a materialization process. If it isn't then it also doesn't need to be cleaned up, it can just exist as tables. This might be better as an "annotate" function or something?
        * Anyways...
* **Atom select statement.** There's something kicking around in here about not just a chunk select, but an atom select. Like, maybe for defining (or modifying) chunks, we need a separate layer of select statements to pick out which atoms we want? So we have an atom select clause and also a chunk select clause?
* **Declarative group by operation.** Is this all some kind of crazy, declarative `group_by`?
    * In this case, "persisted" chunks are just things that we have handy groups pre-defined for. A sorta trivial group by operation.
    * Can we filter groups from a GroupBy? (That's a `HAVING` clause, pretty much, but I'm not sure what the options are via PyArrow/Polars/whatever.)
    * How would something like the `ChunkExpander` fit in to this? You can't modify what things belong in a group that way at that stage, can you?
* **Potential delineations.** There are a few ways we could conceptually slice what we're working on. We just have to figure out the best combination.
    * **Chunks & atoms.** Everything is a chunk or an atom. Chunks are made of atoms, and the end goal of what we're trying to do is to grab the right chunks
        * **Are chunks immutable?** Once you have a chunk, maybe that's it—you can't do anything else with it. So if you want to expand a chunk (for example), what you would actually have is an `Expander` that takes *existing* chunks (either persisted or ephemeral) and uses them as an input to generate completely new chunks. So an "expander" is just a chunker that uses existing chunks as an input.
        * **And what is a `Chunks` object?** Based on the above, a `Chunks` object contains both the chunk attributes and the chunk relations. Plus a reference to the parent corpus, probably. So that's actually not too bad, it's just three elements. (We probably want to change the column names of the internal tables, so that there's just one source of truth. Or maybe not even have a name there? Maybe the chunk name exists externally, not internally [i.e. it's just an alias or something?]).
        * **Flexibility downsides?** This *does* make it a bit trickier to represent relationships between different kinds of chunks, if people are interested in that. But that also feels like a pret
    * **Objects & relationships.**. Everything is an object or a relationship. Atoms and chunks are both just objects (which can have attributes). Which atoms are associated with a chunk are defined by relationships. The primary output of a retrievall process is to operate on a `chunk-atom` relationship, maybe? This paradigm is *really* general, but maybe not the best suited to a situation where we're largely doign the same things over and over... For example, it would let us more easily reflect relationships between multiple kinds of "chunk" (which is not a super common thing for our use-cases?), but also seems like it would make it more difficult to actually get a chunk consistently (because then you need to get the atoms, etc). This feels like it's kind of leaning towards a graph or hypergraph sorta thing.
* **Leading options.** It feels like there are two leading options right now:
    * **Embrace databases.** This would lean towards storing everything in DuckDB and basically writing helper wrapper functions around everything. That would be performant, and maybe elucidate some ideas on how things should work, but may also hurt flexibility? Kinda depends on what the DuckDB interface allows for. It'll take some exploration. The thesis here is that databases are established and flexible technology, and it would be wise to build on them rather than reinventing the wheel.
    * **Custom representation.** This leans towards breaking things down even further in a custom way: more nested data, references of references, etc. Thhe thesis for this one is that databses are great and all, but they *aren't* built for handling the type of thing we're trying to do, and we need something more suited to our specific use cases.
    * **Does it matter either way?** I'm curious if the backend stuff could be independent of the API. Maybe both are valid, and having a great interface could let either option work well?

## How DuckDB went
* One of the things *I* value is not having to put names on stuff/using variables. (Which maybe isn't super important in the grand scheme of things, but feels like something that helps with generalizability.) With DuckDB, using variable table names feels really difficult—it takes some work to avoid "officially" registering a table with a super simple string name. 
* Managing multiple tables for a single concept is indeed quite tricky. Having `document` and `document-atom` and `atom` all be accessed by having to know the string names and how they tie together is cumbersome, especially when *actually using* the data (e.g. materializing it, scoring it) almost always requires combining multiple data sources.
* With that in mind, the whole relational thing really is being put to the test here. While splitting normalizing things like we have is great for data consistency and stuff, having to then bring the data back together all the time is challenging. It may indeed be better to give up on that and start using de-normalized data, if we're going to be joining all the time anyways. (But on the other hand, if we're going to be "joining al lthe time anyways", we need to be super careful about *what* we're joining, and who owns what data, and how we access it, etc. For example, we wouldn't want to de-normalize atom attributes by default, because then we may miss some if we add attributes or we may end up duplicating tons and tons of info.)

**ColBERT mockup example??** That may be a good test case for actually using something complex.

## Examples
To flesh this out more, I'm gonna throw some examples out here.

Let's say we've got tokenized documents (like we do with an `OCRCorpus`), and we're trying to **retrieve the most positive haikus**. What would this process look like?

Here are a few potential APIs:
### Pure SQL
Since `retrievall` is *pretty much* just querying stuff from a database, we can pseudocode some SQL for haiku-querying.

First, we need to find where the Haikus are. Since this a bit complex, and maybe we want to re-use the haikus we found in the corpus for something else, we'll persist the chunks to the corpus. Ideally, any API we end up going with should allow us to both persist chunk for re-use *and* just use them ephemerally for one-off tasks.
```sql
CREATE TABLE haiku_atom AS
    SELECT 42 AS i, 84 AS j;

SELECT
    
FROM corpus
GROUP BY

HAVING
```
If we wanted to pre-compute haiku chunks, rather than 

### SQL-like
Despite, sharing a lot in common with regular database querying operations, `retrievall` has a much greater focus on grouping atoms and aggregating them. So what we might want instead is something *like* SQL, but with less boilerplate for doing the things we're trying to do.

```sql

```

### Expression-y (Polars/Spark-like)

### Other

```python
corpus.chunk("existing_chunk").filter(tfidf)

# Another approach
corpus.chunk(fixedwindow(size=100, offset))
```

Another option: A `Chunks` object contains chunk attributes and also the references to which atoms the chunk contains (A "view" of them, as it were). Regardless of how we actually store the info about the atoms in the chunk, accessing the atoms themselves can be done by:
```python
my_chunks = Chunks() # Whatever does here to make chunks
my_chunks.atoms() # Access a groupy_by of all the atom stuff, perhaps? Somehow, this should get us access to the contents of the chunks.
```

Gotta look at some other approaches, too.

In [2]:
tesseract_csv = (
    "level,page_num,block_num,par_num,line_num,word_num,left,top,width,height,conf,text\n"
    "1,1,0,0,0,0,0,0,300,400,-1,\n"
    "2,1,1,0,0,0,20,20,110,90,-1,\n"
    "3,1,1,1,0,0,20,20,180,30,-1,\n"
    "4,1,1,1,1,0,20,20,110,10,-1,\n"
    "5,1,1,1,1,1,20,20,30,10,96.063751,The\n"
    "5,1,1,1,1,2,60,20,50,10,95.965691,(quick)\n"
    "4,1,1,1,2,0,20,40,200,10,-1,\n"
    "5,1,1,1,2,1,20,40,70,10,95.835831,[brown]\n"
    "5,1,1,1,2,2,100,40,30,10,94.899742,fox\n"
    "5,1,1,1,2,3,140,40,60,10,96.683357,jumps!\n"
    "3,1,1,2,0,0,20,80,90,30,-1,\n"
    "4,1,1,2,1,0,20,80,80,10,-1,\n"
    "5,1,1,2,1,1,20,80,40,10,96.912064,Over\n"
    "5,1,1,2,1,2,40,80,30,10,96.887390,the\n"
    "4,1,1,2,2,0,20,100,100,10,-1,\n"
    "5,1,1,2,2,1,20,100,60,10,90.893219,<lazy>\n"
    "5,1,1,2,2,2,90,100,30,10,96.538940,dog\n"
    "1,2,0,0,0,0,0,0,300,400,-1,\n"
    "2,2,1,0,0,0,20,20,110,90,-1,\n"
    "3,2,1,1,0,0,20,20,180,30,-1,\n"
    "4,2,1,1,1,0,20,20,110,10,-1,\n"
    "5,2,1,1,1,1,20,20,30,10,96.063751,The\n"
    "5,2,1,1,1,2,60,20,50,10,95.965691,~groovy\n"
    "4,2,1,1,2,0,20,40,200,10,-1,\n"
    "5,2,1,1,2,1,20,40,70,10,95.835831,minute!\n"
    "5,2,1,1,2,2,100,40,30,10,94.899742,dog\n"
    "5,2,1,1,2,3,140,40,60,10,96.683357,bounds\n"
    "3,2,1,2,0,0,20,80,90,30,-1,\n"
    "4,2,1,2,1,0,20,80,80,10,-1,\n"
    "5,2,1,2,1,1,20,80,40,10,96.912064,UPON\n"
    "5,2,1,2,1,2,40,80,30,10,96.887390,the\n"
    "4,2,1,2,2,0,20,100,100,10,-1,\n"
    "5,2,1,2,2,1,20,100,60,10,90.893219,sleepy\n"
    "5,2,1,2,2,2,90,100,30,10,96.538940,fox\n"
)

In [3]:
from pyarrow.csv import read_csv
import io
# from retrievall.basic import corpus_from_tesseract_table

## Metadata fetchr?

In [39]:
import polars as pl
import pyarrow as pa
from retrievall.core import AttrExpr, Chunks


class AtomData(AttrExpr):
    """
    Get atom metadata?
    """

    def __init__(self, attrs: list):
        self.attrs = attrs

    def __call__(self, chunks: Chunks) -> pa.Array:
        return (
            pl.from_arrow(chunks.chunk_atoms)
            .join(pl.from_arrow(chunks.corpus.atoms), left_on="atom", right_on="id")
            .group_by("chunk")
            .agg(pl.col(self.attrs))
        )

In [59]:
from retrievall.ocr import corpus_from_tesseract_table

corpus = corpus_from_tesseract_table(
    read_csv(io.BytesIO(tesseract_csv.encode())), document_id="123"
)

AtomData("tesseract_coord")(corpus.chunk("page")).to_arrow().schema

chunk: uint64
tesseract_coord: large_list<item: struct<page_num: int64, block_num: int64, par_num: int64, line_num: int64, word_num: int64>>
  child 0, item: struct<page_num: int64, block_num: int64, par_num: int64, line_num: int64, word_num: int64>
      child 0, page_num: int64
      child 1, block_num: int64
      child 2, par_num: int64
      child 3, line_num: int64
      child 4, word_num: int64

In [47]:
AtomData("tesseract_coord")(corpus.chunk("page"))[:1]

chunk,tesseract_coord
u64,list[struct[5]]
14712110704280823429,"[{1,1,1,1,1}, {1,1,1,1,2}, … {1,1,2,2,2}]"


## Experiments

In [6]:
from retrievall.ocr import corpus_from_tesseract_table

corpus = corpus_from_tesseract_table(
    read_csv(io.BytesIO(tesseract_csv.encode())), document_id="123"
)

In [9]:
from retrievall.filters import Threshold
from retrievall.exprs import SimpleStringify

(
    corpus.chunk("page")
    .filter(Threshold("ordinal", "<=", 1))
    .select(text=SimpleStringify())
)

pyarrow.Table
text: large_string
----
text: [["The (quick) [brown] fox jumps! Over the <lazy> dog"]]

In [None]:
from retrievall.chunkers import FixedSizeChunk
from retrievall.exprs import SimpleStringify
from retrievall.sparsetext import Tfidf
from retrievall.filters import TopK

(
    corpus.chunk(FixedSizeChunk("page", 64, offset=-32))
    .enrich(tfidf=Tfidf(SimpleStringify(), query="brown fox"))
    .filter(TopK("tfidf", 3))
    .select(text=SimpleStringify())
)

### Example usage
Let's try to put this together

In [None]:
# (
#     corp.chunk(
#         FixedSizeChunker("fixed", size=3, offset=-1, constrain_to="line")
#     )  # Create some ephemeral chunks
#     .enrich(tfidf=TfidfScorer(SimpleStringifier(), query="Inability to exercise"))
#     .filter("WHERE tfidf > 1.2")
#     .select(
#         "tfidf",
#         "service_request_id",
#         OCRTokens(),  # ???
#     )
# )

Could we just do this with a `HAVING` SQL clause anyways? (see [this DuckDB page](https://duckdb.org/docs/sql/query_syntax/having#examples) for example):
```sql
SELECT
    chunk_id,
    SimpleStringifier(),
    avg(income)
FROM addresses
GROUP BY city, street_name
HAVING avg(income) > 2 * median(income);
```

### Using a retrieval configuration
If you're working with an automated job or an experimentation workflow, it may be handier to *configure* your retrieval process via a config file, rather than having to go in and make changes to your Python code by hand. Below is a helper function that can enable this configurability.

(This is pretty hacked together, it might be smarter to try out something like [Pydantic](https://docs.pydantic.dev/latest/) if you want to do this seriously.)

In [None]:
import importlib
from retrievall import Corpus


def config_retrieve(corpus: Corpus, cfg: dict) -> pa.Table:
    """
    Retrieve data from `corpus`, according to the retrieval process
    defiend in `cfg`.

    Parameters
    ----------
    corpus
        `Corpus` to retrieve from.
    cfg
        Configuration dict. Requires 2 keys:
        * `chunk`, has a value of either a string matching existing chunk
          name, or a dict with a `name` key that declares the (qualified) name of the
          `ChunkExpr` to use and an `args` key that contains a dict of arguments to pass
          when instantiating the chunk expression.
        * `retrieve`, which contains a list of dicts, each one with an `operation` key
          (either `filter`, `enrich` or `select`) and an `args` dict. Arguments may be
          strings or expressions, defined in the same way as the chunk expression above
          (i.e. a dict with a `name` and an `args` key)
    """

    def object_from_dict(d: dict):
        """
        Instantiate an arbitrary object from a dict that contains a
        `name` key (giving the class name) and an `args` key (providing
        arguments to the instantation)
        """
        module_name, class_name = d["name"].rsplit(".", 1)
        obj_class = getattr(importlib.import_module(module_name), class_name)

        # Instantiate
        return obj_class(**d["args"])

    chunkexpr = cfg["chunk"]
    # Instantiate chunk expression, if the `chunks` argument contains one
    chunkexpr = chunkexpr if isinstance(chunkexpr, str) else object_from_dict(chunkexpr)

    res = corpus.chunk(chunkexpr)

    for op in cfg["retrieve"]:
        # If there's just one argument, instantiate it. Otherwise, loop
        # through all passed arguments and instantiate them if possible.
        if op["args"].keys() == {"args", "name"}:
            res = getattr(res, op["operation"])(object_from_dict(op["args"]))
        else:
            res = getattr(res, op["operation"])(
                **{
                    k: v if isinstance(v, str) else object_from_dict(v)
                    for k, v in op["args"].items()
                }
            )

    return res

In [None]:
config_retrieve(
    corpus,
    {
        "chunk": "line",
        "retrieve": [
            {
                "operation": "filter",
                "args": {
                    "name": "retrievall.filters.EqualTo",
                    "args": {"attr": "ordinal", "values": [1]},
                },
            },
            {
                "operation": "select",
                "args": {
                    "ordinal": "ordinal",
                    "text": {
                        "name": "retrievall.filters.EqualTo",
                        "args": {"attr": "ordinal", "values": [1]},
                    },
                },
            },
        ],
    },
)

TypeError: 'Chunks' object is not iterable