# Corpus

A `Corpus` is the main source of data in `retrievall`—the thing that you retrieve *from*.

[The term *corpus*](https://en.wikipedia.org/wiki/Text_corpus) (plural: *corpora*) is commonly used in the fields of linguistics and more traditional information retrieval/natural language processing to refer to a dataset or collection of documents.

Retrievall uses this word in the same sense: it's the collection of documents or data that you want to retrieve from. This collection can be large (e.g. thousands of documents, all of Wikipedia), but it can also be small (e.g. a single PDF, a single text string).

## Representation
Many RAG libraries represent corpora and documents (and chunks of documents) in a top-down way, where you your documents are what they are and you retrieve from them as-is: you either retrive the full content of a document, or you maybe desctructively break the document down into smaller chunks that *replace* your documents as new, smaller documents.

In retrievall, a corpus and its documents are represented from the *bottom up*. Rather than storing documents as-is in fixed configurations, a `Corpus` stores the smallest constituent parts of all its documents (referred to as as **`atom`s**—e.g. individual OCR tokens, individual strings), and then deals with arbitrary *collections* of these atoms.

Rather than distinct documents, a `Corpus` is like a big soup of all the atoms in all the documents of the corpus. This means that the atoms can be freely rearranged or reassociated into new collections on-the-fly to "chunk" the corpus in new ways. Even "documents" themselves are just collections of atoms (or "chunks"), which means that a full document, or a paragraph, or a 100-token rolling window, or even single-token spans all have the same primacy and representation. You can easily switch between different "chunking" methods to retrieve exactly what you're looking for, whether thats something big or something really little.

See [here](02_chunks.ipynb) for more details on chunks.

## Creation
A `Corpus` is instantiated with a `atoms` PyArrow `Table`.

### Atoms
At the *bare minimum*, atoms all need an `id` column, with a distinct identifier value for each atom. The there is no prescribed data type for IDs—any type is allowed, but some types may be easier to use or more performant that others.

Atoms can also have any kind of "attributes" or metadata. For example, text-based tokens will have a `text` column (and likely an `ordinal` column to stipulate reading order). Similarly, OCR-based atoms may have location or bounding box information to pinpoint where they exist on document pages.

In [None]:
import pyarrow as pa
from retrievall import Corpus

atoms = pa.Table.from_pydict(
    {
        "id": [123, 355, 684, 235, 407],  # Literally putting random values here.
        "text": ["Lorem", "ipsum", "dolor", "sit", "amet"],
        "ordinal": [1, 2, 3, 4, 5],
    }
)

corpus = Corpus(atoms=atoms)
corpus

<retrievall.core.Corpus at 0x108b0cca0>

## Corpus chunks
Instantiation creates an empty `chunks` dictionary as an attribute of the `Corpus`. This is where all chunk information is stored and accessible. Note that not all useful chunks *need* to be stored in the corpus—in plenty of cases, it's fine to chunk on-the-fly and then throw the chunks away after you get what you need.

### Adding persistent chunks
A `Corpus` cannot be *created* with chunks at instantiation time; chunks must be added to the corpus' chunk dict separately, after instantiation. The easiest way to do this is via `Corpus.set_chunk`.

Any chunks added to the corpus` chunk dict can be thought of as "persistent"—since they're stored in the corpus, you can acces the contents of the chunks again and again without having to re-run any chunking-related computations.

As an example, note that our `corpus` object from earlier in the notebook doesn't have any chunks whatsoever. One rgeally common chunk to set up is a `document` chunk, which tells us which atoms are in which document(s) in our corpus; in this case, we only have one document, but a corpus may have many.

Let's manually add `document` chunks to this corpus. (See the [`Chunks` documentation](02_chunks.ipynb) for details about instantiating `Chunks` objects)

In [None]:
from retrievall import Chunks

# *Very* manually create our document chunks
doc_chunks = Chunks(
    corpus=corpus,
    chunks=pa.Table.from_pydict({"id": [1234]}),
    chunk_atoms=pa.Table.from_pydict(
        {"chunk": [1234, 1234, 1234, 1234, 1234], "atom": [123, 355, 684, 235, 407]}
    ),
)

corpus.set_chunk(name="document", chunks=doc_chunks)

# Manually check the updated chunk dict
corpus.chunks

{'document': <retrievall.core.Chunks at 0x108a39420>}

We can now retrieve relevant `document`s in the future without having to re-determine which atoms belong to which document.

### Getting ephemeral chunks
Chunks that are used for one-off calculations or retrieval processes, which don't need to be stored long-term, can be thought about as "ephemeral" chunks. You can create ephemeral chunks from a chunker by passing a `ChunkExpr` to a corpus' `Corpus.chunk` function.

In [None]:
from retrievall.chunkers import FixedSizeChunk

fixed_chunks = corpus.chunk(FixedSizeChunk("document", 4, offset=-2))
fixed_chunks, len(fixed_chunks)

(<retrievall.core.Chunks at 0x10e539540>, 3)

(Note that many chunk expressions may require referencing some kind of existing chunk—in this case, we used `document`s as the boundaries of our fixed-size windows.)

If we check the corpus, we can see that no new chunks were added, and that these fixed size chunks were just temporary.

In [None]:
corpus.chunks

{'document': <retrievall.core.Chunks at 0x108a39420>}

If we wanted to persist these initially-ephermeral chunks to the corpus, we could pass our ephemeral chunks to `Corpus.set_chunk`:

In [None]:
corpus.set_chunk(
    "my_fixed_chunk", corpus.chunk(FixedSizeChunk("document", 4, offset=-2))
)

corpus.chunks

{'document': <retrievall.core.Chunks at 0x108a39420>,
 'my_fixed_chunk': <retrievall.core.Chunks at 0x108a86920>}

## Merging
If you have multiple `Corpus` objects that share compatible atoms types (e.g. you created corpora for the same kinds of documents from two different sources), they can be merged into a single, larger corpus with `Corpus.merge`.

In [None]:
# Manually set up another corpus
atoms_2 = pa.Table.from_pydict(
    {
        "id": [987, 654, 999, 765],  # Better be distinct from the other corpus!
        "text": ["Some", "other", "little", "doc"],
        "ordinal": [1, 2, 3, 4],
    }
)

corpus_2 = Corpus(atoms=atoms_2)

# Also include some doc chunks
corpus_2.set_chunk(
    name="document",
    chunks=Chunks(
        corpus=corpus_2,
        chunks=pa.Table.from_pydict({"id": [4321]}),  # Needs to be distinct!
        chunk_atoms=pa.Table.from_pydict(
            {"chunk": [4321, 4321, 4321, 4321], "atom": [987, 654, 999, 765]}
        ),
    ),
)

# We also need to have the same persistent chunk types
corpus_2.set_chunk(
    "my_fixed_chunk", corpus_2.chunk(FixedSizeChunk("document", 4, offset=-2))
)

corpus_2  # New corpus, similar to the original

<retrievall.core.Corpus at 0x10e5396f0>

In [None]:
merged = Corpus.merge([corpus, corpus_2])  # New, combined corpus!
merged

<retrievall.core.Corpus at 0x10f646e90>

Notice that we have more `my_fixed_chunks` than before (because of the merge)!

In [None]:
len(merged.chunk("my_fixed_chunk"))

5