# Bodhilib - Models, Components and Interfaces

The core models of bodhilib are -

1. Prompt
1. PromptStream
1. PromptTemplate
1. Document
1. Node

The core components of bodhilib are -
1. DataLoader
1. Splitter
1. Embedder
1. PromptSource
1. VectorDB
1. LLM

## Prompt and PromptStream

In [None]:
class Role(str, Enum):
    SYSTEM = "system"
    AI = "ai"
    USER = "user"

class Source(str, Enum):
    INPUT = "input"
    OUTPUT = "output"

class SupportsText(Protocol):
    @property
    def text(self) -> str: ...

class Prompt(BaseModel):
    text: str
    role: Role
    source: Source
    extras: Dict[str, Any]

class PromptStream(Iterator[Prompt]):
    def __iter__(self) -> Iterator[Prompt]: ...
    def __next__(self) -> Prompt: ...
    @property
    def text(self) -> str: ...

## PromptTemplate

In [None]:
Engine = Literal["default", "jinja2"]

class PromptTemplate:
    def __init__(
        self,
        template: str,
        engine: Optional[Engine] = "default",
    ) -> None: ...

    def to_prompt(self, **kwargs: Dict[str, Any]) -> [Prompt]: ...

`PromptTemplate` allows you to generate prompt for your use-case injecting it with the right context. It re-uses the rich eco-system of python, and does not re-invent the wheel in the process. 

PromptTemplate supports 4 formats:
1. `fstring`

    For simple prompts involving variable injection, you can use the `fstring` format. It uses python's native f-string formatting and interpolation to inject your variables. You can then pass your variables to the `to_prompt` method to build your prompt.

2. `jinja2`

    For more complex prompts involving `loop`, `if-else` conditionals, you can uses `jinja2` templating library, and pass the template as a `jinja2` compatible template. You can then pass your variables to the `to_prompt` method build your prompt.

3. `bodhilib-fstring`

    `bodhilib-fstring` allows you to load simple prompts using `PromptSource` component. The prompts are serialized in bodhilib-prompt-template format, and uses `f-string` format for variable injections. Check out `PromptSource` component for details.

4. `bodhilib-jinja2`

    `bodhilib-jinja2` allows you to load complex prompts using `PromptSource` component. The prompts are serialized in bodhilib-prompt-template format, and uses `jinja2` templates for variable injections. Check out `PromptSource` component for details.

## Document and Nodes

In [None]:
Embedding: TypeAlias = List[float]

class Document(BaseModel):
    text: str
    metadata: Dict[str, Any]

class Node(BaseModel):
    id: Optional[str]
    text: str
    parent: Optional[Document]
    metadata: Dict[str, Any]
    embedding: Optional[Embedding]

## Serialized Input

In [None]:
class SupportsText(Protocol):
    @property
    def text(self) -> str: ...

SerializedInput: TypeAlias = str | List[str] | SupportsText | List[SupportsText] | Dict[str, Any] | List[Dict[str, Any]]

The `SupportsText` protocol is a python protocol. So you don't need to explicitly implement it. If you have a property text in the object, then you automatically support this protocol. All the main models - `Prompt`, `PromptStream`, `Document` and `Node` supports this protocol by having a property `text` that contains the main content.

This way, we have a very fluid composability. We can either pass in a `str`, or for processing a list of texts we can pass in a `List[str]`, or pass in `SupportsText` or a `List[SupportsText]` that can be any of `Prompt`, `PromptStream`, `Document` or `Node`, or a dict representation of object containing text property like `{"text": "your content"}` or a list of such serialized representation.

## DataLoader

A DataLoader is configured using the `add_resource` method. Once configured, it can be either iterated to fetch the resources as `Document` on-demand, or eager fetched using the `load` method to get it as a `List[Document]`.

In [None]:
class DataLoader(Iterable[Document], abc.ABC):
    @abc.abstractmethod
    def add_resource(self, **kwargs: Dict[str, Any]) -> None: ...

    @abc.abstractmethod
    def __iter__(self) -> Iterator[Document]: ...

    def load(self) -> List[Document]: ...

## Splitter

Splitter is used to split `Document` into right-sized processible chunks. For flexibility and composability, it takes in `SerializedInput`, and returns a list of `Node` with text corresponding to splits done by the implementation.

Ideally, you pass in `Document` or a list of `Document` to get back a list of `Node` split into processible chunks.

In [None]:
class Splitter(abc.ABC):
    @abc.abstractmethod
    def split(self, inputs: SerializedInput) -> List[Node]: ...

## Embedder

Embedder takes in `SerializedInput`, and returns a list of `Node` enriched with `embedding`. If you pass in the `Node` or `List[Node]`, the passed argument itself is enriched with `embedding`.

In [None]:
class Embedder(abc.ABC):
    @abc.abstractmethod
    def embed(self, inputs: SerializedInput) -> List[Node]: ...

## PromptSource

`PromptSource` provides you an interface to browse and search through collection of most effective prompts. This way, you can test multiple prompt templates for your use-case and find the one that works for you.

In [None]:
class PromptSource(abc.ABC):
    @abc.abstractmethod
    def find(self, keywords: str | List[str]) -> List[PromptTemplate]: ...
    
    @abc.abstractmethod
    def list_all(self) -> List[PromptTemplate]: ...    

## VectorDB

In [None]:
class VectorDB(abc.ABC):
    @abc.abstractmethod
    def upsert(self, collection_name: str, nodes: List[Node]) -> List[Node]: ...

    @abc.abstractmethod
    def query(
        self, collection_name: str, embedding: Embedding, filter: Optional[Dict[str, Any]], **kwargs: Dict[str, Any]
    ) -> List[Node]: ...

VectorDB has two main interface - `upsert` and `query`.

`upsert` takes in a list of `Node`, and inserts or update the underlying VectorDB with the `text`, `metadata` and the `embedding` in  the `Node` object. These can later be used to query based on property or vector search.

`query` method allows you to query the underlying vector database with the given embedding and property filters. The property filters uses the `MongoDB` query syntax, and not tied to specific vector database. These property filters are transformed by the `VectorDB` to the database specific filters.

## LLM

In [None]:
class LLM(abc.ABC):
    @abc.abstractmethod
    def generate(
        self,
        prompt_input: SerializedInput,
        *,
        stream: Optional[bool] = None,
        **kwargs: Dict[str, Any]) -> Union[Prompt, PromptStream]: ...

The LLM has method generate that takes in a flexible `SerializedInput` to return either of `Prompt` or `PromptStream` depending if you are passing `stream` a False or True respectively.

So, any of the following calls are valid and will generate you a response:

In [None]:
llm.generate("tell me a joke")
llm.generate(["tell me a joke", "joke should be related to architects"])
llm.generate(Prompt("tell me a joke"))
llm.generate([Prompt("you are a helpful AI assistant.", role="system"), Prompt("tell me a joke")])
llm.generate({"text": "tell me a joke", "role": "user"})
llm.generate([{"text": "you are a helpful AI assistant.", "role": "system"}, {"text": "tell me a joke", "role": "user"}])

## Composability and RAG use-case

`bodhilib` library is designed with composability in mind. It takes many ideas from strict functional languages like `Haskell` to design and implement its interface.

Using the bodhilib library, you can simplify the ingestion phase of your RAG process as follows:

In [None]:
from fn import F # fn.py library

data_loader = get_data_loader("file")
spitter = get_splitter("sentence_splitter")
embedder = get_embedder("sentence_embedder")
vector_db = get_vector_db("qdrant", location=":memory:")

data_loader.add_resource(dir="./data")
data = data_loader.load()

result = F(data_loader.load) 
    >> F(splitter.split) 
    >> F(embedder.embed) 
    >> F(vector_db.upsert)

And to query your VectorDB, you can compose it like:

In [None]:
query = "Who is the CEO of SpaceX?"
template = get_prompt_source("bodhiprompts").find("extractive_qna")

answer = (
    query
    >> F(embedder.embed)
    >> F(partial(vector_db.query, "articles_collection"))
    >> F(lambda nodes: [node.text for node in nodes])
    >> F(lambda nodes: {"context": "\n\n".join(nodes)})
    >> F(partial(template.to_prompt, query = query))
    >> F(llm.generate)
)
