### Simple RAG with litesearch
> We will build a simple rag with litesearch. Do not be deceived. We are doing a whole bunch of heavylifting under the hood with very little code.

In [16]:
from fastlite import *
import numpy as np
from litesearch import *
import pymupdf
from chonkie import AutoEmbeddings, Pipeline

In [21]:
db:Database=database('breugel.db')
store=db.get_store()

#### Ingest PDF documents
> We will ingest 10 pdfs fro the pdfs folder'

In [22]:
pdfs = globtastic('pdfs')
pdfs

(#9) ['pdfs/dir_1993_13_2022-05-28_eng.pdf','pdfs/dir_1993_83_2019-06-06_eng.pdf','pdfs/dir_1996_9_2019-06-06_eng.pdf','pdfs/dir_2006_112_2022-07-01_eng.pdf','pdfs/pdfa2a','pdfs/reg_2021_241_2024-03-01_eng.pdf','pdfs/reg_2021_523_2024-03-01_eng.pdf','pdfs/reg_2021_694_2023-09-21_eng.pdf','pdfs/reg_2021_695_2024-03-01_eng.pdf']

Now let's load the pdfs into the db. We will write a quick chunking function using chonkie.

In [28]:
chunk_fn = (Pipeline()
            .chunk_with('recursive', tokenizer='gpt2', chunk_size=2048)
            .chunk_with('semantic', chunk_size=1024)
            .refine_with('overlap', context_size=128)
            .refine_with('embeddings', embedding_model='minishlab/potion-retrieval-32M')).run
def doc2content(pth):
    doc=pymupdf.open(pth)
    def idx2content(i):
        def chunk2content(chunk):
            meta = dict(tokens=chunk.token_count, start_index=chunk.start_index, end_index=chunk.end_index, context=chunk.context, pg_no=i+1)
            meta.update(doc.metadata)
            return dict(content=chunk.text, embedding=chunk.embedding.tobytes(), metadata=meta)
        return L(chunk_fn(doc[i].get_text()).chunks).map(chunk2content)
    return L(range(len(doc))).map(idx2content).concat()

In [29]:
for p in pdfs: store.insert_all(doc2content(p))

Cool, let's search through these contents

We will need to initialise the embedding_model

In [None]:
emb_fn = AutoEmbeddings().get_embeddings('minishlab/potion-retrieval-32M').embed

In [70]:
search = bind(db.search, columns=['content','id'], limit=15,dtype=np.float32)
q = 'economic growth in Europe'

In [71]:
res=search(q, emb_fn(q).tobytes())
first(res)

```python
{ 'content': 'Innovation requires a systemic, cross-cutting and multifaceted '
             "approach. Europe's \n"
             'economic progress, social welfare and quality of life rely on '
             'its ability to boost productivity and growth, which, in ',
  'id': 3988}
```

You do not need to, but you can now flashrank these results if needed. The Potion-retrieval model is a bge distilled model which is a fusion cross encoder with strong retrieval capabilities.

In [61]:
from flashrank import Ranker, RerankRequest

It's a good idea to understand what the mak token limit is for the documents in the db so that we can set the max length of the ranker accordingly. If you set a high amount, it does affect performance.So, the ideal is to set it to max tokens + 100-200

In [62]:
max_token = int(db.q("select max(json_extract(metadata, '$.tokens')) as m from store")[0]['m'])
ranker = Ranker(max_length=max_token+150)
res1=ranker.rerank(RerankRequest(q, L(res).map(lambda r: dict(text=r.content, id=r.id))))

In [66]:
print(first(res1))

{'text': "Europe's \neconomic progress, social welfare and quality of life rely on its ability to boost productivity and growth, which, in \nturn, depends heavily on its ability to innovate. Innovation is also key to solving the major challenges that lie ahead \nfor the Union. Innovation must be responsible, ethical and sustainable.\nEN\nOfficial Journal of the European Union \nL 167 I/64                                                                                                                      ", 'id': 3989, 'score': np.float32(0.99915826)}


Now, let's compare flashrank results and litesearch rerank results.

In [72]:
for r1,r2 in zip(res,res1): print(r1['id'], r2['id'])

3988 3989
3349 3700
3989 3988
3700 3612
3612 3349
3490 3490
1 5
2 4
3 1
4 3
5 6
6 8
7 2
8 9
9 7


So you have it. a simple rag pipeline.