# Long Document (Book) Search Tutorial

### Ensure Roo-VectorDB has been installed

**Please verify the installation of Roo-VectorDB before running any tutorials. Refer main README file Installation section to learn how to install Roo-VectorDB.**

### Download data files

In this tutorial, we demonstrate how to perform long-document search using Roo-VectorDB. As an example, we treat books as long documents, and show how to run searches at different levels of granularity—such as chapters, sections, paragraphs, or the entire book.
We use the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset (specifically the *Books1* subset) as the source of data. This subset contains approximately 18,000 books, each ranging from 100,000 to 1,000,000 words. For this tutorial, we randomly sample 1,000 books from the dataset.
Because a single book is too large to store as one record—making both storage and search inefficient—we preprocess the books using [LangChain TextSplitter](https://python.langchain.com/docs/concepts/text_splitters/). Each book is split into 1024‑token chunks with a 30% overlap between adjacent chunks. Each chunk is stored as an individual row in the database table. The preprocessed dataset can be downloaded here:

1. Preprocessed Pile Books1 data file: [pile_book1.jsonl](https://rooagi8-my.sharepoint.com/:u:/g/personal/chaoma_rooagi_com/EUNiTYeTIPNKrr-NI4t-BOcB9mlh_15NIVNSIF7D75RztA?e=AecDqa) (506.5MB)

Please verify the file size after downloading to ensure the download completed successfully and the file is not corrupted.

### Setup

In [1]:
# Set postgres login info here
PG_USERNAME = <YOUR-USER-NAME> # for example: 'ann'
PG_DBNAME = <YOUR-DBNAME> # for example: 'ann'
PG_HOST = <YOUR-HOST> # for example: 'localhost' 
PG_PORT = <YOUR-PORT> # for example: 58432
PG_PSWORD = <YOUR-USER-PASSWORD>

# Set path of demo data file
embedding_info = (
    "pilebook1_dim768", # name of table in postgres 
    "./pile_book1.jsonl" # demo data file
)

### Load Text Embedding Model

In [2]:
import sentence_transformers
import pickle
import os
import torch
torch.cuda.is_available = lambda : False

SENT_EMBED_MODEL_PATH = "sent_embedding_model_cache.pickle"

def save_mode(model, fn=SENT_EMBED_MODEL_PATH):
    with open(fn, 'wb') as fp:
        pickle.dump(model, fp)

class MySentenceEmbeddingModel:

    def __init__(self):
        self.model = self.load_if_exist()

    def compute_embedding(self, text_batch):
        return self.model.encode(text_batch)

    def load_if_exist(self, fn=SENT_EMBED_MODEL_PATH):
        if os.path.exists(fn):
            with open(fn, 'rb') as fp:
                model = pickle.load(fp)
                print("model loaded from ", fn)
                return model
        else:
            print("downloading model")
            model = sentence_transformers.SentenceTransformer("all-mpnet-base-v2")
            save_mode(model, fn=SENT_EMBED_MODEL_PATH)
            return model
        
    def get_dimension(self):
        return 768

  from .autonotebook import tqdm as notebook_tqdm


### Prepare demo data

In [3]:
import sys
sys.path.append('../python')
import random
import json
import os
import psycopg
import roovector.psycopg as roovec_psycopg

def cleanup_chunk_text(text):
    line_text = [line for line in text.splitlines() if not (line.strip() == "")]
    return " [NEWLINE] ".join(line_text)

class LongDocumentSearchDemo(object):

    def __init__(self, tb_name, data_fn):
        self.pg_conn = self.make_connection()
        self.cur = self.pg_conn.cursor()
        self.table_name = tb_name
        self.n_rows = 0
        self.data_fn = data_fn
        self.sentemb_model = MySentenceEmbeddingModel()
        self.dimension = self.sentemb_model.get_dimension()

    def make_connection(self):
        conn = psycopg.connect(user=PG_USERNAME, dbname=PG_DBNAME, host=PG_HOST, port=PG_PORT, password=PG_PSWORD, autocommit=True)
        roovec_psycopg.register_roovector(conn)
        return conn

    def prepare_table(self, copy_data=True):
        self.cur.execute("DROP TABLE IF EXISTS %s" % self.table_name)
        self.cur.execute(
            "CREATE TABLE %s (book_id int, chunk_id int, chunk_text varchar, book_title varchar, embedding roovector(%d))" % (self.table_name, self.dimension))
        storage_fmt = "PLAIN"
        if self.dimension > 2000:
            storage_fmt = "EXTENDED"
        self.cur.execute("ALTER TABLE %s ALTER COLUMN embedding SET STORAGE %s" % (self.table_name, storage_fmt))

        if copy_data:
            print("copying data...")
            with self.cur.copy(f"COPY {self.table_name} (book_id, chunk_id, chunk_text, book_title, embedding) FROM STDIN WITH (FORMAT BINARY)") as copy:
                copy.set_types(["int4", "int4", "varchar", "varchar", "roovector"])

                cnt = 0
                lncnt = 0

                batch_size = 256
                batch_tuples = []
                batch_texts = []
                with open(self.data_fn, 'r') as f:
                    for line in f:
                        try:
                            json_obj = json.loads(line)
                            chuck_text = json_obj['chunk_text']
                            chuck_id = json_obj['chunk_id']
                            book_id = json_obj['book_id']
                            book_title = json_obj['book_title']

                            batch_tuples.append((book_id, chuck_id, cleanup_chunk_text(chuck_text), book_title))
                            batch_texts.append(chuck_text)
                            cnt += 1

                            if len(batch_texts) >= batch_size:
                                batch_embedding = self.sentemb_model.compute_embedding(batch_texts)
                                print("Computed", cnt, "rows")
                                print("Embedding matrix shape:", batch_embedding.shape)
                                for j in range(0, len(batch_texts)):
                                    copy.write_row((batch_tuples[j][0], batch_tuples[j][1], batch_tuples[j][2], batch_tuples[j][3], batch_embedding[j].tolist()))
                                batch_texts.clear()
                                batch_tuples.clear()

                            lncnt += 1
                            if lncnt >= 1024:
                                break
                        except json.JSONDecodeError as e:
                            print(f"Error decoding JSON on line: {line.strip()} - {e}")

                    self.n_rows = cnt
                print("done writing table!")

    def create_index_ivfflat(self, nlists, nprobes, force_use_index=True):
        print("creating index...")
        self.cur.execute(
            "CREATE INDEX demo_index ON %s USING roo_ivfflat (embedding roovector_cosine_ops) WITH (lists = %d)" % (
                self.table_name, nlists))
        self.cur.execute("SET roo_ivfflat.probes = %d" % nprobes)
        print("done index creation!")
        if force_use_index:
            self.cur.execute("SET enable_seqscan=false")

    # query for the most similar book to the query_text
    def query_for_book_title(self, query_text, k):
        qxs = self.sentemb_model.compute_embedding([query_text])
        query_stm = "SELECT book_id, book_title, MIN(embedding <=> '%s') AS similarity FROM %s GROUP BY book_id, book_title ORDER BY similarity LIMIT %s"
        self.cur.execute(query_stm % ( str(qxs[0].tolist()), self.table_name, k), binary=True, prepare=True)
        return self.cur.fetchall()

    # query for the most similar text chunk to the query_text
    def query_for_text_segment(self, query_text, k):
        qxs = self.sentemb_model.compute_embedding([query_text])
        query_stm = "SELECT book_id, book_title, chunk_text FROM %s ORDER BY embedding <=> '%s' LIMIT %s"
        self.cur.execute(query_stm % (self.table_name, str(qxs[0].tolist()), k), binary=True, prepare=True)
        return self.cur.fetchall()

    # query for the most similar text chunk from distinic books to the query_text
    def query_for_text_segment_distinict_book(self, query_text, k):
        qxs = self.sentemb_model.compute_embedding([query_text])
        query_stm = "SELECT book_id, book_title, chunk_id, chunk_text, MIN(embedding <=> '%s') AS similarity FROM %s GROUP BY book_id, book_title, chunk_id, chunk_text ORDER BY similarity LIMIT %s"
        self.cur.execute(query_stm % ( str(qxs[0].tolist()), self.table_name, k), binary=True, prepare=True)
        return self.cur.fetchall()
    

### Create the demo object

In [4]:
demo = LongDocumentSearchDemo(embedding_info[0],
                              embedding_info[1])

model loaded from  sent_embedding_model_cache.pickle


### Create table and index

In [5]:
demo.prepare_table()

print("Total number of rows:", demo.n_rows)

copying data...
Computed 256 rows
Embedding matrix shape: (256, 768)
Computed 512 rows
Embedding matrix shape: (256, 768)
Computed 768 rows
Embedding matrix shape: (256, 768)
Computed 1024 rows
Embedding matrix shape: (256, 768)
done writing table!
Total number of rows: 1024


### Build index for approximate vector search

In [6]:
# choose parameters to determine IVF-flat approximate vector search
nlists = 1000
nprobes = 10

In [7]:
demo.create_index_ivfflat(nlists, nprobes, force_use_index=False)

creating index...
done index creation!


In [8]:
def decorate_text(text):
    paragraphs = text.split(' [NEWLINE] ')
    paragraphs2 = ["<p style=\"text-align:left\">" + par + "</p>" for par in paragraphs]
    paragraphs2 = paragraphs2[0:min(3, len(paragraphs2))]
    return "".join(paragraphs2) +  "...... <em>[remaining content not displayed]</em>"

In [9]:
def display_table(data, header=None):
    from IPython.display import HTML, display
    html = "<table>"
    
    if header is not None:
        html += "<tr>"
        for name in header:
            html += "<td><b>%s</b></td>"%(name)
        html += "</tr>"
    
    for row in data:
        html += "<tr>"
        for field in row:
            html += "<td>%s</td>"%(field)
        html += "</tr>"
    html += "</table>"
    display(HTML(html))

### Prepare a query text

In [10]:
query_text = "It does not do to dwell on dreams and forget to live, remember that."

### Query Type 1: Search for similar book paragraphs that may originate from the same book

Search for the paragraph most similar to the query_text. The returned paragraphs may originate from the same book.

In [11]:
topk = 10

In [12]:
import time

start_time = time.time()
results = demo.query_for_text_segment(query_text, topk)
total_time = time.time() - start_time
print("Query time:", total_time, "milli seconds")

Query time: 0.030430316925048828 milli seconds


In [13]:
res_table = []
for res in results:
    book_id, book_title, text = res
    res_table.append([book_id, book_title, decorate_text(text)])
    
display_table(res_table, header=["Book ID", "Book Title", "Paragraph Text"])

0,1,2
Book ID,Book Title,Paragraph Text
6,The dogs may bark but the caravan moves on the spirit realm,"I also expect to create my own imagined physical environment which reflects, possibly, my past lives; or, more probably, my preferred environment. If the latter, I would be located on a luscious mountain-side, with a fast-flowing river, both visible and audible; yet to be able to see a beach and the sea.What I find mysteriously fascinating is that I have already had a dream of such an environment, after months of pondering what my next temporary home would be like.During this dream, I heard human voices, but they did not come near me – for which I was grateful. This dream offered me a true R&R environment. It is also consistent with my recent life as a recluse in an isolating environment, with absolutely minimal human contact....... [remaining content not displayed]"
6,The dogs may bark but the caravan moves on the spirit realm,"I have achieved spiritual peace through my exposure to the spirit realm. That has led to a deeper understanding of humanity, and its strengths and foibles. Meaningful patterns of significance may be discerned through perusing the complex mesh of inter-twined destinies.My reality now involves 3 dimensions: the physical, the mental, and the spiritual. While the mental can throw light upon the physical, it is the spiritual, the ephemeral, the ethereal realm which illuminates the totality of existence.### The spirit realm and I (Part 1)...... [remaining content not displayed]"
6,The dogs may bark but the caravan moves on the spirit realm,"This life was imposed upon me, but it is acceptable as consistent with the guidance offered by Hinduism. _Hinduism recommends that, once one has completed one's commitments to family and society, one could withdraw from society to live a life of contemplation and meditation._For example, a cave in the Himalayan mountains had been the meditation home for 3 years of the yogi who had come down to Malaya to guide my widowed mother and I about our respective futures. _Years later, when I detected a coherent pattern in my life, I knew that he had been sent to us_. I remember that he was clearly at peace, and apparently unaffected by the cold of the mountain.In my more comfortable retirement 'cave' I too have achieved peace (after a turbulent life). While the dogs do bark (and snap), this caravan will move on, ignoring those who foolishly insist that only their beliefs must prevail. Certainty is, in my experience, not a human condition....... [remaining content not displayed]"
6,The dogs may bark but the caravan moves on the spirit realm,"Since I had been advised by a casual clairvoyant (or seer) to listen to my subconscious for messages from my Spirit Guide, I wonder if my dream was more than wishful thinking. Living in a flat country whose highest mountain is a mere pimple, whose rivers do not seem to flow like those in New Zealand, and whose dry terrain does not attract much rain (except for sudden troubling downpours occasionally), my subconscious may be seeking to compensate for this deprivation by Nature._In my dream, I was on a lush mountain top, with a raging river below on one side and a cliff on the other – which allowed me to see the distant sea and a rocky shore. It was raining, but I do not remember getting wet. I heard voices, yet neither saw nor met anyone. It was as if we were all avoiding one another._ In the morning, I again remembered this compensatory dream. After all, had I not been born and bred in a lush tropical terrain? Had I not enjoyed the years I had lived there?Then, much to my great surprise, **during my sleep a few nights later, I had a thought flitting through my mind. Intuitively, I felt that spirits created their own personal environments in the Afterlife.** Was that message from my Spirit Guide? As a recluse of many years, I am attracted to this possibility....... [remaining content not displayed]"
5,Satan the sworn enemy of mankind,"Examples are provided in the Qur'an of the types of forgetfulness which Satan seeks to inspire in believers. Among these examples are instances of remaining in the company of those who ridicule the verses of the Qur'an. Allah advises the believers to avoid such discussions, and warns them of Satan's propensity to inspire forgetfulness:When you see people engrossed in mockery of Our Signs, turn from them until they start to talk of other things. And if Satan should ever cause you to forget, once you remember, do not stay sitting with the wrongdoers. (Surat al-An`am, 68)Another stipulation recalls that it is only possible to do something if Allah has so ordained it:...... [remaining content not displayed]"
6,The dogs may bark but the caravan moves on the spirit realm,#THE DOGS MAY BARK BUT THE CARAVAN MOVES ON...... [remaining content not displayed]
6,The dogs may bark but the caravan moves on the spirit realm,"The message? Go, with faith, wherever the currents in the ocean of existence take you.### A Seeker wanders and wonders – cremation and compassion**A sofrologist friend (a medico who uses hypnotherapy to treat his patients) sought to help me with my stress.** _Under hypnosis, I learnt to deal with an unhappy memory thus. I was to place this memory on a stage, and close the curtains, saying 'The show is over.' That process did help._...... [remaining content not displayed]"
7,3 book romance bundle loving the bull rider cowboy down unde,"Coercing myself to breathe slowly, I lowered my arm while I tried to remember anything I'd passed on the road; a gas station, a motel or a roadside diner. But I couldn't recall seeing a damn thing for at least an hour. In the other direction, meanwhile, was the unknown. However, it seemed a fair bet that if nothing was behind me for an hour, than something had to be coming up ahead...didn't it?I simply did not know what to do for the best. But I did feel the compulsion to do something, because standing right where I was equalled lying down and accepting I would have to spend the night in the middle of nowhere, with God knows who, and what, for company.So, keeping my phone in my hand and snatching desperate glances at it with every few steps, I began to walk away from the car - hoping that my guess about a truck stop, or something, up ahead was right....... [remaining content not displayed]"
6,The dogs may bark but the caravan moves on the spirit realm,"I was reminded then of _the most significant spirit intervention in my life. I had been pulled out of that metaphoric deep well_ (where I had no thought, no feeling, and no future) into the sunlit realm of normal existence by a young, chatty, attractive, and kind girl. We had eloped and married. We lived in Singapore for a year. _When we returned to Australia, the higher beings in the spirit realm should have been pleased, should they not?_When I had completed my qualifications and obtained employment in the national capital (which was set in a desert), my wife remained in her State-capital metropolitan home-city. **My palm-readers had been right all along!** I then wondered: when would I enjoy a normal stable life?_Through a chance meeting (or, was it?), I met a young woman who shared all my values!_ I could not believe my good fortune. We married; we produced 2 children (but lost others), while experiencing what seemed to be a normal barrage of life-problems. _We enjoyed a happy family life for a quarter of a century._...... [remaining content not displayed]"


### Query Type 2: Search for similar book paragraphs distinct book

Search for the paragraph most similar to the query_text, ensuring that each returned paragraph are from the distinct books (one paragraph at most per book).

In [14]:
start_time = time.time()
results = demo.query_for_text_segment_distinict_book(query_text, topk)
total_time = time.time() - start_time
print("Query time:", total_time, "milli seconds")

Query time: 0.06387162208557129 milli seconds


In [15]:
res_table = []
for res in results:
    book_id, book_title, text_id, text, sim = res
    res_table.append([book_id, book_title, text_id, decorate_text(text), 1.0 - sim])
    
display_table(res_table, header=["Book ID", "Book Title", "Paragraph ID", "Paragraph Text", "Similarity"])

0,1,2,3,4
Book ID,Book Title,Paragraph ID,Paragraph Text,Similarity
6,The dogs may bark but the caravan moves on the spirit realm,129,"I also expect to create my own imagined physical environment which reflects, possibly, my past lives; or, more probably, my preferred environment. If the latter, I would be located on a luscious mountain-side, with a fast-flowing river, both visible and audible; yet to be able to see a beach and the sea.What I find mysteriously fascinating is that I have already had a dream of such an environment, after months of pondering what my next temporary home would be like.During this dream, I heard human voices, but they did not come near me – for which I was grateful. This dream offered me a true R&R environment. It is also consistent with my recent life as a recluse in an isolating environment, with absolutely minimal human contact....... [remaining content not displayed]",0.3874507114232435
6,The dogs may bark but the caravan moves on the spirit realm,9,"I have achieved spiritual peace through my exposure to the spirit realm. That has led to a deeper understanding of humanity, and its strengths and foibles. Meaningful patterns of significance may be discerned through perusing the complex mesh of inter-twined destinies.My reality now involves 3 dimensions: the physical, the mental, and the spiritual. While the mental can throw light upon the physical, it is the spiritual, the ephemeral, the ethereal realm which illuminates the totality of existence.### The spirit realm and I (Part 1)...... [remaining content not displayed]",0.3695750754818372
6,The dogs may bark but the caravan moves on the spirit realm,126,"This life was imposed upon me, but it is acceptable as consistent with the guidance offered by Hinduism. _Hinduism recommends that, once one has completed one's commitments to family and society, one could withdraw from society to live a life of contemplation and meditation._For example, a cave in the Himalayan mountains had been the meditation home for 3 years of the yogi who had come down to Malaya to guide my widowed mother and I about our respective futures. _Years later, when I detected a coherent pattern in my life, I knew that he had been sent to us_. I remember that he was clearly at peace, and apparently unaffected by the cold of the mountain.In my more comfortable retirement 'cave' I too have achieved peace (after a turbulent life). While the dogs do bark (and snap), this caravan will move on, ignoring those who foolishly insist that only their beliefs must prevail. Certainty is, in my experience, not a human condition....... [remaining content not displayed]",0.3592447852638242
6,The dogs may bark but the caravan moves on the spirit realm,57,"Since I had been advised by a casual clairvoyant (or seer) to listen to my subconscious for messages from my Spirit Guide, I wonder if my dream was more than wishful thinking. Living in a flat country whose highest mountain is a mere pimple, whose rivers do not seem to flow like those in New Zealand, and whose dry terrain does not attract much rain (except for sudden troubling downpours occasionally), my subconscious may be seeking to compensate for this deprivation by Nature._In my dream, I was on a lush mountain top, with a raging river below on one side and a cliff on the other – which allowed me to see the distant sea and a rocky shore. It was raining, but I do not remember getting wet. I heard voices, yet neither saw nor met anyone. It was as if we were all avoiding one another._ In the morning, I again remembered this compensatory dream. After all, had I not been born and bred in a lush tropical terrain? Had I not enjoyed the years I had lived there?Then, much to my great surprise, **during my sleep a few nights later, I had a thought flitting through my mind. Intuitively, I felt that spirits created their own personal environments in the Afterlife.** Was that message from my Spirit Guide? As a recluse of many years, I am attracted to this possibility....... [remaining content not displayed]",0.35543066263199097
5,Satan the sworn enemy of mankind,28,"Examples are provided in the Qur'an of the types of forgetfulness which Satan seeks to inspire in believers. Among these examples are instances of remaining in the company of those who ridicule the verses of the Qur'an. Allah advises the believers to avoid such discussions, and warns them of Satan's propensity to inspire forgetfulness:When you see people engrossed in mockery of Our Signs, turn from them until they start to talk of other things. And if Satan should ever cause you to forget, once you remember, do not stay sitting with the wrongdoers. (Surat al-An`am, 68)Another stipulation recalls that it is only possible to do something if Allah has so ordained it:...... [remaining content not displayed]",0.34251680031518994
6,The dogs may bark but the caravan moves on the spirit realm,0,#THE DOGS MAY BARK BUT THE CARAVAN MOVES ON...... [remaining content not displayed],0.32534212700709497
6,The dogs may bark but the caravan moves on the spirit realm,89,"The message? Go, with faith, wherever the currents in the ocean of existence take you.### A Seeker wanders and wonders – cremation and compassion**A sofrologist friend (a medico who uses hypnotherapy to treat his patients) sought to help me with my stress.** _Under hypnosis, I learnt to deal with an unhappy memory thus. I was to place this memory on a stage, and close the curtains, saying 'The show is over.' That process did help._...... [remaining content not displayed]",0.3238738074227463
7,3 book romance bundle loving the bull rider cowboy down unde,47,"Coercing myself to breathe slowly, I lowered my arm while I tried to remember anything I'd passed on the road; a gas station, a motel or a roadside diner. But I couldn't recall seeing a damn thing for at least an hour. In the other direction, meanwhile, was the unknown. However, it seemed a fair bet that if nothing was behind me for an hour, than something had to be coming up ahead...didn't it?I simply did not know what to do for the best. But I did feel the compulsion to do something, because standing right where I was equalled lying down and accepting I would have to spend the night in the middle of nowhere, with God knows who, and what, for company.So, keeping my phone in my hand and snatching desperate glances at it with every few steps, I began to walk away from the car - hoping that my guess about a truck stop, or something, up ahead was right....... [remaining content not displayed]",0.32269211546731746
6,The dogs may bark but the caravan moves on the spirit realm,117,"I was reminded then of _the most significant spirit intervention in my life. I had been pulled out of that metaphoric deep well_ (where I had no thought, no feeling, and no future) into the sunlit realm of normal existence by a young, chatty, attractive, and kind girl. We had eloped and married. We lived in Singapore for a year. _When we returned to Australia, the higher beings in the spirit realm should have been pleased, should they not?_When I had completed my qualifications and obtained employment in the national capital (which was set in a desert), my wife remained in her State-capital metropolitan home-city. **My palm-readers had been right all along!** I then wondered: when would I enjoy a normal stable life?_Through a chance meeting (or, was it?), I met a young woman who shared all my values!_ I could not believe my good fortune. We married; we produced 2 children (but lost others), while experiencing what seemed to be a normal barrage of life-problems. _We enjoyed a happy family life for a quarter of a century._...... [remaining content not displayed]",0.32081785798073037


### Query Type 3: Search for similar book titles

For each book, we identify the paragraph most similar to the query_text. The similarity score of this paragraph is used as the representative distance for the entire book. Books are then ranked based on these similarity scores.

In [16]:
start_time = time.time()
results = demo.query_for_book_title(query_text, topk)
total_time = time.time() - start_time
print("Query time:", total_time, "milli seconds")

Query time: 0.030543088912963867 milli seconds


In [17]:
res_table = []
for res in results:
    book_id, book_title, similarity = res
    res_table.append([book_id, book_title, 1.0 - similarity])
    
display_table(res_table, ["Book ID", "Book Title", "Similarity"])

0,1,2
Book ID,Book Title,Similarity
6,The dogs may bark but the caravan moves on the spirit realm,0.3874507114232435
5,Satan the sworn enemy of mankind,0.34251680031518994
7,3 book romance bundle loving the bull rider cowboy down unde,0.32269211546731746
0,More haste the marital trials of brother segun,0.31883724444134565
1,The amulet custodian novel 1,0.3046301007270835
4,Gateway to heaven,0.2997284799571468
3,The units,0.24236312508583246
2,100 seconds to midnight,0.2079512426795056
8,The arab states and the palestine conflict,0.13372031547306307
