# Full Text Search Tutorial

### Ensure Roo-VectorDB has been installed

**Please verify the installation of Roo-VectorDB before running any tutorials. Refer main README file Installation section to learn how to install Roo-VectorDB.**

### Download data files

In this tutorial, we use the [Amazon-QA](https://www.kaggle.com/datasets/praneshmukhopadhyay/amazon-questionanswer-dataset) dataset. Each entry in the dataset consists of a question followed by one or more corresponding answers (user reviews). We process the data by enumerating each answer and pairing it with its associated question to form distinct question-answer pairs.
We then create a table with three metadata columns: `question`, `answer`, and a concatenated `question+answer` field. Embeddings are computed from this combined text, allowing vector search to capture semantic similarity across both the question and its answer.

The embedding vectors are generated on-the-fly using a [Sentence Transformer](https://sbert.net/) model. To enable this, the model file will be downloaded beforehand.

1. Amazon-QA text data jsonl file: [amazon-qa.jsonl](https://rooagi8-my.sharepoint.com/:u:/g/personal/chaoma_rooagi_com/ETPyi_peQj9Kg_v5RkQF7OwBvS6a2Q1on0gAJV48uPh9Rg?e=QdGapa) (678MB)
<!-- 1. [Optional] Sentence Transformer model file: [sent_embed_all-mpnet-base-v2.pickle](https://rooagi8-my.sharepoint.com/:u:/g/personal/chaoma_rooagi_com/EV1p4jOKJ8lEmBv-zhogEosB3XgCAql8WNIGdAZ5JwxlWQ?e=WvjiTb) (418MB) -->

Please verify the file size after downloading to ensure the download completed successfully and the file is not corrupted.

### Setup

In [9]:
# Set postgres login info here
PG_USERNAME = <YOUR-USER-NAME> # for example: 'ann'
PG_DBNAME = <YOUR-DBNAME> # for example: 'ann'
PG_HOST = <YOUR-HOST> # for example: 'localhost' 
PG_PORT = <YOUR-PORT> # for example: 58432
PG_PSWORD = <YOUR-USER-PASSWORD>

# Set path of sentence embedding model file
SENT_EMBED_MODEL_PATH = "sent_embed_all-mpnet-base-v2.pickle" 

# Set path of demo data file
embedding_info = (
    "amazonqa1m_dim768", # table name
    768,                 # embedding dimension 
    "amazon-qa.jsonl" # data file
)

### Load Text Embedding Model

In [10]:
import sentence_transformers
import pickle
import os

def save_mode(model, fn=SENT_EMBED_MODEL_PATH):
    with open(fn, 'wb') as fp:
        pickle.dump(model, fp)

class MySentenceEmbeddingModel:

    def __init__(self):
        self.model = self.load_if_exist()

    def compute_embedding(self, text_batch):
        return self.model.encode(text_batch)

    def load_if_exist(self, fn=SENT_EMBED_MODEL_PATH):
        if os.path.exists(fn):
            with open(fn, 'rb') as fp:
                model = pickle.load(fp)
                print("model loaded from ", fn)
                return model
        else:
            print("downloading model")
            model = sentence_transformers.SentenceTransformer("all-mpnet-base-v2")
            save_mode(model, fn=SENT_EMBED_MODEL_PATH)
            return model

### Prepare demo data

In [11]:
import sys
sys.path.append('../python')
import json
import psycopg
import roovector.psycopg as roovec_psycopg

class AmazonQuestionAnswerDemo(object):

    def __init__(self, tb_name, dim, data_fn):
        self.pg_conn = self.make_connection()
        self.cur = self.pg_conn.cursor()
        self.table_name = tb_name
        self.dimension = dim
        self.n_rows = 0
        self.data_fn = data_fn
        self.sentemb_model = MySentenceEmbeddingModel()

    def make_connection(self):
        conn = psycopg.connect(user=PG_USERNAME, dbname=PG_DBNAME, host=PG_HOST, port=PG_PORT, password=PG_PSWORD, autocommit=True)
        roovec_psycopg.register_roovector(conn)
        return conn

    def prepare_table(self, copy_data=True):
        self.cur.execute("DROP TABLE IF EXISTS %s" % self.table_name)
        self.cur.execute(
            "CREATE TABLE %s (id int, question varchar, answer varchar, qacombine varchar, embedding roovector(%d))" % (self.table_name, self.dimension))
        storage_fmt = "PLAIN"
        if self.dimension > 2000:
            storage_fmt = "EXTENDED"
        self.cur.execute("ALTER TABLE %s ALTER COLUMN embedding SET STORAGE %s" % (self.table_name, storage_fmt))

        if copy_data:
            print("copying data...")
            with self.cur.copy(f"COPY {self.table_name} (id, question, answer, qacombine, embedding) FROM STDIN WITH (FORMAT BINARY)") as copy:
                copy.set_types(["int4", "varchar", "varchar", "varchar", "roovector"])

                cnt = 0
                lncnt = 0
                batch_size = 1000
                batch_tuples = []
                batch_texts = []
                with open(self.data_fn, 'r') as f:
                    for line in f:
                        try:
                            json_obj = json.loads(line)
                            question = json_obj['query']
                            answers = json_obj['pos']
                            for ans in answers:
                                qatogether = question + " " + ans
                                batch_tuples.append((cnt, question, ans, qatogether))
                                batch_texts.append(qatogether)
                                cnt += 1

                                if len(batch_texts) >= batch_size:
                                    batch_embedding = self.sentemb_model.compute_embedding(batch_texts)
                                    print("batch =", cnt)
                                    for j in range(0, len(batch_texts)):
                                        copy.write_row((batch_tuples[j][0], batch_tuples[j][1], batch_tuples[j][2], batch_tuples[j][3], batch_embedding[j].tolist()))
                                    batch_texts.clear()
                                    batch_tuples.clear()

                            lncnt += 1
                            if lncnt > 10000:
                                break
                        except json.JSONDecodeError as e:
                            print(f"Error decoding JSON on line: {line.strip()} - {e}")

                    self.n_rows = cnt
                print("done writing table!")

    def create_index_ivfflat(self, nlists, nprobes, force_use_index=True):
        print("creating index...")
        index_name = self.table_name + "_demo_index"
        self.cur.execute("DROP INDEX IF EXISTS %s" % index_name)
        self.cur.execute(
            "CREATE INDEX %s ON %s USING roo_ivfflat (embedding roovector_cosine_ops) WITH (lists = %d)" % (
                index_name, self.table_name, nlists))
        self.cur.execute("SET roo_ivfflat.probes = %d" % nprobes)
        print("done index creation!")
        if force_use_index:
            self.cur.execute("SET enable_seqscan=false")

    def query(self, query_question, k):
        query_vec = self.sentemb_model.compute_embedding([query_question])[0].tolist()
        query_stm = "SELECT id, question, answer, embedding FROM %s ORDER BY embedding <=> '%s' LIMIT %s"
        self.cur.execute(query_stm % (self.table_name, str(query_vec), k), binary=True, prepare=True)
        return self.cur.fetchall()


#Demo_amazonqa1m_dim768 = AmazonQuestionAnswerDemo("amazonqa1m_dim768", 768, "/data/qa_data/amazon-qa.jsonl")

### Choose the text searcher demo you want

In [13]:
"""
So far, candidates are:
 - Demo_amazonqa1m_dim768
"""
demo = AmazonQuestionAnswerDemo(embedding_info[0],
                                embedding_info[1],
                                embedding_info[2])

downloading model


### Create table and index

(This step may take some time)

In [14]:
demo.prepare_table()

print("Total number of rows:", demo.n_rows)

copying data...
batch = 4096
batch = 8192
batch = 12288
batch = 16384
batch = 20480
done writing table!
Total number of rows: 22652


### Build index for approximate vector search

In [15]:
# choose parameters to determine IVF-flat approximate vector search
nlists = 100
nprobes = 5

In [16]:
demo.create_index_ivfflat(nlists, nprobes, force_use_index=False)

creating index...
done index creation!


### Prepare a query text

In [17]:
query_text = "Can you recommend 3 TVs?"

### Run the query

In [18]:
topk = 10

In [19]:
import time

start_time = time.time()
results = demo.query(query_text, topk)
total_time = time.time() - start_time
print("Query time:", total_time, "milli seconds")

Query time: 0.026128053665161133 milli seconds


### Check query results

In [20]:
def display_table(data):
    from IPython.display import HTML, display
    html = "<table>"
    for row in data:
        html += "<tr>"
        for field in row:
            html += "<td>%s</td>"%(field)
        html += "</tr>"
    html += "</table>"
    display(HTML(html))

In [21]:
res_table = []
for res in results:
    idx, question, answer, emb_vec = res
    res_table.append([idx, question, answer])
    #print(idx, "\n", question, "\n---------->", answer)
    
display_table(res_table)

0,1,2
18327,"Does anyone out there have the 75inch Samsung UN75F6400, I would like to see a review on that, seems like all the review I read was 40-55inch","This is an amazing TV and it is the second one I purchase. The picture quality is great but it is a lot better if you calibrate the TV. The TV is great for bright rooms and looks great from every angle. I have got lots of compliments on it. Stay away of the voice control it's not worth your time, in my opinion. Also, it is surprisingly light so two people can easily mount it on the wall. As Brian below said the 3D is unbelievable."
7807,"Panasonic Plasma vs. Sony LCD Please help. i've narrowed my choices down to these two TVs: 1.) Panasonic PH46PZ85U [plasma] and 2.) Sony KDL46Z4100B [LCD]. i've been flipping back and forth for three months now trying to decide between these two TVs. i've done tons of research and have been to several stores looking at these two TVs and i just can't make up my mind. i,ve been to consumer reports.com and cnet.com. both had reviews for the Panasonic but little or no information on the Sony. i like to watch lots of DVD movies, sports (football and NASCAR mostly) and primetime standard cable TV. i plan on buying a Blue-Ray player before christmas. i have an Onkyo Integra receiver and 6 disc CD changer, Sony DVD player and DVR DVD recorder. i would love to here your comments.thank you,Lee Panasonic Plasma vs. Sony LCD Please help. i've narrowed my choices down to these two TVs: 1.) Panasonic PH46PZ85U [plasma] and 2.) Sony KDL46Z4100B [LCD]. i've been flipping back and... » Read More Panasonic Plasma vs. Sony LCD Please help. i've narrowed my choices down to these two TVs: 1.) Panasonic PH46PZ85U [plasma] and 2.) Sony KDL46Z4100B [LCD]. i've been flipping back and forth for three months now trying to decide between these two TVs. i've done tons of research and have been to several stores looking at these two TVs and i just can't make up my mind. i,ve been to consumer reports.com and cnet.com. both had reviews for the Panasonic but little or no information on the Sony. i like to watch lots of DVD movies, sports (football and NASCAR mostly) and primetime standard cable TV. i plan on buying a Blue-Ray player before christmas. i have an Onkyo Integra receiver and 6 disc CD changer, Sony DVD player and DVR DVD recorder. i would love to here your comments.thank you,Lee « Show Less","I don't know if you made your purchase yet or not, but I'd check out the difference in glare/reflections from the screens. I have heard the Sony has a non-glare screen, but most plasmas have a glass screen that reflects background light and can be distracting."
5165,Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about Vizio quality. Has anyone had any wxperience with Vizio and how they hold up? Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about... » Read More Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about Vizio quality. Has anyone had any wxperience with Vizio and how they hold up? « Show Less,"I deleted my apps as well, and the RVR is still occurring. It's been long enough Vizio, it's time for you to fix the tvs you've already sold, and are not slowing down selling!!! FIX IT!I won't be buying another Vizio product again, I'll tell you that..."
5168,Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about Vizio quality. Has anyone had any wxperience with Vizio and how they hold up? Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about... » Read More Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about Vizio quality. Has anyone had any wxperience with Vizio and how they hold up? « Show Less,"There is a thread in avsforum dedicated to the xvt3 series. There have been some issues reported - RVR (random Vizio Reboot), Sets turning on by itself, occasional defective units. Not all TVs are affected and similar problems are not uncommon in other models. Vizio 's support has been prompt in most cases.Picture Quality wise - Vizio ranks highly - esp in pro reviews. There are the standard issues - blooming and some uniformity issues. Also the set does not handle 1080p/24 correctly according to pro reviews .. It uses 3:2 pulldown on 1080p/24 material (e.g. Blu Ray) which can lead to judder in some scenes, There is a thread in avsforum dedicated to the xvt3 series. There have been some issues reported - RVR (random Vizio Reboot), Sets turning on by itself, occasional defective units. Not all TVs are affected and similar problems are not uncommon in other models. Vizio 's support has been prompt in most cases.Picture Quality wise - Vizio ranks highly - esp in pro reviews. There are the standard issues - blooming and some uniformity issues. Also the set does not handle 1080p/24 correctly according to pro reviews .. It uses 3:2 pulldown on 1080p/24 material (e.g. Blu Ray) which can... » Read More There is a thread in avsforum dedicated to the xvt3 series. There have been some issues reported - RVR (random Vizio Reboot), Sets turning on by itself, occasional defective units. Not all TVs are affected and similar problems are not uncommon in other models. Vizio 's support has been prompt in most cases.Picture Quality wise - Vizio ranks highly - esp in pro reviews. There are the standard issues - blooming and some uniformity issues. Also the set does not handle 1080p/24 correctly according to pro reviews .. It uses 3:2 pulldown on 1080p/24 material (e.g. Blu Ray) which can lead to judder in some scenes, « Show Less"
2040,So I'm considering buying this tv used. What are peoples reviews on this tv 7 years later?,"I own it and it is mostly fine. After a move a few years ago, a yellow line runs down the middle that is only really noticeable on blue backgrounds. It runs hot and is an energy hog (200w) compared to current LCD tvs. Also the contrast and resolution is not as sharp as current tvs even at 720p. The remote only works if you point it at the lower right corner of the TV, where the sensor is. 40"" tvs are $300 new (with better specs), so I hope you get a good deal."
5169,Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about Vizio quality. Has anyone had any wxperience with Vizio and how they hold up? Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about... » Read More Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about Vizio quality. Has anyone had any wxperience with Vizio and how they hold up? « Show Less,I ended up deleting the apps I dont use and my random reset problem went away. It was doing it nearly every day until I deleted the apps. I think the apps update on their own and every time it does it auto-restarts.
6473,"does anyone have experience with Sony customer service? I have, and they are so sub standard as to be beyond belief. Save yourself some aggravation, don't buy Sony!","Hello George. Talked to Sony customer service three times today. Gave them Model & Serial numbers of our Vega Triniton. They confirmed it was a projector TV that entitled people to a free tv with free shipping. Our TV was not a projection TV. Suppose yours was? Anyway, the TV is at the curb for the City to pick up on Wednesday. Sony and I are no longer going to do business, but that's OK. Thanks for all your help."
2039,So I'm considering buying this tv used. What are peoples reviews on this tv 7 years later?,"I do not have the original purchased TV. In 2009 a white line appeared across the width of the screen. It could not be fixed. Samsung replaced the TV but since it was out of warranty I had to pay $300 for new one. The second one, same model has been fine."
14976,"This or the Sony STRDN1030? I have Sony TV and a PS3, so just wondering if it really makes all that difference in having ""everything sony""?","No, there is no advantage to having all components in an AV system being all the same brand. Unless you like the idea of seeing the same name on everything...."
5164,Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about Vizio quality. Has anyone had any wxperience with Vizio and how they hold up? Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about... » Read More Vizio Quality VIZIO XVT553SV 55-Inch Class Full Array TruLED with Smart Dimming LCD HDTV 240 Hz SPS with VIZIO Internet AppsI am thinking about purchasing this set but am wondering about Vizio quality. Has anyone had any wxperience with Vizio and how they hold up? « Show Less,"a little bit of Blooming but hardly noticable. I still havce the back light at 85 percent and i only notice it during the Adult Swim Bumps where it is a black screen with bright white text, and even then it is not that noticable."
