# Local Retreival Augmented Generation
## Trevor Andrus

### The following notebook constitutes the functions and pacakges necessary to run two versions of RAG on a local machine: 
### 1. Pull context from Wikipedia 
### 2. Vectorize a local directory for context

#### There are also 3 examples at the bottom of the notebook that show initial success in these techniques. 
#### The specifics of each function are detailed in the declarations below, but the general flow of each method is the following:


## Wikipedia
1. Take a user query
2. Extract nouns, pronouns and named entities from query
3. Dedupe these keywords
4. Create all permuatations of these keywords
5. For each keyword, pull the top N wiki articles
6. Extract context from wiki articles sentence by sentence (from either summary or full text)
7. Record the line of each sentence, article it came from, and url
8. Create vector embedding of each sentence
9. Push Vector, sentence, url, and title to a local pg vector database
10. Taking the original query, pull the top n similar sentences as context from the database
11. Return the resulting context as a dataframe
12. Using the the context dataframe and original query, prompt generation model for final results


## Local Directory
1. Pass a local filepath to the directory you'd like to vectorize
2. Function will loop through directory and identify 3 file types: docx, .ipyn, pdf
2. (This can be expanded to accept additional file types)
3. Extract context from each document line by line
4. record line, document name, filepath and sentence
5. vectorize sentence
6. Push Vector, sentence, filepath, and title to a local pg vector database
7. Taking the original query, pull the top n similar sentences as context from the database
8. Return the resulting context as a dataframe
9. Using the the context dataframe and original query, prompt generation model for final results

In [1]:
# basic imports
import os
import warnings
import pandas as pd
from collections import OrderedDict

# wiki imports
import wikipedia as wp
from wikipedia import WikipediaPage

# generation model and sentence transformer
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer, util

# NLP packages
import nltk
from nltk.tokenize import sent_tokenize
import spacy

# vector database
from pgvector.psycopg import register_vector
import psycopg

# for document ingestion
import PyPDF2
import nbformat
from docx import Document

warnings.filterwarnings("ignore")
nlp = spacy.load("en_core_web_sm")
os.environ["HF_TOKEN"]="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" 

  return torch._C._cuda_getDeviceCount() > 0


In [2]:
def extract_keywords(sentence):
    """
    Take an input sentence (usually the user's question), extract all nouns, pronouns and named entities
    
    PARAMETERS:
        sentence: string representing the sentence
        
    RETURNS:
        list of strings, each being element a noun or pronoun from the input
    """
    
    # use spacy to wrap the input sentence
    doc = nlp(sentence)

    # Extract keywords (nouns and proper nouns)
    keywords = [token.text for token in doc if token.pos_ in ["NOUN", "PROPN"]]
    entities = [ent.text for ent in doc.ents]

    # return list of extracted elements
    return list(set(keywords).union(set(entities)))

In [3]:
def dedupe(keywords):
    
    """
    Take the output of the extract_keywords functionn, and remove duplicates
    (Also remove 'george bush' if 'george' and 'bush' exist)
    
    PARAMETERS:
        keywords: list of keywords from the extract_keywords functionn
    
    RETURNS:
        list of deduped keywords
    """
    
    words_to_delete = []
    words = keywords.copy()
    for i in range(len(words)):
        if len(words[i].split()) == 1:
            continue
        else:
            if words[i].split()[0] in words and words[i].split()[1] in words:
                words_to_delete.append(words[i].split()[0])
                words_to_delete.append(words[i].split()[1])

    for i in words_to_delete:
        words.remove(i)
        

    return words

In [4]:
def create_word_list(user_query):
    """
    Take the original user query, and call the two above functions
    with the deduped output, create combinations of all keywords
    if we have 'cat' and 'dog' and 'run', we return all permutations of the 3
    
    PARAMETERS:
        user_query: string of raw user input
        
    RETURNS:
        list of augmented keywords - all permutations of the deduped list
    """
    
    # get keywords
    words = extract_keywords(user_query)
    
    # dedupe
    words = dedupe(words)

    new_words = words.copy()

    # create all permuations of existing keywords
    for i in words:
        for j in words:
            if i == j:
                continue
            new_word = str(i + ' ' + j)
            if new_word in new_words:
                continue
            else:
                new_words.append(new_word)
                
    # return new list
    return new_words

In [5]:
def gather_context(keywords, paragraph = False):
    
    """
    Take the dedupded keywords from the input query, and gather wiki context for each
    For each word, pull the top n wiki pages from search
    Pull the content line by line (or paragraph by paragraph)
    
    PARAMETERS:
        keywords: list of deduped keywords
        paragraph (bool): if true, gather context in paragraphs, if false, by sentence
        
    RETURNS:
        context_list: a list of strings, each element is the context pulled from wikipedia
            (either a sentence or a paragraph)
        file_names: a list of strings reprenting the wiki articles the context was retreived from
        words: the words used to gather context
        links: the links from wiki pages
    """
    
    # lists to store information gathered, the file it came from, and the word that produced it
    context_list = []
    gathered_files = []
    file_names = []
    words = []
    links = []

    # loop through all keywords
    for word in keywords:
        
        # for each keyword, pull the top 2 wiki pages (change this if needed)
        result = wp.search(word, results = 2)
        
        # for each wiki page
        for i in result:
            try:
                
                # if we've already stored the info from this page, skip
                page = wp.page(i, auto_suggest=False)
                if page.title in gathered_files:
                    continue
                
                # record the name of the page, which word was used to search
                # and the url of the page
                file_names.append(page.title)
                gathered_files.append(page.title)
                words.append(word)
                links.append(page.url)
                
                # get the summary of the page, and split it into sentences
                context = page.summary # can also change this to pull the full page if needed
                sentences = sent_tokenize(context)
                sentences = list(OrderedDict.fromkeys(sentences))
                
                # group sentences into paragraphs (looking for \n is unreliable, so these 
                # are psuedo-paragraphs of sentences)
                if paragraph:
                    grouped_list = [''.join(sentences[i:i+4]) for i in range(0, len(sentences), 4)]
                    sentences = grouped_list
                
                
                # we gather all this to create a dataframe as output
                context_list.append(sentences)
                file_names.append(file)
                key_words.append(current_word)
                
            # we pull the name of the wiki article, then grab it by title
            # sometimes the title returned in a search is not directly associated with a wiki page
            # if this is the case, an error is thrown. We use this to skip these cases. 
            except Exception as e:
                continue

    return context_list, file_names, words, links

In [6]:
def vectorize_context(context_list, file_names, key_words, links):
    """
    Using the output of gather_context, create a database table
    each row in the table contains:
        vectorized context, context, file name, and link
        
    PARAMETERS:
        context_list: a list of strings, each element is the context pulled from wikipedia
            (either a sentence or a paragraph)
        file_names: a list of strings reprenting the wiki articles the context was retreived from
        words: the words used to gather context
        links: the links from wiki pages
    """

    # create connection
    conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)
    conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
    register_vector(conn)

    # replace old data
    conn.execute('DROP TABLE IF EXISTS documents')
    conn.execute('CREATE TABLE documents (id bigserial PRIMARY KEY, line integer, word text, document text, content text, embedding vector(384), link text, page text)')
    
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    for i in range(len(context_list)):
        doc_name = []
        line = []
        words = []
        link = []
        page = []
        for j in range(len(context_list[i])):
            doc_name.append(file_names[i])
            line.append(j)
            page.append('N/A')
            words.append(key_words[i])
            link.append(links[i])
        embeddings = model.encode(context_list[i])

        for content, embedding, word, doc_name, line, link, page in zip(context_list[i], embeddings, words, doc_name, line, link, page):
            conn.execute('INSERT INTO documents (line, word, document, content, embedding, link, page) VALUES (%s, %s, %s, %s, %s, %s, %s)', (line, word, doc_name, content, embedding, link, page))

In [7]:
def query_database(query, n):
    """
    Query the local vector database to the top n cosine similar vectors to the input
    
    PARAMETERS:
        query: string to be matched in database
        n: number of results to be returned
        
    RETURNS:
        pandas dataframe representing query results
    
    """
    string = query
    
    # instantiate sentence transformer for embedding
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # embed the input string
    embedding = model.encode(string)
    
    # establish connection to database
    conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)

    # create embedding array for pgvector
    embedding_pgarray = "[" + ",".join(map(str, embedding)) + "]"

    # Execute a SQL query to find the top 5 vectors closest to the embedding vector
    query = """
        SELECT content, document, page, line, link
        FROM documents 
        ORDER BY embedding <=> %(embedding)s 
        LIMIT %(n)s
    """
    neighbors = conn.execute(query, {'embedding': embedding_pgarray, 'n':n}).fetchall()
    
    # organize query results into dataframe
    df = pd.DataFrame()
    for neighbor in neighbors:
        content = neighbor[0].strip()
        document = neighbor[1]
        page = neighbor[2]
        line = neighbor[3]
        link = neighbor[4]

        df1 = pd.DataFrame({'Content':content, 'Document':document,'Page':page, 'Line':line, 'Link':link}, index=[0])
        df = pd.concat((df, df1))

    df = df.reset_index(drop=True)
    return df        

In [8]:
def internet_query(query, n, paragraph=False):
    """
    Combine all the above functions into a single call - take a query and return results from wikipedia
    
    PARAMETERS:
        query: raw string input from user
        n: number of results to return:
        paragraph (bool): whether or not to return context as paragraphs
        
    RETURNS:
        dataframe with query results
    """
    # create search terms
    new_words = create_word_list(query)
    
    # gather context from search terms
    context_list, file_names, key_words, links = gather_context(new_words, paragraph)
    
    # vectorize gathered context
    vectorize_context(context_list, file_names, key_words, links)
    
    # return top n results
    return query_database(query, n)

In [9]:
def process_pdf(pdf_path, paragraph = False):
    """
    Function to import pdf content and split it into sentences
    
    PARAMETERS:
        pdf_path: path to pdf file
        paragraph (bool): whether or not to process as sentences
        
    RETURNS:
        list of gathered sentences from pdf's
    """
    text = ""
    
    full_sentences = []
    # get all lines from pdf file
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        num_pages = len(reader.pages)
        page_numbers = []
        for page_num in range(num_pages):
            # add page number to db          
            page = reader.pages[page_num]
            text = page.extract_text()
            page_text = sent_tokenize(text)
            page_text = list(OrderedDict.fromkeys(page_text))
            
            for i in range(len(page_text)):
                page_numbers.append(page_num)
                           
            full_sentences.extend(page_text)
    
    # group if necessary
    if paragraph:
        grouped_list = [''.join(sentences[i:i+4]) for i in range(0, len(sentences), 4)]
        sentences = grouped_list
    
    return full_sentences, page_numbers

In [10]:
def process_notebook(notebook_file_path, paragraph=False):
    """
     Function to import ipynb content and split it into sentences
    
    PARAMETERS:
        pdf_path: path to ipynb file
        paragraph (bool): whether or not to process as sentences
        
    RETURNS:
        list of gathered sentences from ipynb's
    """
    
    # Read the notebook file
    with open(notebook_file_path, 'r', encoding='utf-8') as f:
        notebook_content = nbformat.read(f, as_version=4)

    # Extract the cells from the notebook content
    cells = notebook_content['cells']

    # Initialize an empty list to store the lines
    lines = []

    # Iterate through cells and extract text from code and markdown cells
    for cell in cells:
        if cell['cell_type'] == 'code':
            # Extract code cell content and split by newline characters
            lines.extend(cell['source'].splitlines())
        elif cell['cell_type'] == 'markdown':
            # Extract markdown cell content and split by newline characters
            lines.extend(cell['source'].splitlines())
          
    # group if necessary
    if paragraph:
        grouped_list = [''.join(lines[i:i+4]) for i in range(0, len(lines), 4)]
        lines = grouped_list

    # Return the list of lines
    return lines

In [11]:
def process_word_doc(word_doc_file_path, paragraph=False):
    
    """
    Function to import docx content and split it into sentences
    
    PARAMETERS:
        pdf_path: path to docx file
        paragraph (bool): whether or not to process as sentences
        
    RETURNS:
        list of gathered sentences from docx's
    """
    
    # create word doc object
    doc = Document(word_doc_file_path)

    # Initialize an empty string to store the extracted text
    text = ""

    # Iterate through paragraphs in the document and extract text
    for paragraph1 in doc.paragraphs:
        text += paragraph1.text + "\n"

    # break text into sentences
    sentences = sent_tokenize(text)
    sentences = list(OrderedDict.fromkeys(sentences))
    
    # group if necessary
    if paragraph:
        grouped_list = [''.join(sentences[i:i+4]) for i in range(0, len(sentences), 4)]
        sentences = grouped_list
    
    return sentences

In [12]:
def vectorize_directory(directory, paragraph=False):
    """
    Loop through a given directory, create a large list of sentences from all compatible files
    Push large list of sentences to vector database
    
    PARAMETERS:
        directory: file path to directory to be vectorized
        paragraph: whether to group context in paragraphs
    """
    
    # instantiate sentences transformer for embeddings
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # create connection
    conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)
    conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
    register_vector(conn)

    # replace old data
    conn.execute('DROP TABLE IF EXISTS documents')
    conn.execute('CREATE TABLE documents (id bigserial PRIMARY KEY, line integer, document text, content text, embedding vector(384), link text, page text)')

    # loop through directory
    directory_path = directory
    all_lines = []
    file_names = []
    links = []
    pages_by_doc = []

    for filename in os.listdir(directory_path):
        if filename.endswith('.ipynb'):
            
            # Construct the full path to the notebook file
            notebook_file_path = os.path.join(directory_path, filename)

            # Process the notebook file and append lines to the list
            lines = process_notebook(notebook_file_path, paragraph)
            file_names.append(filename)
            
            pages = []
            for i in range(len(lines)):
                pages.append('N/A')
            
            pages_by_doc.append(pages)
                
            all_lines.append(lines)
            links.append('/user/tandrus/lab/tree/' + file_path[2:])
            
        elif filename.endswith('.docx'):
            
            # Construct the full path to the word doc file
            file_path = os.path.join(directory_path, filename)
            file_names.append(filename)

            sentences = process_word_doc(file_path, paragraph)
            
            pages = []
            for i in range(len(sentences)):
                pages.append('N/A')
            
            pages_by_doc.append(pages)
                
            all_lines.append(sentences)
            links.append('/user/tandrus/lab/tree/' + file_path[2:])
            
        elif filename.endswith('.pdf'):
            
            # Construct the full path to the pdf file
            file_path = os.path.join(directory_path, filename)
            file_names.append(filename)

            sentences, pdf_pages = process_pdf(file_path, paragraph)
            
            pages_by_doc.append(pdf_pages)
            
            all_lines.append(sentences)
            links.append('/user/tandrus/lab/tree/' + file_path[2:])

    print('Number of Documents: ', len(all_lines))
    
    # push all gathered context to vector database
    for i in range(len(all_lines)):
        
        print(file_names[i], ': ', len(all_lines[i]), ' sentences found.')
        doc_name = []
        line = []
        link = []
        page = []
        for j in range(len(all_lines[i])):
            doc_name.append(file_names[i])
            line.append(j)
            link.append(links[i])
            page.append(pages_by_doc[i][j])
        embeddings = model.encode(all_lines[i])

        for content, embedding, doc_name, line, link, page in zip(all_lines[i], embeddings, doc_name, line, link, page):
            conn.execute('INSERT INTO documents (line, document, content, embedding, link, page) VALUES (%s, %s, %s, %s, %s, %s)', (line, doc_name, content, embedding, link, page))

In [13]:
def create_long_context(context_df, length):
    """
    Take the results of a query (dataframe), append all the context to a single strinng
    limit the context to a specified length
    
    PARAMTERS:
        context_df: pandas dataframe output from a vector db query
        length: desired length of long context
    """
    
    # append all rows from context df 'content' column
    context = ''
    for i in range(len(context_df)):
        context += context_df.loc[i, 'Content']

    context = context[:length]
    
    return context

In [14]:
def create_prompt(query, result_df, local=False):   
    """
    Prompt engineering for generation
    Combine query and context
     
    PARAMETERS:
        query: string query from user
        result_df: pandas dataframe from query result
    
    """
    context = create_long_context(result_df, 1000)
    prompt = f"""
    Context: {context}

    Question: According to the context provided, {query}
    """
    return prompt

In [15]:
def generate(prompt, query):
    """
    Given a prompt, use google's gemma 7b to generate output
    
    PARAMETERS:
        prompt: prompt for generation model
        query: original query used to create the prompt
    """
    # define output length
    max_length = 500
    
    # tokenize input
    input_ids = tokenizer(prompt, return_tensors="pt", max_length=max_length)
    
    # generate
    outputs = model.generate(**input_ids, max_length=max_length)
    
    # gather response
    response = tokenizer.decode(outputs[0])
    
    # if the output contains 'cannot answer' - (the answer to the question is not found in the context)
    # regenerate without context. 
    if 'cannot answer' in response:
        print('Context was unhelpful - regenerating')
        input_ids = tokenizer(query, return_tensors="pt", max_length=max_length)
        outputs = model.generate(**input_ids, max_length=max_length)
        response = tokenizer.decode(outputs[0])
        print(response)
    else:
        
        # I split the response because the generation tends to include the full question and context
        # in its output, this removes that for cleaner output
        print(response.split('Question')[1])

In [16]:
def make_clickable(val):
    """
    Function to make links clickable in the result dataframes
    """
    # target _blank to open new window
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto")

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

# Example 1

In [None]:
query = 'was george bush president in 2008?'

In [None]:
%%time
query = 'was george bush president in 2008'
result_df = internet_query(query, 10, paragraph=False)

CPU times: user 6.25 s, sys: 106 ms, total: 6.35 s
Wall time: 9.1 s


Unnamed: 0,Content,Document,Page,Line,Link
0,"George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009.",George W. Bush,,0,https://en.wikipedia.org/wiki/George_W._Bush
1,"The 2008 United States presidential election was the 56th quadrennial presidential election, held on November 4, 2008.",2008 United States presidential election,,0,https://en.wikipedia.org/wiki/2008_United_States_presidential_election
2,"George Herbert Walker Bush (June 12, 1924 – November 30, 2018) was an American politician, diplomat, and businessman who served as the 41st president of the United States from 1989 to 1993.",George H. W. Bush,,0,https://en.wikipedia.org/wiki/George_H._W._Bush


In [143]:
prompt = create_prompt(query, result_df)
generate(prompt, query)
print()
pretty_df = result_df.head(3).style.format({'Link': make_clickable})
pretty_df

: According to the context provided, was george bush president in 2008
    Answer: Yes, George Bush was president in 2008.<eos>



Unnamed: 0,Content,Document,Page,Line,Link
0,"George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009.",George W. Bush,,0,https://en.wikipedia.org/wiki/George_W._Bush
1,"The 2008 United States presidential election was the 56th quadrennial presidential election, held on November 4, 2008.",2008 United States presidential election,,0,https://en.wikipedia.org/wiki/2008_United_States_presidential_election
2,"George Herbert Walker Bush (June 12, 1924 – November 30, 2018) was an American politician, diplomat, and businessman who served as the 41st president of the United States from 1989 to 1993.",George H. W. Bush,,0,https://en.wikipedia.org/wiki/George_H._W._Bush


# Example 2

In [149]:
%%time
query = 'Was Steve young quarterback for byu in 1984'
result_df = internet_query(query, 10, paragraph=True)

CPU times: user 9.46 s, sys: 355 ms, total: 9.82 s
Wall time: 14 s


In [150]:
prompt = create_prompt(query, result_df)
generate(prompt, query)
pretty_df = result_df.head(3).style.format({'Link': make_clickable})
pretty_df

Context was unhelpful - regenerating
<bos>Was Steve young quarterback for byu in 1984?

The answer is no. Steve Young was not a quarterback for BYU in 1984.<eos>


Unnamed: 0,Content,Document,Page,Line,Link
0,"Jon Steven Young (born October 11, 1961) is an American former football quarterback who played in the National Football League (NFL) for 15 seasons, most notably with the San Francisco 49ers.He was drafted by and played for the Tampa Bay Buccaneers.Prior to his NFL career, Young was a member of the Los Angeles Express in the United States Football League (USFL) for two seasons.He played college football for the BYU Cougars, setting school and NCAA records en route to being runner-up for the 1983 Heisman Trophy.",Steve Young,,0,https://en.wikipedia.org/wiki/Steve_Young
1,"The 1984 BYU Cougars football team represented Brigham Young University (BYU) in the 1984 NCAA Division I-A football season.The Cougars were led by 13th-year head coach LaVell Edwards and played their home games at Cougar Stadium in Provo, Utah.The team competed as a member of the Western Athletic Conference, winning the conference for the ninth consecutive year.The Cougars finished the regular season as the only undefeated team in Division I-A, and secured their first ever national title by defeating Michigan in the 1984 Holiday Bowl.",1984 BYU Cougars football team,,0,https://en.wikipedia.org/wiki/1984_BYU_Cougars_football_team
2,"The BYU Cougars football team is the college football program representing Brigham Young University (BYU) in Provo, Utah.The Cougars began collegiate football competition in 1922, and have won 23 conference championships and one national championship in 1984.The team has competed in several different athletic conferences during its history, from July 1, 2011 to 2022, they competed as an FBS Independent.On September 10, 2021, the Big 12 Conference unanimously accepted BYU’s application to the conference.",BYU Cougars football,,0,https://en.wikipedia.org/wiki/BYU_Cougars_football


# Example 3

In [151]:
%%time
vectorize_directory('./Examples', paragraph=False)
print()

Number of Documents:  4
Infant Health Study.pdf :  957  sentences found.
FetalUltrasounds.pdf :  211  sentences found.
Scheduling.ipynb :  1844  sentences found.
Critical Analysis.docx :  51  sentences found.

CPU times: user 1min 35s, sys: 4.32 s, total: 1min 40s
Wall time: 18.7 s


In [158]:
query = 'Are ultrasounds important for prenatal care?'
result_df = query_database(query, 5)

In [159]:
pretty_df = result_df.head(3).style.format({'Link': make_clickable})

pretty_df

Unnamed: 0,Content,Document,Page,Line,Link
0,"B ACKGROUND When it comes to prenatal care, obtaining quality ultra- sounds and properly understanding those ultrasounds is critical to achieving proper diagnoses and providing premium patient care.",FetalUltrasounds.pdf,0,2,/user/tandrus/lab/tree/Examples/FetalUltrasounds.pdf
1,"A LTERNATIVE DATA Due to the time constraints on the project, and the diffi- culty of obtaining the desired dataset, a premade alternative ultrasound dataset, concerned with detecting and diagnosing tumors in ultrasounds, was used in an effort to test potential architectures in preparation for when the fetal dataset would be ready for use.",FetalUltrasounds.pdf,2,72,/user/tandrus/lab/tree/Examples/FetalUltrasounds.pdf
2,"It is important to understand what healthcare profes - sionals who meet expecting and new parents through pregnancy, childbirth, and postnatal care follow-ups should consider when providing support and informa - tion.",Infant Health Study.pdf,1,49,/user/tandrus/lab/tree/Examples/Infant Health Study.pdf


In [2]:
prompt = create_prompt(query, result_df)
generate(prompt, query)

Answer: Yes, according to the text, ultrasounds are important for prenatal care as they are used to achieve proper diagnoses and provide premium patient care.<eos>


In [160]:
prompt = create_prompt(query, result_df)
generate(prompt, query)


: According to the context provided, Are ultrasounds important for prenatal care?
    Answer: Yes, according to the text, ultrasounds are important for prenatal care as they are used to achieve proper diagnoses and provide premium patient care.<eos>


In [38]:
conn = psycopg.connect(dbname='rag', user='rag', password='rag', autocommit=True)
                       
# Execute a SQL query to find the top 5 vectors closest to the embedding vector
query = """
    SELECT count(*)
    FROM documents 
"""
neighbors = conn.execute(query).fetchall()

In [46]:
# Paragraph
neighbors

[(525,)]

In [29]:
#sentence by sentence
neighbors

[(2097,)]