## Data Extraction & Data Base Embeddings Integration

In [2]:
import sys
sys.path.append("/home/tagore/repos/ai/scripts")
import TextExtractor as te
import VectorDB as vdb

### Text Extractor: Class Definition

/scripts/TextExtractor.py

### Text Extractor: test

In [3]:
output = '/home/tagore/repos/ai/data/exampe_debug_folder'
input = '/home/tagore/repos/ai/data/example_data'
chroma_db = '/home/tagore/repos/ai/data/example_database'
test_extractor = te.TextExtractor(output, input)
pdfs = test_extractor.get_pdf()
test_extractor.extract_pdf_texts(pdfs[0])[0]

IndexError: list index out of range

In [3]:
urltext

['s://github.com/enormandeau/ncbi_blast_tutorial.\n\nAbout\nCrash course for NCBI blast tools\n\nTopics\n\nResources\n\nStars\n\nWatchers\n\nForks\n\nReleases\n\nFooter\n\nFooter navigation',
 "archive. For example:\n\nAdd the bin folder from the extracted archive to your path. For example, add\nthe following line to your ~/.bashrc file:\n\nAnd change the /PATH/TO part to the path where you have put the extracted\narchive.\n\nExample sequences to use with the tutorial\nIn order to test blast, you need a test fasta file. Use the following files\nthat come with the tutorial:\n\nCreate blast database\nThe different blast tools require a formatted database to search against. In\norder to create the database, we use the makeblastdb tool:\n\nThis will create a list of files in the databases folder. These are all part\nof the blast database.\n\nBlast\nWe can now blast our sequences against the database. In this case, both our\nquery sequences and database sequences are DNA sequences, so we us

### Vector Database: Class Definition

/home/tagore/repos/ai/scripts/VectorDB.py

### Text Extractor: test

In [8]:
import sys
sys.path.append("/home/tagore/repos/ai/scripts")
import TextExtractor as te
import VectorDB as vdb


input = '/home/tagore/repos/ai/data/example_data/'
output = '/home/tagore/repos/ai/data/exampe_debug_folder'
chroma_db = '/home/tagore/repos/ai/data/example_db/'
collection_name = 'test'

test_vector_db = vdb.VectorDB(input, output, chroma_db, collection_name)


In [15]:
len(test_vector_db.peek())

7

In [30]:
import argparse
import logging
import ollama  # Ensure the Ollama library is imported
import TextExtractor as te
import VectorDB as vdb

class RAG:
    def __init__(self, input_dir: str, output_dir: str, chroma_db_dir: str, chroma_db_name: str, model="mxbai-embed-large"):
        self.vector_db = self._setup_vector_db(input_dir, output_dir, chroma_db_dir, chroma_db_name, model)
        logging.basicConfig(level=logging.INFO)
    
    def _setup_vector_db(self, input_dir, output_dir, chroma_db_dir, chroma_db_name, model):
        """Check if the database exists and set up if not."""
        try:
            # Attempt to load existing vector database
            vector_db = vdb.VectorDB(input_dir, output_dir, chroma_db_dir, chroma_db_name, model)
            # Check if the database is empty or needs updating
            if not self._is_database_populated(vector_db):
                logging.info("Database is not populated. Loading data...")
                vector_db.load_data()
            return vector_db
        except Exception as e:
            logging.error(f"Error setting up VectorDB: {e}")
            raise

    def _is_database_populated(self, vector_db):
        """Check if the vector database has data."""
        return len(vector_db.peek()) > 0
    
    def generate_prompt(self, question, context):
        template = """You need to answer questions about specific software.
        Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. 
        keep the answer concise.
        user
        Question: {question} 
        Context: {context} 
        Do not say according to the text. just give the answer, no comment."""
        return template.format(question=question, context=context)

    def generate_answer(self, query_text, k=4):
        """Generate an answer using the vector database and Ollama model."""
        try:
            output = ollama.generate(
                    model="llama3",
                    prompt= self.generate_prompt(query_text, self.vector_db.query(query_text, k)),
                )
            return output['response']
        except Exception as e:
            logging.error(f"Error generating answer: {e}")
            return "Error generating answer."

In [31]:
input = '/home/tagore/repos/ai/data/example_data/'
output = '/home/tagore/repos/ai/data/exampe_debug_folder'
chroma_db = '/home/tagore/repos/ai/data/example_db/'
collection_name = 'test'

testRAG = RAG(input, output, chroma_db, collection_name)

In [36]:
print(testRAG.generate_answer("What does word size parameter mean in BLAST?"))

The word size parameter in BLAST refers to the size of words that can score at least T when compared with words from the query, and is valid for values between 2-7.


In [33]:
print(testRAG.generate_answer("How to get the results of BLASTP in XML format?"))

To get the results of BLASTP in XML format, you can use the `blast_formatter` command with the `-outfmt 5` option, as shown in the example: `$ blast_formatter –rid X3R7GAUS014 –out test.xml –outfmt 5`


In [35]:
print(testRAG.generate_answer("How to perform a BLAST on a specific taxonomic group?"))

To perform a BLAST on a specific taxonomic group, you can use the `-taxids` option and specify the NCBI taxonomy ID(s) for the given organism(s). For example: `$ blastn –db nt –query QUERY –taxids 9606 –outfmt 7 –out OUTPUT.tab`.


In [38]:
print(testRAG.generate_answer("What parameters do I use to perform BLAST with epitopes smaller than 10 amino acids?"))

You can use the "word_size" parameter and set it to 2 or less, as epitopes smaller than 10 amino acids are likely to be shorter sequences.


In [39]:
print(testRAG.generate_answer("Which kind of databases can be searched with BLASTX?"))

Protein databases (e.g., PDB, GenBank) and nucleotide databases (e.g., nt, nr).
