### Data Preprocessing

In [1]:
import fitz  # PyMuPDF
import os
import torch
from dotenv import load_dotenv, find_dotenv

In [81]:
import nest_asyncio
nest_asyncio.apply()

In [2]:
load_dotenv(find_dotenv())

True

In [None]:
# Create a sample pdf for ch1,2
def save_page_ranges(source_pdf_path, output_pdf_path, page_ranges):
    """
    Saves specified ranges of pages from a source PDF to a new PDF file.

    Args:
    source_pdf_path (str): Path to the source PDF file.
    output_pdf_path (str): Path to the output PDF file.
    page_ranges (list of tuples): List of tuples, where each tuple represents a page range to save (inclusive, 0-indexed).
    """
    # Open the source PDF file
    doc = fitz.open(source_pdf_path)
    # Create a new PDF to save selected pages
    new_doc = fitz.open()

    # Iterate through each range and add the pages to the new document
    for start, end in page_ranges:
        new_doc.insert_pdf(doc, from_page=start, to_page=end)

    # Save the new document
    new_doc.save(output_pdf_path)
    new_doc.close()
    doc.close()
    print(f"Specified page ranges have been saved to {output_pdf_path}")

# path to input pdf file
source_pdf_path = '../data/ConceptsofBiology-WEB.pdf'
# path to output pdf file
output_pdf_path = 'sample_ch1_ch2_ConceptsofBiology.pdf'

# pass range of pages to extract
page_ranges = [(18, 38), (40, 66)]
save_page_ranges(source_pdf_path, output_pdf_path, page_ranges)


### TODO 
- Get unstractured pdf data into structured format like JSON

### Embedding model

In [3]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5", device=('cuda' if torch.cuda.is_available() else 'cpu'))

  from .autonotebook import tqdm as notebook_tqdm
modules.json: 100%|██████████| 349/349 [00:00<00:00, 1.93MB/s]
config_sentence_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 590kB/s]
README.md: 100%|██████████| 94.6k/94.6k [00:00<00:00, 421kB/s]
sentence_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 347kB/s]
config.json: 100%|██████████| 779/779 [00:00<00:00, 5.61MB/s]
model.safetensors: 100%|██████████| 1.34G/1.34G [03:59<00:00, 5.60MB/s]
tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 3.52MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 492kB/s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 841kB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 328kB/s]
1_Pooling/config.json: 100%|██████████| 191/191 [00:00<00:00, 1.66MB/s]


In [4]:
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))

1024


### LLM

In [None]:
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

llm_hf = HuggingFaceInferenceAPI(model_name="microsoft/Phi-3-mini-4k-instruct", 
                                 temperature=0.0,
                                 token=os.getenv("HUGGING_FACE_TOKEN"))

In [None]:
print(llm_hf.complete("Hi"))

In [5]:
from llama_index.llms.ollama import Ollama

phi3 = Ollama(
    model="llama2:13b-chat",
    request_timeout=50.0,
    temperature=0.0
)

In [6]:
print(phi3.complete("world is "))

The world is a vast and diverse place, encompassing a wide range of cultures, landscapes, and ecosystems. Here are some interesting facts about the world:

1. The world is home to over 7.9 billion people, with the population projected to reach 9.7 billion by 2050. (Source: United Nations Department of Economic and Social Affairs)
2. The largest country in the world by land area is Russia, which covers an area of approximately 17.1 million square kilometers. (Source: CIA World Factbook)
3. The highest mountain in the world is Mount Everest, located in the Himalayas between Nepal and Tibet. It stands at a height of 8,848 meters (29,029 feet) above sea level. (Source: National Geographic)
4. The deepest lake in the world is Lake Baikal in Russia, which reaches a maximum depth of approximately 1,642 meters (5,387 feet). (Source: Lake Baikal Foundation)
5. The longest river in the world is the Nile River, which flows for approximately 6,695 kilometers (4,160 miles) through Egypt, Sudan, and

In [7]:
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are CEO of MetaAI"),
    ChatMessage(role="user", content="Introduce Llama2 to the world."),
]
response = phi3.chat(messages)

In [11]:
print(response)

assistant: 
Hello, fellow humans! I am Mark Zuckerberg, CEO of Meta AI, and I am thrilled to introduce you to our latest creation: Llama2! 🐪❤️

Llama2 is a revolutionary new AI model that represents the next generation of language understanding. This incredible technology has been designed to understand and respond to human input in a more natural, human-like way than ever before.

With Llama2, you can have conversations with our AI just like you would with a real person! 💬👩‍💻 Our AI is so advanced that it can understand the nuances of human language and respond in a way that is both appropriate and engaging.

But that's not all - Llama2 also has a range of exciting features that make it stand out from other AI models. For example, it can:

🔍 Understand context and intent behind human input, allowing for more accurate and relevant responses.

💬 Generate human-like text based on given prompts or topics, making it perfect for applications like chatbots and virtual assistants.

🎨 Create a

### Load Data

In [12]:
from llama_index.core import SimpleDirectoryReader

loader = SimpleDirectoryReader(
    input_dir="../data/sample/",
    recursive=True,
    required_exts=[".pdf"],
)

documents = loader.load_data()

In [76]:
# Sample Document
documents[3]

Document(id_='a6d4877b-6047-43a7-90c0-dbd1ab8d6ccf', embedding=None, metadata={'page_label': '4', 'file_name': 'sample_ch1_ch2_ConceptsofBiology.pdf', 'file_path': '/home/c3po/Documents/project/learning/amar-works/askbio/src/../data/sample/sample_ch1_ch2_ConceptsofBiology.pdf', 'file_type': 'application/pdf', 'file_size': 8980495, 'creation_date': '2024-04-26', 'last_modified_date': '2024-04-25'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='FIGURE 1.4Although no tw o look alik e, these kit tens ha ve inherit ed g enes fr om both par ents and shar e man y of the same char acteristics.\n(credit: Piet er & R enée L anser)\nRegulation/Homeos tasis\nEven the smal lest organisms ar e comple x and r equir e mul tiple r egulatory mechanisms t o coordinat e i

In [94]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=600,chunk_overlap=128)
nodes = splitter.get_nodes_from_documents(documents)

### Indexing Data (IN-MEMORY)

In [14]:
from llama_index.core import VectorStoreIndex

index0 = VectorStoreIndex.from_documents(
    documents,
    embed_model=embed_model,
    show_progress=True)   

Parsing nodes: 100%|██████████| 48/48 [00:00<00:00, 1157.10it/s]
Generating embeddings: 100%|██████████| 66/66 [00:03<00:00, 16.68it/s]


In [95]:
index1 = VectorStoreIndex(nodes=nodes,
                          use_async=True,
                          embed_model=embed_model,
                          show_progress=True)

Generating embeddings:   0%|          | 0/11 [00:00<?, ?it/s]

Generating embeddings: 100%|██████████| 11/11 [00:06<00:00,  1.77it/s]


### Querying

In [60]:
query_engine = index0.as_query_engine(llm=phi3)

In [96]:
query_engine1 = index1.as_query_engine(llm=phi3)
print(query_engine1.query("The type of logical thinking that uses related observations to arrive at a general conclusion is called?"))

The type of logical thinking that uses related observations to arrive at a general conclusion is called inductive reasoning.


In [93]:
response = query_engine.query("The type of logical thinking that uses related observations to arrive at a general conclusion is called?")
print(response)

The type of logical thinking that uses related observations to arrive at a general conclusion is called inductive reasoning.


In [None]:
def get_response(query: str):
    query_engine = index.as_query_engine(llm=phi3) # TODO Need to move as class attribute
    response = query_engine.query(query)
    return response


### Dataset Generation and Evaluation

In [113]:
from llama_index.core.evaluation import DatasetGenerator
data_gen = DatasetGenerator(nodes=nodes, 
                            llm=phi3, 
                            num_questions_per_chunk=2, 
                            question_gen_query="Generate 2 questions per chunk.Restrict the questions to the context information provided.")

  data_gen = DatasetGenerator(nodes=nodes,


In [118]:
eval_questions = data_gen.generate_questions_from_nodes()

  return QueryResponseDataset(queries=queries, responses=responses_dict)


In [126]:
eval_questions_updated = [q for q in eval_questions if ("How" in q or "What" in q and not ("pdf" in q or "PDF" in q))]
len(eval_questions_updated)

503

In [134]:
from llama_index.core.evaluation import RelevancyEvaluator
import json

rel_eval = RelevancyEvaluator(llm=phi3)

ques = eval_questions_updated[0]

relevancy_results = []
for q in eval_questions_updated[:10]:
    ques_response = query_engine.query(q)
    eval_result = json.loads(rel_eval.evaluate_response(query=ques, response=ques_response).json())
    relevancy_results.append(eval_result)
    print(f" q --> {q} score --> {eval_result['score']}")

# print(f"Q --> {ques} \nsource --> {ques_response.source_nodes[0].node.get_content()} \neval_result --> {eval_result}\n")

 q --> What are the first forms of life on Earth thought to have been? score --> 1.0
 q --> How long ago did plants and animals appear on Earth? score --> 1.0
 q --> What is the science that studies life called? score --> 0.0
 q --> What is an example of a sub-discipline in biology that studies viruses? score --> 0.0
 q --> What are the properties of life that can be identified and described? score --> 1.0
 q --> What is the difference between living entities and non-living entities? score --> 0.0
 q --> What are the four questions that biologists have struggled with since the early beginnings of biology? score --> 0.0
 q --> How do the various living things function, and how do we organize them to better understand them? score --> 0.0
 q --> What are the eight characteristics that define life, according to biologists? score --> 1.0
 q --> How do cells specialize in specific functions, and how do they come together to form organs such as the heart, lung, or skin? score --> 1.0


### PromptTemplate Using Retriever

In [71]:
from llama_index.core import PromptTemplate

template1 = ("Your name is AskBio. You are AI chatbot who can answer question by using provided context information from book named Concepts of Biology"
            "Be more specific and do not facricate the answers.\n" 
            "If you are unsure about answer, please ask for clarfications.\n"
            "Use the provided context information below to answer the user questions. \n"
            "-------------------------------------------\n"
            "{context_str}" 
            "\n -------------------------------------------\n"
            "Given this information, please answer user questions: {query_str} \n")
qa_template1 = PromptTemplate(template1)


In [143]:
user_query = "How long ago humans inhabited Earth?"
retriever = index1.as_retriever()

for qry in eval_questions_updated[:10]:
    nodes = retriever.retrieve(qry)
    context = " ".join(node.get_text() for node in nodes)
    prompt1 = qa_template1.format(context_str=context, query_str=qry)
    response1 = query_engine.query(prompt1)
    eval_result = json.loads(rel_eval.evaluate_response(query=qry, response=response1).json())
    print(f" q --> {qry} score --> {eval_result['score']}")
    

 q --> What are the first forms of life on Earth thought to have been? score --> 0.0
 q --> How long ago did plants and animals appear on Earth? score --> 1.0
 q --> What is the science that studies life called? score --> 1.0
 q --> What is an example of a sub-discipline in biology that studies viruses? score --> 1.0
 q --> What are the properties of life that can be identified and described? score --> 1.0
 q --> What is the difference between living entities and non-living entities? score --> 1.0
 q --> What are the four questions that biologists have struggled with since the early beginnings of biology? score --> 1.0
 q --> How do the various living things function, and how do we organize them to better understand them? score --> 1.0
 q --> What are the eight characteristics that define life, according to biologists? score --> 1.0
 q --> How do cells specialize in specific functions, and how do they come together to form organs such as the heart, lung, or skin? score --> 1.0


In [144]:
def get_response_with_retriever(query: str):
    # TODO Need to move as class attribute
    custom_template = ("Your name is AskBio. You are AI chatbot who can answer question by using provided context information from book named Concepts of Biology"
            "Be more specific and do not facricate the answers.\n" 
            "If you are unsure about answer, please ask for clarfications.\n"
            "Use the provided context information below to answer the user questions. \n"
            "-------------------------------------------\n"
            "{context_str}" 
            "\n -------------------------------------------\n"
            "Given this information, please answer user questions: {query_str} \n")
    askbio_template = PromptTemplate(custom_template)
    retriever = index.as_retriever() # TODO Need to move as class attribute
    retrieved_nodes = retriever.retrieve(qry)
    retrieved_context = " ".join(node.get_text() for node in retrieved_nodes)
    formatted_prompt = askbio_template.format(context_str=retrieved_context, query_str=qry)
    response = query_engine.query(formatted_prompt)
    return response