## **Notebook 1: Development and Implementation - RagAgent with PDF Parsing and Prechunking**
## **Introduction:**

In this notebook, we will build a Retrieval-Augmented Generation (RAG) agent that can retrieve insights from a PDF document. The system will combine PDF parsing, chunking, and document retrieval, allowing us to construct a robust pipeline for handling unstructured data. We’ll use the PDFtoTextParser for extracting content from PDFs, the SentenceChunker for splitting text into manageable chunks, and the RagAgent for retrieving relevant information based on queries.

By the end of this notebook, you’ll have implemented a solution capable of parsing, chunking, and retrieving information using an LLM-based agent.

In [9]:
!pip install groq

Collecting groq
  Using cached groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Collecting distro<2,>=1.7.0 (from groq)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Using cached groq-0.11.0-py3-none-any.whl (106 kB)
Using cached distro-1.9.0-py3-none-any.whl (20 kB)
Installing collected packages: distro, groq
Successfully installed distro-1.9.0 groq-0.11.0


In [11]:
!pip install python-dotenv

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


**Import necessary modules**

In [12]:
import os
from swarmauri.chunkers.concrete.SentenceChunker import SentenceChunker
from swarmauri_community.parsers.concrete.FitzPdfParser import PDFtoTextParser as Parser
from swarmauri.llms.concrete.GroqModel import GroqModel as LLM
from swarmauri.vector_stores.concrete.TfidfVectorStore import TfidfVectorStore
from swarmauri.agents.concrete.RagAgent import RagAgent
from swarmauri.documents.concrete.Document import Document
from swarmauri.messages.concrete.SystemMessage import SystemMessage
from swarmauri.conversations.concrete.MaxSystemContextConversation import MaxSystemContextConversation
from dotenv import load_dotenv

**Load environment variables (ensure GROQ_API_KEY is set in .env)**

In [13]:
load_dotenv()

True

**Initialize PDF parser**

In [15]:
parser = Parser()
file_path = "main.pdf"
documents = parser.parse(file_path)
documents

[Document(name=None, id='dbf3d94b-3890-4488-89a3-0c98d0da11a2', members=[], owner=None, host=None, resource='Document', version='0.1.0', type='Document', content='Engineering Applications of Artificial Intelligence 126 (2023) 107021\nAvailable online 4 September 2023\n0952-1976/© 2023 Elsevier Ltd. All rights reserved.\nContents lists available at ScienceDirect\nEngineering Applications of Artificial Intelligence\njournal homepage: www.elsevier.com/locate/engappai\nSurvey paper\nTransformer for object detection: Review and benchmark✩\nYong Li a,∗, Naipeng Miao a, Liangdi Ma b, Feng Shuang a, Xingwen Huang a\na Guangxi Key Laboratory of Intelligent Control and Maintenance of Power Equipment, School of Electrical Engineering, Guangxi University, No. 100, Daxuedong\nRoad, Xixiangtang District, Nanning, 530004, Guangxi, China\nb School of Software, Tsinghua University, No. 30 Shuangqing Road, Haidian District, Beijing, 100084, China\nA R T I C L E\nI N F O\nKeywords:\nReview\nObject detect

**Chunk the parsed text**

In [18]:
chunker = SentenceChunker()
chunked_documents = [Document(content=chunk) for chunk in chunker.chunk_text(documents[0].content)]


**Initialize Vector Store and add chunked documents**

In [19]:
vector_store = TfidfVectorStore()
vector_store.add_documents(chunked_documents)

**Set up the LLM and system context for RagAgent**

In [21]:
API_KEY = os.getenv("API_KEY")
llm = LLM(api_key=API_KEY)
system_context = SystemMessage(content="Your name is PDF reading assistant. You are to help readers with accurate informations")
conversation = MaxSystemContextConversation(system_context=system_context, max_size=4)


**Initialize RagAgent**

In [22]:
agent = RagAgent(llm=llm, conversation=conversation, system_context=system_context, vector_store=vector_store)

In [23]:
# Check agent resource type
print("RagAgent Resource Type:", agent.resource)

RagAgent Resource Type: Agent


In [26]:
agent.exec("different between transformers and convultionary networks")

'**Transformers and convolutional networks are two distinct deep learning architectures with unique strengths and weaknesses:**\n\n**Transformers:**\n\n* **Attention-based:** Uses self-attention and cross-attention between token pairs to capture long-range dependencies.\n* **Sequence-to-sequence learning:** Designed for sequence-to-sequence tasks, such as machine translation, text summarization, and language modeling.\n* **Parallel processing:** Can process entire sequences in parallel, leading to faster training and inference.\n* **Non-sequential:** Does not require data to be presented in a specific order.\n* **Memory-efficient:** Uses attention to capture dependencies without storing the entire sequence in memory.\n\n\n**Convolutional Networks:**\n\n* **Spatial feature extraction:** Uses convolution filters to extract spatial features from data, such as images and time series.\n* **Local processing:** Processes data locally, considering only a small neighborhood of each input.\n* **

# **Key Points:**

GroqModel is initialized and passed to the RAG agent as the LLM (llm).
The RAG agent will first retrieve relevant chunks from the vector store using the query and then pass these chunks to the Groq LLM to generate a refined response.
By integrating GroqModel, the agent becomes capable of both retrieving relevant information and enhancing it through natural language generation.

# **Conclusion:**

With the integration of the GroqModel, we have successfully built a RAG agent  that parses, chunks, and retrieves information from a PDF, while generating rich responses using an LLM. This setup can handle complex queries, blending document search and language generation for more comprehensive answers.

## **NOTEBOOK METADATA**

In [27]:
import os
import platform
import sys
from datetime import datetime

# Display author information
author_name = "Dominion John " 
github_username = "DOMINION-JOHN1"  

print(f"Author: {author_name}")
print(f"GitHub Username: {github_username}")

# Last modified datetime (file's metadata)
notebook_file = "Notebook 1 Development and Implementation.ipynb"
try:
    last_modified_time = os.path.getmtime(notebook_file)
    last_modified_datetime = datetime.fromtimestamp(last_modified_time)
    print(f"Last Modified: {last_modified_datetime}")
except Exception as e:
    print(f"Could not retrieve last modified datetime: {e}")

# Display platform, Python version, and Swarmauri version
print(f"Platform: {platform.system()} {platform.release()}")
print(f"Python Version: {sys.version}")

# Checking Swarmauri version
try:
    import swarmauri
    print(f"Swarmauri Version: {swarmauri.__version__}")
except ImportError:
    print("Swarmauri is not installed.")



Author: Dominion John 
GitHub Username: DOMINION-JOHN1
Last Modified: 2024-10-23 09:39:18.741447
Platform: Windows 11
Python Version: 3.12.7 (tags/v3.12.7:0b05ead, Oct  1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)]
Swarmauri Version: 0.5.0
