# Scottish Water: Multi-Document Agent

## Setup

#### The first thing to do is to load the environment variables which is the OPENAI API key 

In [1]:
from dotenv import load_dotenv
import os

load_dotenv('config.env')

True

#### A lot of jupyter LLMs work in async mode so this next block of code enables the jupyter notebook also work in async mode 

(This affects launching Gradio) 

In [None]:
import nest_asyncio
nest_asyncio.apply()

## 1. Load, read and extract the data from the documents

**Note**: The pdf files are in a folder called 'ScottishWater'.
This can be modified to get the documents directly from a website

In [3]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, SummaryIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.tools import FunctionTool, QueryEngineTool
from llama_index.core.vector_stores import MetadataFilters, FilterCondition
from typing import List, Optional

In [4]:
# Define the LLM and embedding model to be used in this task
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

#### Download a few Scottish Water papers

In [5]:
papers = [
    "ScottishWater/310823SWDeveloperGuide_June23.pdf",
    "ScottishWater/150219WaterForScotlandV4.pdf",
    "ScottishWater/080822MisconnectionsExplainedAug22.pdf",
    "ScottishWater/190718SurfaceWaterGuidanceDoc8ppA4PagesHiRes.pdf",
    "ScottishWater/SewersForScotlandv4.pdf",
    "ScottishWater/021220CampervanWasteDisposalGuidance.pdf",
    "ScottishWater/190718WaterConnectionsCodeScotlandJul14.pdf",
    "ScottishWater/170718swbyelawsexplained.pdf",
    "ScottishWater/120221SWPrivateToPublicv4aPages.pdf",
    "ScottishWater/170718swleaddrinkingwaterawpageslr.pdf",
    "ScottishWater/170718swbyelawsexplained.pdf",
    "ScottishWater/120221SWPrivateToPublicv4aPages.pdf",
    "ScottishWater/170718swleaddrinkingwaterawpageslr.pdf",
    "ScottishWater/150120SWByelaws2021HiresPagesWeb.pdf",
    "ScottishWater/021222A70118SWAccessToLandBrochureJune2022SinglePagesHR.pdf",
    "ScottishWater/261121ScottishWaterDomesticQuickGuide.pdf"  
]

### Reading the Data
#### This function takes the data from the pdf files provided above and does the following:
    * It loads the documents, splits the documents into chunk sizes
    * Creates nodes from the documents 
    * Since we want to be able to search the documents and also summarize, we will be creating two tools: the vector tool and summary tool
    * The vector tool takes the nodes, creates indexes which are then stored to make searchinhg easier 
    * The summary tool takes the nodes and creates summaries based on the nodes provided

In [9]:
def get_doc_tools(
    file_path: str,
    name: str,
) -> str:
    """Get vector query and summary query tools from a document."""

    # load documents
    documents = SimpleDirectoryReader(input_files=[file_path]).load_data()
    splitter = SentenceSplitter(chunk_size=1024)
    nodes = splitter.get_nodes_from_documents(documents)
    vector_index = VectorStoreIndex(nodes)
    
    def vector_query(
        query: str, 
        page_numbers: Optional[List[str]] = None
    ) -> str:
        """Use to answer questions over a given paper.
    
        Useful if you have specific questions over the paper.
        Always leave page_numbers as None UNLESS there is a specific page you want to search for.
    
        Args:
            query (str): the string query to be embedded.
            page_numbers (Optional[List[str]]): Filter by set of pages. Leave as NONE 
                if we want to perform a vector search
                over all pages. Otherwise, filter by the set of specified pages.
        
        """
    
        page_numbers = page_numbers or []
        metadata_dicts = [
            {"key": "page_label", "value": p} for p in page_numbers
        ]
        
        query_engine = vector_index.as_query_engine(
            similarity_top_k=2,
            filters=MetadataFilters.from_dicts(
                metadata_dicts,
                condition=FilterCondition.OR
            )
        )
        response = query_engine.query(query)
        return response
        
    
    vector_query_tool = FunctionTool.from_defaults(
        name=f"vector_tool_{name}",
        fn=vector_query
    )
    
    summary_index = SummaryIndex(nodes)
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize",
        use_async=True,
    )
    summary_tool = QueryEngineTool.from_defaults(
        name=f"summary_tool_{name}",
        query_engine=summary_query_engine,
        description=(
            f"Useful for summarization questions related to {name}"
        ),
    )

    return vector_query_tool, summary_tool

In [11]:
from pathlib import Path

paper_to_tools_dict = {}
for paper in papers:
    print(f"Getting tools for paper: {paper}")
    vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
    paper_to_tools_dict[paper] = [vector_tool, summary_tool]

Getting tools for paper: ScottishWater/310823SWDeveloperGuide_June23.pdf
Getting tools for paper: ScottishWater/150219WaterForScotlandV4.pdf
Getting tools for paper: ScottishWater/080822MisconnectionsExplainedAug22.pdf


incorrect startxref pointer(1)


Getting tools for paper: ScottishWater/190718SurfaceWaterGuidanceDoc8ppA4PagesHiRes.pdf
Getting tools for paper: ScottishWater/SewersForScotlandv4.pdf
Getting tools for paper: ScottishWater/021220CampervanWasteDisposalGuidance.pdf


incorrect startxref pointer(1)


Getting tools for paper: ScottishWater/190718WaterConnectionsCodeScotlandJul14.pdf


incorrect startxref pointer(3)


Getting tools for paper: ScottishWater/170718swbyelawsexplained.pdf
Getting tools for paper: ScottishWater/120221SWPrivateToPublicv4aPages.pdf
Getting tools for paper: ScottishWater/170718swleaddrinkingwaterawpageslr.pdf


incorrect startxref pointer(3)


Getting tools for paper: ScottishWater/170718swbyelawsexplained.pdf
Getting tools for paper: ScottishWater/120221SWPrivateToPublicv4aPages.pdf
Getting tools for paper: ScottishWater/170718swleaddrinkingwaterawpageslr.pdf
Getting tools for paper: ScottishWater/150120SWByelaws2021HiresPagesWeb.pdf
Getting tools for paper: ScottishWater/021222A70118SWAccessToLandBrochureJune2022SinglePagesHR.pdf
Getting tools for paper: ScottishWater/261121ScottishWaterDomesticQuickGuide.pdf


### Extend the Agent with Tool Retrieval

Put the tools into d flat dictionary for wasy processing

In [12]:
all_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

The next step is to make the tools retrievable. As we have a large number of documents and two tools per document, it might be costly and time consuming to send all tools to the llm at once. To curb this, we create an object index and then retrieve the required tools based on the prompt provided by the user.

In [13]:
# define an "object" index and retriever over these tools
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
)

The code below sets the similarity to the top 3 so it will return the top three tools based on the user prompt.

In [14]:
obj_retriever = obj_index.as_retriever(similarity_top_k=3)

In [15]:
tools = obj_retriever.retrieve(
    "Tell me about the water byelaws for private owners"
)

In [36]:
tools[0].metadata

ToolMetadata(description='Useful for summarization questions related to 261121ScottishWaterDomesticQuickGuide', name='summary_tool_261121ScottishWaterDomesticQuickGuide', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)

#### Implementing the Agent
The agent in this RAG implementation helps in making the RAG application more efficient with complex or 
multi step prompts. Say we want to search for and summarize a section from a document. It helps the LLM decipher and break down the actions so they can be performed easily. 
An agent has the worker and the runner. The worker orchestrates the tasks, gets what needs to be done in what order and the runner performs the reasoning, gets the tasks executed and sends the response to the user.

In [16]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    tool_retriever=obj_retriever,
    llm=llm, 
    system_prompt=""" \
You are an agent designed to answer queries over a set of given papers.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.\

""",
    verbose=False
)
agent = AgentRunner(agent_worker)

In viewing the responses, we can use either query or chat. Query is usually a one-off call where the agent doesn't keep track of what was previously asked. 

In [38]:
response = agent.query(
    "Tell me about the water byelaws for private owners"
    "and compare it against the byelaws for public owners"
)
print(str(response))

Retrying llama_index.embeddings.openai.base.get_embedding in 0.5911091986894219 seconds as it raised APIConnectionError: Connection error..


KeyboardInterrupt: 

In [18]:
response = agent.query(
    "Compare and contrast the private and public related papers. "
    "What are the major differences in the approach of Scottish Water to both entities. "
)

In [19]:
print(str(response))

assistant: The major differences in the approach of Scottish Water to private and public entities are as follows:

Private Related Paper:
- Emphasizes the importance of compliance with regulations and standards for maintaining water safety and quality in homes.
- Recommends the use of licensed plumbers and contractors who are members of recognized national licensing schemes for water-related work.
- Highlights the need for regular maintenance of domestic appliances, pipes, cisterns, shower hoses, toilet WC, and hose union taps to prevent backflow.
- Encourages access to servicing valves and fittings for easy maintenance and repairs.

Public Related Paper:
- Focuses on maintaining water quality and safety in homes and business premises in Scotland.
- Mentions the implementation of Water Byelaws to prevent backflow contamination of the public water supply.
- Stresses compliance with installation requirements and the use of appropriate backflow prevention devices to protect against differ

In [20]:
response = agent.query(
    "Provide a summary of the hydraulic modelling design"
)
print(str(response))

  return ToolOutput(
  return ToolOutput(


assistant: Hydraulic modelling design involves creating detailed simulations to analyze and predict the behavior of fluids in various systems. This process typically includes assessing factors such as fluid flow, pressure, and potential risks like backflow. By utilizing hydraulic modelling design, engineers can optimize the performance of water systems, ensure compliance with regulations such as Water Byelaws, and enhance overall safety and efficiency in water supply and distribution networks.


In [21]:
import gradio as gr
def generate(input, slider):
    response = agent.query(input)
    return str(response)

demo = gr.Interface(fn=generate, 
                    inputs=[gr.Textbox(label="What would you like to know"), 
                            gr.Slider(label="Max new tokens", 
                                      value=20,  
                                      maximum=1024, 
                                      minimum=1)], 
                    outputs=[gr.Textbox(label="Response")])

gr.close_all()
demo.launch(share=True)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://60bf20168ea10cbe07.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


