<a href="https://colab.research.google.com/github/SantoshIBM/PythonicOOPs/blob/main/XMLRetrievers_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Our aim here is to create a custom retriever which is capable of integrating itself into a langchain agent which will be used as a framework on top on GPT-4o LLM to build a robust order enquiry engine.
The retriever will be passed a tool to our Langchain agent and will be deployed over Gradio for providing a chat interface.

In [1]:
!pip install langchainhub
!pip install langchain-openai
!pip install langchain
!pip install beautifulsoup4
!pip install langchain-community
!pip install faiss-cpu
!pip install -U langchain-community tavily-python
!pip gradio_client==0.2.10
!pip install gradio==3.38.0
!pip install unstructured


Collecting langchainhub
  Downloading langchainhub-0.1.21-py3-none-any.whl.metadata (659 bytes)
Collecting types-requests<3.0.0.0,>=2.31.0.2 (from langchainhub)
  Downloading types_requests-2.32.0.20241016-py3-none-any.whl.metadata (1.9 kB)
Downloading langchainhub-0.1.21-py3-none-any.whl (5.2 kB)
Downloading types_requests-2.32.0.20241016-py3-none-any.whl (15 kB)
Installing collected packages: types-requests, langchainhub
Successfully installed langchainhub-0.1.21 types-requests-2.32.0.20241016
Collecting langchain-openai
  Downloading langchain_openai-0.2.14-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain-core<0.4.0,>=0.3.27 (from langchain-openai)
  Downloading langchain_core-0.3.28-py3-none-any.whl.metadata (6.3 kB)
Collecting openai<2.0.0,>=1.58.1 (from langchain-openai)
  Downloading openai-1.58.1-py3-none-any.whl.metadata (27 kB)
Collecting tiktoken<1,>=0.7 (from langchain-openai)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.met

Collecting unstructured
  Downloading unstructured-0.16.11-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2024.10.22-py3-none-any.whl.metadata (13 kB)
Collecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rapidfuzz (from unstructured)
  Downloading rapidfuzz-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting backoff (from unstructured)
  Downl

In [1]:
import getpass
import os

In [2]:
os.environ["OPENAI_API_KEY"] = getpass.getpass()

··········


In [3]:
from langchain.schema import BaseRetriever, Document
from typing import List
import xml.etree.ElementTree as ET
from pathlib import Path
from pydantic import Field, BaseModel
import re

class XMLOrderRetriever(BaseRetriever, BaseModel):
  file_path: Path = Field(..., description="Path to the XML file")
  root: ET.Element = None
  def __init__(self, file_path: Path, **kwargs):
    super().__init__(file_path=file_path, **kwargs)
    self.file_path = file_path
    self.root = self._parse_xml()

  def _parse_xml(self):
    tree = ET.parse(self.file_path)
    return tree.getroot()

  def _get_relevant_documents(self, query: str) -> List[Document]:
    query_parts = query.lower().split()
    documents = []

    if "order" in query_parts:
      order_info = self._extract_order_info()
      documents.append(Document(page_content=str(order_info), metadata={"type": "order_info"}))

    if "billing" in query_parts:
      billing_info = self._extract_billing_info()
      documents.append(Document(page_content=str(billing_info), metadata={"type": "billing_info"}))

    if "orderlines" in query_parts or "items" in query_parts:
      order_lines = self._extract_order_lines()
      documents.append(Document(page_content=str(order_lines), metadata={"type": "order_lines"}))

    if "charges" in query_parts:
      header_charges = self._extract_header_charges()
      documents.append(Document(page_content=str(header_charges), metadata={"type": "header_charges"}))

    return documents

  def _extract_order_info(self):
    return {
        "OrderNo": self.root.get("OrderNo"),
        "DocumentType": self.root.get("DocumentType"),
        #"TotalAmount": self.root.find("PriceInfo").get("TotalAmount"),
        #"Currency": self.root.find("PriceInfo").get("Currency"),
        "PaymentStatus": self.root.get("PaymentStatus")
        }

  def _extract_billing_info(self):
        bill_to = self.root.find("PersonInfoBillTo")
        return {
            "AddressLine1": bill_to.get("AddressLine1"),
            "AddressLine2": bill_to.get("AddressLine2"),
            "State": bill_to.get("State"),
            "Country": bill_to.get("Country"),
            "ZipCode": bill_to.get("ZipCode")
        }

  def _extract_order_lines(self):
        order_lines = []
        for line in self.root.find("OrderLines").findall("OrderLine"):
            order_line = {
                "OrderedQty": line.get("OrderedQty"),
                "PrimeLineNo": line.get("PrimeLineNo"),
                "ItemID": line.find("Item").get("ItemID"),
                "UnitPrice": line.find("LinePriceInfo").get("UnitPrice"),
                "LineTotal": line.find("LinePriceInfo").get("LineTotal")
            }
            order_lines.append(order_line)
        return order_lines

  def _extract_header_charges(self):
        header_charges = []
        for charge in self.root.find("HeaderCharges").findall("HeaderCharge"):
            header_charge = {
                "ChargeAmount": charge.get("ChargeAmount"),
                "ChargeCategory": charge.get("ChargeCategory"),
                "ChargeName": charge.get("ChargeName")
            }
            header_charges.append(header_charge)
        return header_charges

  async def _aget_relevant_documents(self, query: str) -> List[Document]:
        return self._get_relevant_documents(query)


The below code shows how to use the above retriever.

In [4]:
from pathlib import Path
file_path = Path("/content/sample_data/Order_sample.xml")

In [26]:
# Initialize the retriever
#retriever = XMLOrderRetriever(file_path=file_path)

# Query for specific information
#order_info = retriever.get_relevant_documents("order")
#billing_info = retriever.get_relevant_documents("billing")
#order_lines = retriever.get_relevant_documents("orderlines")
#charges = retriever.get_relevant_documents("charges")

# Print the retrieved information
#for doc in order_info + order_lines :
    #print(f"Type: {doc.metadata['type']}")
    #print(doc.page_content)
    #print()


In [5]:
from langchain.vectorstores import FAISS
#from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
ordRetriever = XMLOrderRetriever(file_path=file_path)

# Query for specific information
order_info = ordRetriever.get_relevant_documents("order")
#billing_info = ordRetriever.get_relevant_documents("billing")
order_lines = ordRetriever.get_relevant_documents("orderlines")
#charges = ordRetriever.get_relevant_documents("charges")

# Create a list of Document objects with text and metadata
docObjects = []
for docObject in order_info + order_lines:
    docObjects.append(Document(page_content=docObject.page_content, metadata=docObject.metadata))

# Create and populate the vector store with metadata
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docObjects, embeddings)  # Use from_documents



  order_info = ordRetriever.get_relevant_documents("order")


In [7]:
from langchain.tools.retriever import create_retriever_tool
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain.agents import create_openai_functions_agent
from langchain.agents import OpenAIFunctionsAgent, AgentExecutor
from langchain.agents.openai_functions_agent.base import OpenAIFunctionsAgentOutputParser
from langchain.prompts.chat import ChatPromptTemplate
from langchain.prompts import MessagesPlaceholder
from langchain.prompts import HumanMessagePromptTemplate


# Instead of importing from langchain_core.prompts, import from langchain.prompts.chat:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from pydantic import Field # Import Field from pydantic
from langchain.tools import Tool
from langchain.schema import BaseRetriever, Document
from typing import List


# Get the prompt to use - you can modify this!
output_parser = StrOutputParser()

# Updated prompt to include 'context' and 'input_documents'
# Added a tool_code placeholder for the agent to specify tools
prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content="You are a helpful AI assistant.Answer questions only based on information available in context."),
    HumanMessage(content="{input}"),
    HumanMessage(content="{context}"),  # Include context in the prompt
    #HumanMessage(content="Order Information: {order_info}\nOrder Lines: {order_lines}"),  # Include metadata placeholders
    MessagesPlaceholder(variable_name="agent_scratchpad"), # Placeholder to store tool call and results. This is important for OpenAI function calling
    HumanMessagePromptTemplate.from_template("{input_documents}")
])



#prompt = hub.pull("hwchase17/openai-functions-agent")
#print(prompt)
# You need to set OPENAI_API_KEY environment variable or pass it as argument `api_key`.
#llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=1)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

retriever = vectorstore.as_retriever()
# Modified to pass 'input' and 'agent_scratchpad' for create_stuff_documents_chain
document_chain = create_stuff_documents_chain(
    llm,
    prompt,
    document_variable_name="input_documents"
)


def create_modified_retrieval_chain(retriever, chain):
  """Create a retrieval chain that handles list of Documents."""
  def _run(query: str) -> str:
    docs = retriever.get_relevant_documents(query)  # Get list of Documents
    #relevant_docs = [doc for doc in docs if doc.metadata.get("type") in ["order_info", "order_lines"]]
    #order_info = next((doc.page_content for doc in docs if doc.metadata.get("type") == "order_info"), "")
    #order_lines = next((doc.page_content for doc in docs if doc.metadata.get("type") == "order_lines"), "")
    # Chain expects a dictionary, so we provide it
    #import uuid
    #tool_call_id = str(uuid.uuid4())
    # Modified agent_scratchpad to include 'role' and 'content'
    #return chain.invoke({"input_documents": docs, "context": "",
                        # "input": query, "agent_scratchpad":
                         # [{"role": "tool", "content": f"Tool Order_search returned: {docs}", # Changed to 'role' and 'content' keys
                           # "tool_call_id": tool_call_id}]}) # Changed to 'role' and 'content' keys
    return chain.invoke({"input_documents": docs, "context": "", "input": query, "agent_scratchpad": []})  # Removed 'agent_scratchpad'
  return _run

retrieval_chain = create_modified_retrieval_chain(retriever, document_chain)

# Create a custom tool that wraps your retrieval_chain function
class CustomRetrieverTool(Tool):
   # name and description are now instance variables
  def __init__(self, name: str = "Order_search", description: str = "Search for information about Order. For any questions about Order, you must use this tool!"):
    """Initialize the tool."""
    super().__init__(name=name, func=self._run, description=description)  # Pass name and description directly
    self.name = name
    self.description = description

  def _run(self, query: str) -> str:
    return retrieval_chain(query)  # Call your retrieval_chain function

  async def _arun(self, query: str) -> str:
    raise NotImplementedError("This tool does not support async")

# Use the custom tool instead of create_retriever_tool
retriever_tool = CustomRetrieverTool()

tools = [retriever_tool]

#agent = create_openai_functions_agent(llm, tools, prompt)
#agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)


# Modified AgentExecutor to fetch context dynamically
class ContextAwareAgentExecutor(AgentExecutor):
  retriever_tool: Tool = Field(..., description="The retriever tool to use for context") # Define retriever_tool as a Pydantic Field

  def __init__(self,retriever_tool, *args, **kwargs):
    """Initialize the agent executor."""
    super().__init__(*args, retriever_tool=retriever_tool, **kwargs)
     # Access the retriever tool from the tools list
    #self.retriever_tool = next((tool for tool in self.tools if tool.name == "Order_search"), None)
    #self.retriever_tool = retriever_tool

  async def _ainvoke(self, inputs: dict) -> dict:
    """invoke the agent chain."""
    # Fetch context from the retriever tool
    #context = self.retriever_tool.run(inputs["input"]) # Removed this line. The _take_next_step function will handle tool calls.

    # Update the inputs with the fetched context
    #updated_inputs = {**inputs, "context": context}  # Removed this line. The _take_next_step function will handle tool calls.

    # Generate a unique tool_call_id
    #import uuid # Removed this line. The _take_next_step function will handle tool calls.
    #tool_call_id = str(uuid.uuid4()) # Removed this line. The _take_next_step function will handle tool calls.
    #updated_inputs["agent_scratchpad"] = [ # Removed this line. The _take_next_step function will handle tool calls.
    #  {"role": "tool", "content": f"Tool Order_search returned: {context}", "tool_call_id": tool_call_id} # Removed this line. The _take_next_step function will handle tool calls.
    #]

    # Call the original _ainvoke method with updated inputs
    return await super()._ainvoke(inputs)  # Removed updated_inputs and passed original inputs to handle tool call

  def _take_next_step(self, name_to_tool_map, color_mapping, inputs, intermediate_steps, run_manager):
    """Override _take_next_step to include context in kwargs."""
    # Fetch context if not already present in inputs
    if "context" not in inputs:
      context = self.retriever_tool.run(inputs["input"])
      inputs["context"] = context # Add context to inputs if it's not already there

    # Get relevant documents and add them to inputs
    retrievedDocs = self.retriever_tool.func(inputs["input"])  # Get documents using the retriever tool's function
    inputs["input_documents"] = retrievedDocs # Removed .page_content to pass the entire document

    return super()._take_next_step(name_to_tool_map, color_mapping, inputs, intermediate_steps, run_manager)


agent = OpenAIFunctionsAgent(llm=llm, tools=tools, prompt=prompt, output_parser=OpenAIFunctionsAgentOutputParser)
#agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True)
agent_executor = ContextAwareAgentExecutor(retriever_tool, agent=agent, tools=tools, verbose=True)

In [8]:
query = "What is the order number provided in the context? Please provide a brief answer."
result = agent_executor.invoke({"input": query}, config={"max_tokens": 300})
print(result["output"])



[1m> Entering new ContextAwareAgentExecutor chain...[0m
[32;1m[1;3mIt looks like you have details for an order with OrderNo 'ImportedOrderS1'. Here is a summary of the order:

- **Document Type**: 0001
- **Payment Status**: AUTHORIZED

**Items in the Order:**

1. **Item 1**:
   - Ordered Quantity: 1
   - Prime Line Number: 001
   - Item ID: 100001
   - Unit Price: $20
   - Line Total: $20.00

2. **Item 2**:
   - Ordered Quantity: 1
   - Prime Line Number: 002
   - Item ID: 100001
   - Unit Price: $35
   - Line Total: $35.00

**Total Order Amount**: $55.00

If you need more information or have specific questions about this order, please let me know![0m

[1m> Finished chain.[0m
It looks like you have details for an order with OrderNo 'ImportedOrderS1'. Here is a summary of the order:

- **Document Type**: 0001
- **Payment Status**: AUTHORIZED

**Items in the Order:**

1. **Item 1**:
   - Ordered Quantity: 1
   - Prime Line Number: 001
   - Item ID: 100001
   - Unit Price: $20
  

In [22]:
# Summarization (using transformers)
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(result["output"], max_length=30, min_length=5)
print(summary[0]["summary_text"])

Device set to use cpu


If you have any questions or need further details about the order "ImportedOrderS1," feel free to ask! Whether it's


In [9]:
print(result["output"])

It looks like you have details for an order with OrderNo 'ImportedOrderS1'. Here is a summary of the order:

- **Document Type**: 0001
- **Payment Status**: AUTHORIZED

**Items in the Order:**

1. **Item 1**:
   - Ordered Quantity: 1
   - Prime Line Number: 001
   - Item ID: 100001
   - Unit Price: $20
   - Line Total: $20.00

2. **Item 2**:
   - Ordered Quantity: 1
   - Prime Line Number: 002
   - Item ID: 100001
   - Unit Price: $35
   - Line Total: $35.00

**Total Order Amount**: $55.00

If you need more information or have specific questions about this order, please let me know!


In [10]:
import gradio as gr



In [11]:
#from transformers import pipeline

def predict(message, _):
  result = agent_executor.invoke({"input": message})
  #summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
  #summary = summarizer(result["output"])
  return result["output"]

In [12]:
gr.ChatInterface(predict,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Hi I am your virtual assistant, how can I help you today?", container=False, scale=7),
    title="Order Support",
    description="Ask anything about Orders",
    theme="soft",
    examples=["What is the order number", "Who placed this order?"],
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",).launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
IMPORTANT: You are using gradio version 3.38.0, however version 4.44.1 is available, please upgrade.
--------
Running on public URL: https://5797be4fb864d7f16f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


