# Langchain Retrieval Augmented Generation

This notebook introduces how to work with Langchain. 
Made by Csaba Hegedűs and Attila Frankó, BME-TMIT. 

## Chapter 0 Setup

### Python packages 
Installing prerequisites: langchain and langgraph libraries

In [1]:
%pip install --quiet langchain langchain-community langchain-openai langchain_chroma 
%pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence
%pip install --quiet azure-identity pillow PyMuPDF 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
arrowhead-client 0.5.0a0 requires Flask>=1.0.2, which is not installed.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23

### Configure LLM

Always run this, before trying out anything else. 

You can use OpenAI or AzureOpenAI. 

In [2]:
AZURE_OPENAI_ENDPOINT = ""
AZURE_OPENAI_API_KEY = ""
AZURE_OPENAI_API_VERSION = "2024-05-01-preview"
AZURE_OPENAI_DEPLOYMENT_NAME = "gpt4o"
AZURE_OPENAI_EMBEDDING_MODEL = "text-embedding-3-large"

from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

llm = AzureChatOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION,
    deployment_name=AZURE_OPENAI_DEPLOYMENT_NAME,
)

embedder = AzureOpenAIEmbeddings(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION,
    model=AZURE_OPENAI_EMBEDDING_MODEL,
)

ALTERNATIVE: Using OpenAI as LLM

In [22]:
OPENAI_API_KEY = ""

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(api_key=OPENAI_API_KEY, model="gpt-4o")
embedder = OpenAIEmbeddings(api_key=OPENAI_API_KEY)

### Configure Azure Document Intelligence

In [19]:
AZURE_DOCUMENT_AI_ENDPOINT = ""
AZURE_DOCUMENT_AI_API_KEY = ""
DOC_PATH = "./docs/copilotRC.pdf"
DOC_OUTPUT_PATH = "./langchain_rag_processed/"

## Chapter 1 Processing documents naively  

Document load, split, store (embed). 

Need to load the document into a Document object.
Follow this tutorial, if need additional help: https://python.langchain.com/v0.2/docs/tutorials/rag/

 There are many types of loaders in Langchain: https://python.langchain.com/v0.2/docs/integrations/document_loaders/

 How to load PDFs specifically: https://python.langchain.com/v0.2/docs/how_to/document_loader_pdf/

 I have used Unstructured library, because it has built in OCR, supports multi-modality and many file types. Has Langchain integration: https://python.langchain.com/v0.2/docs/integrations/providers/unstructured/

 However, for an introduction demo, it is sufficient to use a simplier loader. So we will use the PyPDF loader: https://python.langchain.com/v0.2/docs/how_to/document_loader_pdf/#using-pypdf

 Since v0.3 the API has changed, `load_and_split()` is deprecated now, but it is great for demonstrating basic splitting.

In [3]:
%pip install -q pypdf 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [14]:
from langchain_community.document_loaders import PyPDFLoader
from pprint import pprint

# Files - scientific paper vs. project proposal
file_path = "./docs/copilotRC.pdf"
#file_path = "./docs/example_dtops.pdf"

# Loader
loader = PyPDFLoader(file_path)
#pages = loader.load_and_split()
pages = loader.load() # Just pages
pages

[Document(metadata={'source': './docs/copilotRC.pdf', 'page': 0}, page_content='Co-pilots for Arrowhead-based\nCyber-Physical System of Systems Engineering\nCsaba Heged ˝us, P ´al Varga\nDepartment of Telecommunications and Artificial Intelligence\nBudapest University of Technology and Economics\nM˝uegyetem rkp. 3., H-1111 Budapest, Hungary.\nEmail: {hegeduscs, pvarga }@tmit.bme.hu\nAbstract —One benefit of Large Language Model (LLM) based\napplications (e.g. chat assistants or co-pilots) is that they can\nbring humans closer to the loop in various IT and OT solutions.\nCo-pilots can achieve many things at once, i.e. provide a context-\naware natural language interface to knowledge bases, reach\nvarious systems (via APIs), or even help solving multi-step\nproblems with their planning and reasoning abilities. However,\nmaking production-grade chat assistants is a topical challenge,\nas fast-evolving LLMs expose new types of application design\nand security issues that need tackling. The

In [5]:
len(pages)

11

Next step is to split the large Documents into smaller chunks that can be later injected into prompts. 

It's worth noting that currently, gpt4o supports roughly 120K tokens as input context window. This will be filled with:

* system prompt
* chat history
* user query
* context injected by RAG pipeline

We usually inject a couple of relevant chunks, let's say 3. Therefore, we should have chunks that are around 10k tokens each. Previously (GPT4-32K or GPT-3.5), this chunk size was much-much smaller. 

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

#I am configuring chunk size to 1K, so we can see what's happening. 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(pages)

splits

[Document(metadata={'source': './docs/copilotRC.pdf', 'page': 0}, page_content='Co-pilots for Arrowhead-based\nCyber-Physical System of Systems Engineering\nCsaba Heged ˝us, P ´al Varga\nDepartment of Telecommunications and Artificial Intelligence\nBudapest University of Technology and Economics\nM˝uegyetem rkp. 3., H-1111 Budapest, Hungary.\nEmail: {hegeduscs, pvarga }@tmit.bme.hu\nAbstract —One benefit of Large Language Model (LLM) based\napplications (e.g. chat assistants or co-pilots) is that they can\nbring humans closer to the loop in various IT and OT solutions.\nCo-pilots can achieve many things at once, i.e. provide a context-\naware natural language interface to knowledge bases, reach\nvarious systems (via APIs), or even help solving multi-step\nproblems with their planning and reasoning abilities. However,\nmaking production-grade chat assistants is a topical challenge,\nas fast-evolving LLMs expose new types of application design\nand security issues that need tackling. The

In [16]:
len(splits)

41

Now, we need to build a knowledge base using a vector database. We'll use simple in-memory vector DB. In other projects, we're using Postgres as vector DB with a plugin. 

Further read: https://python.langchain.com/v0.2/docs/how_to/vectorstores/

### Try with Azure Document AI 

https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence

In [20]:
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

azure_loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=AZURE_DOCUMENT_AI_ENDPOINT, api_key=AZURE_DOCUMENT_AI_API_KEY,
    file_path= DOC_PATH,  
    api_model="prebuilt-layout",
    mode="markdown"
)

article_from_azure = azure_loader.load()

In [16]:
print("Number of documents generated: " + str(len(article_from_azure)))

for document in article_from_azure:
    with open(DOC_OUTPUT_PATH + "copilotRC_processed.md", "a", encoding="utf-8") as file:
        file.write(document.page_content)

document = article_from_azure[0].page_content

Number of documents generated: 1


Now, let's see the Table of Contents for the document, extracted from the markdown parsed version. 

In [34]:
## GENERATED WITH GITHUB COPILOT

import re
import tiktoken

def extract_toc(markdown_text):
    toc = []
    lines = markdown_text.split('\n')
    for line in lines:
        match = re.match(r'^(#{1,6})\s+(.*)', line)
        if match:
            level = len(match.group(1))
            title = match.group(2)
            toc.append((level, title))
    return toc

def print_toc(toc):
    for level, title in toc:
        indent = '  ' * (level - 1)
        print(f"{indent}- {title}")

def count_tokens(text, model="gpt-4o"):
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    return len(tokens)

def pretty_print_toc_with_token_counts(toc, document):
    sections = re.split(r'(?m)^#{1,6}\s+', document)
    sections = sections[1:]  # Remove the first empty element
    toc_string = ""
    for level, title in toc:
        section_title = title if level == 1 else f"{'#' * level} {title}"
        section_text = next((s for s in sections if s.startswith(title)), "")
        token_count = count_tokens(section_text)
        indent = '  ' * (level - 1)
        toc_string += f"{indent}- {title} (Tokens: {token_count})\n"
    return toc_string

toc = extract_toc(document)
print(pretty_print_toc_with_token_counts(toc, document))

- Co-pilots for Arrowhead-based Cyber-Physical System of Systems Engineering (Tokens: 316)
  - I. INTRODUCTION (Tokens: 432)
    - A. Capabilities of Chat Assistants (Tokens: 153)
    - B. Motivation and Structure of the Paper (Tokens: 841)
  - II. RELATED WORKS (Tokens: 6)
    - A. LLM Engineering Used in Chat Assistants (Tokens: 1464)
    - B. Copilot Products on the Market (Tokens: 189)
  - III. COPILOTS ACROSS THE SOS LIFECYCLE (Tokens: 14)
    - A. Use Cases for Arrowhead Copilots (Tokens: 1500)
    - B. Findings of the Proof of Concept (Tokens: 735)
    - C. Future Work (Tokens: 637)
  - IV. CONCLUSIONS (Tokens: 128)
  - ACKNOWLEDGMENT (Tokens: 107)
  - REFERENCES (Tokens: 1354)



Thought experiment: how can we embed this? 

# Chapter 2 Multimodal document pipelines

Try image extraction Using Document AI SDK (no Langchain)

Snippet forked from https://github.com/microsoft/Form-Recognizer-Toolkit/blob/main/SampleCode/Python/sample_figure_understanding.ipynb

Small utils to crop images from files, as Azure Document AI does NOT currently extract images into the MD / JSON version of the recognized file. 

In [18]:

from PIL import Image
import fitz  # PyMuPDF
import mimetypes

def crop_image_from_image(image_path, page_number, bounding_box):
    """
    Crops an image based on a bounding box.

    :param image_path: Path to the image file.
    :param page_number: The page number of the image to crop (for TIFF format).
    :param bounding_box: A tuple of (left, upper, right, lower) coordinates for the bounding box.
    :return: A cropped image.
    :rtype: PIL.Image.Image
    """
    with Image.open(image_path) as img:
        if img.format == "TIFF":
            # Open the TIFF image
            img.seek(page_number)
            img = img.copy()
            
        # The bounding box is expected to be in the format (left, upper, right, lower).
        cropped_image = img.crop(bounding_box)
        return cropped_image

def crop_image_from_pdf_page(pdf_path, page_number, bounding_box):
    """
    Crops a region from a given page in a PDF and returns it as an image.

    :param pdf_path: Path to the PDF file.
    :param page_number: The page number to crop from (0-indexed).
    :param bounding_box: A tuple of (x0, y0, x1, y1) coordinates for the bounding box.
    :return: A PIL Image of the cropped area.
    """
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)
    
    # Cropping the page. The rect requires the coordinates in the format (x0, y0, x1, y1).
    bbx = [x * 72 for x in bounding_box]
    rect = fitz.Rect(bbx)
    pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72), clip=rect)
    
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    
    doc.close()

    return img

def crop_image_from_file(file_path, page_number, bounding_box):
    """
    Crop an image from a file.

    Args:
        file_path (str): The path to the file.
        page_number (int): The page number (for PDF and TIFF files, 0-indexed).
        bounding_box (tuple): The bounding box coordinates in the format (x0, y0, x1, y1).

    Returns:
        A PIL Image of the cropped area.
    """
    mime_type = mimetypes.guess_type(file_path)[0]
    
    if mime_type == "application/pdf":
        return crop_image_from_pdf_page(file_path, page_number, bounding_box)
    else:
        return crop_image_from_image(file_path, page_number, bounding_box)

Read the same document but with Azure SDK

In [20]:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import ContentFormat
from azure.core.credentials import AzureKeyCredential

document_intelligence_client = DocumentIntelligenceClient(
        endpoint=AZURE_DOCUMENT_AI_ENDPOINT, 
        credential=AzureKeyCredential(AZURE_DOCUMENT_AI_API_KEY)
    )

with open(DOC_PATH, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-layout", analyze_request=f, content_type="application/octet-stream", output_content_format=ContentFormat.MARKDOWN 
        )

result = poller.result()
md_content = result.content

Actual code that extracts images and captions

In [30]:
from PIL import ImageDraw, ImageFont
import os
import base64
from io import BytesIO

figures_data = []

# Assuming toc is a list of dictionaries with 'title' and 'page_number' keys
toc = [
    {"title": "Introduction", "page_number": 1},
    {"title": "Methodology", "page_number": 5},
    {"title": "Results", "page_number": 10},
    # Add more sections as needed
]

def get_section_from_toc(page_number):
    # Find the section based on the page number
    for i in range(len(toc) - 1):
        if toc[i]["page_number"] <= page_number < toc[i + 1]["page_number"]:
            return toc[i]["title"]
    return toc[-1]["title"]  # Return the last section if page number is beyond the last section

print("Figures:")
for idx, figure in enumerate(result.figures):
    figure_content = ""
    print(f"Figure #{idx} has the following spans: {figure.spans}")
    for i, span in enumerate(figure.spans):
        figure_content += md_content[span.offset:span.offset + span.length]

    if figure.caption:
        caption_region = figure.caption.bounding_regions
        print(f"\tCaption: {figure.caption.content}")
        print(f"\tCaption bounding region: {caption_region}")
        for region in figure.bounding_regions:
            if region not in caption_region:
                print(f"\tFigure body bounding regions: {region}")
                boundingbox = (
                    region.polygon[0],  # x0 (left)
                    region.polygon[1],  # y0 (top)
                    region.polygon[4],  # x1 (right)
                    region.polygon[5]   # y1 (bottom)
                )
                print(f"\tFigure body bounding box in (x0, y0, x1, y1): {boundingbox}")
                cropped_image = crop_image_from_file(DOC_PATH, region.page_number - 1, boundingbox) # page_number is 1-indexed

                # Add figure caption to figure 
                draw = ImageDraw.Draw(cropped_image)
                font = ImageFont.truetype("arial.ttf", 20)  # Use a larger font size
                text = figure.caption.content

                # Calculate text size and position
                text_bbox = draw.textbbox((0, 0), text, font=font)
                text_width = text_bbox[2] - text_bbox[0]
                text_height = text_bbox[3] - text_bbox[1]
                image_width, image_height = cropped_image.size
                text_position = ((image_width - text_width) / 2, image_height + 10)  # Bottom middle with padding

                # Create a new image with extra space for the caption
                new_image_height = image_height + text_height + 20  # Add extra space for the caption
                new_image = Image.new("RGB", (image_width, new_image_height), "white")
                new_image.paste(cropped_image, (0, 0))

                # Draw the caption on the new image
                draw = ImageDraw.Draw(new_image)
                draw.text(text_position, text, fill="black", font=font)

                # Get the base name of the file
                base_name = os.path.basename(DOC_PATH)
                # Remove the file extension
                file_name_without_extension = os.path.splitext(base_name)[0]

                output_file = f"{file_name_without_extension}_cropped_image_{idx}.png"
                cropped_image_filename = os.path.join(DOC_OUTPUT_PATH, output_file)

                new_image.save(cropped_image_filename)
                print(f"\tFigure {idx} cropped and saved as {cropped_image_filename}")

                # Get the section from the table of contents
                section = get_section_from_toc(region.page_number)

                # Add to figures_data array
                # Convert the image to JPEG format and encode it in base64
                buffered = BytesIO()
                new_image.save(buffered, format="JPEG")
                img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")

                figures_data.append({
                    "location": cropped_image_filename,
                    "image": img_str,
                    "caption": figure.caption.content,
                    "section": section
                })

# Print the figures_data array
print(figures_data)

Figures:
Figure #0 has the following spans: [{'offset': 7247, 'length': 367}]
	Caption: Fig. 1. A generic overview on an Arrowhead local cloud [10]
	Caption bounding region: [{'pageNumber': 2, 'polygon': [0.9803, 7.3654, 3.8536, 7.3644, 3.8537, 7.5117, 0.9804, 7.5127]}]
	Figure body bounding regions: {'pageNumber': 2, 'polygon': [0.6746, 5.4034, 4.1645, 5.4039, 4.164, 7.2426, 0.6744, 7.2425]}
	Figure body bounding box in (x0, y0, x1, y1): (0.6746, 5.4034, 4.164, 7.2426)
	Figure 0 cropped and saved as ./langchain_rag_processed/copilotRC_cropped_image_0.png
Figure #1 has the following spans: [{'offset': 11560, 'length': 198}]
	Caption: Fig. 2. Overview of Retrieval Augmented Generation
	Caption bounding region: [{'pageNumber': 3, 'polygon': [1.1743, 4.6652, 3.6608, 4.6648, 3.6608, 4.81, 1.1743, 4.8104]}]
	Figure body bounding regions: {'pageNumber': 3, 'polygon': [0.7952, 3.6857, 4.0464, 3.6859, 4.046, 4.5336, 0.795, 4.5336]}
	Figure body bounding box in (x0, y0, x1, y1): (0.7952, 3.6857

We can now use GPT4o to describe the contents of the figures, so they can be added to the vector DB later. 

In [36]:
from langchain_core.messages import HumanMessage, SystemMessage

describe_image_prompt = [
    SystemMessage(content="Describe the image below so we can embed the summary of the image for a RAG pipeline. The image is from a document with the following Table of Contents:"),
    HumanMessage(
        content=[
            {"type": "text", "text": f"{pretty_print_toc_with_token_counts(toc, document)}"},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{figures_data[0]['image']}"},
            },
        ]
    )
]

response = llm.invoke(describe_image_prompt)
response.pretty_print()


The image is a diagram titled "Fig. 1. A generic overview on an Arrowhead local cloud [10]" and it represents the architecture of an Arrowhead-based Cyber-Physical System of Systems (CPSoS). The diagram includes several components within a cloud structure labeled "Governance Body," and these components are interconnected to various application systems and devices.

Components within the cloud include:
- Onboarding Controller (orange)
- Service Registry (blue)
- Certificate Authority (red)
- Orchestration System (green)
- Authorization System (red)
- Gatekeeper System (yellow)

Outside the cloud, there are:
- Application Systems (gray) connected to devices
- Gateway System (black)

The diagram visually represents how different systems and controllers interact within an Arrowhead local cloud to manage and orchestrate services and devices.


## Chapter 2 Retrieval and generation

Read: https://python.langchain.com/v0.2/docs/tutorials/rag/

In [26]:
from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(documents=splits, embedding=embedder)
retriever = vectorstore.as_retriever()

pprint(retriever.invoke("What is arrowhead design copilot?"))



[Document(page_content='Fig. 3. The graphical overview of the Arrowhead Engineering Process (AEP) [24] to be supported by Co-pilots\nthe ecosystem, answering design and integration-related\nquestions. This can be embedded in the Arrowhead\nFramework Wiki [9] as an inline chatbot. Intended users\nare anyone who visits the Wiki.\n2) The Arrowhead Management Copilot interacting with\nvarious Arrowhead Core Systems of a Local Cloud de-\nployment to analyze and understand, potentially manage\nthe CPSoS via the Arrowhead governing middleware.\nThis tool can be embedded as a widget to the Arrowhead\nManagement Tool GUI. Intended users are the authen-\nticated Local Cloud (SoS) operators.\n3)Arrowhead Design Copilot which can integrate with the\nengineering toolchain to design SoS deployment and\nunderlying industrial automation processes and infras-\ntructure (i.e. SysML modeling with Eclipse Papyrus).\nThis co-pilot can be integrated into the Arrowhead\nEngineering Toolchain, interacting wit

In [None]:
#IN CASE YOU NEED TO DELETE THE VECTORSTORE
#vectorstore.delete_collection()

Creating system prompt for retrieval

In [32]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system","""Use the following pieces of context to answer the question at the end.
        If you don't know the answer, just say that you don't know, don't try to make up an answer.
        Use three sentences maximum and keep the answer as concise as possible.
        Always say "thanks for asking!" at the end of the answer.

        {context}

        Question: {question}

        Helpful Answer:"""
    )
])
prompt.pretty_print()



Use the following pieces of context to answer the question at the end.
        If you don't know the answer, just say that you don't know, don't try to make up an answer.
        Use three sentences maximum and keep the answer as concise as possible.
        Always say "thanks for asking!" at the end of the answer.

        [33;1m[1;3m{context}[0m

        Question: [33;1m[1;3m{question}[0m

        Helpful Answer:


In [46]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    concatenated_text = "\n\n".join(doc.page_content for doc in docs)
    return concatenated_text

rag_chain = (
    # creates a dictionary where context value is filled up by retriever then formatted by format_docs
    # and question is passed over unchanged by RunnablePassthrough
    # these are Runnable objects that will be executed in parallel or sequence and the output is fed forward
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    #prompt expects dictionary of context and question
    | prompt
    | llm
    | StrOutputParser()
)

In [47]:
rag_chain.invoke("What is Arrowhead Design Copilot?")

'The Arrowhead Design Copilot is a tool that integrates with the engineering toolchain to assist in designing Systems of Systems (SoS) deployment and underlying industrial automation processes. It can interact with design tools like SysML modeling with Eclipse Papyrus. The intended users are SoS engineers. Thanks for asking!'

Further reads:

https://python.langchain.com/v0.1/docs/use_cases/question_answering/chat_history/

More advanced RAG types can be better implemented using Langgraph. 

* Adaptive RAG
https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_adaptive_rag.ipynb 
* Corrective RAG
https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_crag.ipynb
* Self RAG
https://github.com/langchain-ai/langgraph/blob/main/examples/rag/langgraph_self_rag.ipynb 
