# Muli-Doc Analyser with LLMs

Today we test out how LLMs can handle analysing and summarising multiple documents, in which understanding of a document is dependent on the context of its previous documents. 
This is useful in cases where one needs to analyse a sequence of quarterly/annual reports to identify updates and changes over time.

We leverage frameworks such as LangChain and LangGraph to support this task.

Here, I am using Goldman Sachs Private Wealth ISG Outlook from the past three years to identify how its outlook on the market has changed over time. I have scraped its commodities section for simplicity purposes.



### Loading Documents
First, I convert my text in txt files into LangChain Documents

In [1]:
from langchain_core.documents import Document

documents = {
    '2023 Outlook': 'data/2023-isg-outlook-commodities.txt',
    '2024 Outlook': 'data/2024-isg-outlook-commodities.txt',
    '2025 Outlook': 'data/2025-isg-outlook-commodities.txt'
}

data = []

for id, doc in documents.items():
    
    with open(doc, encoding="utf-8") as f:
        text = f.read()

    if text:
        document = Document(page_content=text, id=id)
        data.append(document)

data

[Document(id='2023 Outlook', metadata={}, page_content='Commodities were a bright spot for markets last \nyear (see Exhibit 180). The S&P GSCI returned \n23%, topping all other major asset classes for a \nsecond consecutive year. But this impressive gain \nbelied a more nuanced reality, as returns reached \n54% by early June, only to be halved in the \nsecond half of the year. Investors also faced a wide \ndispersion among individual commodities and their \nsource of returns. In contrast to the index’s overall \nstrength, both industrial metals and precious \nmetals suffered losses last year. And while the \nenergy subindex outperformed with a 39% gain, \nthe bulk of this came from positive carry, or the \nadditional return holders of a commodity get when \nthe futures curve is strongly downward sloping, \ncalled “backwardation.” The appreciation in spot \nenergy prices was a much smaller 14%.  \n\n\nThe disjointed nature of these returns reflects \nthe tug-of-war between bullish suppl

### Creating LangGraph
Here we generate a LangGraph chain in which the LLM generates an initial summary on an initial document, followed by subsequent summaries on following documents whilst relying context from the previous summaries generated

In [2]:
from typing import List, TypedDict
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langgraph.graph import END, START, StateGraph
from langchain.chat_models import init_chat_model


llm = init_chat_model("gpt-4o-mini", model_provider="openai", temperature=0)


class State(TypedDict):
    content: List[str]
    title: List[str]
    index: int
    summary: str




# Initial summary

initial_template = """
You are writing a summary of multiple documents step by step.

To start, You will be given the first document to summarise.

Write a concise summary of the first document, outlining its main points and themes.

Note: Write down the document title before the document summary.

Document title:
{title}

Document:
{context}
"""

initial_prompt = ChatPromptTemplate([("human", initial_template)])
initial_summary_chain = initial_prompt | llm | StrOutputParser()



# Refining the summary with new docs

refine_template = """
 
Read the summary of the previous document to understand the context and content.  

Use this understanding when writing a summary of the new document, 
incorporating relevant details from the previous document to outline new updates.

Write a concise summary of the new document, highlighting the key similarities and differences compared to the previous document summary.

Note: Write down the document title given before document summary.

Summary of previous document:
{existing_answer}

New Document title:
{title}

New Document:
{context}



"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])
refine_summary_chain = refine_prompt | llm | StrOutputParser()




async def initial_summary(state: State, config: RunnableConfig):
    summary = await initial_summary_chain.ainvoke(
        {"title": state["title"][0], "context": state["content"][0]},
        config,
    )
    return {"summary": summary, "index": 1}


async def refine_summary(state: State, config: RunnableConfig):
    content = state["content"][state["index"]]
    title = state["title"][state["index"]]
    summary = await refine_summary_chain.ainvoke(
        {"existing_answer": state["summary"], "context": content, "title":title},
        config,
    )

    return {"summary": summary, "index": state["index"] + 1}


def should_refine(state: State):
    if state["index"] >= len(state["content"]):
        return END
    else:
        return "refine_summary"


graph = StateGraph(State)
graph.add_node("initial_summary", initial_summary)
graph.add_node("refine_summary", refine_summary)

graph.add_edge(START, "initial_summary")
graph.add_conditional_edges("initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()

### Invoking Graph
Execute the sequence and generate the summaries sequentially.

In [3]:
async for step in app.astream(
    {
    "content": [doc.page_content for doc in data],
    "title" : [doc.id for doc in data]
    },
    stream_mode="values",
):
    if summary := step.get("summary"):
        print("\n===============================================================\n\n")
        print(summary)
        print("\n===============================================================\n\n")




**Document title: 2023 Outlook**

**Document Summary:**
The 2023 Outlook discusses the performance and future expectations for commodities, particularly focusing on oil and gold. In 2022, commodities, led by the S&P GSCI's 23% return, outperformed other asset classes, although this was marked by significant volatility and disparities among individual commodities. Energy prices surged due to geopolitical tensions, particularly from the Russia-Ukraine conflict, but have since retreated due to various bearish factors, including a release from the US Strategic Petroleum Reserve and declining demand from China.

The document highlights the precarious balance between supply-side pressures and demand concerns, particularly in light of low global inventories and potential geopolitical disruptions. It anticipates that oil prices will fluctuate between $70 and $100 per barrel, influenced by various risks, including a possible US recession and ongoing geopolitical tensions.

In contrast, gold'

The results shows summaries for each document, highlighting the key updates on each year's outlook relative to previous years.

It seems like three main themes are mentioned consistently: 

- Overall Market Commentary
- S&D Trends
- Oil
- Gold

Let's refine our documents by splitting them via common themes to generate a better summary.

## Optional Enhancement: Further Splitting of Texts via k-means
Now we use a basic k-means algorithm cluster similar themes together within our documents. #

To simplify the text splitting task, I've manually separated paragraphs in my txt files on the basis that each paragraph would be discussing one particular theme. 

The task would've probably be made easier if we had a strong agentic PDF Crawler.


In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
import pandas as pd

documents = {
    '2023 Outlook': 'data/2023-isg-outlook-commodities.txt',
    '2024 Outlook': 'data/2024-isg-outlook-commodities.txt',
    '2025 Outlook': 'data/2025-isg-outlook-commodities.txt'
}

data = []

for title, doc in documents.items():
    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=['\n\n'])
    with open(doc, encoding="utf-8") as f:
        text = f.read()

    splitted_texts = text_splitter.split_text(text)
    for text in splitted_texts:
        data.append({'title':title, 'text':text})


df = pd.DataFrame(data)
df

Unnamed: 0,title,text
0,2023 Outlook,Commodities were a bright spot for markets las...
1,2023 Outlook,The disjointed nature of these returns reflect...
2,2023 Outlook,"Against these risks, we note that global \ninv..."
3,2023 Outlook,"Oil: High Risks, Low Inventories\nOil prices l..."
4,2023 Outlook,"Although oil prices have receded, the bullish ..."
5,2023 Outlook,"As a result, close to 2 million b/d of Russian..."
6,2023 Outlook,"To be sure, the demand for oil faces an equall..."
7,2023 Outlook,Meeting this demand will require continuing \n...
8,2023 Outlook,"Given these moving pieces, we expect WTI \npri..."
9,2023 Outlook,Gold: Not as Advertised\nSince its meteoric ri...


### Embeddings
Now we take each text and use openAI's embedding model to embed the message as a vector

In [5]:
from langchain_openai import OpenAIEmbeddings
import os

embeddings = OpenAIEmbeddings(
    model='text-embedding-ada-002',
    api_key=os.getenv('OPENAI_API_KEY')
)

df['embeddings'] = embeddings.embed_documents(df['text'])
df.head(10)

Unnamed: 0,title,text,embeddings
0,2023 Outlook,Commodities were a bright spot for markets las...,"[-0.0022702275309711695, -0.007764836307615042..."
1,2023 Outlook,The disjointed nature of these returns reflect...,"[-0.025304894894361496, -0.0422799326479435, 0..."
2,2023 Outlook,"Against these risks, we note that global \ninv...","[0.007760149892419577, -0.02713713049888611, -..."
3,2023 Outlook,"Oil: High Risks, Low Inventories\nOil prices l...","[-0.0027492695953696966, -0.0334058441221714, ..."
4,2023 Outlook,"Although oil prices have receded, the bullish ...","[0.0030426045414060354, -0.04315152019262314, ..."
5,2023 Outlook,"As a result, close to 2 million b/d of Russian...","[0.004910386633127928, -0.03235391899943352, 0..."
6,2023 Outlook,"To be sure, the demand for oil faces an equall...","[0.0008597111445851624, -0.05090921372175217, ..."
7,2023 Outlook,Meeting this demand will require continuing \n...,"[-0.02213151566684246, -0.05137115344405174, 0..."
8,2023 Outlook,"Given these moving pieces, we expect WTI \npri...","[-0.017851943150162697, -0.03615216165781021, ..."
9,2023 Outlook,Gold: Not as Advertised\nSince its meteoric ri...,"[-0.016793372109532356, -0.005321118514984846,..."


### Clustering
I'm using 4 clusters for now.

In [6]:
from sklearn.cluster import KMeans
import numpy as np

n_clusters = 4
embedding_matrix = np.array(df['embeddings'].tolist())

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(embedding_matrix)
labels = kmeans.labels_

df["cluster"] = labels
df.head(10)

Unnamed: 0,title,text,embeddings,cluster
0,2023 Outlook,Commodities were a bright spot for markets las...,"[-0.0022702275309711695, -0.007764836307615042...",0
1,2023 Outlook,The disjointed nature of these returns reflect...,"[-0.025304894894361496, -0.0422799326479435, 0...",2
2,2023 Outlook,"Against these risks, we note that global \ninv...","[0.007760149892419577, -0.02713713049888611, -...",1
3,2023 Outlook,"Oil: High Risks, Low Inventories\nOil prices l...","[-0.0027492695953696966, -0.0334058441221714, ...",2
4,2023 Outlook,"Although oil prices have receded, the bullish ...","[0.0030426045414060354, -0.04315152019262314, ...",2
5,2023 Outlook,"As a result, close to 2 million b/d of Russian...","[0.004910386633127928, -0.03235391899943352, 0...",2
6,2023 Outlook,"To be sure, the demand for oil faces an equall...","[0.0008597111445851624, -0.05090921372175217, ...",2
7,2023 Outlook,Meeting this demand will require continuing \n...,"[-0.02213151566684246, -0.05137115344405174, 0...",2
8,2023 Outlook,"Given these moving pieces, we expect WTI \npri...","[-0.017851943150162697, -0.03615216165781021, ...",1
9,2023 Outlook,Gold: Not as Advertised\nSince its meteoric ri...,"[-0.016793372109532356, -0.005321118514984846,...",3


### Groupings
Checking sizes and chunks of each cluster generated

In [7]:
df.groupby(['cluster']).size()

cluster
0     6
1     7
2    14
3    12
dtype: int64

In [8]:
df.groupby(['cluster','title'])['title'].size().reset_index(name='count')

Unnamed: 0,cluster,title,count
0,0,2023 Outlook,1
1,0,2024 Outlook,2
2,0,2025 Outlook,3
3,1,2023 Outlook,2
4,1,2024 Outlook,2
5,1,2025 Outlook,3
6,2,2023 Outlook,6
7,2,2024 Outlook,5
8,2,2025 Outlook,3
9,3,2023 Outlook,4


### Loading Documents
Now we group our clustered texts all into one string, then convert our dataframe into LangChain Documents

In [9]:
df['grouped'] = df.groupby(['cluster','title'])['text'].transform(lambda x: "".join(x))
df_grouped = df[['cluster','title','grouped']].drop_duplicates().sort_values(by=['cluster', 'title']).reset_index(drop=True)

df_grouped

Unnamed: 0,cluster,title,grouped
0,0,2023 Outlook,Commodities were a bright spot for markets las...
1,0,2024 Outlook,Last year reminded commodity investors that it...
2,0,2025 Outlook,Commodity markets often change the locks just ...
3,1,2023 Outlook,"Against these risks, we note that global \ninv..."
4,1,2024 Outlook,"Considering these challenges, we expect WTI \n..."
5,1,2025 Outlook,While we remain neutral on commodities \novera...
6,2,2023 Outlook,The disjointed nature of these returns reflect...
7,2,2024 Outlook,Oil: Threading the Needle\n“It ain’t what you ...
8,2,2025 Outlook,Supply disruptions remained modest despite \na...
9,3,2023 Outlook,Gold: Not as Advertised\nSince its meteoric ri...


In [10]:
from langchain_community.document_loaders import DataFrameLoader

loader = DataFrameLoader(df_grouped, page_content_column='grouped')
new_documents = loader.load()
new_documents

[Document(metadata={'cluster': 0, 'title': '2023 Outlook'}, page_content='Commodities were a bright spot for markets last \nyear (see Exhibit 180). The S&P GSCI returned \n23%, topping all other major asset classes for a \nsecond consecutive year. But this impressive gain \nbelied a more nuanced reality, as returns reached \n54% by early June, only to be halved in the \nsecond half of the year. Investors also faced a wide \ndispersion among individual commodities and their \nsource of returns. In contrast to the index’s overall \nstrength, both industrial metals and precious \nmetals suffered losses last year. And while the \nenergy subindex outperformed with a 39% gain, \nthe bulk of this came from positive carry, or the \nadditional return holders of a commodity get when \nthe futures curve is strongly downward sloping, \ncalled “backwardation.” The appreciation in spot \nenergy prices was a much smaller 14%.'),
 Document(metadata={'cluster': 0, 'title': '2024 Outlook'}, page_content

### Creating LangGraph
Same LangGraph creation as earlier

In [11]:
from typing import List, TypedDict
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langgraph.graph import END, START, StateGraph
from langchain.chat_models import init_chat_model


llm = init_chat_model("gpt-4o-mini", model_provider="openai", temperature=0)


class State(TypedDict):
    content: List[str]
    title: List[str]
    index: int
    summary: str




# Initial summary

initial_template = """
You are writing a summary of multiple documents step by step.
To start, You will be given the first document to summarise.

Write a concise summary of the first document, outlining the main points.

Note: Write down the document title given before document summary.

First document title:
{title}

First document:
{context}
"""

initial_prompt = ChatPromptTemplate([("human", initial_template)])
initial_summary_chain = initial_prompt | llm | StrOutputParser()



# Refining the summary with new docs

refine_template = """
 
Read the summary of the previous document to understand the context and content.  

Use this understanding when writing a summary of the next document, 
incorporating relevant details from the previous document to outline new updates.

Write a concise summary of the next document, highlighting the key similarities and differences compared to the previous document summary

Note: Write down the document title given before document summary.

Summary of previous document:
{existing_answer}

New Document title:
{title}

New Document:
{context}
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])
refine_summary_chain = refine_prompt | llm | StrOutputParser()




async def initial_summary(state: State, config: RunnableConfig):
    summary = await initial_summary_chain.ainvoke(
        {"title": state["title"][0], "context": state["content"][0]},
        config,
    )
    return {"summary": summary, "index": 1}


async def refine_summary(state: State, config: RunnableConfig):
    content = state["content"][state["index"]]
    title = state["title"][state["index"]]
    summary = await refine_summary_chain.ainvoke(
        {"existing_answer": state["summary"], "context": content, "title":title},
        config,
    )

    return {"summary": summary, "index": state["index"] + 1}


def should_refine(state: State):
    if state["index"] >= len(state["content"]):
        return END
    else:
        return "refine_summary"


graph = StateGraph(State)
graph.add_node("initial_summary", initial_summary)
graph.add_node("refine_summary", refine_summary)

graph.add_edge(START, "initial_summary")
graph.add_conditional_edges("initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()

### Sampling
Lets first evalutate the first cluster to see if everything makes sense.

In [12]:
print(f"\n\nSummary of Theme 0")
cluster_docs = [doc for doc in new_documents if doc.metadata.get('cluster') == 0]

async for step in app.astream(
    {
    "content": [doc.page_content for doc in cluster_docs],
    "title" : [doc.metadata.get("title") for doc in cluster_docs]
    },
    stream_mode="values",
):
    if summary := step.get("summary"):
        print("\n===============================================================\n\n")
        print(summary)
        print("\n===============================================================\n\n")



Summary of Theme 0



**Document Title: 2023 Outlook**

**Document Summary:**
The 2023 Outlook highlights the strong performance of commodities in the previous year, with the S&P GSCI achieving a 23% return, outperforming other asset classes for the second year in a row. However, this overall gain masked significant volatility, as returns peaked at 54% by June before dropping in the latter half of the year. There was considerable variation among individual commodities; while the energy subindex saw a notable 39% increase, primarily driven by positive carry from backwardation, both industrial and precious metals experienced losses. The actual appreciation in spot energy prices was limited to 14%.






**Document Title: 2024 Outlook**

**Document Summary:**
The 2024 Outlook presents a stark contrast to the previous year's performance, as the GSCI experienced a decline of 9% in 2023, following two years of strong gains that had led to expectations of a supercycle. This downturn was wid

### Invoking Graph
Looks sufficient now. Let's run it for each cluster and print the results.

In [13]:
for i in range(0,4):
    print("\n===============================================================")
    print(f"Theme {i} \n\n")
    
    cluster_docs = [doc for doc in new_documents if doc.metadata.get('cluster') == i]

    async for step in app.astream(
        {
        "content": [doc.page_content for doc in cluster_docs],
        "title" : [doc.metadata.get("title") for doc in cluster_docs]
        },
        stream_mode="values",
    ):
        if summary := step.get("summary"):
            print(summary)

    print("\n===============================================================\n\n")


Theme 0 


**Document Title: 2023 Outlook**

**Summary:**
The 2023 Outlook highlights the strong performance of commodities in the previous year, with the S&P GSCI achieving a 23% return, outperforming other asset classes for the second year in a row. However, this overall gain masked significant volatility, as returns peaked at 54% by early June but were halved in the latter half of the year. There was considerable variation among individual commodities; while the energy subindex saw a notable 39% gain primarily due to positive carry from backwardation, both industrial and precious metals experienced losses. The actual appreciation in spot energy prices was limited to 14%.
**Document Title: 2024 Outlook**

**Summary:**
The 2024 Outlook presents a stark contrast to the previous year's performance, as the GSCI experienced a decline of 9% in 2023, following two years of strong gains that had led to expectations of a supercycle. This downturn was widespread, affecting energy, agriculture