# Muli-Doc Analyser powered with LLMs
### Analysing multiple documents via iterative refinement

In [None]:
from langchain_core.documents import Document

documents = {
    '2023 Outlook': 'data/2023-isg-outlook-commodities.txt',
    '2024 Outlook': 'data/2024-isg-outlook-commodities.txt',
    '2025 Outlook': 'data/2025-isg-outlook-commodities.txt'
}

data = []

for id, doc in documents.items():
    
    with open(doc, encoding="utf-8") as f:
        text = f.read()

    if text:
        document = Document(page_content=text, id=id)
        data.append(document)

data

[Document(id='2023', metadata={}, page_content='Commodities were a bright spot for markets last \nyear (see Exhibit 180). The S&P GSCI returned \n23%, topping all other major asset classes for a \nsecond consecutive year. But this impressive gain \nbelied a more nuanced reality, as returns reached \n54% by early June, only to be halved in the \nsecond half of the year. Investors also faced a wide \ndispersion among individual commodities and their \nsource of returns. In contrast to the index’s overall \nstrength, both industrial metals and precious \nmetals suffered losses last year. And while the \nenergy subindex outperformed with a 39% gain, \nthe bulk of this came from positive carry, or the \nadditional return holders of a commodity get when \nthe futures curve is strongly downward sloping, \ncalled “backwardation.” The appreciation in spot \nenergy prices was a much smaller 14%.  \n\n\nThe disjointed nature of these returns reflects \nthe tug-of-war between bullish supply-side \

In [None]:
from typing import List, TypedDict
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langgraph.graph import END, START, StateGraph
from langchain.chat_models import init_chat_model


llm = init_chat_model("gpt-4o-mini", model_provider="openai")


class State(TypedDict):
    content: List[str]
    title: List[str]
    index: int
    summary: str




# Initial summary

initial_template = """
You are writing a summary of multiple documents step by step.

To start, You will be given the first document to summarise.

Write a concise summary of the first document, outlining its main points.

Note: Write down the document title before the document summary.

Document title:
{title}

Document:
{context}
"""

initial_prompt = ChatPromptTemplate([("human", initial_template)])
initial_summary_chain = initial_prompt | llm | StrOutputParser()



# Refining the summary with new docs

refine_template = """
 
Read the summary of the previous document to understand the context and content.  

Use this understanding when writing a summary of the next document, 
incorporating relevant details from the previous document to outline new updates.

Write a concise summary of the next document, outlining the main points.

Note: Write down the document title given before document summary.

Summary of previous document:
{existing_answer}

Document title:
{title}

Document:
{context}



"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])
refine_summary_chain = refine_prompt | llm | StrOutputParser()




async def initial_summary(state: State, config: RunnableConfig):
    summary = await initial_summary_chain.ainvoke(
        {"title": state["title"][0], "context": state["content"][0]},
        config,
    )
    return {"summary": summary, "index": 1}


async def refine_summary(state: State, config: RunnableConfig):
    content = state["content"][state["index"]]
    title = state["title"][state["index"]]
    summary = await refine_summary_chain.ainvoke(
        {"existing_answer": state["summary"], "context": content, "title":title},
        config,
    )

    return {"summary": summary, "index": state["index"] + 1}


def should_refine(state: State):
    if state["index"] >= len(state["content"]):
        return END
    else:
        return "refine_summary"


graph = StateGraph(State)
graph.add_node("initial_summary", initial_summary)
graph.add_node("refine_summary", refine_summary)

graph.add_edge(START, "initial_summary")
graph.add_conditional_edges("initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()

In [None]:
async for step in app.astream(
    {
    "content": [doc.page_content for doc in data],
    "title" : [doc.id for doc in data]
    },
    stream_mode="values",
):
    if summary := step.get("summary"):
        print(summary)

**Document title: None**

Summary:
The document discusses the performance of commodities in financial markets during the past year, noting a notable 23% return in the S&P GSCI, though returns were uneven across individual commodities. Industrial and precious metals lost value, while the energy sector saw a significant gain primarily due to "backwardation" in futures pricing. The report highlights the ongoing tension between supply-side pressures, stemming from years of underinvestment, and demand-side concerns exacerbated by global economic slowdowns, particularly in China. Oil prices were notably influenced by geopolitical developments, especially the impact of sanctions on Russia due to its invasion of Ukraine, which led to brief spikes in prices around $124. A subsequent decline in prices was attributed to reduced demand and significant releases from the US Strategic Petroleum Reserve. The document expresses uncertainty regarding future oil demand, projecting prices to range between

From the results, you can observe that:
- It seems like all themes are mentioned consecutively. 

### Optional Enhancement: Further Splitting of Texts via k-means

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
import pandas as pd

documents = {
    '2023 Outlook': 'data/2023-isg-outlook-commodities.txt',
    '2024 Outlook': 'data/2024-isg-outlook-commodities.txt',
    '2025 Outlook': 'data/2025-isg-outlook-commodities.txt'
}

data = []

for title, doc in documents.items():
    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=['\n\n'])
    with open(doc, encoding="utf-8") as f:
        text = f.read()

    splitted_texts = text_splitter.split_text(text)
    for text in splitted_texts:
        data.append({'title':title, 'text':text})


df = pd.DataFrame(data)
df

Unnamed: 0,title,text
0,2023 Outlook,Commodities were a bright spot for markets las...
1,2023 Outlook,The disjointed nature of these returns reflect...
2,2023 Outlook,"Against these risks, we note that global \ninv..."
3,2023 Outlook,"Oil: High Risks, Low Inventories\nOil prices l..."
4,2023 Outlook,"Although oil prices have receded, the bullish ..."
5,2023 Outlook,"As a result, close to 2 million b/d of Russian..."
6,2023 Outlook,"To be sure, the demand for oil faces an equall..."
7,2023 Outlook,Meeting this demand will require continuing \n...
8,2023 Outlook,"Given these moving pieces, we expect WTI \npri..."
9,2023 Outlook,Gold: Not as Advertised\nSince its meteoric ri...


In [5]:
from langchain_openai import OpenAIEmbeddings
import os

embeddings = OpenAIEmbeddings(
    model='text-embedding-ada-002',
    api_key=os.getenv('OPENAI_API_KEY')
)

df['embeddings'] = embeddings.embed_documents(df['text'])
df.head(10)

Unnamed: 0,title,text,embeddings
0,2023 Outlook,Commodities were a bright spot for markets las...,"[-0.0022702275309711695, -0.007764836307615042..."
1,2023 Outlook,The disjointed nature of these returns reflect...,"[-0.025304894894361496, -0.0422799326479435, 0..."
2,2023 Outlook,"Against these risks, we note that global \ninv...","[0.007760149892419577, -0.02713713049888611, -..."
3,2023 Outlook,"Oil: High Risks, Low Inventories\nOil prices l...","[-0.0027492695953696966, -0.0334058441221714, ..."
4,2023 Outlook,"Although oil prices have receded, the bullish ...","[0.0030426045414060354, -0.04315152019262314, ..."
5,2023 Outlook,"As a result, close to 2 million b/d of Russian...","[0.004910386633127928, -0.03235391899943352, 0..."
6,2023 Outlook,"To be sure, the demand for oil faces an equall...","[0.0008597111445851624, -0.05090921372175217, ..."
7,2023 Outlook,Meeting this demand will require continuing \n...,"[-0.02213151566684246, -0.05137115344405174, 0..."
8,2023 Outlook,"Given these moving pieces, we expect WTI \npri...","[-0.017851943150162697, -0.03615216165781021, ..."
9,2023 Outlook,Gold: Not as Advertised\nSince its meteoric ri...,"[-0.016793372109532356, -0.005321118514984846,..."


In [6]:
from sklearn.cluster import KMeans
import numpy as np

n_clusters = 4
embedding_matrix = np.array(df['embeddings'].tolist())

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(embedding_matrix)
labels = kmeans.labels_

df["cluster"] = labels
df.head(10)

Unnamed: 0,title,text,embeddings,cluster
0,2023 Outlook,Commodities were a bright spot for markets las...,"[-0.0022702275309711695, -0.007764836307615042...",0
1,2023 Outlook,The disjointed nature of these returns reflect...,"[-0.025304894894361496, -0.0422799326479435, 0...",2
2,2023 Outlook,"Against these risks, we note that global \ninv...","[0.007760149892419577, -0.02713713049888611, -...",1
3,2023 Outlook,"Oil: High Risks, Low Inventories\nOil prices l...","[-0.0027492695953696966, -0.0334058441221714, ...",2
4,2023 Outlook,"Although oil prices have receded, the bullish ...","[0.0030426045414060354, -0.04315152019262314, ...",2
5,2023 Outlook,"As a result, close to 2 million b/d of Russian...","[0.004910386633127928, -0.03235391899943352, 0...",2
6,2023 Outlook,"To be sure, the demand for oil faces an equall...","[0.0008597111445851624, -0.05090921372175217, ...",2
7,2023 Outlook,Meeting this demand will require continuing \n...,"[-0.02213151566684246, -0.05137115344405174, 0...",2
8,2023 Outlook,"Given these moving pieces, we expect WTI \npri...","[-0.017851943150162697, -0.03615216165781021, ...",1
9,2023 Outlook,Gold: Not as Advertised\nSince its meteoric ri...,"[-0.016793372109532356, -0.005321118514984846,...",3


In [7]:
df.groupby(['cluster']).size()

cluster
0     6
1     7
2    14
3    12
dtype: int64

In [8]:
df.groupby(['cluster','title'])['title'].size().reset_index(name='count')

Unnamed: 0,cluster,title,count
0,0,2023 Outlook,1
1,0,2024 Outlook,2
2,0,2025 Outlook,3
3,1,2023 Outlook,2
4,1,2024 Outlook,2
5,1,2025 Outlook,3
6,2,2023 Outlook,6
7,2,2024 Outlook,5
8,2,2025 Outlook,3
9,3,2023 Outlook,4


In [9]:
df['grouped'] = df.groupby(['cluster','title'])['text'].transform(lambda x: "".join(x))
df_grouped = df[['cluster','title','grouped']].drop_duplicates().sort_values(by=['cluster', 'title']).reset_index(drop=True)

df_grouped

Unnamed: 0,cluster,title,grouped
0,0,2023 Outlook,Commodities were a bright spot for markets las...
1,0,2024 Outlook,Last year reminded commodity investors that it...
2,0,2025 Outlook,Commodity markets often change the locks just ...
3,1,2023 Outlook,"Against these risks, we note that global \ninv..."
4,1,2024 Outlook,"Considering these challenges, we expect WTI \n..."
5,1,2025 Outlook,While we remain neutral on commodities \novera...
6,2,2023 Outlook,The disjointed nature of these returns reflect...
7,2,2024 Outlook,Oil: Threading the Needle\n“It ain’t what you ...
8,2,2025 Outlook,Supply disruptions remained modest despite \na...
9,3,2023 Outlook,Gold: Not as Advertised\nSince its meteoric ri...


In [10]:
from langchain_community.document_loaders import DataFrameLoader

loader = DataFrameLoader(df_grouped, page_content_column='grouped')
new_documents = loader.load()
new_documents

[Document(metadata={'cluster': 0, 'title': '2023 Outlook'}, page_content='Commodities were a bright spot for markets last \nyear (see Exhibit 180). The S&P GSCI returned \n23%, topping all other major asset classes for a \nsecond consecutive year. But this impressive gain \nbelied a more nuanced reality, as returns reached \n54% by early June, only to be halved in the \nsecond half of the year. Investors also faced a wide \ndispersion among individual commodities and their \nsource of returns. In contrast to the index’s overall \nstrength, both industrial metals and precious \nmetals suffered losses last year. And while the \nenergy subindex outperformed with a 39% gain, \nthe bulk of this came from positive carry, or the \nadditional return holders of a commodity get when \nthe futures curve is strongly downward sloping, \ncalled “backwardation.” The appreciation in spot \nenergy prices was a much smaller 14%.'),
 Document(metadata={'cluster': 0, 'title': '2024 Outlook'}, page_content

In [None]:
from typing import List, TypedDict
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig
from langgraph.graph import END, START, StateGraph
from langchain.chat_models import init_chat_model


llm = init_chat_model("gpt-4o-mini", model_provider="openai")


class State(TypedDict):
    content: List[str]
    title: List[str]
    index: int
    summary: str




# Initial summary

initial_template = """
You are writing a summary of multiple documents step by step.
To start, You will be given the first document to summarise.

Write a concise summary of the first document, outlining the main points.

Note: Write down the document title given before document summary.

First document title:
{title}

First document:
{context}
"""

initial_prompt = ChatPromptTemplate([("human", initial_template)])
initial_summary_chain = initial_prompt | llm | StrOutputParser()



# Refining the summary with new docs

refine_template = """
 
Read the summary of the previous document to understand the context and content.  

Use this understanding when writing a summary of the next document, 
incorporating relevant details from the previous document to outline new updates.

Write a concise summary of the next document, outlining the main points and themes.

Note: Write down the document title given before document summary.

Existing summary of previous document:
{existing_answer}

Next document title:
{title}

Next document:
{context}
"""
refine_prompt = ChatPromptTemplate([("human", refine_template)])
refine_summary_chain = refine_prompt | llm | StrOutputParser()




async def initial_summary(state: State, config: RunnableConfig):
    summary = await initial_summary_chain.ainvoke(
        {"title": state["title"][0], "context": state["content"][0]},
        config,
    )
    return {"summary": summary, "index": 1}


async def refine_summary(state: State, config: RunnableConfig):
    content = state["content"][state["index"]]
    title = state["title"][state["index"]]
    summary = await refine_summary_chain.ainvoke(
        {"existing_answer": state["summary"], "context": content, "title":title},
        config,
    )

    return {"summary": summary, "index": state["index"] + 1}


def should_refine(state: State):
    if state["index"] >= len(state["content"]):
        return END
    else:
        return "refine_summary"


graph = StateGraph(State)
graph.add_node("initial_summary", initial_summary)
graph.add_node("refine_summary", refine_summary)

graph.add_edge(START, "initial_summary")
graph.add_conditional_edges("initial_summary", should_refine)
graph.add_conditional_edges("refine_summary", should_refine)
app = graph.compile()

In [12]:
print(f"\n\nSummary of Theme 0")
cluster_docs = [doc for doc in new_documents if doc.metadata.get('cluster') == 0]

async for step in app.astream(
    {
    "content": [doc.page_content for doc in cluster_docs],
    "title" : [doc.metadata.get("title") for doc in cluster_docs]
    },
    stream_mode="values",
):
    if summary := step.get("summary"):
        print(summary)



Summary of Theme 0
**Document Title:** 2023 Outlook

**Document Summary:**  
The 2023 outlook highlights commodities as a standout performer in the market, with the S&P GSCI achieving a 23% return, outperforming other asset classes for the second consecutive year. However, this gain masked a complex landscape, as returns peaked at 54% by early June but were cut in half during the latter half of the year. There was significant variability among individual commodities, with industrial and precious metals recording losses, while energy commodities excelled with a 39% gain primarily driven by the benefits of backwardation. Spot energy prices saw a modest increase of 14%, indicating that the performance of commodities was not uniform across the board.
**Document Title:** 2024 Outlook

**Document Summary:**  
The 2024 outlook reveals a stark contrast to the previous year's strong commodity performance, as the S&P GSCI declined by 9% in 2023. This downturn affected multiple sectors, includi

In [None]:
for i in range(0,4):
    print(f"Theme {i}")
    
    cluster_docs = [doc for doc in new_documents if doc.metadata.get('cluster') == 0]

    async for step in app.astream(
        {
        "content": [doc.page_content for doc in cluster_docs],
        "title" : [doc.metadata.get("title") for doc in cluster_docs]
        },
        stream_mode="values",
    ):
        if summary := step.get("summary"):
            print(summary)

**Document Title: 2023 Outlook**

**Document Summary:**
The document reviews the performance of commodities in the market for the previous year, highlighting that the S&P GSCI achieved a notable 23% return, outperforming other major asset classes for the second consecutive year. However, this performance masked deeper complexities, as returns peaked at 54% by early June, only to decline significantly in the latter half of the year. It notes a stark difference in performance among individual commodities; while the energy sector was bolstered by a 39% gain largely attributed to positive carry from backwardation, both industrial and precious metals experienced losses. Spot prices for energy saw a more modest increase of 14%.
**Document Title: 2024 Outlook**

**Document Summary:**
The 2024 Outlook reflects on the significant downturn in commodity performance in 2023, marking a 9% decline in the GSCI after two years of strong returns that had led to expectations of a supercycle. This declin