#**README**







---







## About the notebook



This notebook demonstrates the power of LangChain to summarize documents using the "Stuff" and the "Map Reduce" techniques. <br>



*Learning Objectives:* To understand,

- Use Stuff vs Map-Reduce document processing

- Implement both approaches with real code examples

<br>



*Prerequisites:*

- Skills: Python 3+, Basics of LangChain concepts, Basics of prompt engineering
- Difficuly Level: Beginner


<br>



*Quick Start:*<br>

1. Clone the Repository

```

git clone https://github.com/PradnyaSA/MyAIExperiments.git

cd hands-on-notebooks

```



2. Set-Up Access  

   - Refer to [Google Colab](https://colab.research.google.com/) to get started instantly, for *free* !



   - Get your access key to the [OpenAI](https://platform.openai.com/account/api-keys) API.

   <br>



3. To set up API keys in Colab: Go to the "🔑" icon on the left sidebar (Secrets).

    - Click "Add new secret".

    - For the name, use 'openai_api_key'.

    - For the value, paste your OpenAI API key.

    - Make sure "Notebook access" is enabled for this secret.

<br>


4. Run the notebook

   - Open the notebook in Google Colab.

   - Run each cell or run all.

<br>

You are all set!

---
<br>





## Topic: LangChain - Transform Your AI Assistant into a Document Summarizer tool

#### Exercise 1: Basic Comparison (~15 minutes)

Learn the fundamental differences between the the stuff and the map-reduce approaches:



- Run both methods on the same dataset

- Compare outputs and token usage *Extra Credit*

- Understand the processing flow *Extra Credit*<br>



Expected Output:



- Side-by-side comparison of results

- Token usage statistics *Extra Credit*

- Performance metrics *Extra Credit*



<br>



#### Exercise 2: Scaling Challenge (~30 minutes) Extra Credit

Learn the fundamental differences between the stuff and the map-reduce approaches:



Discover when stuff approach fails but map-reduce succeeds:



- Process an increasing number of documents

- Hit context window limits

- Measure performance degradation

<br>



Tasks:



- Start with a few, say 5 documents - both methods work

- Scale to +15 documents - stuff approach struggles

- Scale to +35 documents - only map-reduce works<br>



<br>



#### Exercise 3: Custom Implementation (~45 minutes) Extra Credit

Build your own document processing strategy:



- Create custom prompts for your use case

- Implement hybrid approaches

- Optimize for your specific requirements<br>



Challenges:



- Design prompts for financial analysis

- Handle different document types

- Implement error handling and retries<br>

<br>





## Performance Benchmarks - Extra Credit

## Sample Results - Extra Credit

## Key Concepts

Stuff Documents Chain



- How it works: Concatenates all documents into a single prompt

- Pros: Fast, simple, maintains full context

- Cons: Limited by context window, fails with many documents

- Best for: Small document sets, cross-document analysis



Map-Reduce Documents Chain



- How it works: Process documents individually, then combine results

- Pros: Scales to any number of documents, handles large datasets

- Cons: More API calls, higher cost, may miss cross-document patterns

- Best for: Large document sets, parallel processing needs



---



In [235]:
#Import display and Markdown from IPython for formatted rendering of generated response
from IPython.display import Markdown, display

In [236]:
def printmd(string):
    display(Markdown(string))

Access setup - This function allows secure access to user-defined secrets stored in the Colab environment, such as API keys.

In [237]:
from google.colab import userdata

In [238]:
!pip install -U langchain-community pypdf langchain-openai tiktoken



In [239]:
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI

example of a Summary Plan Description(SPD) for a popular fintech organization for demo purposes download and list

In [240]:
!wget https://www.paypalbenefits.com/document/57
!ls -lart

--2025-09-13 22:07:23--  https://www.paypalbenefits.com/document/57
Resolving www.paypalbenefits.com (www.paypalbenefits.com)... 151.101.130.216, 151.101.194.216, 151.101.66.216, ...
Connecting to www.paypalbenefits.com (www.paypalbenefits.com)|151.101.130.216|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 511430 (499K) [application/pdf]
Saving to: ‘57.9’


2025-09-13 22:07:24 (5.26 MB/s) - ‘57.9’ saved [511430/511430]

total 5016
-rw-r--r-- 1 root root 511430 Mar 28  2024 57.9
-rw-r--r-- 1 root root 511430 Mar 28  2024 57.8
-rw-r--r-- 1 root root 511430 Mar 28  2024 57.7
-rw-r--r-- 1 root root 511430 Mar 28  2024 57.6
-rw-r--r-- 1 root root 511430 Mar 28  2024 57.5
-rw-r--r-- 1 root root 511430 Mar 28  2024 57.4
-rw-r--r-- 1 root root 511430 Mar 28  2024 57.3
-rw-r--r-- 1 root root 511430 Mar 28  2024 57.2
-rw-r--r-- 1 root root 511430 Mar 28  2024 57.1
-rw-r--r-- 1 root root 511430 Mar 28  2024 57
drwxr-xr-x 4 root root   4096 Sep  9 13:46 .config
drwxr-xr-

In [241]:
llm = ChatOpenAI(temperature=0, model_name="gpt-5", api_key=userdata.get('openai_api_key'))

import library specifically designed to load documents from pdf files

In [242]:
from langchain.document_loaders import PyPDFLoader

In [243]:
loader = PyPDFLoader("57")
pages = loader.load_and_split()

Exercise 1: Basic Comparison (~15 minutes)

In [244]:
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate

In [245]:
from langchain.chains.combine_documents import create_stuff_documents_chain

In [246]:
question = "How much does the company match for a 401(k), and what's the vesting schedule?"

In [247]:
# Define stuff single prompt template
prompt_template = """You are an expert Analyzer of 401K documents. Based on following documents,
write an answer this question: {question}
Documents: {context}
SUMMARY:"""


In [248]:
stuff_prompt = PromptTemplate(template=prompt_template,input_variables=["context", "question"])

Let's give model some instructions and utilize LangChain's modules to summerize content of this pdf

In [249]:
# Create the stuff chain using LCEL (LangChain Expression Language)
stuff_chain = create_stuff_documents_chain(llm=llm,prompt=stuff_prompt)

In [250]:
res = stuff_chain.invoke({
    "context": pages[4:10], # only limiting pages for scope of this excercise is limited. In real-world, you can remove this limitation and scan all pages.
     "question":question
    })

In [251]:
about_this_print=f'Following is how the model summerized the documents to answer the question: %s'
printmd('<div style="background-color: lightblue; padding: 10px;">%s</div><br>' % about_this_print % question)
printmd('<div style="background-color: lightblue; padding: 10px;">%s</div>' % res)

<div style="background-color: lightblue; padding: 10px;">Following is how the model summerized the documents to answer the question: How much does the company match for a 401(k), and what's the vesting schedule?</div><br>

<div style="background-color: lightblue; padding: 10px;">- Company match: Dollar-for-dollar (100%) on your 401(k) contributions up to 4% of eligible compensation each allocation period. Catch-up contributions count toward the match.
- Vesting: Safe Harbor matching contributions are 100% vested immediately. Your own 401(k) contributions are also 100% vested at all times.</div>

In [252]:
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

In [253]:
# Define map prompt template
map_prompt_template = """You are an expert Analyzer of 401K documents. Analyze following document section and extract key information relevant to the question: {question}
Document section: {docs}
ANALYSIS:"""


In [254]:
# Map prompt - processes each document individually
map_prompt = PromptTemplate.from_template(
    template=map_prompt_template
)

In [255]:
# Define reduce prompt template
reduce_prompt_template = """You are an expert Analyzer of 401K documents. Based on the following analysis from multiple document sections,
write a concise summary to answer this question: {question}
Document analyses: {docs}
ANALYSIS:"""

In [256]:
# Reduce prompt - combines all individual results
reduce_prompt = PromptTemplate.from_template(reduce_prompt_template)

In [257]:
# Create the chain - Map step
map_chain = LLMChain(
    llm=llm,
    prompt=map_prompt
)

In [258]:
# Create the chain - Reduce step
reduce_chain = LLMChain(
       llm=llm,
       prompt=reduce_prompt
)

In [259]:
combine_documents_chain = StuffDocumentsChain(
        llm_chain=reduce_chain, document_variable_name="docs"
)

In [260]:
reduce_documents_chain = ReduceDocumentsChain(
    combine_documents_chain=combine_documents_chain,
    collapse_documents_chain=combine_documents_chain,
    token_max=1000,
)

In [261]:
# Full chain
map_reduce_chain = MapReduceDocumentsChain(
    llm_chain=map_chain,
    reduce_documents_chain=reduce_documents_chain,
    document_variable_name="docs",
    return_intermediate_steps=False,
)

In [262]:
documents = pages[4:10]

In [263]:
# Execute map-reduce chain
map_reduce_result = map_reduce_chain.invoke({'input_documents':documents,'question':question})

In [264]:
about_this_print=f'Following is how the model summerized the documents to answer the question: %s'
printmd('<div style="background-color: #a8ee90; padding: 10px;">%s</div><br>' % about_this_print % question)
printmd('<div style="background-color: #a8ee90; padding: 10px;">%s</div>' % map_reduce_result["output_text"])

<div style="background-color: #a8ee90; padding: 10px;">Following is how the model summerized the documents to answer the question: How much does the company match for a 401(k), and what's the vesting schedule?</div><br>

<div style="background-color: #a8ee90; padding: 10px;">- Company match: Safe Harbor match of 100% of your 401(k) contributions on the first 4% of eligible compensation each allocation period (catch-up contributions included).
- Vesting: Your own contributions are 100% vested immediately. The excerpts provided do not state the vesting for the employer match; Safe Harbor matches are typically immediately 100% vested, but please confirm in the plan’s Vesting section.</div>

---

##Congratulations on running this fun exercise!
This lab is designed to give you practical, hands-on experience with LangChain document processing. Take your time with each exercise and don't hesitate to experiment beyond the provided examples.

#### Additional Resources
- [Lang Chain](https://www.langchain.com/langchain)

Happy Learning!