# 🦜🔗 Text Summarization of Large Documents using LangChain 


| | |
|-|-|
|Author(s) | [Abonia Sojasingarayar](https://github.com/Abonia1) |

## Overview

Text summarization is an NLP task that creates a concise and informative summary of a longer text. LLMs can be used to create summaries of news articles, research papers, technical documents, and other types of text.

Summarizing large documents can be challenging. To create summaries, you need to apply summarization strategies to your indexed documents. 

In this notebook, you will use LangChain, a framework for developing LLM applications, to apply some summarization strategies. The notebook covers several examples of how to summarize large documents.

### Objective

In this tutorial, you learn how to use LangChain with Ollama - Mistral to summarize large documents by working through the following examples:

- Stuffing method
- MapReduce method
- Refine method

## Getting Started

In [8]:
# !pip install langchain langchain_community  pypdf

Collecting pypdf
  Using cached pypdf-4.1.0-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-4.1.0-py3-none-any.whl (286 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.1/286.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.1.0


### Import libraries

In [5]:
import requests
import warnings
from pathlib import Path

import pandas as pd
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader
from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

warnings.filterwarnings("ignore")

### Import models

we need to load the pre-trained text generation model called `mistral` using Ollama.

In [2]:

llm = Ollama(model="mistral", callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))

## Summarization with Large Documents

### Preparing data files

To begin, we will need to download a few files that are required for the summarizing tasks below.

In [9]:

# Define the current working directory and the data folder within it
data_folder = Path.cwd() / "data"
data_folder.mkdir(parents=True, exist_ok=True)

# pdf_url = "https://arxiv.org/pdf/2305.14314.pdf"
# pdf_url = "https://arxiv.org/pdf/2307.06435.pdf"
# Specify the URL of the PDF to download
pdf_url = "https://uoft-csc413.github.io/2023/assets/tutorials/tut09_llm.pdf"

# Construct the file path for the downloaded PDF
pdf_file = str(data_folder / pdf_url.split("/")[-1])

# Use requests to download the PDF file
response = requests.get(pdf_url)

# Ensure the response is successful
if response.status_code == 200:
    # Write the content to a file
    with open(pdf_file, 'wb') as file:
        file.write(response.content)
    print(f"PDF downloaded successfully to {pdf_file}")
else:
    print(f"Failed to download PDF. Status code: {response.status_code}")


PDF downloaded successfully to /Users/abonia.sojasingarayaribm.com/Documents/GitHub/Orange-RAG-Demo/data/tut09_llm.pdf


### Extract text from the PDF

You use an `PdfReader` to extract the text from our scanned documents.

In [12]:
pdf_loader = PyPDFLoader(pdf_file)
pages = pdf_loader.load_and_split()
print(pages[2].page_content)

What are Language Models? 
●Narrow Sense 
○A probabilistic model that assigns a probability to every ﬁnite sequence (grammatical or not) 
●Broad Sense 
○Decoder-only models (GPT-X, OPT, LLaMA, PaLM) 
○Encoder-only models (BERT, RoBERTa, ELECTRA) 
○Encoder-decoder models (T5, BART)


## Method 1: Stuffing

Stuffing is the simplest method to pass data to a language model. It "stuffs" text into the prompt as context in a way that all of the relevant information can be processed by the model to get what you want.

In LangChain, you can use `StuffDocumentsChain` as part of the `load_summarize_chain` method. What you need to do is setting `stuff` as `chain_type` of your chain.

### Prompt design with `Stuffing` chain

In [14]:
prompt_template = """Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
  """

prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

### Retrying
Initiate a chain using `stuff` method and process three pages document.

In [15]:
stuff_chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt)

In [18]:
four_pages = pages[:4]

In [19]:
four_pages

[Document(page_content='Large Language Models \nCSC413 Tutorial 9 \nYongchao Zhou', metadata={'source': '/Users/abonia.sojasingarayaribm.com/Documents/GitHub/Orange-RAG-Demo/data/tut09_llm.pdf', 'page': 0}),
 Document(page_content='Overview \n●What are LLMs? \n●Why LLMs? \n●Emergent Capabilities \n○Few-shot In-context Learning \n○Advanced Prompt Techniques \n●LLM Training \n○Architectures \n○Objectives \n●LLM Finetuning \n○Instruction ﬁnetuning \n○RLHF \n○Bootstrapping \n●LLM Risks', metadata={'source': '/Users/abonia.sojasingarayaribm.com/Documents/GitHub/Orange-RAG-Demo/data/tut09_llm.pdf', 'page': 1}),
 Document(page_content='What are Language Models? \n●Narrow Sense \n○A probabilistic model that assigns a probability to every ﬁnite sequence (grammatical or not) \n●Broad Sense \n○Decoder-only models (GPT-X, OPT, LLaMA, PaLM) \n○Encoder-only models (BERT, RoBERTa, ELECTRA) \n○Encoder-decoder models (T5, BART)', metadata={'source': '/Users/abonia.sojasingarayaribm.com/Documents/GitHub

In [20]:
try:
    print(stuff_chain.run(four_pages))
except Exception as e:
    print(
        "The code failed since it won't be able to run inference on such a huge context and throws this exception: ",
        e,
    )

 * Large Language Models (LLMs) are probabilistic models assigning a probability to every finite sequence.
* Reason for existence includes few-shot in-context learning and advanced prompt techniques.
* LLMs come in various architectures and objectives.
* Training methods include instruction finetuning, RLHF, and bootstrapping.
* Narrow definition of LLMs is a model assigning probabilities to sequences.
* Broad definition includes decoder-only models (GPT-X, OPT, LLaMA, PaLM), encoder-only models (BERT, RoBerta, ELECTRA), and encoder-decoder models (T5, BART).
* Billions of parameters in large LLMs can be found through Hugging Face's blog. * Large Language Models (LLMs) are probabilistic models assigning a probability to every finite sequence.
* Reason for existence includes few-shot in-context learning and advanced prompt techniques.
* LLMs come in various architectures and objectives.
* Training methods include instruction finetuning, RLHF, and bootstrapping.
* Narrow definition of LL

As you can see, with the `stuff` method, you can summarize the entire document content with a single API call passing all data at once.

Depending on the context length of LLM, the `stuff` method would not work as it result in a prompt larger than the context length.

In [21]:
try:
    print(stuff_chain.run(pages))
except Exception as e:
    print(
        "The code failed since it won't be able to run inference on such a huge context and throws this exception: ",
        e,
    )

 * Language models (LLMs) have shown emergent capabilities, demonstrating human-like behaviors.
* Pretraining involves teaching LLMs to understand and generate text based on patterns in large datasets.
* Fine-tuning involves adapting pretrained LLMs to specific tasks through instruction or supervision.
* Prompting techniques include zero-shot, self-consistency, least-to-most, division-and-conquer, and instruction finetune.
* Training architectures include encoder-decoder models (T5, BART) and decoder-only models (GPT-X, PaLM).
* Risks of LLMs include making mistakes, being misused or attacked, causing harms, and serving as defenses.
* Emergent capabilities include in-context learning, decomposed prompting, and instruction finetune.
* Training techniques include parallelism (Training Objectives - UL2, OpenAI research).
* For further reading, check out the resources listed. * Language models (LLMs) have shown emergent capabilities, demonstrating human-like behaviors.
* Pretraining involv

### Considerations

The `stuffing` method is a way to summarize text by feeding the entire document to a large language model (LLM) in a single call. This method has both pros and cons.

The stuffing method only requires a single call to the LLM, which can be faster than other methods that require multiple calls. When summarizing text, the LLM has access to all the data at once, which can result in a better summary.

But, LLMs have a context length, which is the maximum number of tokens that can be processed in a single call. If the document is longer than the context length, the stuffing method will not work. Also the stuffing method is not suitable for summarizing large documents, as it can be slow and may not produce a good summary.

Let's explore other approaches to help deal with having longer text than context lengh limit of LLMs.

## Method 2: MapReduce

The `MapReduce` method implements a multi-stage summarization. It is a technique for summarizing large pieces of text by first summarizing smaller chunks of text and then combining those summaries into a single summary.

In LangChain, you can use `MapReduceDocumentsChain` as part of the `load_summarize_chain` method. What you need to do is setting `map_reduce` as `chain_type` of your chain.

### Prompt design with `MapReduce` chain

In our example, you have a 40-page document that you need to summarize.

With LangChain, the `map_reduce` chain breaks the document down into 1024 token chunks max. Then it runs the initial prompt you define on each chunk to generate a summary of that chunk. In the example below, you use the following first stage or map prompt.

```Write a concise summary of the following text delimited by triple backquotes. Return your response in bullet points which covers the key points of the text.
'''{text}'''. BULLET POINT SUMMARY:```

Once summaries for all of the chunks are generated, it runs a different prompt to combine those summaries into a single summary. In the example below, you use the following second stage or combine prompt.

```Write a summary of the entire document that includes the main points from all of the individual summaries.```

In [25]:
map_prompt_template = """
                      Write a summary of this chunk of text that includes the main points and any important details.
                      {text}
                      """

map_prompt = PromptTemplate(template=map_prompt_template, input_variables=["text"])

combine_prompt_template = """
                      Write a concise summary of the following text delimited by triple backquotes.
                      Return your response in bullet points which covers the key points of the text.
                      ```{text}```
                      BULLET POINT SUMMARY:
                      """

combine_prompt = PromptTemplate(
    template=combine_prompt_template, input_variables=["text"]
)

### Generate summaries using MapReduce method

After defining prompts, you initialize the associated `map_reduce_chain`.

In [26]:
map_reduce_chain = load_summarize_chain(
    llm,
    chain_type="map_reduce",
    map_prompt=map_prompt,
    combine_prompt=combine_prompt,
    return_intermediate_steps=True,
)

Then, you generate summaries using the chain. Notice that LangChain use a tokenizer (from transformer library) with 1024 token limit by default.

In [27]:
map_reduce_outputs = map_reduce_chain({"input_documents": pages})

 This text appears to be a tutorial slide or note titled "Large Language Models" for a Computer Science course, CSC413, authored by Yongchao Zhou. The main points discussed in the text are as follows:

1. Large language models: They are deep neural network models capable of processing and generating human-like natural language text. They can be used for various natural language processing tasks such as translation, summarization, generation, etc.

2. Model architecture: The large language models consist of an embedding layer (representing words in dense vector form), multiple recurrent or transformer layers (for capturing context and dependencies), and output layer (generating tokens).

3. Training data: These models are trained on vast amounts of text data from the internet to learn statistical patterns in language. Pre-trained models like BERT, RoBERTa, GPT-3, etc., are fine-tuned on specific tasks or datasets for better performance.

4. Capabilities: Large language models can perfor

After summaries are generated, you can validate them by organize input documents and associated output in a Pandas Dataframe.

In [28]:
final_mp_data = []
for doc, out in zip(
    map_reduce_outputs["input_documents"], map_reduce_outputs["intermediate_steps"]
):
    output = {}
    output["file_name"] = Path(doc.metadata["source"]).stem
    output["file_type"] = Path(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["concise_summary"] = out
    final_mp_data.append(output)

In [29]:
pdf_mp_summary = pd.DataFrame.from_dict(final_mp_data)
pdf_mp_summary = pdf_mp_summary.sort_values(
    by=["file_name", "page_number"]
)  # sorting the dataframe by filename and page_number
pdf_mp_summary.reset_index(inplace=True, drop=True)
pdf_mp_summary.head()

Unnamed: 0,file_name,file_type,page_number,chunks,concise_summary
0,tut09_llm,.pdf,0,Large Language Models \nCSC413 Tutorial 9 \nYo...,This text appears to be a tutorial slide or n...
1,tut09_llm,.pdf,1,Overview \n●What are LLMs? \n●Why LLMs? \n●Eme...,"LLMs, or Large Language Models, are artificia..."
2,tut09_llm,.pdf,2,What are Language Models? \n●Narrow Sense \n○A...,Language models are probabilistic models used...
3,tut09_llm,.pdf,3,Large Language Models - Billions of Parameters...,The blog post from Hugging Face discusses the...
4,tut09_llm,.pdf,4,Large Language Models - Hundreds of Billions o...,"The text is about Large Language Models, spec..."


In [30]:
index = 3
print("[Context]")
print(pdf_mp_summary["chunks"].iloc[index])
print("\n\n [Simple Summary]")
print(pdf_mp_summary["concise_summary"].iloc[index])
print("\n\n [Page number]")
print(pdf_mp_summary["page_number"].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_mp_summary["file_name"].iloc[index])

[Context]
Large Language Models - Billions of Parameters  
https://huggingface.co/blog/large-language-models


 [Simple Summary]
 The blog post from Hugging Face discusses the advancements in large language models, which have grown significantly in size over the past few years. These models, such as BERT, RoBERTa, and T5, now contain billions of parameters, enabling them to better understand and generate human-like text. The larger models are able to grasp longer contexts, exhibit improved performance on various NLP tasks, and even demonstrate some ability to transfer knowledge across different domains. However, their increased size also brings challenges in terms of computational resources and ethical considerations around biases and misinformation. The post provides a brief overview of these developments and highlights the potential applications and implications of large language models in various fields.


 [Page number]
3


 [Source: file_name]
tut09_llm


### Considerations

With `MapReduce` method, the model is able to summarize a large paper by overcoming the context limit of `Stuffing` method with parallel processing.

However, the `MapReduce` requires multiple calls to the model to generate intermeadiate summary and potentially losing context between pages.

To deal this challenge, you can try another method to summarize multiple pages at a time.

## Method 3: Refine

The Refine method is an alternative method to deal with large document summarization. It works by first running an initial prompt on a small chunk of data, generating some output. Then, for each subsequent document, the output from the previous document is passed in along with the new document, and the LLM is asked to refine the output based on the new document.

In LangChain, you can use `MapReduceDocumentsChain` as part of the load_summarize_chain method. What you need to do is setting `refine` as `chain_type` of your chain.

### Prompt design with `Refine` chain

With LangChain, the `refine` chain requires two prompts.

The question prompt to generate the output for subsequent task. The refine prompt to refine the output based on the generated content.

In this example, the question prompt is:

```
Please provide a summary of the following text.
TEXT: {text}
SUMMARY:
```

and the refine prompt is:

```
Write a concise summary of the following text delimited by triple backquotes.
Return your response in bullet points which covers the key points of the text.
```{text}```
BULLET POINT SUMMARY:
```


In [31]:
question_prompt_template = """
                  Please provide a summary of the following text.
                  TEXT: {text}
                  SUMMARY:
                  """

question_prompt = PromptTemplate(
    template=question_prompt_template, input_variables=["text"]
)

refine_prompt_template = """
              Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
              """

refine_prompt = PromptTemplate(
    template=refine_prompt_template, input_variables=["text"]
)

### Generate summaries using Refine method

After you define prompts, you initiate a summarization chain using `refine` chain type.

In [32]:
refine_chain = load_summarize_chain(
    llm,
    chain_type="refine",
    question_prompt=question_prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True,
)

Then, you use the summatization chain to summarize document using Refine method.

In [33]:
refine_outputs = refine_chain({"input_documents": pages})

 In this tutorial for CSC413, Yongchao Zhou discusses large language models. He begins by explaining that these models are neural networks with a large number of parameters, allowing them to learn complex patterns in data. The author then describes the Transformer model, which uses attention mechanisms to allow the model to focus on different parts of the input sequence when generating an output sequence.

Next, Zhou discusses some applications of large language models, such as text generation and machine translation. He also mentions their limitations, including the generation of inaccurate or nonsensical responses, and the potential for misuse or biased outputs.

The tutorial then covers some techniques for fine-tuning large language models on specific tasks. Fine-tuning involves training a model on a smaller dataset that is relevant to the task at hand. This allows the model to learn domain-specific knowledge and improve its performance on the given task.

Finally, Zhou discusses so

Below you can see the resulting summaries.

In [34]:
final_refine_data = []
for doc, out in zip(
    refine_outputs["input_documents"], refine_outputs["intermediate_steps"]
):
    output = {}
    output["file_name"] = Path(doc.metadata["source"]).stem
    output["file_type"] = Path(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["concise_summary"] = out
    final_refine_data.append(output)

In [35]:
pdf_refine_summary = pd.DataFrame.from_dict(final_refine_data)
pdf_refine_summary = pdf_mp_summary.sort_values(
    by=["file_name", "page_number"]
)  # sorting the datafram by filename and page_number
pdf_refine_summary.reset_index(inplace=True, drop=True)
pdf_refine_summary.head()

Unnamed: 0,file_name,file_type,page_number,chunks,concise_summary
0,tut09_llm,.pdf,0,Large Language Models \nCSC413 Tutorial 9 \nYo...,This text appears to be a tutorial slide or n...
1,tut09_llm,.pdf,1,Overview \n●What are LLMs? \n●Why LLMs? \n●Eme...,"LLMs, or Large Language Models, are artificia..."
2,tut09_llm,.pdf,2,What are Language Models? \n●Narrow Sense \n○A...,Language models are probabilistic models used...
3,tut09_llm,.pdf,3,Large Language Models - Billions of Parameters...,The blog post from Hugging Face discusses the...
4,tut09_llm,.pdf,4,Large Language Models - Hundreds of Billions o...,"The text is about Large Language Models, spec..."


In [36]:
index = 3
print("[Context]")
print(pdf_refine_summary["chunks"].iloc[index])
print("\n\n [Simple Summary]")
print(pdf_refine_summary["concise_summary"].iloc[index])
print("\n\n [Page number]")
print(pdf_refine_summary["page_number"].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_refine_summary["file_name"].iloc[index])

[Context]
Large Language Models - Billions of Parameters  
https://huggingface.co/blog/large-language-models


 [Simple Summary]
 The blog post from Hugging Face discusses the advancements in large language models, which have grown significantly in size over the past few years. These models, such as BERT, RoBERTa, and T5, now contain billions of parameters, enabling them to better understand and generate human-like text. The larger models are able to grasp longer contexts, exhibit improved performance on various NLP tasks, and even demonstrate some ability to transfer knowledge across different domains. However, their increased size also brings challenges in terms of computational resources and ethical considerations around biases and misinformation. The post provides a brief overview of these developments and highlights the potential applications and implications of large language models in various fields.


 [Page number]
3


 [Source: file_name]
tut09_llm


### Considerations

In short, the Refine method for text summarization with LLMs can pull in more relevant context and may be less lossy than Map Reduce. However, it requires many more calls to the LLM than Stuffing, and these calls are not independent, meaning they cannot be parallelized. Additionally, there is some potential dependency on the ordering of the documents. Latest documents they might become more relevant as this method suffers from recency bias.

## Conclusion


In this notebook you learn about different techniques to summarize long documents with LangChain and Mistral. What you have seen in this notebook are only some of the possibilities you have. For example, there is another method called the Map-Rerank method which involves running an initial prompt on each chunk of data, which not only tries to complete a task but also gives a score for how certain it is in its answer/output. The responses are then ranked according to this score, and the highest score is returned.


It's crucial to note that, depending on specific requirements, leveraging a foundational model in conjunction with a custom framework might be advantageous for developing generative AI applications. This approach offers several benefits:

1. Greater adaptability in integrating different Large Language Models (LLMs), prompting templates, and document handling strategies.
2. Enhanced customization capabilities to tailor generative applications to specific scenarios.
 3.  Improved performance metrics, including reduced latency and enhanced scalability of the application.

These advantages underscore the flexibility and control that a foundational model with a custom framework can provide, making it a viable option for various generative AI application development needs.
