<a href="https://colab.research.google.com/github/MittalMonika/Signdetection/blob/main/Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simplifying Financial Document Summarization with LangChain

## Overview

Text summarization simplifies lengthy texts into brief, informative summaries. This NLP technique is ideal for condensing news articles, research papers, and other texts.

This guide focuses on summarizing financial documents, particularly identifying due taxes and critical deadlines. Summarizing financial texts involves extracting key dates, tax obligations, and preparatory steps required before deadlines. Prior familiarity with basic summarization techniques is advantageous for effectively handling complex financial documents.

Utilize LangChain in this Google Colab notebook to apply advanced summarization strategies specifically tailored for financial documents. Through practical examples, you'll learn to efficiently extract essential information, ensuring timely tax submissions and compliance with financial deadlines.

[Blog](https://medium.com/google-cloud/langchain-chain-types-large-document-summarization-using-langchain-and-google-cloud-vertex-ai-1650801899f6related)

### Objective

 To summarize large finacial documents there are different LLM methods and API but I am exploring OpenAI, and further Vertex AI Generative AI Studio of Google Cloud can be explored and following methods:

- Stuffing method
- MapReduce method
- Refine method

In [52]:
import os
open_api_key="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"
os.environ["OPENAI_API_KEY"]=open_api_key

In [None]:
!sudo apt -y -qq install tesseract-ocr
!sudo apt -y -qq install libtesseract-dev
!sudo apt-get -y -qq install poppler-utils #required by PyPDF2 for page count and other pdf utilities
!sudo apt-get -y -qq install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig

In [None]:
pip install  pytesseract pypdf  PyPDF2 textract langchain transformers openai tiktoken

In [5]:
import urllib
import warnings
from pathlib import Path as p

import pandas as pd
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader
from langchain.llms import VertexAI

warnings.filterwarnings("ignore")



### Preparing data files

To begin, you will need to download a few files that are required for the summarizing tasks below.

In [6]:
data_folder = p.cwd() / "data"
p(data_folder).mkdir(parents=True, exist_ok=True)

#pdf_url = "https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf"
pdf_file = str(p(data_folder, 'tax1.pdf'))

#urllib.request.urlretrieve(pdf_url, pdf_file)

### Extract text from the PDF

You use an `PdfReader` to extract the text from our scanned documents.

In [56]:
pdf_loader = PyPDFLoader(pdf_file)
pages = pdf_loader.load_and_split()
text = pages[0].page_content
print(text)

Notice of Balance Due for Tax Year 2018 Notice Date: May 6, 2019 Social Security Number: XXX-KX-X0K To contact us: 800-829-1040 Your Caller ID: OOK Page 1 of 4You have a balance due for 2018Amount due: $262.76Our records show you have unpaid taxes and/or penalties and interest on your December 31, 2018 Form 1040. If you already have an installment or payment agreement in place for this tax year, then continue with that agreement.Billing Summary: Tax you owed: $38,371.00 Payments and credits: $0.00 Failure to pay proper estimated tax penalty: $262.76 Amount due by May 27, 2019: $262.76


In [30]:
three_pages = pages[:1]

## Method 1: Stuffing

Stuffing is the simplest method to pass data to a language model. It "stuffs" text into the prompt as context in a way that all of the relevant information can be processed by the model to get what you want.

In LangChain, you can use `StuffDocumentsChain` as part of the `load_summarize_chain` method. What you need to do is setting `stuff` as `chain_type` of your chain.

### Prompt design with `Stuffing` chain

In [24]:
## Basic Prompt Summarization
from langchain.chat_models import ChatOpenAI
from langchain.schema import(
    AIMessage,
    HumanMessage,
    SystemMessage
)

In [57]:
chat_messages=[
    SystemMessage(content='You are an expert assistant with expertize in summarizing financial documents extracting the tax to be paid and date'),
    HumanMessage(content=f'Please provide a short and concise summary of the following speech:\n TEXT: {three_pages}')
]

llm=ChatOpenAI(model_name='gpt-3.5-turbo')

In [59]:
llm.get_num_tokens(text)

183

In [36]:
prompt_template = """Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
  """

prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

In [60]:
llm = ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo')

In [46]:
from langchain import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document

In [75]:
template = '''Write a concise and short summary of the following financial document Response your answer with the information a line about the form or motive and then provide 1. Name or SSN of the person , 2. tax to be paid, 3.date by it should be paid.
Speech: `{text}`
'''
prompt = PromptTemplate(
    input_variables=['text'],
    template=template
)

In [76]:
chain = load_summarize_chain(
    llm,
    chain_type='stuff',
    prompt=prompt,
    verbose=False
)
output_summary = chain.run(three_pages)

In [77]:
# Splitting the output into lines
lines = output_summary.split('\n')

# Parsing each line to extract information
summary_data = {}
for line in lines:
    # Splitting each line by the first occurrence of ": "
    key, value = line.split(': ', 1)
    summary_data[key] = value

# Now, `summary_data` is a dictionary with the extracted data
print(summary_data)

{'Form/Motive': 'Notice of Balance Due for Tax Year 2018', '1. Name or SSN of the person': 'XXX-KX-X0K', '2. Tax to be paid': '$262.76', '3. Date by which it should be paid': 'May 27, 2019'}


In [73]:
output_summary

'Name or SSN of the person: XXX-KX-X0K\nTax to be paid: $262.76\nDate by which it should be paid: May 27, 2019'

In [70]:
output_summary

'1. Name of the person: Not provided\n2. Tax to be paid: $262.76\n3. Date by which it should be paid: May 27, 2019'

In [67]:
output_summary

'- Notice of Balance Due for Tax Year 2018\n- Notice Date: May 6, 2019\n- Social Security Number: XXX-KX-X0K\n- Contact number: 800-829-1040\n- Caller ID: OOK\n- Balance due for 2018: $262.76\n- Unpaid taxes and/or penalties and interest on December 31, 2018 Form 1040\n- If installment or payment agreement is in place, continue with that agreement\n- Billing Summary:\n  - Tax owed: $38,371.00\n  - Payments and credits: $0.00\n  - Failure to pay proper estimated tax penalty: $262.76\n- Amount due by May 27, 2019: $262.76'

In [63]:
output_summary

'Summary: The document is a notice of balance due for tax year 2018. The recipient has a balance due of $262.76, which includes unpaid taxes, penalties, and interest. The amount is to be paid by May 27, 2019.'

In [55]:
output_summary

'This speech is a notice informing the recipient that they have a balance due for the tax year 2018. The amount due is $262.76, which includes unpaid taxes, penalties, and interest. The recipient is advised to continue with any existing installment or payment agreement. The deadline for payment is May 27, 2019.'

### Retrying
Initiate a chain using `stuff` method and process three pages document.

 With the `stuff` method, you can summarize the entire document content with a single API call passing all data at once.

Depending on the context length of LLM, the `stuff` method would not work as it result in a prompt larger than the context length.

As expected, the code returns the expection message.

### Considerations

The `stuffing` method is a way to summarize text by feeding the entire document to a large language model (LLM) in a single call. This method has both pros and cons.

The stuffing method only requires a single call to the LLM, which can be faster than other methods that require multiple calls. When summarizing text, the LLM has access to all the data at once, which can result in a better summary.

But, LLMs have a context length, which is the maximum number of tokens that can be processed in a single call. If the document is longer than the context length, the stuffing method will not work. Also the stuffing method is not suitable for summarizing large documents, as it can be slow and may not produce a good summary.

Let's explore other approaches to help deal with having longer text than context lengh limit of LLMs.

## Method 2: MapReduce

The `MapReduce` method implements a multi-stage summarization. It is a technique for summarizing large pieces of text by first summarizing smaller chunks of text and then combining those summaries into a single summary.

In LangChain, you can use `MapReduceDocumentsChain` as part of the `load_summarize_chain` method. What you need to do is setting `map_reduce` as `chain_type` of your chain.

### Prompt design with `MapReduce` chain

In our example, you have a 32-page document that you need to summarize.

With LangChain, the `map_reduce` chain breaks the document down into 1024 token chunks max. Then it runs the initial prompt you define on each chunk to generate a summary of that chunk. In the example below, you use the following first stage or map prompt.

```Write a concise summary of the following text delimited by triple backquotes. Return your response in bullet points which covers the key points of the text.
'''{text}'''. BULLET POINT SUMMARY:```

Once summaries for all of the chunks are generated, it runs a different prompt to combine those summaries into a single summary. In the example below, you use the following second stage or combine prompt.

```Write a summary of the entire document that includes the main points from all of the individual summaries.```

In [None]:
map_prompt_template = """
                      Write a summary of this chunk of text that includes the main points and any important details.
                      {text}
                      """

map_prompt = PromptTemplate(template=map_prompt_template, input_variables=["text"])

combine_prompt_template = """
                      Write a concise summary of the following text delimited by triple backquotes.
                      Return your response in bullet points which covers the key points of the text.
                      ```{text}```
                      BULLET POINT SUMMARY:
                      """

combine_prompt = PromptTemplate(
    template=combine_prompt_template, input_variables=["text"]
)

### Generate summaries using MapReduce method

After defining prompts, you initialize the associated `map_reduce_chain`.

In [None]:
map_reduce_chain = load_summarize_chain(
    vertex_llm_text,
    chain_type="map_reduce",
    map_prompt=map_prompt,
    combine_prompt=combine_prompt,
    return_intermediate_steps=True,
)

Then, you generate summaries using the chain. Notice that LangChain use a tokenizer (from transformer library) with 1024 token limit by default.

In [None]:
map_reduce_outputs = map_reduce_chain({"input_documents": pages})

After summaries are generated, you can validate them by organize input documents and associated output in a Pandas Dataframe.

In [None]:
final_mp_data = []
for doc, out in zip(
    map_reduce_outputs["input_documents"], map_reduce_outputs["intermediate_steps"]
):
    output = {}
    output["file_name"] = p(doc.metadata["source"]).stem
    output["file_type"] = p(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["concise_summary"] = out
    final_mp_data.append(output)

In [None]:
pdf_mp_summary = pd.DataFrame.from_dict(final_mp_data)
pdf_mp_summary = pdf_mp_summary.sort_values(
    by=["file_name", "page_number"]
)  # sorting the dataframe by filename and page_number
pdf_mp_summary.reset_index(inplace=True, drop=True)
pdf_mp_summary.head()

In [None]:
index = 3
print("[Context]")
print(pdf_mp_summary["chunks"].iloc[index])
print("\n\n [Simple Summary]")
print(pdf_mp_summary["concise_summary"].iloc[index])
print("\n\n [Page number]")
print(pdf_mp_summary["page_number"].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_mp_summary["file_name"].iloc[index])

### Considerations

With `MapReduce` method, the model is able to summarize a large paper by overcoming the context limit of `Stuffing` method with parallel processing.

However, the `MapReduce` requires multiple calls to the model and potentially losing context between pages.

To deal this challenge, you can try another method to summarize multiple pages at a time.

## Method 3: Refine

The Refine method is an alternative method to deal with large document summarization. It works by first running an initial prompt on a small chunk of data, generating some output. Then, for each subsequent document, the output from the previous document is passed in along with the new document, and the LLM is asked to refine the output based on the new document.

In LangChain, you can use `MapReduceDocumentsChain` as part of the load_summarize_chain method. What you need to do is setting `refine` as `chain_type` of your chain.

### Prompt design with `Refine` chain

With LangChain, the `refine` chain requires two prompts.

The question prompt to generate the output for subsequent task. The refine prompt to refine the output based on the generated content.

In this example, the question prompt is:

```
Please provide a summary of the following text.
TEXT: {text}
SUMMARY:
```

and the refine prompt is:

```
Write a concise summary of the following text delimited by triple backquotes.
Return your response in bullet points which covers the key points of the text.
```{text}```
BULLET POINT SUMMARY:
```


In [None]:
question_prompt_template = """
                  Please provide a summary of the following text.
                  TEXT: {text}
                  SUMMARY:
                  """

question_prompt = PromptTemplate(
    template=question_prompt_template, input_variables=["text"]
)

refine_prompt_template = """
              Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
              """

refine_prompt = PromptTemplate(
    template=refine_prompt_template, input_variables=["text"]
)

### Generate summaries using Refine method

After you define prompts, you initiate a summarization chain using `refine` chain type.

In [None]:
refine_chain = load_summarize_chain(
    vertex_llm_text,
    chain_type="refine",
    question_prompt=question_prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True,
)

Then, you use the summatization chain to summarize document using Refine method.

In [None]:
refine_outputs = refine_chain({"input_documents": pages})

Below you can see the resulting summaries.

In [None]:
final_refine_data = []
for doc, out in zip(
    refine_outputs["input_documents"], refine_outputs["intermediate_steps"]
):
    output = {}
    output["file_name"] = p(doc.metadata["source"]).stem
    output["file_type"] = p(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["concise_summary"] = out
    final_refine_data.append(output)

In [None]:
pdf_refine_summary = pd.DataFrame.from_dict(final_refine_data)
pdf_refine_summary = pdf_mp_summary.sort_values(
    by=["file_name", "page_number"]
)  # sorting the datafram by filename and page_number
pdf_refine_summary.reset_index(inplace=True, drop=True)
pdf_refine_summary.head()

In [None]:
index = 3
print("[Context]")
print(pdf_refine_summary["chunks"].iloc[index])
print("\n\n [Simple Summary]")
print(pdf_refine_summary["concise_summary"].iloc[index])
print("\n\n [Page number]")
print(pdf_refine_summary["page_number"].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_refine_summary["file_name"].iloc[index])

### Considerations

In short, the Refine method for text summarization with LLMs can pull in more relevant context and may be less lossy than Map Reduce. However, it requires many more calls to the LLM than Stuffing, and these calls are not independent, meaning they cannot be parallelized. Additionally, there is some potential dependency on the ordering of the documents. Latest documents they might become more relevant as this method suffers from recency bias.