# Intelligent Book Summarization Using LangChain and ChatGPT with PDF Export


## Project Description:

In today's fast-paced world, the ability to distill large volumes of information into concise summaries is a valuable skill. This project aims to leverage cutting-edge technologies to create an intelligent book summarization system that combines the power of the **LangChain** framework, **ChatGPT** as a Language Model, and the **FPDF** Python library to produce well-structured PDF summaries of books.

## LangChain Framework:

The LangChain framework serves as the foundation of our project. LangChain is a versatile and flexible tool that provides a structured approach to processing and analyzing natural language text. It allows us to extract key insights, identify important themes, and break down complex sentences, making it an ideal choice for our summarization task.

## ChatGPT as Language Model:

ChatGPT, an advanced Language Model, plays a central role in this project. It harnesses the power of deep learning and natural language processing to understand and generate human-like text. We employ ChatGPT to process the content of the book, ensuring that our summaries are not only concise but also coherent and contextually accurate.

The OpenAI limits the API usage to certain constarints like limiting the number of tokens and API call in 1 minute. A workaround is also provided to address these constraints.

## Summarization Process:

Our project follows a systematic summarization process. We first load the PDF book using **PyPDFLoader** API from LangChain which uses the **PyPDF** library. Then comes the pre-processes of text to remove noise and identify relevant content.  Next, the text is subjected to ChatGPT that generates a summary based on this refined input, ensuring that it captures the most salient points and key takeaways from the book. The combination of LangChain and ChatGPT ensures that our summaries are both accurate and comprehensive.

## PDF Export Using FPDF:

To make our summaries easily accessible and shareable, we utilize the **FPDF** Python library to convert the generated summaries into PDF documents. FPDF allows us to format the content, add headers and footers, include images or graphs (although we did not add any), and create a professional-looking document that retains the essence of the book. The PDF export feature ensures that our summaries can be seamlessly integrated into various platforms, such as e-readers or websites.

A noteworthy highlight is the addition of external encoding within the FPDF to accommodate special characters.

For the demo purpose, this code is tested on the book [Crime & Punishment](https://books.google.com.pk/books/about/Crime_and_Punishment.html?id=9uBXAAAAYAAJ&printsec=frontcover&source=kp_read_button&hl=en&redir_esc=y#v=onepage&q&f=false) by **Fyodor Dostoevsky**

In [1]:
#Install the necessary libraries
!pip install -q pypdf langchain openai tiktoken



For the secure handling of OpenAI API keys, I have stored the API key in a Python file namely **"API_file.py"**. The file is present in the same directory. Let's load the API key and store it as an environment variable.

In [2]:
#Extract the API key as API_key
from API_file import API_key

In [3]:
#Create an Environment Variable named OPENAI_API_KEY to store the API_key
import os

os.environ["OPENAI_API_KEY"] = API_key

In [4]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

In [5]:
#Instantiate the GPT-4 LLM from OpenAI
llm = ChatOpenAI(model="gpt-4", temperature=0.4, max_tokens=500)#10000)

We'll set the **chain_type** to **map_reduce**. Because the document is pretty large and contains **2,09,050** words. The **PyPDFLoader** API will divide the document into smaller chunks of pages. By employing the **map_reduce** chain, we'll first map each page to an individual summary using an LLMChain. Then we'll use a **ReduceDocumentsChain** to combine those summaries into a single global summary.

In [6]:
#Instantiate the LangChain Summarization API
chain = load_summarize_chain(llm, chain_type="map_reduce")

In [7]:
#Read the PDF document
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_path="crime-and-punishment.pdf")
documents = loader.load_and_split()

## Pre-Processing

Let's clean the text and discard some irrelevant information.

In [8]:
#Remove some initial irrelevant pages
for i in range (6):
    _ = documents.pop(0)

In [11]:
#Remove the redundant phrases
for i in range (0,len(documents)):
    try:
        documents.remove("Free eBooks at Planet eBook.com")
    except ValueError:
        pass

## Summarize the text

We cannot directly go on to making API calls to OpenAI. Becuase that will raise **RateLimitError**. 

To avoid RateLimitError when making API calls to OpenAI, we must adhere to OpenAI's rate limits, which are set at:

- 10,000 TPM (tokens per minute), and
- 200 RPM (requests per minute).

Given the substantial size of books, such as the one used in our testing, which contains **209,050** words, an effieient strategy must be employed to effectively manage the API calls. This involves breaking the calls into segments, typically corresponding to individual pages, and subsequently merging the individual summaries to create a comprehensive summary. This approach ensures compliance with OpenAI's rate limits while efficiently summarizing lengthy texts like books.

In [12]:
import time, random

In [13]:
#Variable to store the summarized text
summary = ""

#Variable to store the length of words; used to make sure we send less than 10k tokens
length = 0

#Variable to store the page number of last request sent to OpenAI
last_page = 0

wait_time = 1.1
for page_num in range(0,len(documents)):
    length = length + len(documents[page_num].page_content.split(" "))
    if length > 6500:
        summary = summary + " " + chain.run(documents[last_page:page_num])
        length = 0
        last_page = page_num
        #Since the ChatGPT takes a lot of time to process, let's 
        # print the progress along the way. So that we could know
        # the prgram is working and is not stuck anywhere
        print("Current Page Number:",page_num)
        print("\n\t\t\t\tSystem is waiting to ensure 10k TPM request")
        #Wait 66 seconds before making the next request
        time.sleep(wait_time*60)
        print("\n\t\t\t\t- System is done waiting")
        #wait_time = wait_time + 1

Current Page Number: 22

				System is waiting to ensure 10k TPM request

				- System is done waiting
Current Page Number: 45

				System is waiting to ensure 10k TPM request

				- System is done waiting
Current Page Number: 67

				System is waiting to ensure 10k TPM request

				- System is done waiting
Current Page Number: 91

				System is waiting to ensure 10k TPM request

				- System is done waiting
Current Page Number: 114

				System is waiting to ensure 10k TPM request

				- System is done waiting
Current Page Number: 140

				System is waiting to ensure 10k TPM request

				- System is done waiting
Current Page Number: 163

				System is waiting to ensure 10k TPM request

				- System is done waiting
Current Page Number: 189

				System is waiting to ensure 10k TPM request

				- System is done waiting
Current Page Number: 214

				System is waiting to ensure 10k TPM request

				- System is done waiting
Current Page Number: 239

				System is waiting to ensure 10k TPM re

In [14]:
#Check if the last pages are summarized or not
if last_page < len(documents):
    summary = summary + chain.run(documents[last_page:len(documents)])

In [15]:
#Print the summary
summary

' The protagonist of "Crime and Punishment," Raskolnikov, is a poor, anxious young man who avoids social interaction. He visits an old woman, Alyona Ivanovna, to pawn a watch, and observes her home and actions carefully, hinting at a secret plan. Later, he goes to a tavern where he meets Marmeladov, a drunkard who shares his tragic life story, including his marriage to a noblewoman, Katerina Ivanovna, and his subsequent descent into alcoholism and poverty. Marmeladov also discusses his daughter Sonia\'s difficult life and limited education. This passage from "Crime and Punishment" describes the hardships of Katerina Ivanovna, a sick woman caring for her children and dealing with financial struggles. A character named Marmeladov recounts a tense situation involving his daughter, Sonia, who is forced to leave their home due to accusations from a notorious woman, Darya Frantsovna. Marmeladov, a drunken man, shares his struggles and guilt over his actions, such as stealing money from his f

## PDF Generation

The last nail in the coffin is to convert the generated summary from a string to a nice and clean PDF document. This will be done using **FPDF** library. One thing that we understand about this library is that it is originally a PHP library. It has been copied into Python language thanks to its vibrant community.

But I guess they are not readily updaiting it. That is why it generates errors when encountered with special characters. It would throw the following error if your text contains any special character like ‘ ’

```
    
python fpdf UnicodeEncodeError: 'latin-1' codec can't encode character '\u2018' in position 403: ordinal not in range(256)
    
```

**FPDF** cannot handle special characters. We need to explicitly add the Font Files for the specific font family. Otherwise, it will throw error whenever it encounters special characters. I am using the standard **Time New Roman** font. Its respective font files are downloaded from [here](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fdwl.freefontsfamily.com%2Fdownload%2FTimes-New-Roman-Font%2F) and are stored in **fonts** directory. If you are using some other font, you can add its font files.

There is a new **FPDF2** library but it has more issues than the older **FPDF**. Because we can add the custom fonts into FPDF but this solution does not work for FPDF2.
    

In [16]:
#Load the FPDF library
from fpdf import FPDF

In [19]:
"""
This function accepts a string as input and converts it into a PDF file.
And stores the PDF file on the current working directory.
"""
import random
def create_pdf(final_summary):
    
    #Instantiate the FPDF class
    pdf = FPDF(orientation='P', #Portrait
           unit= 'in', #Other values for unit: 'mm' --> Mili Meter 'pt' --> point, 'cm' --> Centimeter, 'in' --> inches
           format='A4' #A4 Page
           )
    
    
    #Add the font files for Times New Roman
    pdf.add_font(family="Times", style="", fname="fonts/times new roman.ttf", uni=True)
    pdf.add_font(family="Times", style="B", fname="fonts/times new roman bold.ttf", uni=True)
    pdf.add_font(family="Times", style="I", fname="fonts/times new roman italic.ttf", uni=True)
    pdf.add_font(family="Times", style="BI", fname="fonts/times new roman bold italic.ttf", uni=True)
    
    #Set margins to 1 inches
    pdf.set_margins(left= 1, top=1, right=1)
    
    #Add the title and author name
    pdf.set_title(title="Summary of the Book Crime & Punishment")
    pdf.set_author(author="Umair")
    
    #Add a page
    pdf.add_page()

    #Set the style to Bold & Italic, and szie to 20 for the first page
    pdf.set_font(family="Times",
                 style='BI',  # 'I' --> italics, 'U' --> underlined, '' --> regular font
                 size=20
                 )
    
    #Add the tex "Summary of the Book Crime & Punishments" on the first page
    pdf.cell(w=6.25, h=1, txt = "Summary", ln = 1, align = 'C')
    pdf.cell(w=6.25, h=1, txt = "of the Book", ln = 1, align = 'C')
    pdf.cell(w=6.25, h=1, txt = "Crime & Punishment", ln = 1, align = 'C')
    
    #Add a new page
    pdf.add_page()

    #Set the style to noraml and font size to 12
    pdf.set_font(family="Times",
                 style='',  # 'I' --> italics, 'U' --> underlined, '' --> regular font
                 size=12
                 )
    
    line = ""
    next_line = random.randrange(start=50, stop=200)
    para_length = 0
    period_words = ["Mrs.", "Mr.", "Ms.", "Dr.", "Prof.", "Capt.", "Gen.", "Sen.", "Rev.", "Hon.", "St."\
                   "Jr.","Sr.", "i.e.","e.g.","etc.","a.m.", "p.m.",]
    for count, word in enumerate(final_summary.split(" ")):
        para_length = para_length + 1
        if para_length > next_line:
            if (word.endswith(".")) and (word not in period_words) and (len(word) > 2):
                #insert the line in pdf
                pdf.cell(w=6.25, h=0.285, txt = line + " "+ word, ln = 1, align = '')
                #insert a line-break
                pdf.cell(w=0, h=0.285, txt = "", ln = 1, align = '')
                line = ""
                next_line = random.randrange(start=50, stop=200)
                para_length = 0
                continue
        if len(line) + len(word) <= 87:
            line = line + " " + word
        else:
            #insert the line in pdf
            pdf.cell(w=6.25, h=0.285, txt = line, ln = 1, align = '')
            line = word
    
    #Write the last line
    pdf.cell(w=6.25, h=0.285, txt = line, ln = 1, align = '')
    
    # save the pdf with name .pdf
    pdf.output("summary.pdf")

In [20]:
#Convert the summary (a string variable) into a PDF document by using create_pdf function
create_pdf(summary)

## Conclusion

By combining LangChain, ChatGPT, and FPDF, our project delivers a powerful and efficient book summarization system that can be utilized across a wide range of domains, from education to research and beyond.