[Summarization System](https://www.youtube.com/watch?v=LNq_2s_H01Y&t=204s)<br>
Summarization exercise based on Sam Witteveen's tutorial

In [8]:
from dotenv import load_dotenv
import os
load_dotenv()

os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")

In [9]:
from langchain import PromptTemplate, LLMChain
from langchain_openai import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import PyPDFLoader

llm = OpenAI(temperature=0)

In [23]:
text_splitter = CharacterTextSplitter()
loader = PyPDFLoader("how-to-win-friends-and-influence-people.pdf")
docs = loader.load_and_split(text_splitter=text_splitter)

In [32]:
docs[5:9]

[Document(page_content='How  This Book Was Written—And Why  \n \n \nDuring the first thirty-five years of t he twentieth cent ury, the publ ishing houses of Am erica pri nted \nmore than a fifth of a m illion different  books. Most of them  were deadly dull,  and m any were financial failures. \n“Many,” did I say ? The presi dent of one of t he largest  publ ishing houses i n the worl d confessed t o me that his \ncompany, after seventy-five years of  publishing experience, still lost m oney on seven out of every eight books it \npublished. \n \nWhy, then, did I have the tem erity to write anot her book?  And, after I had written it, why should you \nbother to read i t? \n \nFair quest ions, bot h; and I’l l try to answer t hem. \n \nI have, si nce 1912, been conduct ing educat ional cour ses for busi ness and professi onal men and wom en \nin New York. At  first, I conduct ed courses i n publ ic speaking onl y—courses desi gned t o train adul ts, by actual \nexperience, to think on th

### Map-reduce Summarization

Summarizing a large document has its limitation due to context window.<br>This method involves **an initial prompt on each chunk of data ***
( for summarization tasks, this could be a summary of that chunk; for question-answering tasks, it could be an answer based solely on that chunk). **Then a different prompt is run to combine all the initial outputs.** This is implemented in the LangChain as the MapReduceDocumentsChain.

**Pros:** Can scale to larger documents (and more documents) than StuffDocumentsChain. The calls to the LLM on individual documents are independent and can therefore be parallelized.

**Cons:** Requires many more calls to the LLM than StuffDocumentsChain. Loses some information during the final combining call. If there is an important piece of information sliced into chunk 2 and 3, such that the individual slices aren't that important, then Mapreduce might drop that piece of information altogether.   

In [36]:
from langchain.chains.summarize import load_summarize_chain
import textwrap

chain = load_summarize_chain(llm, 
                             chain_type="map_reduce")

output_summary = chain.run(docs[5:9])
wrapped_text = textwrap.fill(output_summary, width=100)
print(wrapped_text)

  The author wrote this book to address the need for training in public speaking and getting along
with people. He discusses the importance of human engineering and how it can lead to financial
success. The book shares proven principles that have transformed the lives of many, including
successful salespeople, executives, and spouses. It provides nine suggestions for getting the most
out of the book, emphasizing the importance of applying the principles in daily life. The book
should be treated as a working handbook on human relations and referred to often.


Incidentally , the __load_summarize_chain__ has to prompt templates built into it.
- The first one uses the summarization prompt on each of the document chunks to generate individual summaries.
- The second one uses a similar prompt to generate a summary of the summaries

Bot these prompts are shown below.

In [38]:
# for summarizing each part
chain.llm_chain.prompt.template

'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

In [39]:
# for combining the parts
chain.combine_document_chain.llm_chain.prompt.template

'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

We can repeat the above exercise by setting verbose = Trye to see what is happening under the hood.

In [42]:
chain = load_summarize_chain(llm, 
                             chain_type="map_reduce",
                             verbose=True
                             )


output_summary = chain.run(docs[5:9])
wrapped_text = textwrap.fill(output_summary, 
                             width=100,
                             break_long_words=False,
                             replace_whitespace=False)
print(wrapped_text)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"How  This Book Was Written—And Why  
 
 
During the first thirty-five years of t he twentieth cent ury, the publ ishing houses of Am erica pri nted 
more than a fifth of a m illion different  books. Most of them  were deadly dull,  and m any were financial failures. 
“Many,” did I say ? The presi dent of one of t he largest  publ ishing houses i n the worl d confessed t o me that his 
company, after seventy-five years of  publishing experience, still lost m oney on seven out of every eight books it 
published. 
 
Why, then, did I have the tem erity to write anot her book?  And, after I had written it, why should you 
bother to read i t? 
 
Fair quest ions, bot h; and I’l l try to answer t hem. 
 
I have, si nce 1912, been conduct ing educat ional cour ses for busi ness and professi onal men and wom en 
in

A slight variation to the above is tested below. We try a custom prompt and we add the provision of accessing the summary of each of the individual chunk. To do this , we have to set __return_intermediate_steps__ to true, when we instantiate the chain.

In [57]:
prompt_template = """Write a concise summary of the following:


{text}


CONSCISE SUMMARY IN BULLET POINTS:"""

bullet_prompt = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(llm,
                             chain_type ="map_reduce",
                             return_intermediate_steps=True,
                             map_prompt = bullet_prompt,
                             combine_prompt = bullet_prompt)

output_summary = chain({"input_documents":docs[5:9]}, return_only_outputs=True)
wrapped_text = textwrap.fill(output_summary['output_text'],
                                             width = 100,
                                             break_long_words=False,
                                             replace_whitespace=False)

print(wrapped_text)                        



  warn_deprecated(


 

- Author conducted educational courses for adults in New York since 1912, focusing on public
speaking and getting along with people in business and social contacts.
- Research shows that 15% of
financial success is due to technical knowledge and 85% is due to skill in human engineering and
personality.
- John D. Rockefeller believed that the ability to deal with people is a valuable
commodity.
- A survey revealed that adults are interested in studying subjects related to health and
understanding human relations.
- The author wrote "How to Win Friends and Influence People" after
conducting extensive research and interviews with successful people.
- The book offers practical
principles that have been proven to work like magic.
- Examples of success include increased sales,
promotions, and happier homes.
- The book aims to help readers discover and utilize their dormant
assets and handle life's situations.
- Nine suggestions are given on how to get the most out of the
book, including r

We can access the summary of any of the intermediate chunk like this

In [59]:
wrapped_text = textwrap.fill(output_summary['intermediate_steps'][1], 
                             width=100,
                             break_long_words=False,
                             replace_whitespace=False)
print(wrapped_text)

 

- A committee conducted a survey and found that adults are interested in understanding and
getting along with people, making people like them, and winning others to their way of thinking.
-
They searched for a practical textbook on the subject but could not find one.
- The author, who had
been searching for a practical handbook on human relations, decided to write one for use in his own
courses.
- He read extensively on the subject and hired a researcher to spend one and a half years
in various libraries.
- He also personally interviewed successful people to discover their
techniques in human relations.
- The book, "How to Win Friends and Influence People," grew out of
the author's lectures and experiences with thousands of adults.
- The principles discussed in the
book are not mere theories, but have been proven to work like magic.
- An example is given of an
employer who, after applying the principles, saw a positive change in his organization and gained
more profit, leisure, and 

### Stuff Summarization

Stuffing is the simplest method, whereby you simply stuff all the related data into the prompt as context to pass to the language model. This is implemented in LangChain as the StuffDocumentsChain.

**Pros:** Only makes a single call to the LLM. When generating text, the LLM has access to all the data at once.

**Cons:** Most LLMs have a context length, and for large documents (or many documents) this will not work as it will result in a prompt larger than the context length.

The main downside of this method is that **it only works one smaller pieces of data.**  Once you are working with many pieces of data, this approach is no longer feasible. The next two approaches are designed to help deal with that.



In [54]:
chain = load_summarize_chain(llm,chain_type="stuff")
prompt_template = """Write a concise bullet point summary of the following:

{text}

CONCISE SUMMARY IN BULLET POINTS:"""

bullet_prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

In [55]:
chain = load_summarize_chain(llm,
                             chain_type="stuff",
                             prompt=bullet_prompt)

output_summary = chain.run(docs[5:6])
wrapped_text = textwrap.fill(output_summary, 
                             width=100,
                             break_long_words=False,
                             replace_whitespace=False)
print(wrapped_text)

 
- Author conducted educational courses for adults in New York since 1912 
- Realized adults needed
training in public speaking and getting along with people 
- Dealing with people is a big problem,
even in technical fields 
- Research shows 15% of financial success is due to technical knowledge,
85% is due to human engineering and personality 
- Highest-paid personnel in engineering are not
necessarily those with the most technical knowledge 
- John D. Rockefeller valued the ability to
deal with people and was willing to pay for it 
- Colleges do not offer practical courses on
developing this ability 
- University of Chicago and United Y.M.C.A. Schools conducted a survey to
determine what adults want to study 
- Survey revealed that health is the prime concern for adults


In [52]:
docs[5]

Document(page_content='How  This Book Was Written—And Why  \n \n \nDuring the first thirty-five years of t he twentieth cent ury, the publ ishing houses of Am erica pri nted \nmore than a fifth of a m illion different  books. Most of them  were deadly dull,  and m any were financial failures. \n“Many,” did I say ? The presi dent of one of t he largest  publ ishing houses i n the worl d confessed t o me that his \ncompany, after seventy-five years of  publishing experience, still lost m oney on seven out of every eight books it \npublished. \n \nWhy, then, did I have the tem erity to write anot her book?  And, after I had written it, why should you \nbother to read i t? \n \nFair quest ions, bot h; and I’l l try to answer t hem. \n \nI have, si nce 1912, been conduct ing educat ional cour ses for busi ness and professi onal men and wom en \nin New York. At  first, I conduct ed courses i n publ ic speaking onl y—courses desi gned t o train adul ts, by actual \nexperience, to think on the