<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/PDFSummarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF Summarizer with few lines of code using Gradio, OpenAI and LangChain

## Install necessary packages

[Langchain website link](https://docs.langchain.com/docs/)

In [25]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu


Looking in indexes: https://download.pytorch.org/whl/cpu


In [26]:
!pip install -q gradio openai pypdf tiktoken langchain transformers

In [None]:
#with open('env_vars.json', 'r') as f:
#    env_vars = json.load(f)
#openai.api_key = env_vars["OPENAI_API_KEY"]

In [27]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = ""
os.environ["OPENAI_API_KEY"] = ""

In [28]:
# https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    print(encoding.encode(string))
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string("tiktoken is great!", "cl100k_base")

[83, 1609, 5963, 374, 2294, 0]


6

In [29]:
import gradio as gr
from langchain import OpenAI, PromptTemplate, HuggingFaceHub
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader

# llm = OpenAI(temperature=0)
hub_llm = HuggingFaceHub(repo_id="google/flan-t5-large", model_kwargs={"temperature":0.9, "max_length":64})

In [None]:
#PyPDFLoader??

## LangChain part 
#### Function that takes PDF file as input and returns the summary of that PDF
- langchain `PyPDFLoader` helps load the PDF
- After that we can split the document in smaller chunks
- We then use the `load_summarize_chain` to create a summarization chain

In [30]:
def summarize_pdf(pdf_file_path):
    loader = PyPDFLoader(pdf_file_path)
    docs = loader.load_and_split()
    chain = load_summarize_chain(hub_llm, chain_type="map_reduce")
    summary = chain.run(docs)   
    return summary

In [31]:
summarize = summarize_pdf("data/gpt4all.pdf")
summarize


'The GPT4All ecosystem includes a number of language models that are available for use in the GPT4All-Snoozy project.'

In [23]:
# just to show you how it works
loader = PyPDFLoader('data/gpt4all.pdf')
doc=loader.load_and_split()
print(len(doc))
doc[0]

9


Document(metadata={'source': 'data/gpt4all.pdf', 'page': 0}, page_content='GPT4All: An Ecosystem of Open Source Compressed Language Models\nYuvanesh Anand\nNomic AI\nyuvanesh@nomic.ai\nZach Nussbaum\nNomic AI\nzach@nomic.ai\nAdam Treat\nNomic AI\nadam@nomic.ai\nAaron Miller\nNomic AI\naaron@nomic.ai\nRichard Guo\nNomic AI\nrichard@nomic.ai\nBen Schmidt\nNomic AI\nben@nomic.ai\nGPT4All Community\nPlanet Earth\nBrandon Duderstadt∗\nNomic AI\nbrandon@nomic.ai\nAndriy Mulyar∗\nNomic AI\nandriy@nomic.ai\nAbstract\nLarge language models (LLMs) have recently\nachieved human-level performance on a range\nof professional and academic benchmarks. The\naccessibility of these models has lagged behind\ntheir performance. State-of-the-art LLMs re-\nquire costly infrastructure; are only accessible\nvia rate-limited, geo-locked, and censored web\ninterfaces; and lack publicly available code and\ntechnical reports.\nIn this paper, we tell the story of GPT4All, a\npopular open source repository that aim

## Create a simple gradio UI (if you prefer UI)

In [None]:

input_pdf_path = gr.components.Textbox(label="Provide the PDF file path")
output_summary = gr.components.Textbox(label="Summary")

interface = gr.Interface(
    fn=summarize_pdf,
    inputs=input_pdf_path,
    outputs=output_summary,
    title="PDF Summarizer",
    description="Provide PDF file path to get the summary.",
).launch(share=True)