In [1]:
import fitz
from unidecode import unidecode

doc = fitz.open("test2.pdf")
toc = doc.get_toc(simple=True)

for content in toc:
    page = content[-1]
    title = content[-2]
    print(unidecode(doc[page-1].get_text()).find(title))

1516
2358
943
1200
2190
287
2412
2428
2276
1004
2987


In [2]:
toc

[[1, 'Introduction', 1],
 [1, 'Related Work', 2],
 [1, 'Framework', 3],
 [2, 'Unsupervised pre-training', 3],
 [2, 'Supervised fine-tuning', 3],
 [2, 'Task-specific input transformations', 4],
 [1, 'Experiments', 4],
 [2, 'Setup', 4],
 [2, 'Supervised fine-tuning', 5],
 [1, 'Analysis', 7],
 [1, 'Conclusion', 8]]

In [4]:
idx_start = unidecode(doc[0].get_text()).find("Introduction")
idx_end = unidecode(doc[1].get_text()).find("Related Work")

content = unidecode(doc[0].get_text())[idx_start:] + unidecode(doc[1].get_text())[:idx_end]
print(content)

Introduction
The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised
learning in natural language processing (NLP). Most deep learning methods require substantial
amounts of manually labeled data, which restricts their applicability in many domains that suffer
from a dearth of annotated resources [61]. In these situations, models that can leverage linguistic
information from unlabeled data provide a valuable alternative to gathering more annotation, which
can be time-consuming and expensive. Further, even in cases where considerable supervision
is available, learning good representations in an unsupervised fashion can provide a significant
performance boost. The most compelling evidence for this so far has been the extensive use of pre-
trained word embeddings [10, 39, 42] to improve performance on a range of NLP tasks [8, 11, 26, 45].
Leveraging more than word-level information from unlabeled text, however, is challenging for two
main reas

In [12]:
import fitz
from unidecode import unidecode
import os
import requests
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())


class DocumentSummarizer:

    API_TOKEN = os.getenv("HUGGINGFACE_API_TOKEN")

    # API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"
    API_URL = "https://api-inference.huggingface.co/models/Falconsai/text_summarization"
    headers = {"Authorization": f"Bearer {API_TOKEN}"}

    def __init__(self, path: str) -> None:
        self.doc = fitz.open(path)
        self.toc = self.doc.get_toc(simple=True)
    
    def _summarize(self, payload: str):
        response = requests.post(self.API_URL, headers=self.headers, json=payload)
        return response.json()[0]["summary_text"]
    
    def summarize(self) -> str:
        summarized_doc = {}
        for i, content in enumerate(self.toc):
            title = content[1]
            page_no = content[2]

            if i+1==len(self.toc):
                text = unidecode(self.doc[page_no-1].get_text())
                start_idx = text.find(title)
                content_text = text[len(title)+start_idx:]
                
            
            else:
                start_idx = unidecode(self.doc[page_no-1].get_text()).find(title)
                end_idx = unidecode(self.doc[self.toc[i+1][2]-1].get_text()).find(self.toc[i+1][1])
                if page_no == self.toc[i+1][2]:
                    content_text = unidecode(self.doc[page_no-1].get_text())[len(title)+start_idx:end_idx]
                else:
                    content_text = unidecode(self.doc[page_no-1].get_text())[len(title)+start_idx:] + \
                                   unidecode(self.doc[self.toc[i+1][2]-1].get_text())[:end_idx]
            
            content_text = content_text.replace("\n", " ")
            summzarized_content_text = ""
            while len(content_text)>512:
                summzarized_content_text += self._summarize(content_text[:512])
                content_text = content_text[512:]
            summarized_doc[title] = summzarized_content_text
        
        summarized_paper = ""
        for title, content in summarized_doc.items():
            summarized_paper += title + "\n" + content
        return summarized_paper

In [13]:
ds = DocumentSummarizer(path="test2.pdf")
res = ds.summarize()

In [15]:
print(res)

Introduction
the ability to learn effectively from raw text is crucial to alleviating the dependence on supervised learning in natural language processing (NLP) deep learning methods require substantial amounts of manually labeled data, which restricts their applicability in many domains that suffer from a dearth of annotated resources .the most compelling evidence for this so far has been the extensive use of pre-trained word embeddings [10, 39, 42] to improve performance on a range of NLP tasks [8, 11, 26, 45].optimization objectives are most effective at learning text representations that are useful for transfer . Recent research has looked at various objectives such as language modeling [44], machine translation [38], and discourse coherence [22] .l architecture [43, 44], using intricate learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made it difficult to develop effective semi-supervised learning approaches for language processing task

In [17]:
ds.toc

[[1, 'Introduction', 1],
 [1, 'Related Work', 2],
 [1, 'Framework', 3],
 [2, 'Unsupervised pre-training', 3],
 [2, 'Supervised fine-tuning', 3],
 [2, 'Task-specific input transformations', 4],
 [1, 'Experiments', 4],
 [2, 'Setup', 4],
 [2, 'Supervised fine-tuning', 5],
 [1, 'Analysis', 7],
 [1, 'Conclusion', 8]]