# Summarizing research papers

This notebook looks at text summarization and specifically, downloading arXiv papers in PDF format and summarizing them.

In [None]:
# First install the required packages
!pip install openai wget pdfplumber

In [24]:
import numpy as np
import pathlib
import pdfplumber
import torch
import wget

from os.path import exists
from transformers import pipeline, logging
from urllib.parse import urlparse

# Supress some unnecessary warnings
logging.set_verbosity_error()

## T5

Let's create a class which will handle PDF file downloading from arXiv and converting the PDF pages to text for us for easy summarization.

In [11]:
# PDF file handler for downlading papers from arXiv
class PDFFile:
    pages = []

    def __init__(self, url: str, file_name: str):
        # Validate URL
        if "arxiv.org" not in url:
            raise Exception(f"The URL is not for arxiv.org: {url}")
        # Is this a URL to the paper or to the PDF?
        r = urlparse(url)
        if "/abs/" in url:
            # URL to the paper - not the PDF. Create PDF url
            url = url.replace('/abs/', '/pdf/')
        elif r.path.endswith('pdf') or "/pdf/" in url:
            # This is a URL to the PDf
            pass
        else:
            raise Exception(f"The URL is not for a paper nor PDF: {url}")
        self.url = url
        self.file_name = file_name
        if not exists(file_name):
            print(f'Will download: {url}')
            self.get_file()
        else:
            self.path = file_name
        self.get_pages()

    def get_file(self):
        # Get file
        file = wget.download(self.url, self.file_name)
        self.path = pathlib.Path(file)

    def get_pages(self):
        # Get pages from PDF file
        pgs = pdfplumber.open(self.path).pages
        # Iterate over all the pages
        for p in pgs:
            txt = p.extract_text()
            self.pages.append(txt)

    def page_count(self) -> int:
        return len(self.pages)

In [16]:
# Create the pipeline
device = torch.device("cuda") if torch.cuda.is_available() else torch.device('mps') if torch.has_mps else torch.device("cpu")
non_mps = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# MPS devices currently don't appear to work correctly with this pipeline - so going with non-MPS
summarizer = pipeline("summarization", "pszemraj/long-t5-tglobal-base-16384-book-summary", device=non_mps)

In [25]:
# Helper method to get a summary for a range of pages from a PDF file
def get_summary(pdf: PDFFile, start: int = 0, end: int = -1):
	# If end is -1, we go to the end of the list
	if end == -1:
		end = len(pdf.pages)
	txt = ''
	for p in pdf.pages[start:end]:
		txt += p + '\n'
	summ = summarizer(txt)[0]['summary_text']
	return summ

In [26]:
pdf = PDFFile('https://arxiv.org/abs/2210.11427', 'diffedit.pdf')
results = get_summary(pdf, 0, 8)
results



'In this paper, the author demonstrates an efficient and cost-effective method for generating convincing images by using a combination of image-editing methods. He uses a mixture of diffusion-based and text-conditioned modeling approaches to develop a new kind of image editor that can be used to create accurate edits in a wide variety of subject matter.'

The above summarizes all 8 pages in one go and what you get is just an overall summary. What if we wanted to see something a bit more detailed for each page? We can pass each page separately to the summrizer and see if a single summary per page make things clearer?

Let's modify the `get_summary` method to support summarizing a range of pages or a page at a time ...

In [32]:
# Updated Helper method to get a summary for a range of pages from a PDF file - either a page at a time,
# or, for a range of pages summarized at once
def get_summary(pdf: PDFFile, start: int = 0, end: int = -1, each_page: bool = True):
	results = ''
	# If end is -1, we go to the end of the list
	if end == -1:
		end = len(pdf.pages)
	txt = ''
	for p in pdf.pages[start:end]:
		if each_page:
			summ = summarizer(p)[0]['summary_text']
			results += summ + '\n\n'
		else:
			txt += p + '\n'
	if not each_page:
		results = summarizer(txt)[0]['summary_text']
	return results

In [33]:
pdf = PDFFile('https://arxiv.org/abs/2210.11427', 'diffedit.pdf')
results = get_summary(pdf, 0, 8)
print(results)



In this paper, the author discusses the use of masked image-editing to create convincing images for text. The author uses a combination of mask-based image generation and canbe-viewed images to create realistic, eloquent effects.

In this paper, we present a new approach to image-editing. We focus on the use of machine learning to predict where images should be edited and how they should be manipulated. This paper also presents an extended discussion of image-retrieval, which involves using machine learning approaches to learn how to manipulate images in real time.

The first part of the paper discusses some of the most important methods for transforming images into text. For example, in the Diffusion-Framework approach, we use a novel method called "Diffussion-Matrion" to transform an image into a model that can be used to predict how it will change over time.

In this paper, the author demonstrates how to use a noise-reduction technique to decode an image and then apply that informat

OK, that certainly gives us longer text and a better overview of each page, but I'm not sure that there's enough info in there for somebody to be able to understand the paper entirely 🙂 But it's a start right?

The model card for the model used is here:
https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary

It might be possible to improve upon the results by using other parameters/options supported by the model and/or trying different text summarization model. But that's an exercise for later perhaps?