# **Automatic Text Summarizer and Metadata extractor with JSON Output**
## *Leo Schuhmann*
This notebook takes an english PDF-File as input, extracts its available Metadata and reads its text. <br> After that different ways of generating a summary with keywords and abstractive and extractive techniques get used.

#**Please install following libraries**

In [None]:
!pip install --upgrade pyPDF2
!pip install --upgrade sentencepiece
!pip install --upgrade bert-extractive-summarizer
!pip install --upgrade spacy
!pip install --upgrade transformers
!pip install --upgrade neuralcoref
!pip install --upgrade pegasuspy
!python -m spacy download en_core_web_md
!pip install --upgrade git+https://github.com/google/flax.git
!pip install --upgrade python-rake
!pip install --upgrade nltk
!pip install --upgrade torch
!pip install --upgrade re
!pip install ipywidgets
!pip install IPython

#**Import necessary libraries and download helper files**

In [None]:
from ipywidgets import FileUpload
from IPython.display import display
from PyPDF2 import PdfFileReader
import nltk
import re
from summarizer import Summarizer
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration, PegasusTokenizer, PegasusForConditionalGeneration, AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import pprint
import RAKE
import pprint
nltk.download('punkt')
nltk.download('stopwords')
pp = pprint.PrettyPrinter(indent=4) #pretty print outputs

#**Set the ENGLISH PDF file**
*Please press on upload and select the PDF file to get started.* <br>
**After executing the cell below and uploading the file only execute the next cells, not this one again.**

In [None]:
upload = FileUpload(accept='.pdf', multiple=False)
display(upload)

*this code is converting the pdf upload file, to make it usable in the next steps*

In [None]:
with open('file_output.pdf', 'wb') as output_file: 
    for uploaded_filename in upload.value:
        content = upload.value[uploaded_filename]['content']   
        output_file.write(content) 

# **Now we can start**
## 1. Read PDF Metadata and Text

In [None]:
def get_info(path):
    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)
        info = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()

        full_text = []
        for i in range(number_of_pages):
          full_text.append((pdf.getPage(i)).extractText())
        full_text = " ".join(full_text)
        
    return full_text, info, number_of_pages

full_text, metadata, number_of_pages = get_info('file_output.pdf')
meta = metadata.copy()
meta[r'/NumberPages'] = number_of_pages
pp.pprint(meta)
pp.pprint(full_text[:100]) #lets only view the first 100 chars 

## 2. Basic Text Preperation and Cleaning

In [None]:
body = full_text.replace("/[^A-Za-z0-9\s!?]/g",'').replace("\n", " ").strip()
body = re.sub(' +', ' ', body)
pp.pprint(body[:100]) #lets only view the first 100 chars 

## 3. BERT Extractive Summarizer

Available options and parameter:

model = Summarizer(
    **model**: This gets used by the hugging face bert library to load the model, you can supply a custom trained model here
    **custom_model**: If you have a pre-trained model, you can add the model class here.
    **custom_tokenizer**:  If you have a custom tokenizer, you can add the tokenizer here.
    **hidden**: Needs to be negative, but allows you to pick which layer you want the embeddings to come from.
    **reduce_option**: It can be 'mean', 'median', or 'max'. This reduces the embedding layer for pooling.
    **sentence_handler**: The handler to process sentences. If want to use coreference, instantiate and pass CoreferenceHandler instance
)

model(
    **body**: str # The string body that you want to summarize
    **ratio**: float # The ratio of sentences that you want for the final summary
    **min_length**: int # Parameter to specify to remove sentences that are less than min length characters
    **max_length**: int # Parameter to specify to remove sentences greater than the max length,
    **num_sentences**: Number of sentences to use. Overrides ratio if supplied.
)

**My tests showed that leaving the default parameters except for the output with num_sentences generally yields the best results**


In [None]:
model = Summarizer()
result = model(body, num_sentences=3) 
pp.pprint(result)

## 4. T5 Model
for fine-tuning see huggingface documentation: <br>
https://huggingface.co/docs/transformers/main_classes/model

https://huggingface.co/docs/transformers/main_classes/tokenizer

In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

t5_prepared_Text = "summarize: "+result
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt")

summary_ids = model.generate(tokenized_text,
                                    min_length=30,
                                    max_length=100)

output = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids])[0]
pp.pprint(output)

##Input and Output Tensor generated:

In [None]:
pp.pprint(tokenized_text)

pp.pprint(summary_ids)

## 5. Pegasus Model
for fine-tuning see huggingface documentation: <br>
https://huggingface.co/docs/transformers/main_classes/model

https://huggingface.co/docs/transformers/main_classes/tokenizer

In [None]:
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')

tokenized_text = tokenizer.encode(result, return_tensors='pt')

summary_ids = model.generate(tokenized_text)
out = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids])[0]
pp.pprint(out)

## 6. Keywords: RAKE
the parameters used here are the only ones available

In [None]:
rake = RAKE.Rake(RAKE.SmartStopList())
keywords = []

out_key = rake.run(body, minCharacters = 1, maxWords = 3, minFrequency = 3)
if not out_key:
  out_key = rake.run(body, minCharacters = 1, maxWords = 3, minFrequency = 2)
  if not out_key:
    out_key = rake.run(body, minCharacters = 1, maxWords = 3, minFrequency = 1)

for entry in range(3):
  keywords.append((out_key[entry][0]))

pp.pprint(keywords)

## Generate JSON Output

In [None]:
json_result = {"metadata": meta, 
               "BERT_extractive_sum": result,
               "T5_abstractive_sum": output,
               "Pegasus_abstractive_sum": out,
               "Rake_top_3_keywords": keywords}
pp.pprint(json_result)