# **Automatic Text Summarizer and Metadata extractor with JSON Output**
## *Leo Schuhmann*
This notebook takes an english PDF-File as input, extracts its available Metadata and reads its text. <br> After that different ways of generating a summary with keywords and abstractive and extractive techniques get used.

#**Please install following libraries**

In [None]:
!pip install --upgrade pyPDF2
!pip install --upgrade sentencepiece
!pip install --upgrade bert-extractive-summarizer
!pip install --upgrade spacy
!pip install --upgrade transformers
!pip install --upgrade neuralcoref
!pip install --upgrade pegasuspy
!python -m spacy download en_core_web_md
!pip install --upgrade git+https://github.com/google/flax.git
!pip install --upgrade python-rake
!pip install --upgrade nltk
!pip install --upgrade torch
!pip install --upgrade re
!pip install --upgrade ipywidgets
!pip install --upgrade IPython

#**Import necessary libraries and download helper files**

In [None]:
from ipywidgets import FileUpload
from IPython.display import display
from PyPDF2 import PdfFileReader
import nltk
import re
from summarizer import Summarizer
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration, PegasusTokenizer, PegasusForConditionalGeneration, AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import pprint
import RAKE
import pprint
nltk.download('punkt')
nltk.download('stopwords')
pp = pprint.PrettyPrinter(indent=4) #pretty print outputs

#**Set the ENGLISH PDF file**
*Please press on upload and select the PDF file to get started.* <br>
**After uploading the file only execute the next cells, not this one again.**

In [67]:
upload = FileUpload(accept='.pdf', multiple=False)
display(upload)

FileUpload(value={}, accept='.pdf', description='Upload')

*this code is converting the pdf upload file, to make it usable in the next steps*

In [68]:
with open('file_output.pdf', 'wb') as output_file: 
    for uploaded_filename in upload.value:
        content = upload.value[uploaded_filename]['content']   
        output_file.write(content) 

# **Now we can start**
## 1. Read PDF Metadata and Text

In [69]:
def get_info(path):
    with open(path, 'rb') as f:
        pdf = PdfFileReader(f)
        info = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()

        full_text = []
        for i in range(number_of_pages):
          full_text.append((pdf.getPage(i)).extractText())
        full_text = " ".join(full_text)
        
    return full_text, info, number_of_pages

full_text, metadata, number_of_pages = get_info('file_output.pdf')
meta = metadata.copy()
meta[r'/NumberPages'] = number_of_pages
pp.pprint(meta)
pp.pprint(full_text[:100]) #lets only view the first 100 chars 

{   '/Author': 'Leo Schuhmann',
    '/CreationDate': 'D:20211218124501Z',
    '/Creator': 'Microsoft® Word for Microsoft 365',
    '/Keywords': 'Business Informatics, IT, Study',
    '/ModDate': "D:20211218134549+01'00'",
    '/NumberPages': 2,
    '/Producer': 'Microsoft® Word for Microsoft 365',
    '/Subject': 'Business Informatics',
    '/Title': 'Business Informatics'}
('\n'
 ' \n'
 'My studies in business informatics:\n'
 ' \n'
 'I \n'
 "study business and computer science. That's cool! Busine")


## 2. Basic Text Preperation and Cleaning

In [70]:
body = full_text.replace("/[^A-Za-z0-9\s!?]/g",'').replace("\n", " ").strip()
body = re.sub(' +', ' ', body)
pp.pprint(body[:100]) #lets only view the first 100 chars 

('My studies in business informatics: I study business and computer science. '
 "That's cool! Business is ")


## 3. BERT Extractive Summarizer

In [71]:
model = Summarizer()
result = model(body, num_sentences=3) 
pp.pprint(result)

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


('My studies in business informatics: I study business and computer science. '
 'From the perspective of computer science, business informatics is an applied '
 'computer science. Although business informatics has many characterist ics of '
 'a so - called interface or bridge discipline, which is open to other '
 'disciplines, it has its own field of statement: it deals with theories, '
 'methods, tools and develops intersubjectively verifiable knowledge about '
 'information and communication systems . At many university locations, '
 'business informatics is therefore assigned to the economic sciences or the '
 'social and economic sciences.')


## 4. T5 Model

In [72]:
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cpu')

t5_prepared_Text = "summarize: "+result
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

summary_ids = model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=100,
                                    early_stopping=True)

output = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids])[0]
pp.pprint(output)

('business informatics has many characterist ics of a so - called interface or '
 'bridge discipline, which is open to other disciplines. it deals with '
 'theories, methods, tools and develops intersubjectively verifiable knowledge '
 'about information and communication systems.')


## 5. Pegasus Model

In [73]:
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')

tokenized_text = tokenizer(result, return_tensors='pt').to(device)

summary_ids = model.generate(tokenized_text['input_ids'])
out = ([tokenizer.decode(g, skip_special_tokens=True) for g in summary_ids])[0]
pp.pprint(out)

'What is business informatics?'


## 6. Keywords: RAKE

In [74]:
rake = RAKE.Rake(RAKE.SmartStopList())
keywords = []

out_key = rake.run(body, minCharacters = 1, maxWords = 3, minFrequency = 3)
if not out_key:
  out_keyt = rake.run(body, minCharacters = 1, maxWords = 3, minFrequency = 2)
  if not out_key:
    out_key = rake.run(body, minCharacters = 1, maxWords = 3, minFrequency = 1)

for entry in range(3):
  keywords.append((out_key[entry][0]))

pp.pprint(keywords)

['business informatics deals', 'business informatics', 'computer science']


## Generate JSON Output

In [75]:
json_result = {"metadata": meta, 
               "BERT_extractive_sum": result,
               "T5_abstractive_sum": output,
               "Pegasus_abstractive_sum": out,
               "Rake_top_3_keywords": keywords}
pp.pprint(json_result)

{   'BERT_extractive_sum': 'My studies in business informatics: I study '
                           'business and computer science. From the '
                           'perspective of computer science, business '
                           'informatics is an applied computer science. '
                           'Although business informatics has many '
                           'characterist ics of a so - called interface or '
                           'bridge discipline, which is open to other '
                           'disciplines, it has its own field of statement: it '
                           'deals with theories, methods, tools and develops '
                           'intersubjectively verifiable knowledge about '
                           'information and communication systems . At many '
                           'university locations, business informatics is '
                           'therefore assigned to the economic sciences or the '
                      