# Following procedure will be followed for this problem statement:
# - Convert pdf into xml file using Grobid
# - Extract title, headers and text section wise from xml file
# - Text processing
# - Extractive Summarization using BERT Summarizer
# - Predict title from the summarized text for that section using Flan-T5 base model
# - Combine all the text into single object for evaluation
# - Extract titles and content from the reference ppt of the paper from kaggle dataset
# - Perform Evaluation using Rouge Score
# - Create slides from the summarrized data and titles
# - Create final function which will take file_path as input and gives ppt as the output
# - Check by uploading the file and check reference.pptx file after using above function
# - Scope

In [5]:
# Convert the given file into the xml format using Grobid

!pip install pygrobid

Collecting pygrobid
  Downloading pygrobid-0.1.6.tar.gz (3.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pygrobid
  Building wheel for pygrobid (setup.py) ... [?25l[?25hdone
  Created wheel for pygrobid: filename=pygrobid-0.1.6-py3-none-any.whl size=3939 sha256=4db78193fa85d3b8a431e584097e137b45bf85de276b76f167bd9e3f013f5a54
  Stored in directory: /root/.cache/pip/wheels/8e/25/2d/3916f3225cb2b366b89fcb316ffdd86432863e42f86023ce1a
Successfully built pygrobid
Installing collected packages: pygrobid
Successfully installed pygrobid-0.1.6


In [6]:
from grobid.client import GrobidClient

In [7]:
# host = "localhost"
# port = "8070"
# client = GrobidClient(host, port)

# rsp = client.serve("processFulltextDocument", "/content/2021.sdp-1.11.pdf", consolidate_header="1")

In [8]:
# Above cell was giving http connection error while connecting to the Grobid.
# Hence using the local setup of Grobid to convert and upload the xml file to this directory.
# Following is the procedure to setup Grobid in local

# Run following docker command in the shell to run the docker image in local (for windows launch docker desktop before running this)
# docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

# Then visit localhost:8070 on browser and upload the pdf file into TEI section and choose Process FullText Document option
# Convert and upload the xml file

In [9]:
# Extract sectionwise data required for summarization from the converted file
import xml.etree.ElementTree as ET

In [10]:
def extract_section_wise_text_from_file(file_path):
  # Your XML content

  # Parse the XML content
  tree = ET.parse(file_path)
  root = tree.getroot()

  # Dictionary to store extracted information
  result_dict = {}

  # Iterate through each "div" element
  for div_element in root.iterfind('.//{http://www.tei-c.org/ns/1.0}div'):
      div_data = {}

      # Extract headers and paragraphs under the "div"
      headers = [head.text for head in div_element.iterfind('.//{http://www.tei-c.org/ns/1.0}head')]
      paragraphs = [p.text for p in div_element.iterfind('.//{http://www.tei-c.org/ns/1.0}p')]

      # Store information in the dictionary
      div_data['headers'] = headers
      div_data['paragraphs'] = paragraphs

      # Add the information to the main dictionary
      result_dict[f"Division_{len(result_dict)+1}"] = div_data
  return result_dict

In [11]:
def extract_title_from_file(file_path):
  tree = ET.parse(file_path)
  root = tree.getroot()

  # Find the titleStmt element using the XML namespace
  namespace = {'tei': 'http://www.tei-c.org/ns/1.0'}
  title_element = root.find('.//tei:titleStmt/tei:title', namespace)

  # Extract the text content of the title element
  if title_element is not None:
      title_text = title_element.text
      print("Title Text:", title_text)
      return title_text
  else:
      print("Title Element not found in the XML content.")
      return "unknown title"

In [12]:
research_paper_file_path = "/content/Paper_cf.pdf.tei.xml"
result_dict = extract_section_wise_text_from_file(research_paper_file_path)
paper_title = extract_title_from_file(research_paper_file_path)

# Print the result dict
for key, value in result_dict.items():
    print(f"{key}:")
    print(f"Headers: {value['headers']}")
    print(f"Paragraphs: {value['paragraphs']}")
    print("\n")

Title Text: Approximation Algorithms for Combinatorial Auctions with Complement-Free Bidders
Division_1:
Headers: []
Paragraphs: ['In a combinatorial auction m heterogenous indivisible items are sold to n bidders. This paper considers settings in which the valuation functions of the bidders are known to be complement-free (a.k.a. subadditive). We provide several approximation algorithms for the social-welfare maximization problem in such settings. Firstly, we present a logarithmic upper bound for the case that the access to the valuation functions is via demand queries. For the weaker value queries model we provide a tight O( √ m) approximation. Unlike the other algorithms we present, this algorithm is also incentive compatible. Finally, we present two approximation algorithms for the more restricted class of XOS valuations: A simple deterministic algorithm that provides an approximation ratio of 2 and an optimal e e-1 approximation achieved via randomized rounding. We also present opt

In [13]:
# combining the content for each division
for key, value in result_dict.items():
    value["headers"] = " ".join(value["headers"])
    value["paragraphs"] = " ".join(value["paragraphs"])

In [14]:
def combine_content_for_each_division(input_dict):
  for key, value in input_dict.items():
    value["headers"] = " ".join(value["headers"])
    value["paragraphs"] = " ".join(value["paragraphs"])
  return input_dict

In [15]:
# Remove duplicate entries

# list to track values and their corresponding keys
value_counts = []

# List to store keys to remove
keys_to_remove = []

# Iterate through the original dictionary
for key, value in result_dict.items():
    # If the value is already in the value_counts dictionary, add the key to the keys_to_remove list
    if value["paragraphs"] in value_counts or value["paragraphs"] == "":
        keys_to_remove.append(key)
    else:
        # Otherwise, add the value to the value_counts dictionary
        value_counts.append(value["paragraphs"])

# Remove the keys outside the loop
print(keys_to_remove)


['Division_12', 'Division_13']


In [16]:
def remove_duplicate_entries(input_dict):
  # list to track values and their corresponding keys
  value_counts = []

  # List to store keys to remove
  keys_to_remove = []

  # Iterate through the original dictionary
  for key, value in input_dict.items():
      # If the value is already in the value_counts dictionary, add the key to the keys_to_remove list
      if value["paragraphs"] in value_counts or value["paragraphs"] == "":
          keys_to_remove.append(key)
      else:
          # Otherwise, add the value to the value_counts dictionary
          value_counts.append(value["paragraphs"])

  # Remove the keys outside the loop
  print(keys_to_remove)
  for key in keys_to_remove:
    input_dict.pop(key)

  return input_dict

In [17]:
for key in keys_to_remove:
    result_dict.pop(key)

In [18]:
# Exctractive text summarization using BERT Extractive Summarizer
# Reasons to select Extractive Text Summarization technique:
# 1. As the pdf documents can be too long and passing the whole document content to the LLM model wont be possible everytime.
# 2. Abstractive Summarization technique can give random output based on its creativity which will not be related to the actual context.
# 3. We need to create content for slides which will be done section wise hence to get relatable and fast output we will use Extractive Summarization.
!pip install bert-extractive-summarizer

Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl (25 kB)
Installing collected packages: bert-extractive-summarizer
Successfully installed bert-extractive-summarizer-0.10.1


In [19]:
from summarizer import Summarizer
bert_model = Summarizer()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [20]:
def generate_extractive_summary(text):
  ext_summary = bert_model(text, min_length = 40, max_length = 150)
  summary = "".join(ext_summary)
  return summary

In [21]:
!pip install torch
!pip install transformers



In [22]:
import torch
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

In [23]:
# We will then use the T5 base model to extract appropriate title for the slides as every time the headers extracted from the doc
# will not be correct.

model_name='google/flan-t5-base'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [24]:
# Function to generate title from the given text

def generate_title(text):
  prompt = f"""
    Give a suitable title explaining the following text.

    {text}

  """
  inputs = tokenizer(prompt, return_tensors='pt')
  output = tokenizer.decode(
      model.generate(
          inputs["input_ids"],
          max_new_tokens=100
          # min_length=50,
          # max_length=150,
          # num_beams=2
      )[0],
      skip_special_tokens=True
  )
  return output

In [25]:
# Function to generate abstractive summary from the given text

def generate_abstractive_summary(text):
  prompt = f"""
  Summarize the following text which corresponds to the research paper.

  {text}

  Summary:
    """
  inputs = tokenizer(prompt, return_tensors='pt')
  output = tokenizer.decode(
      model.generate(
          inputs["input_ids"],
          # max_new_tokens=100
          # min_length=50,
          # max_length=150,
          num_beams=2
      )[0],
      skip_special_tokens=True
  )
  return output

In [26]:
for key, value in result_dict.items():
  para = value["paragraphs"]
  ext_sum = generate_extractive_summary(para)
  title = generate_title(ext_sum)
  value["summary"] = ext_sum
  value["title"] = title



In [27]:
def generate_final_summary_title_dict(input_dict):
  for key, value in input_dict.items():
    para = value["paragraphs"]
    ext_sum = generate_extractive_summary(para)
    if ext_sum == "":
      ext_sum = generate_abstractive_summary(para)
      title = generate_title(ext_sum)
    else:
      title = generate_title(ext_sum)
    value["summary"] = ext_sum
    value["title"] = title
  return input_dict

In [28]:
# Print the result
for key, value in result_dict.items():
    print(f"{key}:")
    print(f"Headers: {value['headers']}")
    print(f"Paragraphs: {value['paragraphs']}")
    print(f"title: {value['title']}")
    print(f"summary: {value['summary']}")
    print("\n")

Division_1:
Headers: 
Paragraphs: In a combinatorial auction m heterogenous indivisible items are sold to n bidders. This paper considers settings in which the valuation functions of the bidders are known to be complement-free (a.k.a. subadditive). We provide several approximation algorithms for the social-welfare maximization problem in such settings. Firstly, we present a logarithmic upper bound for the case that the access to the valuation functions is via demand queries. For the weaker value queries model we provide a tight O( √ m) approximation. Unlike the other algorithms we present, this algorithm is also incentive compatible. Finally, we present two approximation algorithms for the more restricted class of XOS valuations: A simple deterministic algorithm that provides an approximation ratio of 2 and an optimal e e-1 approximation achieved via randomized rounding. We also present optimal lower bounds for both the demand oracles model and the value oracles model.
title: A combina

In [29]:
# Extract data from slides for evaluation

presentation_slide_file_path = "/content/Slides_cf.pdf.tei.xml"
result_dict_slides = extract_section_wise_text_from_file(presentation_slide_file_path)

# combining the content for each division
for key, value in result_dict_slides.items():
    value["headers"] = " ".join(value["headers"])
    value["paragraphs"] = " ".join(value["paragraphs"])


# Remove duplicate entries

# Dictionary to track values and their corresponding keys
value_counts = []

# List to store keys to remove
keys_to_remove = []

# Iterate through the original dictionary
for key, value in result_dict_slides.items():
    # If the value is already in the value_counts dictionary, add the key to the keys_to_remove list
    if value["paragraphs"] in value_counts or value["paragraphs"] == "":
        keys_to_remove.append(key)
    else:
        # Otherwise, add the value to the value_counts dictionary
        value_counts.append(value["paragraphs"])

# Remove the keys outside the loop
print(keys_to_remove)


['Division_29', 'Division_30']


In [30]:
for key in keys_to_remove:
    result_dict_slides.pop(key)

In [31]:
# combining all text 0f slides into a single object for evaluation

def combine_all_text_of_slides_dict(input_dict):
  final_text = ""
  for key, value in input_dict.items():
    final_text += value["headers"] + "\n"
    final_text += value["paragraphs"] + "\n"

  return final_text

In [32]:
# combining all summarized text into a single object for evaluation

def combine_all_text_of_paper_summary_dict(input_dict):
  final_text = ""
  for key, value in input_dict.items():
    final_text += value["title"] + "\n"
    final_text += value["summary"] + "\n"

  return final_text

In [33]:
slide_full_summary = combine_all_text_of_slides_dict(result_dict_slides)
slide_full_summary = slide_full_summary.replace("", "")

full_paper_summary = combine_all_text_of_paper_summary_dict(result_dict)


In [34]:
# Evaluation of the results

!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting pyarrow-hotfix (from datasets>=2.0.0->

In [35]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=c010b3f9dcc7bf255e6e11e94689e867348ef8edeb98c7dda9e2e35c4dff8260
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [36]:
# Evaluation of the summarized text with the reference slide text

from evaluate import load
# Load the ROUGE metric
import evaluate
rouge = evaluate.load('rouge')

listed_full_paper_summary = [full_paper_summary]
listed_slide_full_summary = [slide_full_summary]

rouge_scores = rouge.compute(
    predictions=listed_full_paper_summary,
    references=listed_slide_full_summary,
    use_aggregator=True,
    use_stemmer=True,
)

print(rouge_scores)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'rouge1': 0.4424379232505643, 'rouge2': 0.17360438851242338, 'rougeL': 0.16510802966784907, 'rougeLsum': 0.40180586907449206}


In [37]:
# creating presentation from the section wise summarized research data

!pip install python-pptx

Collecting python-pptx
  Downloading python_pptx-0.6.23-py3-none-any.whl (471 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/471.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/471.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting XlsxWriter>=0.5.7 (from python-pptx)
  Downloading XlsxWriter-3.1.9-py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.8/154.8 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: XlsxWriter, python-pptx
Successfully installed XlsxWriter-3.1.9 python-pptx-0.6.23


In [38]:

# import Presentation class
# from pptx library
from pptx import Presentation
from pptx.util import Inches, Pt

In [42]:
# Function to create first title slide

def create_title_slide(pres, title):
  first_slide_layout = pres.slide_layouts[0]
  slide = pres.slides.add_slide(first_slide_layout)

  slide.shapes.title.text = title
  return pres


In [40]:
# Function to convert text into the slide


def create_slide_from_text(pres, title, summary):
  listed_summary = summary.split(".")
  bullet_slide_layout = pres.slide_layouts[1]
  slide = pres.slides.add_slide(bullet_slide_layout)
  shapes = slide.shapes

  title_shape = shapes.title
  body_shape = shapes.placeholders[1]

  title_shape.text = title
  tf = body_shape.text_frame
  # tf.text = listed_summary[0]

  for i, v in enumerate(listed_summary):
    if i < 6:
      p = tf.add_paragraph()
      p.text = v
      p.font.size = Pt(22)
    else:
      break

  return pres

In [43]:
# create presentation from the title and summaries

prs_1 = Presentation()
prs_1 = create_title_slide(prs_1, paper_title)
for k, v in result_dict.items():
  prs_1 = create_slide_from_text(prs_1, v["title"], v["summary"])

prs_1.save("test1.pptx")

In [44]:
# You can check the test1.pptx file saved in the current directory.

# Now applying the same functionalities to generate the presentation for another file.

In [45]:
# Final function combining all the functionalities which will take file path as input and will save presentation in the current folder

def p2s_converter(file_path):
  r_dict = extract_section_wise_text_from_file(file_path)
  title = extract_title_from_file(file_path)
  r_dict = combine_content_for_each_division(r_dict)
  r_dict = remove_duplicate_entries(r_dict)
  f_dict = generate_final_summary_title_dict(r_dict)
  n_pres = Presentation()
  n_pres = create_title_slide(n_pres, title)
  for k, v in f_dict.items():
    n_pres = create_slide_from_text(n_pres, v["title"], v["summary"])

  n_pres.save("reference.pptx")
  return print("Presentation created for the given file path!")


In [46]:
# User can call this function after loading the required libraries directly with the xml file path

p2s_converter("/content/2021.sdp-1.11.pdf.tei.xml")

Title Text: Extractive Research Slide Generation Using Windowed Labeling Ranking
['Division_13']




Presentation created for the given file path!


# Scope

# Not able to use pdf figure 2.0 as the documentation available is not much clear.

# Tried image extraction using different methods but was not able to find out correct algorithm to add the images in the slide
# as image with caption was not able to extract from the file hence excluding image addition part for now.

# Following is the procedure that can be used to add images to the slide
#  - Extract image along with the image caption
#  - Check the caption from the extracted headers
#  - Associate that image to the header which is having max similarity

# We can even improve the summarries by fine tuning the model for the given dataset.
# Can find even better LLM model to create bullet points from the summary got from Extractive Summarization.