

## 📘 Notebook Summary: Section-Based Structured Summarization

This notebook demonstrates a method for generating structured summaries from the full text of research papers. It follows a three-step approach:
1. **Section Extraction**: Heuristically splits the paper into sections based on common headers.
2. **Semantic Matching**: Uses TF-IDF and cosine similarity to find the best matching section for categories like problem, innovation, results, and related work.
3. **Structured Summary Generation**: Produces a dictionary-based summary useful for downstream tasks like metadata tagging or paper indexing.

The code is self-contained.


## 🧰 Setup Instructions for Grobid (Local)

1. Install Java 11 (required)
   Recommended: Use SDKMAN for easy version management
     - Install SDKMAN:
         curl -s "https://get.sdkman.io" | bash
     - Install Java 11:
         sdk install java 11.0.19-tem
         sdk use java 11.0.19-tem

2. Clone the Grobid repository:
     git clone https://github.com/kermitt2/grobid.git
     cd grobid

3. Build the project using Gradle:
     ./gradlew clean install

4. Run the Grobid service locally:
     ./gradlew run

Ensure 'java -version' shows Java 11 before building Grobid.


In [18]:
import requests
from io import BytesIO
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
from helper_functions import pad_arxiv_id
import spacy
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer, util


import pandas as pd


In [11]:
df = pd.read_csv("ai_ml_papers.csv")
kw_model = KeyBERT('all-MiniLM-L6-v2')
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
nlp = spacy.load("en_core_web_sm")

  df = pd.read_csv("ai_ml_papers.csv")


In [15]:

def extract_fulltext_grobid(arxiv_id, grobid_url="http://localhost:8070/api/processFulltextDocument"):
    """
    Download a PDF from arXiv and send it to Grobid for full-text extraction.

    Args:
        arxiv_id (str): The arXiv paper ID (e.g., "2301.12345").
        grobid_url (str): The endpoint of the Grobid service.

    Returns:
        str: Extracted TEI XML text from Grobid.
    """
    padded_id = pad_arxiv_id(arxiv_id)
    url = f"https://arxiv.org/pdf/{padded_id}.pdf"
    try:
        response = requests.get(url)
        response.raise_for_status()
        files = {"input": (f"{arxiv_id}.pdf", BytesIO(response.content), "application/pdf")}

        grobid_response = requests.post(grobid_url, files=files)
        grobid_response.raise_for_status()
        return grobid_response.text
    except Exception as e:
        return f"Error with Grobid processing: {e}"
    

def parse_sections_from_tei(tei_xml):
    """
    Parses TEI XML output from Grobid into section-wise text.

    Args:
        tei_xml (str): TEI XML string extracted by Grobid.

    Returns:
        dict: Dictionary mapping section titles to their full content.
    """
    sections = {}
    soup = BeautifulSoup(tei_xml, "xml")
    for div in soup.find_all("div"):
        head = div.find("head")
        if head:
            title = head.text.strip().lower()
            paragraphs = [p.text.strip() for p in div.find_all("p")]
            sections[title] = " ".join(paragraphs)
    return sections    
    

def semantic_section_match(sections):
    """
    Finds the most relevant section for each concept using sentence embeddings.

    Args:
        sections (dict): Dictionary of section titles to text.
        concepts (list): List of semantic concepts to search for.
        top_k (int): Number of top matches to return per concept.

    Returns:
        dict: Mapping of concept to the most relevant section text.
    """
    section_titles = list(sections.keys())
    potential_section_headings = ["introduction", "motivation", "method", "experiment", "related work", "background", "innovation", "result"]

    query_embeddings = sbert_model.encode(potential_section_headings, convert_to_tensor=True)
    title_embeddings = sbert_model.encode(section_titles, convert_to_tensor=True)

    similarity = util.cos_sim(query_embeddings, title_embeddings)
    best_indices = similarity.max(dim=0).indices

    matched = {}
    for i, idx in enumerate(best_indices):
        concept = potential_section_headings[idx]
        matched[concept] = sections[section_titles[i]]
    return matched



In [16]:

def generate_summary(arxiv_id):
    """
    Generates a structured summary from an arXiv ID using Grobid and semantic matching.

    Args:
        arxiv_id (str): The arXiv identifier of the paper.

    Returns:
        dict: Structured summary with key sections.
    """
    tei_xml = extract_fulltext_grobid(arxiv_id)
    sections = parse_sections_from_tei(tei_xml)

    matched_sections = semantic_section_match(sections)

    summary = {
        "Problem/Motivation": matched_sections.get("motivation", "") + matched_sections.get("introduction", ""),
        "Key Innovations": matched_sections.get("method", "") + matched_sections.get("approach", ""),
        "Results": matched_sections.get("experiment", "") + matched_sections.get("result", ""),
        "Related Work": matched_sections.get("related work", "") + matched_sections.get("background", "")
    }

    return summary
    
    # Optional LLM refinement
    # if USE_LLM:
    #     for category in summary:
    #         summary[category] = refine_with_llm(summary[category], category)
            


df_sample = df.iloc[5]
print(df_sample['id'])
generate_summary(df_sample['id'])

704.0985


{'Problem/Motivation': 'Conventional embedded systems consist of a microcontroller and DSP components realized using Field programmable gate arrays (FPGA), Complex programmable logic arrays (CPLDs), etc. With the increasing trend of System on Chip (SoC) integrations, mixed signal design on the single chip has become achievable. Such systems are excessively used in the areas of wireless communication, networking, signal processing, multimedia and networking. In order to increase the quality of service (QoS) the embedded system needs to be fault tolerant, must consume low power, must have high life time and should be economically feasible. These services have become a common specification for all the embedded systems and consequently to attract attention from commercial market different researchers have come up with novel solutions to redefine the QoS of embedded systems. Future embedded systems consists of evolutionary techniques that repair, evolve and adapt themselves to the condition