# **Install Dependencies**
This cell installs all required packages and downloads the spaCy model.

In [1]:
# Install required libraries (run this cell once)
!pip install spacy nltk transformers beautifulsoup4 pymupdf
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# **Import Libraries and Setup**
This cell imports necessary libraries and initializes the BigBird-Pegasus summarization pipeline.

In [9]:
# Import required libraries
import re
import nltk
import spacy
import fitz
from transformers import pipeline

# Download NLTK punkt if not already done
nltk.download('punkt')
nltk.download('punkt_tab')

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize the BigBird-Pegasus summarization pipeline for PubMed texts
summarizer = pipeline("summarization", model="google/bigbird-pegasus-large-pubmed")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
Device set to use cpu


# **Helper Functions: PDF Reading and Cleaning**
This cell contains functions for reading PDF files with a content filter and cleaning the extracted text.

In [10]:
def read_pdf_with_content_filter(file_path, keywords=["Abstract", "Introduction", "Methods", "Results", "Conclusions"]):
    """
    Reads a PDF file and returns text only from pages that contain one of the specified keywords.
    This helps exclude pages that mainly contain header/metadata.
    """
    doc = fitz.open(file_path)
    content_pages = []
    for i in range(len(doc)):
        page_text = doc[i].get_text()
        if any(keyword.lower() in page_text.lower() for keyword in keywords):
            content_pages.append(page_text)
    return "\n".join(content_pages)

def clean_text(text):
    """
    Cleans the text by removing citations, extra whitespace, and unwanted characters.
    """
    text = re.sub(r'\[\d+\]', '', text)  # Remove citations like [12]
    text = re.sub(r'\(\d+\)', '', text)  # Remove citations like (3)
    text = re.sub(r'\s+', ' ', text)     # Normalize whitespace
    return text.strip()

# **Helper Functions: Core Section Extraction**
This cell includes functions to extract core sections from text and remove header metadata as a fallback.


In [11]:
def extract_core_sections(text):
    """
    Attempts to extract core sections using common headings.
    Returns a dictionary with section name (lowercase) as key and its content as value.
    """
    pattern = r'(?i)(Abstract|Introduction|Methods|Results|Conclusions|Discussion)\s*[:\n\.]'
    splits = re.split(pattern, text)
    sections = {}
    if len(splits) > 1:
        for i in range(1, len(splits), 2):
            heading = splits[i].strip().lower()
            content = splits[i+1].strip() if i+1 < len(splits) else ""
            sections[heading] = content
    return sections

def remove_header_metadata(text, marker="Competing Interests:"):
    """
    Removes header/metadata from the text by using a marker.
    If the marker is found, returns text after it; otherwise, returns the original text.
    """
    idx = text.find(marker)
    if idx != -1:
        return text[idx + len(marker):].strip()
    return text

# **Helper Functions: Chunking and Summarization**
This cell defines functions to split text into chunks, summarize text, and format bullet points.


In [12]:
def split_into_chunks(text, chunk_size=500):
    """
    Splits the text into chunks of approximately chunk_size words.
    """
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i+chunk_size])
        chunks.append(chunk)
    return chunks

def summarize_text(text, max_length=200, min_length=50):
    """
    Summarizes the given text using BigBird-Pegasus.
    Adjusts output lengths if the input is very short.
    """
    input_length = len(text.split())
    if input_length < 60:
        max_length = min(max_length, 40)
        min_length = min(min_length, 10)
    summary = summarizer(text, max_length=max_length, min_length=min_length, do_sample=False)
    return summary[0]['summary_text']

def format_bullet_points(summary):
    """
    Splits the summary into sentences and formats each as a bullet point.
    """
    sentences = nltk.sent_tokenize(summary)
    bullets = ["- " + sentence for sentence in sentences]
    return "\n".join(bullets)

# **Helper Function: Post-Processing (Wrap Paragraph)**
This cell defines a function to convert bullet points into a neatly wrapped paragraph using the textwrap module.


In [13]:
def bullet_to_paragraph_wrapped(bullet_text, width=80):
    """
    Converts bullet point summary into a paragraph and wraps the text to a specified width.
    """
    paragraph = bullet_text.replace("- ", "").replace("<n>", " ")
    paragraph = re.sub(r'\s+', ' ', paragraph).strip()
    wrapped_paragraph = textwrap.fill(paragraph, width=width)
    return wrapped_paragraph

# **Main Pipeline**
This cell integrates all helper functions: reading the PDF, cleaning text, extracting core sections (or using fallback), chunking, summarizing each chunk, and then generating a final bullet point summary.

In [14]:
# Main Pipeline with Improved Extraction

# Step 1: Read the PDF using a content filter to exclude pages that lack core content.
pdf_file_path = '/content/pmed.0020298.pdf'  # Replace with your actual PDF path
full_text = read_pdf_with_content_filter(pdf_file_path)
print("Filtered extracted text length (characters):", len(full_text))

# Step 2: Clean the extracted text.
cleaned_text = clean_text(full_text)
print("\n--- First 1000 characters of cleaned text ---\n")
print(cleaned_text[:1000])

# Step 3: Attempt to extract core sections from the cleaned text.
sections = extract_core_sections(cleaned_text)
print("\nExtracted sections:", list(sections.keys()))

# Step 4: Fallback if no core sections are found.
if not sections:
    print("\nNo clear sections detected. Using fallback extraction to remove header metadata.")
    core_text = remove_header_metadata(cleaned_text)
else:
    # Combine sections in a preferred order if found.
    order = ['abstract', 'introduction', 'methods', 'results', 'conclusions', 'discussion']
    core_content = [sections[sec] for sec in order if sec in sections]
    core_text = " ".join(core_content) if core_content else cleaned_text

print("\n--- Combined Core Text Preview (first 1000 characters) ---\n")
print(core_text[:1000])

# Step 5: Split the core text into manageable chunks.
chunks = split_into_chunks(core_text, chunk_size=500)
print("\nNumber of core text chunks:", len(chunks))

# Step 6: Summarize each chunk individually.
chunk_summaries = []
for i, chunk in enumerate(chunks):
    print(f"\nSummarizing core chunk {i+1}/{len(chunks)}...")
    try:
        chunk_summary = summarize_text(chunk, max_length=200, min_length=50)
    except Exception as e:
        print(f"Error in chunk {i+1}: {e}")
        chunk_summary = ""
    chunk_summaries.append(chunk_summary)
    print("Chunk summary:", chunk_summary)

# Step 7: Combine all chunk summaries and perform a final summarization.
final_core_summary_text = " ".join(chunk_summaries)
final_summary = summarize_text(final_core_summary_text, max_length=200, min_length=50)
print("\nFinal Summary:\n", final_summary)

# Step 8: Format the final summary as bullet points.
bullet_points = format_bullet_points(final_summary)
print("\nBullet Point Summary from Core Content:\n", bullet_points)

Filtered extracted text length (characters): 41943

--- First 1000 characters of cleaned text ---

Randomized, Controlled Intervention Trial of Male Circumcision for Reduction of HIV Infection Risk: The ANRS 1265 Trial Bertran Auvert1,2,3,4*, Dirk Taljaard5, Emmanuel Lagarde2,4, Joe¨lle Sobngwi-Tambekou2, Re´mi Sitta2,4, Adrian Puren6 1 Hoˆpital Ambroise-Pare´, Assitance Publique—Hoˆpitaux de Paris, Boulogne, France, 2 INSERM U 687, Saint-Maurice, France, 3 University Versailles Saint-Quentin, Versailles, France, 4 IFR 69, Villejuif, France, 5 Progressus, Johannesburg, South Africa, 6 National Institute for Communicable Disease, Johannesburg, South Africa Competing Interests: The authors have declared that no competing interests exist. Author Contributions: BA designed the study with DT, EL, and AP. DT and AP were responsible for operational aspects, including laboratory and field work and in- country administration of the study. BA monitored the study with input from EL and wrote the 

Attention type 'block_sparse' is not possible if sequence_length: 618 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...


Chunk summary: this paper provides an overview of recent developments in the field of infectious diseases .<n> it is anticipated that the information gained from these studies will be useful for understanding the emergence of new pathogens and for improving the control of infections in the future .<n> the paper is divided into three sections .<n> the first section deals with general principles of infection control . in this section ,<n> authors have discussed general principles of infection control , including the role of the immune system and innate and adaptive immunity .<n> the second section is dedicated to specific aspects of infection control , including immunological principles .<n> the third section is dedicated to animal models of infection . in the first section ,<n> authors have discussed general principles of infection control , including the role of the immune system and innate and adaptive immunity . in the second section ,<n> authors have discussed specific aspects of in

Your max_length is set to 200, but your input_length is only 137. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=68)


Chunk summary: efforts to prevent men from becoming infected with hiv have been hampered by lack of knowledge about risk factors .<n> circumcised men have higher chances of becoming infected with hiv , compared to uncircumcised men .<n> this study was aimed at understanding risk factors for hiv infection and ways to reduce men from becoming infected .

Summarizing core chunk 14/14...
Chunk summary: key clinical messageatypical manifestation of huntington 's disease ( hd ) is an autosomal dominant loss of hair and teeth .<n> the distinctive feature of hd is that the affected individual often does not reconstitute his or her lost hair and/or teeth over the course of the disease .<n> therefore , a well - designed long - term follow - up strategy is essential for patients with hd .

Final Summary:
 this paper provides an overview of recent developments in the field of infectious diseases .<n> authors have discussed general principles of infection control , including the role of the immune 

# Post-Processing: Final Paragraph Summary
This cell converts the bullet point summary into a wrapped paragraph for a clean final output.

In [16]:
import textwrap
# Convert bullet points to a neatly wrapped paragraph.
paragraph_summary_wrapped = bullet_to_paragraph_wrapped(bullet_points, width=80)
print("\nFinal Paragraph Summary (wrapped):\n", paragraph_summary_wrapped)


Final Paragraph Summary (wrapped):
 this paper provides an overview of recent developments in the field of
infectious diseases . authors have discussed general principles of infection
control , including the role of the immune system and innate and adaptive
immunity . authors have discussed specific aspects of infection control ,
including the role of the immune system and innate and adaptive immunity .
authors have also discussed animal models of infection , which provide an
understanding of how organisms control the body shape up in response to external
forces . in the future , it is expected that the information gained from these
studies will be useful for understanding the emergence of new pathogens and for
improving the control of infections in the future .
