In [1]:
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.25.4-cp39-abi3-win_amd64.whl (16.6 MB)
Installing collected packages: pymupdf
Successfully installed pymupdf-1.25.4


You should consider upgrading via the 'C:\Users\Adeeshunnikrishnan\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


# Extracting text from PDF

In [2]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)  # Open PDF file
    text = ""
    for page in doc:
        text += page.get_text("text") + "\n"  # Extract text from each page
    return text

# Example usage
pdf_text = extract_text_from_pdf("quantum.pdf")
print(pdf_text[:1000])  # Print first 1000 characters


arXiv:2503.09776v1  [cs.NI]  12 Mar 2025
A Short Scalability Study on the SeQUeNCe
Parallel Quantum Network Simulator
Aaron Welch
Computational Sciences &
Engineering Division
Oak Ridge National Laboratory, USA
welchda@ornl.gov
Mariam Kiran
Computational Sciences &
Engineering Division
Oak Ridge National Laboratory, USA
kiranm@ornl.gov
Abstract—As quantum networking continues to grow in im-
portance, its study is of interest to an ever wider community and
at an increasing scale. However, the development of its physical
infrastructure remains burdensome, and services providing third-
party access are not enough to meet demand. A variety of
simulation frameworks provide a method for testing aspects of
such systems on commodity hardware, but are predominantly
serial and thus unable to scale to larger networks and/or
workloads. One effort to address this was focused on parallelising
the SeQUeNCe discrete event simulator, though it has yet to be
proven to work well across system architectur

# Removing newlines, extra spaces, citation numbers

In [3]:
import re

def clean_text(text):
    text = re.sub(r'\n+', '\n', text)  # Remove excessive newlines
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'\[[0-9]+\]', '', text)  # Remove citation numbers like [1], [2]
    return text.strip()

cleaned_text = clean_text(pdf_text)
print(cleaned_text[:1000])


arXiv:2503.09776v1 [cs.NI] 12 Mar 2025 A Short Scalability Study on the SeQUeNCe Parallel Quantum Network Simulator Aaron Welch Computational Sciences & Engineering Division Oak Ridge National Laboratory, USA welchda@ornl.gov Mariam Kiran Computational Sciences & Engineering Division Oak Ridge National Laboratory, USA kiranm@ornl.gov Abstract—As quantum networking continues to grow in im- portance, its study is of interest to an ever wider community and at an increasing scale. However, the development of its physical infrastructure remains burdensome, and services providing third- party access are not enough to meet demand. A variety of simulation frameworks provide a method for testing aspects of such systems on commodity hardware, but are predominantly serial and thus unable to scale to larger networks and/or workloads. One effort to address this was focused on parallelising the SeQUeNCe discrete event simulator, though it has yet to be proven to work well across system architectures

# Break Long Research Papers into Chunks for Summarization
* Most models have a token limit (e.g., BART and T5 can only handle 512 tokens per input).
* Long documents need to be split into smaller chunks for effective summarization.

# 📌 1️⃣ Define Chunk Size (Based on Model Limitations)
* BART & T5: Can process 512 tokens (~400 words) at a time.
* LongFormer & BigBird: Can process 4,096+ tokens (but are slower).
* Best Practice: Split text into chunks of ~400 words

In [4]:
import re

def split_text_into_chunks(text, max_words=400):
    words = text.split()  # Split text into words
    chunks = []
    current_chunk = []

    for word in words:
        current_chunk.append(word)
        if len(current_chunk) >= max_words:
            chunks.append(" ".join(current_chunk))
            current_chunk = []

    # Add the last chunk if it's not empty
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# Example usage
text_chunks = split_text_into_chunks(cleaned_text, max_words=400)
print(f"Total chunks: {len(text_chunks)}")
print(text_chunks[0])  # Print the first chunk


Total chunks: 10
arXiv:2503.09776v1 [cs.NI] 12 Mar 2025 A Short Scalability Study on the SeQUeNCe Parallel Quantum Network Simulator Aaron Welch Computational Sciences & Engineering Division Oak Ridge National Laboratory, USA welchda@ornl.gov Mariam Kiran Computational Sciences & Engineering Division Oak Ridge National Laboratory, USA kiranm@ornl.gov Abstract—As quantum networking continues to grow in im- portance, its study is of interest to an ever wider community and at an increasing scale. However, the development of its physical infrastructure remains burdensome, and services providing third- party access are not enough to meet demand. A variety of simulation frameworks provide a method for testing aspects of such systems on commodity hardware, but are predominantly serial and thus unable to scale to larger networks and/or workloads. One effort to address this was focused on parallelising the SeQUeNCe discrete event simulator, though it has yet to be proven to work well across sys

# 📌 3️⃣ Ensure Chunks Don't Cut Off Mid-Sentence
* Instead of splitting blindly every 400 words, we can split by sentences.

In [5]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def split_text_smart(text, max_words=400):
    sentences = sent_tokenize(text)  # Split into sentences
    chunks = []
    current_chunk = []

    for sentence in sentences:
        current_chunk.append(sentence)
        if sum(len(s.split()) for s in current_chunk) >= max_words:
            chunks.append(" ".join(current_chunk))
            current_chunk = []

    # Add last chunk if not empty
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# Example usage
text_chunks = split_text_smart(cleaned_text, max_words=400)
print(f"Total smart chunks: {len(text_chunks)}")
print(text_chunks[0])  # Print first chunk


Total smart chunks: 9
arXiv:2503.09776v1 [cs.NI] 12 Mar 2025 A Short Scalability Study on the SeQUeNCe Parallel Quantum Network Simulator Aaron Welch Computational Sciences & Engineering Division Oak Ridge National Laboratory, USA welchda@ornl.gov Mariam Kiran Computational Sciences & Engineering Division Oak Ridge National Laboratory, USA kiranm@ornl.gov Abstract—As quantum networking continues to grow in im- portance, its study is of interest to an ever wider community and at an increasing scale. However, the development of its physical infrastructure remains burdensome, and services providing third- party access are not enough to meet demand. A variety of simulation frameworks provide a method for testing aspects of such systems on commodity hardware, but are predominantly serial and thus unable to scale to larger networks and/or workloads. One effort to address this was focused on parallelising the SeQUeNCe discrete event simulator, though it has yet to be proven to work well acros

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Adeeshunnikrishnan\AppData\Roaming\nltk_data.
[nltk_data]     ..
[nltk_data]   Package punkt is already up-to-date!


# 🔹  Summarize Each Chunk Using a Hugging Face Model

## 📌 1️⃣ Choose a Summarization Model  
Hugging Face provides several summarization models:  

| Model | Max Token Limit | Best For |
|------------|-----------------|-------------|
| `facebook/bart-large-cnn` | **1024 tokens (~750 words)** | **General summarization (news, research, etc.)** |
| `t5-small`, `t5-base`, `t5-large` | **512 tokens (~400 words)** | **Flexible text summarization** |
| `google/pegasus-xsum` | **512 tokens (~400 words)** | **Extremely short summaries** |

**Best Choice for Our Task:**  
- **BART (`facebook/bart-large-cnn`)** because it can handle **longer text** and produces high-quality summaries.  

* ✅ Why BART?

* Handles long text (up to 1024 tokens).
* Produces coherent and informative summaries.

In [6]:
!pip install transformers




You should consider upgrading via the 'C:\Users\Adeeshunnikrishnan\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [7]:
from transformers import pipeline

# Load the summarization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Example test
test_text = "Natural language processing (NLP) has revolutionized many fields, including machine translation, text summarization, and conversational AI. With the advent of deep learning, NLP models have become more sophisticated, enabling better understanding of human language."
summary = summarizer(test_text, max_length=50, min_length=10, do_sample=False)
print(summary[0]['summary_text'])


  from .autonotebook import tqdm as notebook_tqdm





Device set to use cpu


Natural language processing (NLP) has revolutionized many fields, including machine translation. With the advent of deep learning, NLP models have become more sophisticated, enabling better understanding of human language.


# 📌 3️⃣ Summarize Each Chunk
* Now, we summarize each chunk from our previous step.

In [8]:
summarized_chunks = []
for chunk in text_chunks:
    summary = summarizer(chunk, max_length=100, min_length=20, do_sample=False)[0]['summary_text']
    summarized_chunks.append(summary)

# Print first summarized chunk
print(summarized_chunks[0])


Quantum networks are being designed to develop ultra-secure and highly accurate sensor networks for science and commercial applications. Current implementations are limited to a 300 km distance. SeQUeNCe addresses this limitation by enabling parallel discrete event simulation that can scale across many processes or nodes.


# 📌 4️⃣ Save Summarized Data to CSV
* Once we have the summaries, let’s save them for future use.

In [9]:
import pandas as pd

summary_df = pd.DataFrame({"original_text": text_chunks, "summary": summarized_chunks})
summary_df.to_csv("summarized_research_papers.csv", index=False)

print("Summarization complete! Results saved to 'summarized_research_papers.csv'.")


Summarization complete! Results saved to 'summarized_research_papers.csv'.


# 📌 1️⃣ Choose an NER Model

* ✅ Best Model for Our Task:
* We will use allenai/scibert_scivocab_uncased, since it's trained on scientific papers.

In [10]:
from transformers import pipeline

# Load NER model
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", grouped_entities=True)

# Example text
test_text = "This research was conducted by John Doe at MIT in collaboration with Google Research."
entities = ner_pipeline(test_text)

# Print extracted entities
print(entities)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER', 'score': 0.99362546, 'word': 'John Doe', 'start': 31, 'end': 39}, {'entity_group': 'ORG', 'score': 0.99840766, 'word': 'MIT', 'start': 43, 'end': 46}, {'entity_group': 'ORG', 'score': 0.9990243, 'word': 'Google Research', 'start': 69, 'end': 84}]


# 📌 3️⃣ Extract Named Entities from Summarized Chunks

* Now, let’s extract key details from our summarized research paper chunks.
* ✅ Explanation:

* Loops through each summarized chunk and applies NER.
* Stores the extracted named entities in extracted_entities.

In [11]:
extracted_entities = []
for summary in summarized_chunks:
    entities = ner_pipeline(summary)
    extracted_entities.append(entities)

# Print entities from the first summarized chunk
print(extracted_entities[0])


[{'entity_group': 'ORG', 'score': 0.9056627, 'word': 'SeQUeNC', 'start': 195, 'end': 202}]


# 📌 4️⃣ Save Extracted Entities to CSV

In [12]:
entity_list = []

for idx, entities in enumerate(extracted_entities):
    for entity in entities:
        entity_list.append({
            "summary_chunk": summarized_chunks[idx],
            "entity": entity['word'],
            "entity_type": entity['entity_group']
        })

# Convert to DataFrame
entity_df = pd.DataFrame(entity_list)

# Save to CSV
entity_df.to_csv("extracted_entities.csv", index=False)

print("Entity extraction complete! Results saved to 'extracted_entities.csv'.")


Entity extraction complete! Results saved to 'extracted_entities.csv'.
