# Text Summarization of Research Papers Using Pre-trained Models

In [1]:
!pip install keybert
!pip install PyMuPDF




In [19]:
!pip install rapidfuzz

Collecting rapidfuzz
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz
Successfully installed rapidfuzz-3.13.0


In [10]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [9]:
import fitz
import textwrap
import torch
from google.colab import files
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Define Summarization Models
models = {
    "DistilBART": "sshleifer/distilbart-cnn-12-6",
    "BART": "facebook/bart-large-cnn",
    "T5": "t5-base",

}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#  Upload and Extract Text from PDF
def extract_text_from_pdf(pdf_path):
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text("text") + " "
    return text.strip()

# Split text into chunks (~900 tokens per chunk)
def split_text(text, chunk_size=5000):
    return textwrap.wrap(text, width=chunk_size)

#  Summarization Function
def summarize_text(model_name, model_path, text_chunks):
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)

    summarized_chunks = []

    print(f"\n🔹 **{model_name} Summarization in Progress...** 🔹\n")
    for i, chunk in enumerate(tqdm(text_chunks, desc=f"{model_name}", unit="chunk")):
        inputs = tokenizer("summarize: " + chunk, return_tensors="pt", truncation=True, max_length=1024).to(device)
        summary_ids = model.generate(inputs.input_ids, max_length=200, min_length=50, num_beams=4)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        summarized_chunks.append(summary)
        print(f"\n🔹 **Chunk {i+1} Summary:**\n{summary}\n")

    return " ".join(summarized_chunks)

#  Upload PDF File
print("\n **Upload a Research Paper PDF:**")
uploaded_file = files.upload()
pdf_filename = list(uploaded_file.keys())[0]

# Extract and Process PDF Text
text = extract_text_from_pdf(pdf_filename)
if not text:
    print("\nError: Extracted text is empty! The PDF may contain only images.")
else:
    chunks = split_text(text)

    #  Summarize Using Each Model
    for model_name, model_path in models.items():
        final_summary = summarize_text(model_name, model_path, chunks)
        print(f"\n🔹 **Final {model_name} Summary:**\n{final_summary}\n")



 **Upload a Research Paper PDF:**


Saving 10.pdf to 10 (1).pdf

🔹 **DistilBART Summarization in Progress...** 🔹



DistilBART:  33%|███▎      | 1/3 [00:32<01:05, 32.79s/chunk]


🔹 **Chunk 1 Summary:**
 Propositions to Reconsider the Organization of a Scientiﬁc Workshop by Christoph Schommer, University of Luxembourg . The idea of a workshop has changed; it is less a meeting place of researchers who share a common research interest but more a market .



DistilBART:  67%|██████▋   | 2/3 [00:56<00:27, 27.29s/chunk]


🔹 **Chunk 2 Summary:**
 A solid review frame may contain a general statement of the reviewer, the ﬁnal decision if the contribution is to be accepted or rejected, as well as the own conﬁdence to the subject . All of the authors receive black points that range from 1 to k with a maximal limit of max . Contact persons of the accepted paper receive up to k black points, whereas co-authors receive only 1 .



DistilBART: 100%|██████████| 3/3 [01:12<00:00, 24.18s/chunk]


🔹 **Chunk 3 Summary:**
 The word workshop originates from to work out and remembers in form and content a study group . Social things like lunch (should be served in situ in order to avoid to hazard the progressing of the workshop) or a walk (a common excursion) should be done together .


🔹 **Final DistilBART Summary:**
 Propositions to Reconsider the Organization of a Scientiﬁc Workshop by Christoph Schommer, University of Luxembourg . The idea of a workshop has changed; it is less a meeting place of researchers who share a common research interest but more a market .  A solid review frame may contain a general statement of the reviewer, the ﬁnal decision if the contribution is to be accepted or rejected, as well as the own conﬁdence to the subject . All of the authors receive black points that range from 1 to k with a maximal limit of max . Contact persons of the accepted paper receive up to k black points, whereas co-authors receive only 1 .  The word workshop originates from to w




config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]


🔹 **BART Summarization in Progress...** 🔹



BART:  33%|███▎      | 1/3 [00:29<00:58, 29.22s/chunk]


🔹 **Chunk 1 Summary:**
The idea of a basic understanding of a workshop has changed. It is less a meeting place of researchers who share a common research interest but more a market. A half-day workshop is mostly ﬁnished at lunchtime, speakers are sometimes not present and unexcused. The number of participants do seldom exceed the number of talks.



BART:  67%|██████▋   | 2/3 [00:55<00:27, 27.36s/chunk]


🔹 **Chunk 2 Summary:**
Workshops are important as they concern with current research issues much more detailed than conferences do. Workshops consist exclusively of presentations that are ordered. Some con- tributions live from practical demonstrations or build on other presentations. Presentations are pressed into a time schedule with a limited amount of time for questions.



BART: 100%|██████████| 3/3 [01:19<00:00, 26.66s/chunk]


🔹 **Chunk 3 Summary:**
summarize: example a video conferencing through Skype. Social things like lunch (should be served in situ in order to avoid to hazard the progressing of the workshop) or a walk (a common excursion) should be done together. Linking with a panel sessions that is open to everyone without any fee.


🔹 **Final BART Summary:**
The idea of a basic understanding of a workshop has changed. It is less a meeting place of researchers who share a common research interest but more a market. A half-day workshop is mostly ﬁnished at lunchtime, speakers are sometimes not present and unexcused. The number of participants do seldom exceed the number of talks. Workshops are important as they concern with current research issues much more detailed than conferences do. Workshops consist exclusively of presentations that are ordered. Some con- tributions live from practical demonstrations or build on other presentations. Presentations are pressed into a time schedule with a limited am




config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]


🔹 **T5 Summarization in Progress...** 🔹



T5:  33%|███▎      | 1/3 [00:36<01:13, 36.54s/chunk]


🔹 **Chunk 1 Summary:**
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 



T5:  67%|██████▋   | 2/3 [01:03<00:31, 31.14s/chunk]


🔹 **Chunk 2 Summary:**
each author who is missing unexcused is to be blacklisted . all of the authors receive black points that range from 1 to k with a maximum limit of max . if an author exceeds the given limit, he is barred from presentation for a period of time .



T5: 100%|██████████| 3/3 [01:22<00:00, 27.38s/chunk]


🔹 **Chunk 3 Summary:**
if we still follow a publication mainstream, we will not advance . if we still follow a publication mainstream, we will never see the world outside of our field of vision . if we still follow a publication mainstream, we will not advance .


🔹 **Final T5 Summary:**
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  each author who is missing unexcused is to be blacklisted . all of the authors receive black points that range from 1 to k with a maximum limit of max . if an author exceeds the given limit, he is barred from presentation for a period of time . if we still follow a publication mainstream, we will not advance . if we still follow a publication mainstream, we will never see the world outside of our field of vision . if we still follow a publication mainstream, we will not advance .






# Evaluation of Summarization Models Using ROUGE Scores

In [12]:
from rouge import Rouge

models = {
    "DistilBART": "sshleifer/distilbart-cnn-12-6",
    "BART": "facebook/bart-large-cnn",
    "T5": "t5-base",

}

#  Compute ROUGE Scores
def compute_rouge_scores(original_text, summarized_text):
    rouge = Rouge()
    scores = rouge.get_scores(summarized_text, original_text, avg=True)
    return scores

# Store Precomputed Summaries
precomputed_summaries = {}

#  Summarize Once and Store Results
for model_name, model_path in models.items():
    print(f"\n**Summarizing with {model_name}...** \n")
    precomputed_summaries[model_name] = summarize_text(model_name, model_path, chunks)

#  Store Scores
model_scores = {}

#  Evaluate Each Precomputed Summary
for model_name, summary in precomputed_summaries.items():
    print(f"\n**Evaluating {model_name} Summary...** \n")

    rouge_scores = compute_rouge_scores(text, summary)  # Compute ROUGE scores

    # Display Scores
    print(f"\n **{model_name} ROUGE Scores:**")
    print(f"ROUGE-1: {rouge_scores['rouge-1']}")
    print(f"ROUGE-2: {rouge_scores['rouge-2']}")
    print(f"ROUGE-L: {rouge_scores['rouge-l']}\n")

    model_scores[model_name] = rouge_scores

#  Rank Models Based on ROUGE-1 F1 Score
ranked_models = sorted(model_scores.items(), key=lambda x: x[1]['rouge-1']['f'], reverse=True)

print("\n **Model Rankings Based on ROUGE-1 F1 Score:**")
for rank, (model_name, scores) in enumerate(ranked_models, 1):
    print(f"{rank}. {model_name} - ROUGE-1 F1: {scores['rouge-1']['f']:.4f}")



**Summarizing with DistilBART...** 


🔹 **DistilBART Summarization in Progress...** 🔹



DistilBART:  33%|███▎      | 1/3 [00:21<00:43, 21.60s/chunk]


🔹 **Chunk 1 Summary:**
 Propositions to Reconsider the Organization of a Scientiﬁc Workshop by Christoph Schommer, University of Luxembourg . The idea of a workshop has changed; it is less a meeting place of researchers who share a common research interest but more a market .



DistilBART:  67%|██████▋   | 2/3 [00:49<00:25, 25.17s/chunk]


🔹 **Chunk 2 Summary:**
 A solid review frame may contain a general statement of the reviewer, the ﬁnal decision if the contribution is to be accepted or rejected, as well as the own conﬁdence to the subject . All of the authors receive black points that range from 1 to k with a maximal limit of max . Contact persons of the accepted paper receive up to k black points, whereas co-authors receive only 1 .



DistilBART: 100%|██████████| 3/3 [01:06<00:00, 22.07s/chunk]


🔹 **Chunk 3 Summary:**
 The word workshop originates from to work out and remembers in form and content a study group . Social things like lunch (should be served in situ in order to avoid to hazard the progressing of the workshop) or a walk (a common excursion) should be done together .


**Summarizing with BART...** 







🔹 **BART Summarization in Progress...** 🔹



BART:  33%|███▎      | 1/3 [00:29<00:58, 29.29s/chunk]


🔹 **Chunk 1 Summary:**
The idea of a basic understanding of a workshop has changed. It is less a meeting place of researchers who share a common research interest but more a market. A half-day workshop is mostly ﬁnished at lunchtime, speakers are sometimes not present and unexcused. The number of participants do seldom exceed the number of talks.



BART:  67%|██████▋   | 2/3 [00:56<00:27, 27.80s/chunk]


🔹 **Chunk 2 Summary:**
Workshops are important as they concern with current research issues much more detailed than conferences do. Workshops consist exclusively of presentations that are ordered. Some con- tributions live from practical demonstrations or build on other presentations. Presentations are pressed into a time schedule with a limited amount of time for questions.



BART: 100%|██████████| 3/3 [01:21<00:00, 27.15s/chunk]


🔹 **Chunk 3 Summary:**
summarize: example a video conferencing through Skype. Social things like lunch (should be served in situ in order to avoid to hazard the progressing of the workshop) or a walk (a common excursion) should be done together. Linking with a panel sessions that is open to everyone without any fee.


**Summarizing with T5...** 







🔹 **T5 Summarization in Progress...** 🔹



T5:  33%|███▎      | 1/3 [00:37<01:14, 37.36s/chunk]


🔹 **Chunk 1 Summary:**
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 



T5:  67%|██████▋   | 2/3 [01:03<00:31, 31.02s/chunk]


🔹 **Chunk 2 Summary:**
each author who is missing unexcused is to be blacklisted . all of the authors receive black points that range from 1 to k with a maximum limit of max . if an author exceeds the given limit, he is barred from presentation for a period of time .



T5: 100%|██████████| 3/3 [01:22<00:00, 27.51s/chunk]


🔹 **Chunk 3 Summary:**
if we still follow a publication mainstream, we will not advance . if we still follow a publication mainstream, we will never see the world outside of our field of vision . if we still follow a publication mainstream, we will not advance .


**Evaluating DistilBART Summary...** 







 **DistilBART ROUGE Scores:**
ROUGE-1: {'r': 0.12967581047381546, 'p': 0.9811320754716981, 'f': 0.22907488780560464}
ROUGE-2: {'r': 0.07837687604224569, 'p': 0.94, 'f': 0.14468958298147136}
ROUGE-L: {'r': 0.12967581047381546, 'p': 0.9811320754716981, 'f': 0.22907488780560464}


**Evaluating BART Summary...** 


 **BART ROUGE Scores:**
ROUGE-1: {'r': 0.1371571072319202, 'p': 0.9565217391304348, 'f': 0.23991275680309773}
ROUGE-2: {'r': 0.0744858254585881, 'p': 0.881578947368421, 'f': 0.13736545217675175}
ROUGE-L: {'r': 0.1371571072319202, 'p': 0.9565217391304348, 'f': 0.23991275680309773}


**Evaluating T5 Summary...** 


 **T5 ROUGE Scores:**
ROUGE-1: {'r': 0.06234413965087282, 'p': 0.9433962264150944, 'f': 0.11695906316457032}
ROUGE-2: {'r': 0.030572540300166758, 'p': 0.7857142857142857, 'f': 0.05885500195421855}
ROUGE-L: {'r': 0.06234413965087282, 'p': 0.9433962264150944, 'f': 0.11695906316457032}


 **Model Rankings Based on ROUGE-1 F1 Score:**
1. BART - ROUGE-1 F1: 0.2399
2. Distil

# Performance Evaluation of Summarization Models: Speed and Token Efficiency

In [13]:
import time
import torch

#  Measure Inference Time and Token Efficiency
def measure_speed_efficiency(model_name, model_path, text_chunks):
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(device)

    total_time = 0
    total_tokens_input = 0
    total_tokens_output = 0

    print(f"\n **Measuring Speed for {model_name}...** ")

    for chunk in tqdm(text_chunks, desc=f" {model_name}", unit="chunk"):
        inputs = tokenizer("summarize: " + chunk, return_tensors="pt", truncation=True, max_length=1024).to(device)
        total_tokens_input += inputs.input_ids.shape[1]  # Count input tokens

        start_time = time.time()
        summary_ids = model.generate(inputs.input_ids, max_length=200, min_length=50, num_beams=4)
        end_time = time.time()

        total_time += (end_time - start_time)
        total_tokens_output += summary_ids.shape[1]

    avg_time_per_chunk = total_time / len(text_chunks)
    token_efficiency = total_tokens_output / total_tokens_input

    print(f"**{model_name} Performance:**")
    print(f" Avg Time per Chunk: {avg_time_per_chunk:.4f} sec")
    print(f" Token Efficiency: {token_efficiency:.4f} (Output/Input ratio)\n")

    return {
        "avg_time_per_chunk": avg_time_per_chunk,
        "token_efficiency": token_efficiency
    }

# Evaluate Each Model for Speed & Efficiency
performance_metrics = {}
for model_name, model_path in models.items():
    performance_metrics[model_name] = measure_speed_efficiency(model_name, model_path, chunks)

# Rank Models by Speed (Lower is Better)
ranked_by_speed = sorted(performance_metrics.items(), key=lambda x: x[1]['avg_time_per_chunk'])

print("\n **Model Rankings by Speed (Faster is Better):**")
for rank, (model_name, metrics) in enumerate(ranked_by_speed, 1):
    print(f"{rank}. {model_name} - Avg Time: {metrics['avg_time_per_chunk']:.4f} sec")

# Rank Models by Token Efficiency (Higher is Better)
ranked_by_efficiency = sorted(performance_metrics.items(), key=lambda x: x[1]['token_efficiency'], reverse=True)

print("\n **Model Rankings by Token Efficiency (Higher is Better):**")
for rank, (model_name, metrics) in enumerate(ranked_by_efficiency, 1):
    print(f"{rank}. {model_name} - Token Efficiency: {metrics['token_efficiency']:.4f}")



 **Measuring Speed for DistilBART...** 


 DistilBART: 100%|██████████| 3/3 [00:56<00:00, 18.75s/chunk]


**DistilBART Performance:**
 Avg Time per Chunk: 18.7376 sec
 Token Efficiency: 0.0706 (Output/Input ratio)


 **Measuring Speed for BART...** 


 BART: 100%|██████████| 3/3 [01:27<00:00, 29.33s/chunk]


**BART Performance:**
 Avg Time per Chunk: 29.3163 sec
 Token Efficiency: 0.0696 (Output/Input ratio)


 **Measuring Speed for T5...** 


 T5: 100%|██████████| 3/3 [01:37<00:00, 32.60s/chunk]

**T5 Performance:**
 Avg Time per Chunk: 32.5813 sec
 Token Efficiency: 0.0817 (Output/Input ratio)


 **Model Rankings by Speed (Faster is Better):**
1. DistilBART - Avg Time: 18.7376 sec
2. BART - Avg Time: 29.3163 sec
3. T5 - Avg Time: 32.5813 sec

 **Model Rankings by Token Efficiency (Higher is Better):**
1. T5 - Token Efficiency: 0.0817
2. DistilBART - Token Efficiency: 0.0706
3. BART - Token Efficiency: 0.0696





In [None]:
import re
import nltk
import torch
from google.colab import files
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import fitz

nltk.download("punkt_tab")

# Load full BART model (Use GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Facebook BART model for summarization (full version)
bart_name = "facebook/bart-large-cnn"
bart_tokenizer = AutoTokenizer.from_pretrained(bart_name)
bart_model = AutoModelForSeq2SeqLM.from_pretrained(bart_name).to(device)

# Extract text from PDF
def extract_text_with_pymupdf(pdf_path):
    text = ""
    try:
        with fitz.open(pdf_path) as doc:
            for page in doc:
                text += page.get_text("text")

        if not text.strip():
            raise ValueError("Error: Extracted text is empty! The PDF may contain only images.")

        print("Successfully extracted text!")
        return text

    except Exception as e:
        print(f"PDF Extraction Failed: {e}")
        return None


# Split text into chunks
def split_text(text, chunk_size=1000):
    sentences = nltk.sent_tokenize(text)
    chunks, current_chunk = [], []
    chunk_length = 0

    for sentence in sentences:
        words = sentence.split()
        if chunk_length + len(words) <= chunk_size:
            current_chunk.append(sentence)
            chunk_length += len(words)
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk, chunk_length = [sentence], len(words)

    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks if chunks else [text]

# Summarization function
def summarize_with_bart(text, max_length=200, min_length=50):
    if not text.strip():
        return "No valid content to summarize."

    inputs = bart_tokenizer.encode("summarize: " + text, return_tensors="pt", truncation=True, max_length=1024).to(device)
    summary_ids = bart_model.generate(inputs, max_length=max_length, min_length=min_length, num_beams=3)
    return bart_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Extract keywords based on sentence importance using BART (simplified method)
def extract_keywords_bart(text, num_keywords=5):
    if not text.strip():
        return []

    # Summarize the text and extract keywords based on the summary's relevance
    summary = summarize_with_bart(text, max_length=50)
    keywords = summary.split()[:num_keywords]  # Take the first few words from the summary as "keywords"

    return keywords

# Generate structured key points (with bullet points)
def generate_key_points(text, max_length=100):
    if not text.strip():
        return "No valid text for key points."

    inputs = bart_tokenizer.encode("summarize: " + text, return_tensors="pt", truncation=True, max_length=512).to(device)
    summary_ids = bart_model.generate(inputs, max_length=max_length, num_beams=2)
    summary = bart_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Return key points as bullet points
    key_points = summary.split('. ')
    bullet_points = "\n- ".join([point.strip() for point in key_points if point.strip()])
    return f"- {bullet_points}" if bullet_points else "No key points found."

# Process a single chunk
def process_chunk(chunk):
    summary = summarize_with_bart(chunk)
    keywords = extract_keywords_bart(chunk)
    key_points = generate_key_points(chunk)
    return summary, keywords, key_points

# **Main Execution**
print("Upload a research paper PDF:")
pdf_path = list(files.upload().keys())[0]

# Corrected: Using extract_text_with_pymupdf to extract text from PDF
text = extract_text_with_pymupdf(pdf_path)

if text:
    chunks = split_text(text)

    print("\n **Processing Research Paper...** ")

    results = []
    all_keywords = set()
    all_key_points = []

    # Sequential execution with progress bar
    for i, chunk in enumerate(tqdm(chunks, desc=" Summarizing", unit="chunk")):
        summary, keywords, key_points = process_chunk(chunk)
        results.append(summary)
        all_keywords.update(keywords)
        all_key_points.append(key_points)

    #  Print final outputs
    print("\n🔹 **Final Merged Summary:** 🔹\n")
    print("\n\n".join(results))

    print("\n**Combined Key Terms:** \n")
    print(", ".join(all_keywords))

    print("\n **Final Key Takeaways:** \n")
    print("\n".join(all_key_points))

else:
    print(" Extraction failed. Try another file.")


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Upload a research paper PDF:


Saving 10.pdf to 10 (3).pdf
Successfully extracted text!

 **Processing Research Paper...** 


 Summarizing: 100%|██████████| 3/3 [03:33<00:00, 71.24s/chunk]


🔹 **Final Merged Summary:** 🔹

The Organization of a Scientiﬁc Workshop. Propositions to Reconsider the Organization of the Workshop. The Motivation for the Workshop and the Motivation of its Participants. The Goals of the workshop. The Conclusions.

Each author must pay the registration fee and must declare to be present at the workshop. If the presenter fails unexcused, all authors and co-authors are blacklisted (see 2.3). With this, the workshop can become more credible; participants may be sure that the presentation takes place.

Pinging experts hit-or-miss is a good way to enrich the workshop. Non-present experts could be contacted by telephone, skype, or other video-conferencing machines. Social things like lunch (should be served in situ in order to avoid to hazard the progressing of the workshop) should be done together.

**Combined Key Terms:** 

is, author, the, of, Organization, Each, experts, pay, Scientiﬁc, The, must, Pinging, a, hit-or-miss

 **Final Key Takeaways:** 

-




# Highlighting Key Information in Research Papers Based on Keyword Matching

In [23]:
import os
import nltk
from nltk.corpus import stopwords
import fitz

# Download NLTK stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words('english'))

# Function to preprocess and clean text (removes stop words)
def preprocess_text(text):
    words = text.split()
    words = [word for word in words if word.lower() not in stop_words]
    return " ".join(words)

# Function to highlight sentences in the PDF based on matching key terms
def highlight_sentences_in_pdf(pdf_path, key_points, threshold=75):
    try:
        with fitz.open(pdf_path) as doc:
            # Loop through all pages in the PDF
            for page_num in range(len(doc)):
                page = doc.load_page(page_num)
                text_instances = []

                # Loop through the key points and search for matching sentences in the page
                for key_point in key_points:
                    # Preprocess the key point (remove stopwords)
                    preprocessed_key_point = preprocess_text(key_point)

                    # Extract sentences from the page
                    sentences = page.get_text("text").split("\n")

                    for sentence in sentences:
                        # Preprocess each sentence from the PDF (remove stopwords)
                        preprocessed_sentence = preprocess_text(sentence)

                        # Split the key point and sentence into words
                        key_terms = set(preprocessed_key_point.split())
                        sentence_terms = set(preprocessed_sentence.split())


                        if len(sentence_terms & key_terms) >= 1:
                            text_instances += page.search_for(sentence)

                # Highlight the matched sentences
                for inst in text_instances:
                    page.add_highlight_annot(inst)

            # Get the original file name and append "highlighted" to it
            base_name = os.path.splitext(os.path.basename(pdf_path))[0]
            highlighted_pdf_path = f"/content/{base_name}_highlighted.pdf"

            # Save the PDF with highlights
            doc.save(highlighted_pdf_path)

            print(f" Highlighted PDF saved as {highlighted_pdf_path}")
            return highlighted_pdf_path  # Return the path for downloading

    except Exception as e:
        print(f"Error highlighting sentences in PDF: {e}")
        return None

# **Main Execution for Highlighting with Keyword Matching**
if text:
    print("\n🔹 **Highlighting All Key Points in PDF (with keyword matching)...** 🔹")

    # Use the already existing key points (from earlier execution block)
    combined_key_points = "\n".join(all_key_points)

    # Now highlight and get the correct highlighted PDF
    highlighted_pdf_path = highlight_sentences_in_pdf(pdf_path, all_key_points)

    if highlighted_pdf_path:
        # Allow the user to download the highlighted PDF
        files.download(highlighted_pdf_path)
    else:
        print(" Could not create the highlighted PDF.")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



🔹 **Highlighting All Key Points in PDF (with keyword matching)...** 🔹
 Highlighted PDF saved as /content/10 (2)_highlighted.pdf


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Comparing AI Models for Extracting and Evaluating Research Methodology in Academic Summarie

In [16]:
import requests
import json

# Function to extract methodology using OpenRouter API
def extract_methodology_with_openrouter(summary_text, model_name):
    prompt = f"""
    Given the following summary of a research paper, please extract the specific research methodology used in the study. The methodology should include the research design, data collection methods, data analysis techniques, and any tools or frameworks used. Additionally, provide a brief description or explanation of the methodology.

    Summary: {summary_text}

    Methodology:
    """

    # API endpoint and headers
    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {
        "Authorization": "Bearer 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",
        "Content-Type": "application/json",
    }

    # Request data
    data = json.dumps({
        "model": model_name,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    })

    # Send the request
    response = requests.post(url, headers=headers, data=data)

    # Parse the response
    if response.status_code == 200:
        response_data = response.json()
        return response_data['choices'][0]['message']['content'].strip()
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return None


# Function to evaluate methodology output
def evaluate_methodology(methodology):
    # Score the methodology based on details, relevance, clarity, and completeness
    score = 0

    # Criteria for scoring:
    criteria = {
        "research_design": ["research design", "approach", "method", "study", "qualitative", "quantitative"],
        "data_collection": ["data collection", "survey", "interviews", "questionnaire", "observation", "samples"],
        "data_analysis": ["data analysis", "analysis techniques", "statistical", "coding", "qualitative", "quantitative"],
        "tools_frameworks": ["tools", "framework", "software", "methodology", "tool", "database", "framework"]
    }

    # Check for research design section
    if any(keyword in methodology.lower() for keyword in criteria["research_design"]):
        score += 1

    # Check for data collection section
    if any(keyword in methodology.lower() for keyword in criteria["data_collection"]):
        score += 1

    # Check for data analysis section
    if any(keyword in methodology.lower() for keyword in criteria["data_analysis"]):
        score += 1

    # Check for tools and frameworks section
    if any(keyword in methodology.lower() for keyword in criteria["tools_frameworks"]):
        score += 1

    return score


# Function to compare models based on their extracted methodology
def compare_models(summary_text):
    models = [
        "deepseek/deepseek-r1:free",
        "meta-llama/llama-3.2-3b-instruct:free"
    ]

    print("\n **Extracted Methodology using OpenRouter AI (Model Comparison):** ")
    outputs = {}
    scores = {}

    # Extract methodology from both models
    for model in models:
        print(f"\nExtracting methodology using model: {model}\n\n")
        methodology = extract_methodology_with_openrouter(summary_text, model)
        if methodology:
            print(f"Methodology extracted with {model}: \n{methodology[:300]}...")
            outputs[model] = methodology
            scores[model] = evaluate_methodology(methodology)
        else:
            print(f"Failed to extract methodology with {model}")

    # Compare the scores
    print("\n**Scoring Comparison**:")
    for model in models:
        print(f"{model}: Score = {scores[model]}")

    # Determine the best model
    best_model = max(scores, key=scores.get)
    print(f"\n**Best Model:** {best_model} with a score of {scores[best_model]}")


merged_summary = " ".join(results)

# Compare the outputs from both models and rank the best one
compare_models(merged_summary)



 **Extracted Methodology using OpenRouter AI (Model Comparison):** 

Extracting methodology using model: deepseek/deepseek-r1:free


Methodology extracted with deepseek/deepseek-r1:free: 
**Methodology:**  

**1. Research Design:**  
- **Workshop Framework Design:** The study employs a structured, case-based approach to redesigning the organization of a scientific workshop. The design focuses on credibility, participant accountability, and interactive engagement.  

**2. Data Collect...

Extracting methodology using model: meta-llama/llama-3.2-3b-instruct:free


Methodology extracted with meta-llama/llama-3.2-3b-instruct:free: 
Here is the extracted research methodology:

**Research Design:** The study appears to be a qualitative or mixed-methods study, as it involves a workshop or seminar where participants are expected to be present and engage with the content. The design is likely to be a case study or a pilot study, as...

**Scoring Comparison**:
deepseek/deepseek-r1:free: Score =

deepseek/deepseek-r1:free provided a more detailed and structured methodology, which is why it is considered the better option in this case.




# Methodology Extraction from Research Paper Summaries Using OpenRouter's deepseek-r1 Model

In [17]:
import requests
import json

# Function to extract methodology using OpenRouter API
def extract_methodology_with_openrouter(summary_text):
    # Define the prompt for extracting methodology
    prompt = f"""
    Given the following summary of a research paper, please extract the specific research methodology used in the study. The methodology should include the research design, data collection methods, data analysis techniques, and any tools or frameworks used. Additionally, provide a brief description or explanation of the methodology.

    Summary: {summary_text}

    Methodology:
    """

    # API endpoint and headers
    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {
        "Authorization": "Bearer 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",
        "Content-Type": "application/json",
    }

    # Request data
    data = json.dumps({
        "model": "deepseek/deepseek-r1:free",
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    })

    # Send the request
    response = requests.post(url, headers=headers, data=data)

    # Parse the response
    if response.status_code == 200:
        response_data = response.json()
        return response_data['choices'][0]['message']['content'].strip()
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return None

# ✅ Extract Methodology from Merged Summary (using OpenRouter API)
merged_summary = " ".join(results)

print("\n**Extracted Methodology using OpenRouter AI:** ")
methodology_answer = extract_methodology_with_openrouter(merged_summary)
print(methodology_answer)



**Extracted Methodology using OpenRouter AI:** 
**Methodology:**

1. **Research Design:**  
   - **Descriptive Case Study:** The study employs a descriptive approach to outline a proposed framework for organizing a scientific workshop. It focuses on institutional mechanisms (e.g., registration fees, penalties for non-attendance) and participant engagement strategies to enhance credibility and ensure presentation delivery.

2. **Data Collection Methods:**  
   - **Registration and Attendance Records:** Mandatory fee payment and declaration of presence were used to track participant commitment.  
   - **Remote Expert Inclusion:** Non-present experts were engaged via video-conferencing tools (e.g., Skype, telephone) to diversify perspectives.  
   - **Observational Data:** Social interventions (e.g., communal lunches held *in situ*) and workshop progression were monitored to assess their impact on collaboration and schedule adherence.  

3. **Data Analysis Techniques:**  
   - **Complian

# Generating Creative Project Ideas from Research Paper Summaries Using OpenRouter's GPT-3.5 Model

In [24]:
import requests
import json

# Function to extract key sections from the paper
def extract_key_sections(summary_text):
    prompt = f"""
    Please extract the key sections from the following research paper summary. Focus on the abstract, conclusion, and future work sections.

    Summary: {summary_text}

    Key Sections:
    """

    url = "https://openrouter.ai/api/v1/chat/completions"
    headers = {
        "Authorization": "Bearer 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",  # Replace with your actual API key
        "Content-Type": "application/json",
    }

    data = json.dumps({
        "model": "gpt-3.5-turbo",
        "messages": [
            {"role": "user", "content": prompt}
        ]
    })

    response = requests.post(url, headers=headers, data=data)

    if response.status_code == 200:
        return response.json()['choices'][0]['message']['content'].strip()
    else:
        return None

# Function to generate project ideas based on extracted sections
def generate_project_ideas(summary_text):
    print("\n**Generating Project Ideas...**")
    key_sections = extract_key_sections(summary_text)

    if key_sections:


        prompt = f"""
        Please generate creative and feasible **project ideas** based on the following research paper summary.
        Use the context from the abstract, conclusion, and future work to suggest:
        - Real-world implementation ideas
        - Academic or industry research projects
        - Prototype or product development ideas
        - Applications of the findings

        Research Summary Key Sections:
        {key_sections}

        Project Ideas:
        """

        url = "https://openrouter.ai/api/v1/chat/completions"
        headers = {
            "Authorization": "Bearer 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",
            "Content-Type": "application/json",
        }

        data = json.dumps({
            "model": "gpt-3.5-turbo",
            "messages": [
                {"role": "user", "content": prompt}
            ]
        })

        response = requests.post(url, headers=headers, data=data)

        if response.status_code == 200:
            ideas = response.json()['choices'][0]['message']['content'].strip()
            print(f"\n Generated Project Ideas:\n{ideas}")
        else:
            print(f"Error generating project ideas: {response.status_code}, {response.text}")
    else:
        print("Failed to extract key sections. Project idea generation skipped.")

# Sample usage
# merged_summary = " ".join(results)  # If you have multiple summaries combined
merged_summary = " ".join(results)


# Generate project ideas from the research summary
generate_project_ideas(merged_summary)



**Generating Project Ideas...**

 Generated Project Ideas:
1. Online Workshop Platform Development: Create a virtual workshop platform that allows for seamless online participation for experts who are unable to attend in person. The platform can include features such as live streaming of presentations, interactive Q&A sessions, and networking opportunities for participants.

2. Workshop Attendance Tracking System: Develop a system for tracking presenter attendance at workshops to ensure all authors and co-authors are present. This system could use RFID technology, mobile check-ins, or other methods to monitor and enforce attendance policies.

3. Workshop Credibility Assessment Tool: Create a tool for evaluating the credibility of scientific workshops based on factors such as presenter attendance, participant engagement, and expert involvement. This tool can help organizers improve workshop quality and reputation.

4. Workshop Communication Strategy Consultation: Offer consulting servi