<a href="https://colab.research.google.com/github/Aarohi-jain84/scientific-text-summarizer/blob/main/Scientific_Text_Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 📦 Install necessary libraries for NLP and summarization
!pip install nltk gensim transformers scikit-learn sumy



In [6]:
# 📚 Import required libraries and download NLTK resources
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
import warnings
warnings.filterwarnings("ignore")

nltk.download('punkt')
nltk.download('punkt_tab')  # Explicitly request punkt_tab
nltk.download('tokenizers/punkt')  # Tries full path

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Error loading tokenizers/punkt: Package 'tokenizers/punkt'
[nltk_data]     not found in index


False

In [8]:
# 🧼 Preprocess: Tokenize text, remove stopwords, and filter meaningful words
sample_text = """
Graph neural networks (GNNs) have emerged as a powerful framework for representation learning on graphs.
They achieve state-of-the-art results on tasks such as node classification, link prediction, and graph classification.
In this work, we investigate how architectural choices influence the performance of GNNs and propose new variants
with improved expressiveness and efficiency.
"""

# Convert to lowercase and tokenize
words = word_tokenize(sample_text.lower())

# Remove stopwords and non-alphanumeric tokens
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.isalnum() and word not in stop_words]

print("Filtered Words:", filtered_words)

Filtered Words: ['graph', 'neural', 'networks', 'gnns', 'emerged', 'powerful', 'framework', 'representation', 'learning', 'graphs', 'achieve', 'results', 'tasks', 'node', 'classification', 'link', 'prediction', 'graph', 'classification', 'work', 'investigate', 'architectural', 'choices', 'influence', 'performance', 'gnns', 'propose', 'new', 'variants', 'improved', 'expressiveness', 'efficiency']


In [9]:
# 📝 Summarization using Sumy's LSA method

# Convert the text into a parser-friendly format
parser = PlaintextParser.from_string(sample_text, Tokenizer("english"))

# Initialize the LSA summarizer
lsa_summarizer = LsaSummarizer()

# Generate the summary (limit to 2 sentences)
summary = lsa_summarizer(parser.document, 2)

# Display the summary
print("🔍 LSA Summary:")
for sentence in summary:
    print("-", sentence)


🔍 LSA Summary:
- Graph neural networks (GNNs) have emerged as a powerful framework for representation learning on graphs.
- In this work, we investigate how architectural choices influence the performance of GNNs and propose new variants with improved expressiveness and efficiency.


In [10]:
# 🧠 Summarization using TF-IDF sentence scoring

# Step 1: Split the text into sentences
sentences = sent_tokenize(sample_text)

# Step 2: Use TF-IDF Vectorizer to compute sentence-level TF-IDF scores
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)

# Step 3: Compute sentence importance as the sum of TF-IDF scores
sentence_scores = tfidf_matrix.sum(axis=1).flatten().tolist()[0]

# Step 4: Pair scores with sentences and sort them
scored_sentences = list(zip(sentence_scores, sentences))
scored_sentences.sort(reverse=True)

# Step 5: Extract top 2 sentences
top_sentences = [sent for score, sent in scored_sentences[:2]]

# Step 6: Display the summary
print("🧩 TF-IDF Summary:")
for sent in top_sentences:
    print("-", sent)


🧩 TF-IDF Summary:
- In this work, we investigate how architectural choices influence the performance of GNNs and propose new variants
with improved expressiveness and efficiency.
- 
Graph neural networks (GNNs) have emerged as a powerful framework for representation learning on graphs.


In [11]:
# 🧠 Summarization using LSA (Latent Semantic Analysis) from Sumy

# Step 1: Initialize Sumy parser
parser = PlaintextParser.from_string(sample_text, Tokenizer("english"))

# Step 2: Apply LSA Summarizer
lsa_summarizer = LsaSummarizer()
lsa_summary = lsa_summarizer(parser.document, 2)  # Number of summary sentences

# Step 3: Display summary
print("\n🔍 LSA Summary:")
for sentence in lsa_summary:
    print("-", sentence)



🔍 LSA Summary:
- Graph neural networks (GNNs) have emerged as a powerful framework for representation learning on graphs.
- In this work, we investigate how architectural choices influence the performance of GNNs and propose new variants with improved expressiveness and efficiency.


In [12]:
# ✅ Final Example: Full pipeline demonstration
text_to_summarize = """
Artificial Intelligence (AI) is transforming scientific research. With capabilities in automating data analysis, AI systems can process enormous datasets,
generate hypotheses, and even assist in experimental design. Natural Language Processing (NLP) specifically aids in mining scientific literature by summarizing key findings,
extracting concepts, and linking related works. Tools like TF-IDF, sentence embeddings, and deep learning summarizers are gaining popularity in research institutions.
"""

print("📌 Original Text:")
print(text_to_summarize)

# Preprocess
words = word_tokenize(text_to_summarize.lower())
filtered_words = [word for word in words if word.isalnum() and word not in stop_words]
filtered_text = ' '.join(filtered_words)

# TF-IDF Summary
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sent_tokenize(text_to_summarize))
scores = X.sum(axis=1).A1
top_sentence_index = scores.argmax()
print("\n📝 TF-IDF Summary:")
print(sent_tokenize(text_to_summarize)[top_sentence_index])

# LSA Summary
parser = PlaintextParser.from_string(text_to_summarize, Tokenizer("english"))
lsa_summary = LsaSummarizer()(parser.document, 2)
print("\n🧠 LSA Summary:")
for sentence in lsa_summary:
    print("-", sentence)


📌 Original Text:

Artificial Intelligence (AI) is transforming scientific research. With capabilities in automating data analysis, AI systems can process enormous datasets,
generate hypotheses, and even assist in experimental design. Natural Language Processing (NLP) specifically aids in mining scientific literature by summarizing key findings,
extracting concepts, and linking related works. Tools like TF-IDF, sentence embeddings, and deep learning summarizers are gaining popularity in research institutions.


📝 TF-IDF Summary:
Natural Language Processing (NLP) specifically aids in mining scientific literature by summarizing key findings,
extracting concepts, and linking related works.

🧠 LSA Summary:
- Natural Language Processing (NLP) specifically aids in mining scientific literature by summarizing key findings, extracting concepts, and linking related works.
- Tools like TF-IDF, sentence embeddings, and deep learning summarizers are gaining popularity in research institutions.
