# 📌 Task 1 – Extractive Text Summarization with TextRank and Stopword Removal

This notebook demonstrates extractive summarization using TF-IDF vectorization and cosine similarity. The goal is to automatically generate concise summaries by selecting the most relevant sentences from a given text.

In [24]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import re
import joblib

# Download required nltk data
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords

# Define improved summary function
def generate_summary(text, num_sentences=4):
    # Step 1: Tokenize into sentences
    sentences = nltk.sent_tokenize(text)
    
    # Step 2: Clean sentences (remove non-alphanumeric chars)
    clean_sentences = [re.sub(r'\W+', ' ', sentence) for sentence in sentences]

    # Step 3: Remove stopwords for TF-IDF input
    stop_words = set(stopwords.words('english'))
    processed_sentences = []
    for sent in clean_sentences:
        words = sent.lower().split()
        filtered = [w for w in words if w not in stop_words]
        processed_sentences.append(" ".join(filtered))
    
    # Step 4: TF-IDF vectorization with tuned parameters
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2), sublinear_tf=True)
    tfidf_matrix = vectorizer.fit_transform(processed_sentences)
    
    # Step 5: Compute cosine similarity matrix between sentences
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

    # Step 6: Save vectorizer and matrix for reproducibility
    joblib.dump(vectorizer, "vectorizer.pkl")
    joblib.dump(tfidf_matrix, "tfidf_matrix.pkl")

    # Step 7: Score sentences by sum of similarity to all others
    scores = similarity_matrix.sum(axis=1)

    # Step 8: Pick top N sentences with highest scores
    top_indices = scores.argsort()[-num_sentences:][::-1]

    # Step 9: Return selected sentences in original order for coherence
    top_sentences = [sentences[i] for i in sorted(top_indices)]
    return " ".join(top_sentences)


[nltk_data] Downloading package punkt to C:\Users\Pratham
[nltk_data]     Modi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Pratham
[nltk_data]     Modi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Example 1 – Machine Learning

In [25]:
text = """
Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.
It has become a key technique for solving a wide range of problems in science and industry.
Applications of machine learning are everywhere: in recommendation systems, spam filtering, image recognition, self-driving cars, and more.
Instead of writing code to solve a problem, you feed data into a generic algorithm, and it builds its own logic based on that data.
This has opened up entirely new possibilities in automation and artificial intelligence.
"""

summary = generate_summary(text, num_sentences=4)
print("Summary:")
print(summary)

Summary:

Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. It has become a key technique for solving a wide range of problems in science and industry. Applications of machine learning are everywhere: in recommendation systems, spam filtering, image recognition, self-driving cars, and more. This has opened up entirely new possibilities in automation and artificial intelligence.


### Example 2 – Technology and Society

In [17]:
text = """The rapid evolution of technology has dramatically reshaped modern society. From smartphones and cloud computing to artificial intelligence and biotechnology, technological innovations have revolutionized the way we communicate, work, and live. These advancements have improved global connectivity, increased productivity, and opened up new opportunities for innovation in every sector. However, they have also introduced challenges such as digital addiction, privacy concerns, and job displacement due to automation. As society becomes more reliant on technology, it becomes crucial to ensure that ethical considerations guide development and that access to these tools is equitable across different populations."""

summary = generate_summary(text, num_sentences=4)
print("Summary:")
print(summary)

Summary:
The rapid evolution of technology has dramatically reshaped modern society. From smartphones and cloud computing to artificial intelligence and biotechnology, technological innovations have revolutionized the way we communicate, work, and live. These advancements have improved global connectivity, increased productivity, and opened up new opportunities for innovation in every sector. However, they have also introduced challenges such as digital addiction, privacy concerns, and job displacement due to automation.


### Example 3 – Space Exploration (Extended)

In [18]:
text = """Space exploration has long fascinated humanity, offering the promise of discovery, innovation, and the quest to understand our place in the universe. Since the launch of Sputnik 1 in 1957, the world has witnessed rapid advancements in aerospace technology. From the historic Apollo moon landings to the establishment of the International Space Station, humanity has continued to push the boundaries of what is possible. In recent years, space missions have moved beyond governmental agencies. Private companies like SpaceX, Blue Origin, and Rocket Lab are pioneering reusable rocket technology, drastically reducing the cost of launching satellites and paving the way for commercial space travel. These ventures have also inspired ambitious goals such as colonizing Mars and mining asteroids for rare minerals. Simultaneously, robotic missions to planets and moons have uncovered new data, such as evidence of subsurface oceans on Europa and organic molecules on Mars. With the James Webb Space Telescope now delivering breathtaking images of galaxies billions of light-years away, our understanding of the cosmos continues to deepen. However, space exploration is not without its challenges—space debris, high mission costs, and the unknown effects of long-term space travel on human health remain significant concerns. Nonetheless, the pursuit of space exploration symbolizes hope, curiosity, and the unyielding human desire to reach for the stars."""

summary = generate_summary(text, num_sentences=4)
print("Summary:")
print(summary)

Summary:
Space exploration has long fascinated humanity, offering the promise of discovery, innovation, and the quest to understand our place in the universe. In recent years, space missions have moved beyond governmental agencies. However, space exploration is not without its challenges—space debris, high mission costs, and the unknown effects of long-term space travel on human health remain significant concerns. Nonetheless, the pursuit of space exploration symbolizes hope, curiosity, and the unyielding human desire to reach for the stars.


## Model Evaluation (ROUGE)

Since summarization isn’t classification, accuracy/precision don’t fit here.
We use ROUGE scores — a standard way to check how close our generated summary is to a reference one.
Higher ROUGE = better summary quality.

In [None]:
%pip install rouge-score

from rouge_score import rouge_scorer

reference_summary = """Machine learning is a field that enables computers to learn without explicit programming.
It is widely used in many applications including recommendation systems and self-driving cars."""

generated_summary = generate_summary(text, num_sentences=4)

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_summary, generated_summary)
print("ROUGE Scores:", scores)

Note: you may need to restart the kernel to use updated packages.
ROUGE Scores: {'rouge1': Score(precision=0.3181818181818182, recall=0.7777777777777778, fmeasure=0.45161290322580644), 'rougeL': Score(precision=0.30303030303030304, recall=0.7407407407407407, fmeasure=0.4301075268817205)}
