# Extractive Summarization using TF-IDF & Cosine Similarity

This notebook demonstrates how to create a basic extractive summarizer using `nltk`, `TfidfVectorizer`, and `cosine_similarity`. It selects the most representative sentences based on similarity with the entire document vector.


In [None]:
!pip install transformers torch datasets accelerate --quiet

In [None]:
!pip install sacrebleu sacremoses evaluate --quiet

### 🔹 Step 1: Import Libraries and Download NLTK Tokenizer


In [4]:
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import heapq

nltk.download('punkt')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 🔹 Step 2: Define Input Text


In [None]:
# Sample input text
text = """
India's GDP grew 7.8% in the first quarter of 2024, driven by strong performance in services and manufacturing sectors.
The Finance Ministry highlighted that inflation remains under control, and consumer spending is steadily rising.
Foreign investment saw a marginal dip due to global uncertainty, but domestic consumption remained strong.
The Reserve Bank of India is expected to keep interest rates steady in the coming quarter.
Analysts predict a GDP growth rate of 6.5% for the next fiscal year, assuming monsoon conditions remain favorable.
"""

### 🔹 Step 3: Sentence Tokenization


In [None]:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)

### 🔹 Step 4: TF-IDF Vectorization of Sentences


In [None]:
vectorizer = TfidfVectorizer()
sentence_vectors = vectorizer.fit_transform(sentences)

### 🔹 Step 5: Compute Document Vector (Mean of Sentence Vectors)


In [None]:
doc_vector = sentence_vectors.mean(axis=0).A1.reshape(1, -1)

### 🔹 Step 6: Compute Cosine Similarity


In [None]:
# Cosine Similarity between each sentence and full document
scores = cosine_similarity(sentence_vectors, doc_vector)

### 🔹 Step 7: Rank Top Sentences


In [None]:
top_n = 3  # number of summary sentences
top_indices = heapq.nlargest(top_n, range(len(scores)), scores.__getitem__)

### 🔹 Step 8: Print Extractive Summary


In [None]:
# Print Summary
summary = [sentences[i] for i in sorted(top_indices)]
print("📝 Extractive Summary:\n")
for s in summary:
    print("- " + s)

📝 Extractive Summary:

- 
India's GDP grew 7.8% in the first quarter of 2024, driven by strong performance in services and manufacturing sectors.
- The Reserve Bank of India is expected to keep interest rates steady in the coming quarter.
- Analysts predict a GDP growth rate of 6.5% for the next fiscal year, assuming monsoon conditions remain favorable.
