# Extractive Text Summarization using NLP

### Project Overview
This project implements an **Extractive Text Summarization** system using classical Natural Language Processing techniques such as:
- Sentence Tokenization
- Text Preprocessing
- TF-IDF Vectorization
- Sentence Scoring & Ranking

The goal is to automatically generate a concise summary by selecting the most important sentences from the original text.

---

In [15]:
import re
import nltk
import numpy as np
import pandas as pd

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer


In [16]:
# Download required NLTK resources (run once)
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sahua\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sahua\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [17]:
text = """Artificial Intelligence (AI) is transforming industries by enabling machines to learn from data and make decisions.
Machine learning, a subset of AI, allows systems to improve automatically through experience.
Deep learning, a specialized form of machine learning, uses neural networks with multiple layers.
AI is widely used in healthcare, finance, education, and autonomous vehicles.
Despite its benefits, AI also raises ethical concerns such as bias, job displacement, and privacy.
Researchers are actively working on responsible AI to address these challenges.
"""

In [18]:
sentences = sent_tokenize(text)

print("Number of sentences:", len(sentences))
print("\nSentences:\n")
for i, s in enumerate(sentences, 1):
    print(f"{i}. {s}")


Number of sentences: 6

Sentences:

1. Artificial Intelligence (AI) is transforming industries by enabling machines to learn from data and make decisions.
2. Machine learning, a subset of AI, allows systems to improve automatically through experience.
3. Deep learning, a specialized form of machine learning, uses neural networks with multiple layers.
4. AI is widely used in healthcare, finance, education, and autonomous vehicles.
5. Despite its benefits, AI also raises ethical concerns such as bias, job displacement, and privacy.
6. Researchers are actively working on responsible AI to address these challenges.


In [20]:
def clean_sentence(sentence):
    sentence = sentence.lower()
    sentence = re.sub(r'[^a-zA-Z ]', '', sentence)
    return sentence

cleaned_sentences = [clean_sentence(sentence) for sentence in sentences]
cleaned_sentences

['artificial intelligence ai is transforming industries by enabling machines to learn from data and make decisions',
 'machine learning a subset of ai allows systems to improve automatically through experience',
 'deep learning a specialized form of machine learning uses neural networks with multiple layers',
 'ai is widely used in healthcare finance education and autonomous vehicles',
 'despite its benefits ai also raises ethical concerns such as bias job displacement and privacy',
 'researchers are actively working on responsible ai to address these challenges']

In [21]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(cleaned_sentences)

tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf.get_feature_names_out()
)

tfidf_df


Unnamed: 0,actively,address,ai,allows,artificial,automatically,autonomous,benefits,bias,challenges,...,responsible,specialized,subset,systems,transforming,used,uses,vehicles,widely,working
0,0.0,0.0,0.159927,0.0,0.312158,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.312158,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.185751,0.362563,0.0,0.362563,0.0,0.0,0.0,0.0,...,0.0,0.0,0.362563,0.362563,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.296668,0.0,0.0,0.0,0.0,0.296668,0.0,0.0,0.0
3,0.0,0.0,0.19011,0.0,0.0,0.0,0.371071,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.371071,0.0,0.371071,0.371071,0.0
4,0.0,0.0,0.168338,0.0,0.0,0.0,0.0,0.328576,0.328576,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.399601,0.399601,0.204726,0.0,0.0,0.0,0.0,0.0,0.0,0.399601,...,0.399601,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.399601


In [22]:
sentence_scores = np.sum(tfidf_matrix.toarray(), axis=1)

sentence_scores

score_df = pd.DataFrame({
    "Sentence": sentences,
    "Score": sentence_scores
})

score_df


Unnamed: 0,Sentence,Score
0,Artificial Intelligence (AI) is transforming i...,3.281502
1,"Machine learning, a subset of AI, allows syste...",2.955746
2,"Deep learning, a specialized form of machine l...",3.103158
3,"AI is widely used in healthcare, finance, educ...",2.78761
4,"Despite its benefits, AI also raises ethical c...",3.125526
5,Researchers are actively working on responsibl...,2.602334


In [23]:
ranked_sentences = score_df.sort_values(by="Score", ascending=False)
ranked_sentences

Unnamed: 0,Sentence,Score
0,Artificial Intelligence (AI) is transforming i...,3.281502
4,"Despite its benefits, AI also raises ethical c...",3.125526
2,"Deep learning, a specialized form of machine l...",3.103158
1,"Machine learning, a subset of AI, allows syste...",2.955746
3,"AI is widely used in healthcare, finance, educ...",2.78761
5,Researchers are actively working on responsibl...,2.602334


In [24]:
def generate_summary(sentences, scores, top_n=3):
    ranked_indices = np.argsort(scores)[::-1][:top_n]
    ranked_indices = sorted(ranked_indices)  # preserve original order
    summary = " ".join([sentences[i] for i in ranked_indices])
    return summary

summary = generate_summary(sentences, sentence_scores, top_n=3)
summary

'Artificial Intelligence (AI) is transforming industries by enabling machines to learn from data and make decisions. Deep learning, a specialized form of machine learning, uses neural networks with multiple layers. Despite its benefits, AI also raises ethical concerns such as bias, job displacement, and privacy.'

In [25]:
print("ORIGINAL TEXT:\n")
print(text)

print("\n" + "="*80 + "\n")

print("GENERATED SUMMARY:\n")
print(summary)

ORIGINAL TEXT:

Artificial Intelligence (AI) is transforming industries by enabling machines to learn from data and make decisions.
Machine learning, a subset of AI, allows systems to improve automatically through experience.
Deep learning, a specialized form of machine learning, uses neural networks with multiple layers.
AI is widely used in healthcare, finance, education, and autonomous vehicles.
Despite its benefits, AI also raises ethical concerns such as bias, job displacement, and privacy.
Researchers are actively working on responsible AI to address these challenges.



GENERATED SUMMARY:

Artificial Intelligence (AI) is transforming industries by enabling machines to learn from data and make decisions. Deep learning, a specialized form of machine learning, uses neural networks with multiple layers. Despite its benefits, AI also raises ethical concerns such as bias, job displacement, and privacy.


## Conclusion

In this project, we implemented an Extractive Text Summarization system using classical NLP techniques.
The approach is:
- Simple
- Explainable
- Computationally efficient

### Key Techniques Used:
- Sentence Tokenization
- Text Cleaning
- TF-IDF Vectorization
- Sentence Ranking

This method is suitable for real-world applications where transparency and speed are important.

---