<a href="https://colab.research.google.com/github/Saish31/Python-Projects/blob/main/Extractive_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extractive Text Summarization

**Extractive Summarization**: The extractive approach involves picking up the most important phrases and lines from the documents.

LexRank is a graph-based method for automatic text summarization that identifies the most important sentences in a document by analyzing their similarity to other sentences. It essentially treats sentences as nodes in a graph and calculates their importance based on how many other sentences recommend them. This recommendation is determined by sentence similarity, often measured using cosine similarity.

In [6]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
import networkx as nx

In [11]:
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [12]:
class ExtractiveSummarizer:
    """
    High-level extractive summarizer using TextRank algorithm.
    """
    def __init__(self, num_sentences: int = 3):
        self.num_sentences = num_sentences
        self.vectorizer = TfidfVectorizer(stop_words='english')

    def _tokenize(self, text: str) -> list[str]:
        """Split text into sentences."""
        return nltk.sent_tokenize(text)

    def _build_similarity_graph(self, sentences: list[str]) -> nx.Graph:
        """Create a graph where nodes are sentences and edges are TF-IDF cosine similarities."""
        tfidf_matrix = self.vectorizer.fit_transform(sentences)
        similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
        # Zero out self-similarities
        for i in range(len(similarity_matrix)):
            similarity_matrix[i, i] = 0.0

        graph = nx.from_numpy_array(similarity_matrix)
        return graph

    def summarize(self, text: str) -> str:
        """
        Generate an extractive summary by selecting top-ranked sentences.
        :param text: Input document as a single string.
        :return: Concise summary as a string.
        """
        sentences = self._tokenize(text)
        if len(sentences) <= self.num_sentences:
            return text

        graph = self._build_similarity_graph(sentences)
        scores = nx.pagerank(graph)
        # Rank sentences by score
        ranked = sorted(scores.items(), key=lambda item: item[1], reverse=True)
        top_indices = [idx for idx, _ in ranked[: self.num_sentences]]
        # Preserve original order
        top_indices.sort()

        summary = ' '.join(sentences[i] for i in top_indices)
        return summary

In [15]:
if __name__ == '__main__':
    sample = (
        """
        Immediately after the verdict, in a statement released through her spokesperson, Amber had said she was ‘sad’ she had ‘lost the case’. The jury had also found Johnny guilty of defamation on one count and ordered him to pay Amber $2 million in damages. However, most legal experts said the case had been vindication for Johnny.
        Speaking about it on Today Show, Amber said about the jury, “I don’t blame them. I actually understand. He’s a beloved character and people feel they know him. He’s a fantastic actor.”
        The actor also addressed the memes that have been made about her and the hate coming her way on social media through the trial. She said, “I don’t care what one thinks about me or what judgments you want to make about what happened in the privacy of my own home, in my marriage, behind closed doors. I don’t presume the average person should know those things. And so I don’t take it personally. But even somebody who is sure I’m deserving of all this hate and vitriol, even if you think that I’m lying, you still couldn’t look me in the eye and tell me that you think on social media there’s been a fair representation. You cannot tell me that you think that this has been fair.
        """
    )

    summarizer = ExtractiveSummarizer(num_sentences=2)
    print("Summary:", summarizer.summarize(sample))

Summary: Speaking about it on Today Show, Amber said about the jury, “I don’t blame them. But even somebody who is sure I’m deserving of all this hate and vitriol, even if you think that I’m lying, you still couldn’t look me in the eye and tell me that you think on social media there’s been a fair representation.
