# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [1]:
! pip install sumy
!pip install lxml_html_clean




### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [2]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer


### Scrape the text

In [3]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"


In [4]:
import nltk
nltk.download('all')


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\pegah\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\pegah\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\pegah\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\pegah\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\pegah\AppData\Roaming\nltk_data...
[

True

### Summarize - TextRankSummarizer

In [5]:
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, 5)

for sentence in summary:
    print(sentence)


For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[16] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source document

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [6]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer


### Create Summarizers

### LexRankSummarizer

In [7]:
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, 5)
for sentence in summary:
    print(sentence)


An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".
The sentences in these summaries do not necessarily match up with sentences in the original text, so it would be difficult to assign labels to examples for training.
For example, in document summarization, one would like the summary to cover all important and relevant concepts in the document.
Automatic Text Summarization.


### LuhnSummarizer

In [8]:
summarizer = LuhnSummarizer()
summary = summarizer(parser.document, 5)
for sentence in summary:
    print(sentence)


Such transformation, however, is computationally much more challenging than extraction, involving both natural language processing and often a deep understanding of the domain of the original text in cases where the original document relates to a special field of knowledge.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together.
It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system ( MEAD) 

### LsaSummarizer

In [9]:
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 5)
for sentence in summary:
    print(sentence)


For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases.
Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper.
It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system ( MEAD) that combines the LexRank score (stationary probability) with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights.
Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.
Although they did not replace other approaches and are often combined with them, by 2019 machine learning methods dominated the extractive summarization of single docu

## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [14]:

!pip install gensim==3.8.0




Collecting gensim==3.8.0
  Downloading gensim-3.8.0.tar.gz (23.4 MB)
     ---------------------------------------- 0.0/23.4 MB ? eta -:--:--
     ------------ --------------------------- 7.3/23.4 MB 45.4 MB/s eta 0:00:01
     ------------------- ------------------- 11.5/23.4 MB 35.8 MB/s eta 0:00:01
     ---------------------------------- ---- 21.0/23.4 MB 35.9 MB/s eta 0:00:01
     --------------------------------------  23.3/23.4 MB 29.7 MB/s eta 0:00:01
     --------------------------------------- 23.4/23.4 MB 27.2 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting smart_open>=1.7.0 (from gensim==3.8.0)
  Using cached smart_open-7.3.0.post1-py3-none-any.whl.metadata (24 kB)
Using cached smart_open-7.3.0.post1-py3-none-any.whl (61 kB)
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py): started
  Building wheel for gensim (setup.py): finished with status 'done'
  Created w

  DEPRECATION: Building 'gensim' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'gensim'. Discussion can be found at https://github.com/pypa/pip/issues/6334


In [None]:
#Pay Attention Please

I tried using the Gensim `summarize()` function, but as it's only available in version 3.8.3, I faced some challenges. First, I attempted the installation in Google Colab, but it failed due to compatibility issues and lack of support for custom environments.

To go further, I installed Anaconda on my local Windows system and created a dedicated environment with Python 3.8 to install `gensim==3.8.3`. However, the installation failed again due to system-level errors related to missing Microsoft Visual C++ Build Tools, and also no matching pre-built binary was found even using the `--only-binary` option.

At this point, I have tried both cloud and local approaches, but I was unable to install Gensim 3.8.3 successfully.

In [15]:
!pip install beautifulsoup4
!pip install requests




### Import the library

In [16]:
from gensim.summarization import summarize
from bs4 import BeautifulSoup
import requests



### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [17]:
def get_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

In [18]:
def collect_text(soup):
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text

In [19]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"

In [20]:
text = collect_text(get_page(url))
text

summary = summarize(text, ratio=0.2)
print(summary)

Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.
Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms.
Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a carefully selected su

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [29]:
!pip install summa --upgrade




In [30]:
!pip install summa



### Import the library

In [31]:
from summa.summarizer import summarize
import requests
from summa.summarizer import summarize
from bs4 import BeautifulSoup
import requests


In [32]:
def get_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def collect_text(soup):
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text


In [33]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
text = collect_text(get_page(url))



### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

### Summarize

In [34]:
summary = summarize(text, ratio=0.2)
print(summary)


Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.
Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms.
Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a carefully selected su

## ASSIGNMENT: Take the same medium article (the one I wrote) we used for Task 1 of ML-1 and extract the text and summarize them using all the above methods and provide the best summary with a note saying why the chosen library is the best
url = https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

### Submit 2 files
- (notebook) .ipynb
- (summary) .txt

In [43]:
!pip install sumy
!pip install summa
!pip install beautifulsoup4
!pip install requests
!pip install gensim==3.8.0




In [45]:
from bs4 import BeautifulSoup
import requests
from summa.summarizer import summarize as summa_summarize

from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from gensim.summarization import summarize as gensim_summarize



In [46]:
def get_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def collect_text(soup):
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text

url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
text = collect_text(get_page(url))


In [47]:
summa_summary = summa_summarize(text, ratio=0.2)


In [48]:
try:
    gensim_summary = gensim_summarize(text, ratio=0.2)
except ValueError:
    gensim_summary = "Gensim failed to generate summary. Possibly text is too short or not clean."


In [49]:
parser = HtmlParser.from_url(url, Tokenizer("english"))

# TextRank
text_rank = TextRankSummarizer()
text_rank_summary = '\n'.join(str(sentence) for sentence in text_rank(parser.document, 5))

# Luhn
luhn = LuhnSummarizer()
luhn_summary = '\n'.join(str(sentence) for sentence in luhn(parser.document, 5))

# LexRank
lex = LexRankSummarizer()
lex_summary = '\n'.join(str(sentence) for sentence in lex(parser.document, 5))

# LSA
lsa = LsaSummarizer()
lsa_summary = '\n'.join(str(sentence) for sentence in lsa(parser.document, 5))


In [51]:
summaries = {
    "Summa (TextRank)": summa_summary,
    "Gensim": gensim_summary,
    "Sumy - TextRank": text_rank_summary,
    "Sumy - Luhn": luhn_summary,
    "Sumy - LexRank": lex_summary,
    "Sumy - LSA": lsa_summary,
}

In [52]:
note = """
===== Final Conclusion =====

Chosen Library: Summa (TextRank-based)

Why?
✓ Among all methods, Summa produced the most coherent, compact, and relevant summary.
✓ It does not require tokenization or parsing manually.
✓ It works directly with plain HTML text.
✓ Output is fluent, with logically connected sentences.

Use Case: Ideal for quick and reliable summarization of online articles with minimal setup.
"""

In [53]:
with open("summary_comparison.txt", "w", encoding="utf-8") as f:
    for method, summary in summaries.items():
        f.write(f"===== {method} =====\n")
        f.write(f"#Sentences: {summary.count('.')}\n")
        f.write(f"#Words: {len(summary.split())}\n")
        f.write(summary + "\n\n")
    f.write(note)
