# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [1]:
# ! pip install sumy

In [2]:
# !pip install NumPy

In [3]:
# pip install numpy

In [4]:
import nltk
# nltk.download('punkt')

### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [5]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

In [6]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
import requests
from bs4 import BeautifulSoup

### Scrape the text

In [7]:
wiki_url = "https://en.wikipedia.org/wiki/Automatic_summarization"

In [8]:
# Parse HTML content from the URL
parser = HtmlParser.from_url(wiki_url, Tokenizer("english"))
wiki_text = ' '.join([str(sentence) for sentence in parser.document.sentences])

### Summarize - TextRankSummarizer

In [9]:
# Summarize using TextRank
text_rank_summarizer = TextRankSummarizer()
summary = ' '.join([str(sentence) for sentence in text_rank_summarizer(parser.document, sentences_count=3)])

In [10]:
# Print the summary
print("TextRank Summary:", summary)

TextRank Summary: For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail. Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph). A Class of Submodular Functions for Document Summarization", The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011^ Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, Learning Mixtures of Submodular Functions for Image Collection Summarization, In Advances of Neural Information Processing System

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [11]:
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.html import HtmlParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

### Create Summarizers

In [12]:
# Summarize using LexRank
lex_rank_summarizer = LexRankSummarizer()
lex_rank_summary = ' '.join([str(sentence) for sentence in lex_rank_summarizer(parser.document, sentences_count=3)])

# Summarize using Luhn
luhn_summarizer = LuhnSummarizer()
luhn_summary = ' '.join([str(sentence) for sentence in luhn_summarizer(parser.document, sentences_count=3)])

# Summarize using LSA
lsa_summarizer = LsaSummarizer()
lsa_summary = ' '.join([str(sentence) for sentence in lsa_summarizer(parser.document, sentences_count=3)])

### LexRankSummarizer

In [13]:
print("LexRank Summary:", lex_rank_summary)

LexRank Summary: An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary". Automatic Text Summarization.


### LuhnSummarizer

In [14]:
print("Luhn Summary:", luhn_summary)

Luhn Summary: Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph). It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system ( MEAD) that combines the LexRank score (stationary probability) with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights. A Class of Submodular Functions for Document Summarization", The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011^ Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, Learning Mixtures of Submodular Functions for Image Collection Summarization

### LsaSummarizer

In [15]:
print("LSA Summary:", lsa_summary)

LSA Summary: For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases. Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper. Although they did not replace other approaches and are often combined with them, by 2019 machine learning methods dominated the extractive summarization of single documents, which was considered to be nearing maturity.


## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [16]:
# !pip install "gensim==3.8.2"

### Import the library

In [17]:
from gensim.summarization import summarize
import requests
from bs4 import BeautifulSoup

In [18]:
# Parse HTML content from the URL
response = requests.get(wiki_url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the text from the page
wiki_text = ' '.join([p.text for p in soup.find_all('p')])

# Summarize using Gensim
summary2 = summarize(wiki_text, ratio=0.2)  # Adjust the ratio parameter as needed

In [19]:
print("Gensim Summary:", summary2)

Gensim Summary: Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.
Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms.
Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a caref

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [20]:
import os
import requests
import re
import sys
from bs4 import BeautifulSoup

In [21]:
def get_page(url):
#     # Validate the URL
#     if not re.match(r'https?://medium.com/', url):
#         print('Please enter a valid website, or make sure it is a medium article')
#         sys.exit(1)

    # Make a GET request to the URL
    res = requests.get(url)
    res.raise_for_status()
    
    # Parse the HTML content
    soup = BeautifulSoup(res.text, 'html.parser')
    return soup

In [22]:
def collect_text(soup):
    # Collect text from paragraphs
    text = f'url: {url}\n\n'
    para_text = soup.find_all("p")
    
    # Print paragraphs text for debugging
    print(f"Paragraphs text = \n {[para.text for para in para_text]}")
    
    # Add paragraphs to the text
    for para in para_text:
        text += f"{para.text}\n\n"
    return text

In [23]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"

In [24]:
text = collect_text(get_page(url))
text

Paragraphs text = 
 ['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n', 'Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/o

'url: https://en.wikipedia.org/wiki/Automatic_summarization\n\nAutomatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n\n\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the 

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [25]:
gensim_summary_text = summarize(text, word_count=200, ratio = 0.1)

In [26]:
gensim_summary_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms.\nImage summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a carefully selected

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [27]:
# !pip install summa

### Import the library

In [28]:
from summa import summarizer
from summa import keywords

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

### Summarize

In [29]:
# Extractive summarization
extractive_summary = summarizer.summarize(text)
print("Extractive Summary:")
print(extractive_summary)

Extractive Summary:
Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.
Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms.
Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a c

In [30]:
# Keyword extraction
keywords_text = keywords.keywords(text)
print("\nKeywords:")
print(keywords_text)


Keywords:
summarized
summarizes
summarizing
summary
keyphrases
keyphrase
automatic summarization
text
texts
algorithms
algorithm
algorithmically
informativeness
model
models
modeling
informative sentences
similarly
similarity
similar
document
documents
summaries simply
sentence
relevant information
evaluated
evaluation
evaluates
evaluate
evaluations
methods
method
extract
extraction
extracted
extractive
extracting
feature
features
rank
ranked
ranks
video
videos
results
resulting
result
human
humans
based
problem
problems
learning
learn
learned
relevance
important
importance
like
likely
examples
example
different
difference
differ
generate
generating
generates
generated
multi
words
word
abstraction
abstract
abstractive
abstracts
graph ranking
process
processes
unigram
unigrams
set
sets
textrank
news
approached
approach
automatically produce
supervised
supervision
use
useful
uses
researchers
high
highly
factor
factorization
function
functions
appear
appears
rouge
training
new
selected
s

## ASSIGNMENT: Take the same medium article (the one I wrote) we used for Task 1 of ML-1 and extract the text and summarize them using all the above methods and provide the best summary with a note saying why the chosen library is the best
url = https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

### Submit 2 files
- (notebook) .ipynb
- (summary) .txt

In [31]:
med_url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"

In [32]:
# Function to get text from a URL
def get_text_from_url(url):
    parser = HtmlParser.from_url(url, Tokenizer("english"))
    return ' '.join([str(sentence) for sentence in parser.document.sentences])

# Sample URL
med_url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"

# Get text from the Medium URL
med_text = get_text_from_url(med_url)

# Create a new parser for the Medium URL
med_parser = HtmlParser.from_url(med_url, Tokenizer("english"))

# Summarize using Gensim
gensim_summary = summarize(med_text)

# Summarize using Gensim
gensim_summary1 = summarize(med_text, word_count=200, ratio=0.1)

# Summarize using LexRank for Medium
lex_rank_summarizer = LexRankSummarizer()
lex_rank_summary_med = ' '.join([str(sentence) for sentence in lex_rank_summarizer(med_parser.document, sentences_count=3)])

# Summarize using Luhn for Medium
luhn_summarizer = LuhnSummarizer()
luhn_summary_med = ' '.join([str(sentence) for sentence in luhn_summarizer(med_parser.document, sentences_count=3)])

# Summarize using LSA for Medium
lsa_summarizer = LsaSummarizer()
lsa_summary_med = ' '.join([str(sentence) for sentence in lsa_summarizer(med_parser.document, sentences_count=3)])

# Summarize using TextRank for Medium
text_rank_summarizer = TextRankSummarizer()
text_rank_summary_med = ' '.join([str(sentence) for sentence in text_rank_summarizer(med_parser.document, sentences_count=3)])

# Summarize using Summa for Medium
summa_summary_med = summarizer.summarize(med_text, words=50)

In [33]:
# Print Gensim Summary
print("\nGensim Summary:")
print(gensim_summary)


Gensim Summary:
Pictures of Tanishi and MeMy 4-year-old angel came running to me, asked me to play with her for a while.
“Papa, tell me what stuff means and something means.” Cannot help evade a cute curious face, I said, “I am working on Neural Network.” Before I finish the statement, “Papa, What is a Meural Metark?” I gave up my stubbornness of avoiding her.
With a smile, I said slowly, “Its Neu — ral Net — work” She asked, “Papa, What is Meu-ral Met-ark?” Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.
“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.” It would be stupid on my part to start with a definition of Neural Network like how we used to teach adults in college.
Asked her to draw a dog out 

In [34]:
# Print Gensim Summary
# Summarize using Gensim with updated parameters
gensim_summary1 = summarize(med_text, word_count=200, ratio=0.1)
print("\nGensim Summary:")
print(gensim_summary1)


Gensim Summary:
“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.” It would be stupid on my part to start with a definition of Neural Network like how we used to teach adults in college.
What I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.
After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.
A dog will have features like face, body, legs, and tail.
A lion will have features like face, body, legs, tail and a beard.
The neurons grouped together with features like face, body, legs, tail and a beard forms a lion.
Once all the features are there, the neurons will send a signal that the picture you are looking at is a lion and not a dog.
Neural network is a group of neuron

In [35]:
# Print LexRank Summary for Medium
print("\nLexRank Summary:")
print(lex_rank_summary_med)


LexRank Summary:
Was it a dog or a lion? Picture of my version of Neural Network with their Neuron friends“Your brain is here inside our head. Ultimately, the neurons in your brain tell that it is a lion and not a dog.


In [36]:
# Print Luhn Summary for Medium
print("\nLuhn Summary:")
print(luhn_summary_med)


Luhn Summary:
Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through. How you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home. When you see a new object, your brain will ask the neurons, ‘Hey, anybody experienced this before?’ The neurons will say, ‘Yes, I have seen this.’ Certain other neurons will say, ‘No, I have not seen this.’ The neurons that have seen this before, will group together and form logical connections from the past and gives us an object from our memory.


In [37]:
# Print LSA Summary for Medium
print("\nLSA Summary:")
print(lsa_summary_med)


LSA Summary:
For example, when I showed you a lion picture, your brain asked the neurons who had seen it before. Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on. And I hope she will not come to me running asking “Papa, what is Meural Metark?” again.


In [38]:
# Print TextRank Summary for Medium
print("\nTextRank Summary:")
print(text_rank_summary_med)


TextRank Summary:
Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through. “Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.” The same principle is applied for a song that you hear, a cartoon that you watch, a rhyme that you sing, an animal that you draw, a food that you taste, a flower that you smell and so on.


In [39]:
# Print TextRank Summary for Medium
print("\nSumma Summary:")
print(summa_summary_med)


Summa Summary:
After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.
The neurons grouped together with features like face, body, legs, tail and a beard forms a lion.


In [40]:
with open("lex_rank_summary_med.txt", "w", encoding="utf-8") as file:
    file.write(lex_rank_summary_med)
with open("luhn_summary_med.txt", "w", encoding="utf-8") as file:
    file.write(luhn_summary_med)
with open("lsa_summary_med.txt", "w", encoding="utf-8") as file:
    file.write(lsa_summary_med)
with open("text_rank_summary_med.txt", "w", encoding="utf-8") as file:
    file.write(text_rank_summary_med)
with open("gensim_summary.txt", "w", encoding="utf-8") as file:
    file.write(gensim_summary)
with open("gensim_summary1.txt", "w", encoding="utf-8") as file:
    file.write(gensim_summary1)
with open("summa_summary_med.txt", "w", encoding="utf-8") as file:
    file.write(summa_summary_med)

In [41]:
# Define the sentence
best_summarizer = (
    "In my opinion, the most effective summarizer is Text Rank Summary. While the other summarizers are decent, they often contain information that requires additional context for a comprehensive understanding. Among them, Lex Rank stands out as the least effective, presenting information in a simplistic manner almost like a child. Text Rank, on the other hand, offers a concise overview with general examples, sparking reader interest and encouraging further exploration of the article for more specific details."
)

# Print the statement
print(best_summarizer)

In my opinion, the most effective summarizer is Text Rank Summary. While the other summarizers are decent, they often contain information that requires additional context for a comprehensive understanding. Among them, Lex Rank stands out as the least effective, presenting information in a simplistic manner almost like a child. Text Rank, on the other hand, offers a concise overview with general examples, sparking reader interest and encouraging further exploration of the article for more specific details.


In [42]:
# Save the statement to a text file
filename = "best_summarizer(Text_Rank_Summary).txt"
with open(filename, "w", encoding="utf-8") as file:
    # Save the statement to the file
    file.write(best_summarizer)

# Print statement for confirmation
print(f"\nStatement has been saved to {filename}")


Statement has been saved to best_summarizer(Text_Rank_Summary).txt
