# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

#### citation: chat-gpt, stack overflow,

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [1]:
! pip install sumy



In [2]:
!pip install nltk
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
from nltk.tokenize import sent_tokenize

test_text = "Hello world. This is a test sentence."
sentences = sent_tokenize(test_text)
print("Tokenization test successful. Sentences:", sentences)

Tokenization test successful. Sentences: ['Hello world.', 'This is a test sentence.']


### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [5]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

### Scrape the text

In [6]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"


In [7]:
parser = HtmlParser.from_url(url, Tokenizer("english"))

In [8]:
doc = parser.document
doc

<DOM with 63 paragraphs>

### Summarize - TextRankSummarizer

In [9]:
summarizer = TextRankSummarizer()

In [10]:
summary_text = summarizer(doc, 5)
summary_text

(<Sentence: For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.>,
 <Sentence: Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[16] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages.>,
 <Sentence: Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).>,
 <Sentence: While the goal of a brief summary is to simplify information search and cut the

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [11]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

### Create Summarizers

In [12]:
lexSummarizer =  LexRankSummarizer()
luhnSummarizer = LuhnSummarizer()
lsaSummarizer = LsaSummarizer()

### LexRankSummarizer

In [13]:
lex_summary_text = lexSummarizer(doc, 5)
lex_summary_text

(<Sentence: An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.>,
 <Sentence: The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".>,
 <Sentence: The sentences in these summaries do not necessarily match up with sentences in the original text, so it would be difficult to assign labels to examples for training.>,
 <Sentence: For example, in document summarization, one would like the summary to cover all important and relevant concepts in the document.>,
 <Sentence: Automatic Text Summarization.>)

### LuhnSummarizer

In [14]:
luhn_summary_text = luhnSummarizer(doc, 5)
luhn_summary_text

(<Sentence: Such transformation, however, is computationally much more challenging than extraction, involving both natural language processing and often a deep understanding of the domain of the original text in cases where the original document relates to a special field of knowledge.>,
 <Sentence: Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).>,
 <Sentence: For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together.>,
 <Sentence: It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was 

### LsaSummarizer

In [15]:
lsa_summary_text = lsaSummarizer(doc, 5)
lsa_summary_text

(<Sentence: For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases.>,
 <Sentence: Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper.>,
 <Sentence: It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system ( MEAD) that combines the LexRank score (stationary probability) with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights.>,
 <Sentence: Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.>,
 <Sentence: Although they did not replace other approaches and are often combined with them, by 2019 machine le

## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [16]:
#!pip install --upgrade pip setuptools wheel

In [17]:
!pip install gensim==3.6.0



In [21]:
! pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.2 kB)
Downloading gensim-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.5/26.5 MB[0m [31m67.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.3.3


In [24]:
from collections.abc import Mapping
from collections import defaultdict

In [26]:
! pip install gensim==3.6.0

Collecting gensim==3.6.0
  Using cached gensim-3.6.0-cp310-cp310-linux_x86_64.whl
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 4.3.3
    Uninstalling gensim-4.3.3:
      Successfully uninstalled gensim-4.3.3
Successfully installed gensim-3.6.0


### Import the library

In [27]:
from gensim.summarization import summarize

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [32]:
import requests
import re
from bs4 import BeautifulSoup

In [33]:
def get_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    return soup

In [34]:
def collect_text(soup):
    text = f'url: {url}\n\n'
    para_text = soup.find_all('p')
    print(f"paragraphs text = \n {para_text}")
    for para in para_text:
        text += f"{para.text}\n\n"
    return text

In [35]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"

In [36]:
text = collect_text(get_page(url))
text

paragraphs text = 
 [<p><b>Automatic summarization</b> is the process of shortening a set of data computationally, to create a subset (a <a href="/wiki/Abstract_(summary)" title="Abstract (summary)">summary</a>) that represents the most important or relevant information within the original content. <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">Artificial intelligence</a> <a href="/wiki/Algorithm" title="Algorithm">algorithms</a> are commonly developed and employed to achieve this, specialized for different types of data.
</p>, <p><a href="/wiki/Plain_text" title="Plain text">Text</a> summarization is usually implemented by <a href="/wiki/Natural_language_processing" title="Natural language processing">natural language processing</a> methods, designed to locate the most informative sentences in a given document.<sup class="reference" id="cite_ref-Torres2014_1-0"><a href="#cite_note-Torres2014-1"><span class="cite-bracket">[</span>1<span class="cite-bracket">]</

'url: https://en.wikipedia.org/wiki/Automatic_summarization\n\nAutomatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n\n\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the 

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [61]:
from sumy.summarizers.text_rank import TextRankSummarizer


summarizer = TextRankSummarizer()
summary_text = summarizer(doc, 5)

for sentence in summary_text:
    print(sentence)

For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[16] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source document

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [45]:
! pip install summa

Collecting summa
  Downloading summa-1.2.0.tar.gz (54 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: summa
  Building wheel for summa (setup.py) ... [?25l[?25hdone
  Created wheel for summa: filename=summa-1.2.0-py3-none-any.whl size=54389 sha256=828e08e40e3a039ae190b9a6d5f207e40f87f27aa036373ff501625030d13faf
  Stored in directory: /root/.cache/pip/wheels/4a/ca/c5/4958614cfba88ed6ceb7cb5a849f9f89f9ac49971616bc919f
Successfully built summa
Installing collected packages: summa
Successfully installed summa-1.2.0


### Import the library

In [46]:
from summa import summarizer
from summa import keywords

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

### Summarize

In [47]:
summa_summary_text = summarizer.summarize(text, ratio=0.1)
summa_summary_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms.\nImage summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a carefully selected

## ASSIGNMENT: Take the same medium article (the one I wrote) we used for Task 1 of ML-1 and extract the text and summarize them using all the above methods and provide the best summary with a note saying why the chosen library is the best
url = https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

### Submit 2 files
- (notebook) .ipynb
- (summary) .txt

In [56]:
import requests
from bs4 import BeautifulSoup

url = " https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"

def get_article_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    paragraphs = soup.find_all('p')
    article_text = ""
    for p in paragraphs:
        article_text += p.get_text() + "\n"
    return article_text

article_text = get_article_text(url)
print("Extracted text:")
print(article_text[:500])


Extracted text:
Sign up
Sign in
Sign up
Sign in
Subash Gandyer
Follow
--
1
Listen
Share
It was a cozy Sunday afternoon in the month of February 2018. I just finished my huge customary Sunday lunch spread with family and resting along. Everyone in the family was taking a quick nap for a pre-planned evening outing. Well not everyone, actually.
My 4-year-old angel came running to me, asked me to play with her for a while. As I was lazy and not in a position to move after the big spread, I evaded the chance to play


In [68]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

parser = PlaintextParser.from_string(article_text, Tokenizer("english"))
summarizer = TextRankSummarizer()

sumy_summary = summarizer(parser.document, 10)
sumy_summary_text = " ".join([str(sentence) for sentence in sumy_summary])
print("Sumy Summary:")
print(sumy_summary_text)


Sumy Summary:
“Papa, tell me what stuff means and something means.” Cannot help evade a cute curious face, I said, “I am working on Neural Network.” Before I finish the statement, “Papa, What is a Meural Metark?” I gave up my stubbornness of avoiding her. With a smile, I said slowly, “Its Neu — ral Net — work” She asked, “Papa, What is Meu-ral Met-ark?” At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through. “Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.” It would be stupid on my part to start with a definition of Neural Network like how we used to teach adults in college. After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before. Do you know

In [58]:
from summa import summarizer

summa_summary_text = summarizer.summarize(article_text, ratio=0.1)
print("Summa Summary:")
print(summa_summary_text)


Summa Summary:
After all, neural network inside our brain helps us to learn new things in our life.
What I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.
After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.
A dog will have features like face, body, legs, and tail.
A lion will have features like face, body, legs, tail and a beard.
Her neural network got aligned with classifying Dogs and Lions after some training.
Do you know what is the difference between a lion and a dog?” She said, “Yes.” I said, “This is called Learning.
How you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our ho

In [84]:

file_path = "Shrey_summarized_final.txt"
with open(file_path, "w") as file:
    file.write(summary_content)


windows_path = r"C:\Users\HP\Desktop\GBC Semester 1\Machine Learning 2\Assignments"
print(f"Summarized texts saved to {windows_path}")

Summarized texts saved to C:\Users\HP\Desktop\GBC Semester 1\Machine Learning 2\Assignments


In [66]:
from google.colab import drive
drive.mount('/content/drive')


file_path = "/content/drive/My Drive/Shrey_summarized_texts.txt"
with open(file_path, "w") as file:
    file.write(summary_content)

print(f"Summarized texts saved to {file_path}")

Summarized texts saved to C:\Users\HP\Shrey_summarized_texts.txt
