# Summarization
## This notebook outlines the concepts behind Text Summarization
### Emilja Beneja 101539668

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [3]:
pip install lxml[html_clean]

Collecting lxml-html-clean (from lxml[html_clean])
  Downloading lxml_html_clean-0.4.0-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.4.0-py3-none-any.whl (14 kB)
Installing collected packages: lxml-html-clean
Successfully installed lxml-html-clean-0.4.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install sumy

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-24.6.1-py3-none-

In [11]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [25]:
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

### Scrape the text

In [62]:
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
parser = HtmlParser.from_url(url, Tokenizer("english"))


In [63]:
doc = parser.document
doc

<DOM with 46 paragraphs>

### Summarize - TextRankSummarizer

In [64]:
summarizer = TextRankSummarizer()

In [65]:
summary_text = summarizer(doc, 5)
summary_text

(<Sentence: Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.>,
 <Sentence: “Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.”>,
 <Sentence: After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.>,
 <Sentence: When you see a new object, your brain will ask the neurons, ‘Hey, anybody experienced this before?’ The neurons will say, ‘Yes, I have seen this.’ Certain other neurons will say, ‘No, I have not seen this.’ The neurons that have seen this before, will group together and form logical connections from the past and gives us an object from our memory.>,
 <Sentence: The same principle is applied for a so

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [30]:
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

### Create Summarizers

In [66]:
lexSummarizer =  LexRankSummarizer()
luhnSummarizer = LuhnSummarizer()
lsaSummarizer = LsaSummarizer()

### LexRankSummarizer

In [67]:
lex_summary_text = lexSummarizer(doc, 5)
lex_summary_text

(<Sentence: After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.>,
 <Sentence: Was it a dog or a lion?>,
 <Sentence: Do you know what is the difference between a lion and a dog?” She said, “Yes.” I said, “This is called Learning.>,
 <Sentence: Picture of my version of Neural Network with their Neuron friends“Your brain is here inside our head.>,
 <Sentence: Ultimately, the neurons in your brain tell that it is a lion and not a dog.>)

### LuhnSummarizer

In [68]:
luhn_summary_text = luhnSummarizer(doc, 5)
luhn_summary_text

(<Sentence: Papa, What is a Neural Network?At the back of my head, thoughts of me taking days to comprehend what a NN (short for Neural Network) is, how it would work, where it is used, how it is simulating our human brain’s inner workings were going through.>,
 <Sentence: How you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home.>,
 <Sentence: Every neuron is waiting for your eyes to see something new, for your nose to smell something new, for your ears to hear something new, for your tongue to taste something new.>,
 <Sentence: When something new is heard, or smelled, or seen, or tasted, the neurons will group together to send signals and forms connections with already seen, heard, tasted or smelled neurons.>,
 <Sentence: When you see a new object, your brain will ask the neurons, ‘Hey, anybody experience

### LsaSummarizer

In [69]:
lsa_summary_text = lsaSummarizer(doc, 5)
lsa_summary_text


(<Sentence: If you’ve noticed, this is how ML people make their machines learn through Reinforcement Learning.>,
 <Sentence: For example, when I showed you a lion picture, your brain asked the neurons who had seen it before.>,
 <Sentence: Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on.>,
 <Sentence: And I hope she will not come to me running asking “Papa, what is Meural Metark?” again.>,
 <Sentence: And I have a strong feeling; she would ask me another stunning question sooner or later.>)

## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [35]:
pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-win_amd64.whl.metadata (8.2 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.0.5-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Using cached wrapt-1.16.0-cp311-cp311-win_amd64.whl.metadata (6.8 kB)
Downloading gensim-4.3.3-cp311-cp311-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ------------------ --------------------- 11.0/24.0 MB 62.5 MB/s eta 0:00:01
   ---------------------------------------  23.9/24.0 MB 65.7 MB/s eta 0:00:01
   ---------------------------------------- 24.0/24.0 MB 44.7 MB/s eta 0:00:00
Downloading scipy-1.13.1-cp311-cp311-win_amd64.whl (46.2 MB)
   ---------------------------------------- 0.0/46.2 MB ? eta -:--:--
   ------------- -------------------------- 16.0/46.2 MB 77.5 MB/s

  You can safely remove it manually.
  You can safely remove it manually.


In [45]:
!pip uninstall gensim==3.8.3 -y
!pip cache purge  # Clears cached packages that may interfere



ERROR: Too many arguments


In [46]:
!pip install gensim==3.6.0


Collecting gensim==3.6.0
  Downloading gensim-3.6.0.tar.gz (23.1 MB)
     ---------------------------------------- 0.0/23.1 MB ? eta -:--:--
     ------------ --------------------------- 7.3/23.1 MB 50.2 MB/s eta 0:00:01
     -------------------------------- ------ 19.4/23.1 MB 55.6 MB/s eta 0:00:01
     --------------------------------------- 23.1/23.1 MB 44.4 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: gensim
  Building wheel for gensim (pyproject.toml): started
  Building wheel for gensim (pyproject.toml): finished with status 'done'
  Create

### Import the library

In [47]:
from gensim.summarization import summarize

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [48]:
import requests
import re
from bs4 import BeautifulSoup

In [70]:
def get_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    return soup

In [71]:
def collect_text(soup):
    text = f'url: {url}\n\n'
    para_text = soup.find_all('p')
    print(f"paragraphs text = \n {para_text}")
    for para in para_text:
        text += f"{para.text}\n\n"
    return text

In [72]:
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"

In [73]:
text = collect_text(get_page(url))
text

paragraphs text = 
 [<p class="bf b dx dy dz ea eb ec ed ee ef eg du"><span><button class="bf b dx dy eh dz ea ei eb ec ej ek ee el em eg eo ep eq er es et eu ev ew ex ey ez fa fb fc fd bm fe ff" data-testid="headerSignUpButton">Sign up</button></span></p>, <p class="bf b dx dy dz ea eb ec ed ee ef eg du"><span><a class="af ag ah ai aj ak al am an ao ap aq ar as at" data-testid="headerSignInButton" href="/m/signin?operation=login&amp;redirect=https%3A%2F%2Fmedium.com%2F%40subashgandyer%2Fpapa-what-is-a-neural-network-c5e5cc427c7&amp;source=post_page---top_nav_layout_nav-----------------------global_nav-----------" rel="noopener follow">Sign in</a></span></p>, <p class="bf b dx dy dz ea eb ec ed ee ef eg du"><span><button class="bf b dx dy eh dz ea ei eb ec ej ek ee el em eg eo ep eq er es et eu ev ew ex ey ez fa fb fc fd bm fe ff" data-testid="headerSignUpButton">Sign up</button></span></p>, <p class="bf b dx dy dz ea eb ec ed ee ef eg du"><span><a class="af ag ah ai aj ak al am an ao 

'url: https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7\n\nSign up\n\nSign in\n\nSign up\n\nSign in\n\nSubash Gandyer\n\nFollow\n\n--\n\n1\n\nListen\n\nShare\n\nIt was a cozy Sunday afternoon in the month of February 2018. I just finished my huge customary Sunday lunch spread with family and resting along. Everyone in the family was taking a quick nap for a pre-planned evening outing. Well not everyone, actually.\n\nMy 4-year-old angel came running to me, asked me to play with her for a while. As I was lazy and not in a position to move after the big spread, I evaded the chance to play with her by telling her “Papa’s got some work baby. Got to code some stuff.” I thought that would be the end of the conversation. No! It wasn’t. As my daughter was very inquisitive, she asked me “Papa, what stuff?” I said, “I need to code something for my work.” She didn’t leave. She again asked, “What is code something?” I wanted to end this conversation, as I was half past asl

### Another approach to make it more structured


In [74]:
def collect_main_text_from_url(url):
    # Fetch the webpage content
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all paragraph tags
    paragraphs = soup.find_all('p')
    
    # Extract text from each paragraph, filtering out unwanted phrases
    main_content = []
    for paragraph in paragraphs:
        text = paragraph.get_text()
        
        # Filter out known unwanted phrases
        if 'Sign in' not in text and 'Sign up' not in text and 'Member-only story' not in text \
           and 'Follow' not in text and 'Towards Data Science' not in text and 'linkedin.com' not in text:
            main_content.append(text)
    
    # Join the main content paragraphs into a single string
    main_text = ' '.join(main_content)
    
    return main_text

# Example usage
url = "https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7"
cleaned_text = collect_main_text_from_url(url)

# Now use cleaned_text for summarization
from gensim.summarization import summarize
gensim_summary_text = summarize(cleaned_text, word_count=200, ratio=0.1)
print(gensim_summary_text)


“Neural Network is a collection (a network) of neurons whose job is to learn a new thing or a new place or a new process or a new concept.” It would be stupid on my part to start with a definition of Neural Network like how we used to teach adults in college.
What I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.
After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.
A lion will have features like face, body, legs, tail and a beard.
The neurons grouped together with features like face, body, legs, tail and a beard forms a lion.
Once all the features are there, the neurons will send a signal that the picture you are looking at is a lion and not a dog.
Tanishi: Yes. Me: So, for a dog, the features are face, body, legs and tail.
Neural network is a group of neur

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [75]:
# Clean the text by removing URLs and any unwanted phrases
cleaned_text = re.sub(r'http\S+|Sign in|Follow|Member-only story|Towards Data Science|Help|Status|Careers|Press|Blog|Privacy|Terms|Text to speech|Teams', '', text)

# Summarize the cleaned text
gensim_summary_text = summarize(cleaned_text, word_count=200, ratio=0.1)
print(gensim_summary_text)

What I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.
After telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.
A dog will have features like face, body, legs, and tail.
A lion will have features like face, body, legs, tail and a beard.
The neurons grouped together with features like face, body, legs, tail and a beard forms a lion.
Once all the features are there, the neurons will send a signal that the picture you are looking at is a lion and not a dog.
Every neuron will tune itself to pick up certain features like legs, tail, face, beard, and so on.
Ultimately, the neurons in your brain tell that it is a lion and not a dog.
When I ask you to draw a dog, what are the features there?
All neurons work together like your friends and identify lion and dog.
Neural 

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [59]:
pip install summa

Collecting summa
  Downloading summa-1.2.0.tar.gz (54 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: summa
  Building wheel for summa (pyproject.toml): started
  Building wheel for summa (pyproject.toml): finished with status 'done'
  Created wheel for summa: filename=summa-1.2.0-py3-none-any.whl size=54410 sha256=126888a8f9b27f96a57f90821a3c94291e3c9c83368d1b43a23cf7c26e7f19b7
  Stored in directory: c:\users\user\appdata\local\pip\cache\wheels\10\2d\7a\abce87c4ea233f8dcca0d99b740ac0257eced1f99a124a0e1f
Successfully built summa
Installing collected packages: summa
Successfully installed summa-1.2.0
Note: you may need to restart the kernel to use updated packages.


### Import the library

In [77]:
from summa import summarizer
from summa import keywords

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

### Summarize

In [78]:
summa_summary_text = summarizer.summarize(text, ratio=0.1)
summa_summary_text

'After all, neural network inside our brain helps us to learn new things in our life.\nWhat I was actually doing here was teaching her neural network (brain) the features of a lion like exactly how Machine Learning Engineers would train the machine to learn new features.\nAfter telling her the features of a lion, asked her “Can you draw these for me?” She happily drew almost a similar figure to that of a dog she drew before.\nA dog will have features like face, body, legs, and tail.\nA lion will have features like face, body, legs, tail and a beard.\nHer neural network got aligned with classifying Dogs and Lions after some training.\nDo you know what is the difference between a lion and a dog?” She said, “Yes.” I said, “This is called Learning.\nHow you learnt it is because of Neural Network inside your brain.” Now, a neural network is a collection of neurons that keeps switching on and off based on things you see, feel, hear and think just like switching on light bulb at our home.\nFo

## ASSIGNMENT: Take the same medium article (the one I wrote) we used for Task 1 of ML-1 and extract the text and summarize them using all the above methods and provide the best summary with a note saying why the chosen library is the best
url = https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

### Submit 2 files
- (notebook) .ipynb
- (summary) .txt