# Text Summarization

**Text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning**


There are 2 types of summarization:
    
1. Abstractive summarization

1. Extractive summarization.

**Abstractive Summarization**:
 

- Abstractive methods select words based on semantic understanding, even those words did not appear in the source documents. 

- It aims at producing important material in a new way. They interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text.

-- It can be correlated to the way human reads a text article or blog post and then summarizes in their own word.

**Input document → understand context → semantics → create own summary**

---

**Extractive Summarization**: 


- In Extractive Summarization, we are identifying important phrases or sentences from the original text and extract only these phrases from the text. These extracted sentences would be the summary.

- Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points.

- This approach weights the important part of sentences and uses the same to form the summary. 

- Different algorithm and techniques are used to define weights for the sentences and further rank them based on importance and similarity among each other.

**Input document → sentences similarity → weight sentences → select sentences with higher rank**

## Steps in Text Summarization:

1. Obtain Data
2. Text Preprocessing
3. Convert paragraphs to sentences
4. Tokenizing the sentences
5. Find weighted frequency of occurrence
6. Replace words by weighted frequency in sentences
7. Sort sentences in descending order of weights
8. Summarizing the Article

### Installation

In [None]:
!pip install beautifulsoup4
!pip install lxml

### Importing all necessary libraries

In [None]:
import bs4 as bs
import urllib.request
import re

import nltk
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

### Text for Summarization

In [None]:
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Reinforcement_learning')
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'html')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
    article_text += p.text

In [None]:
article_text

In this script, we first begin with importing the required libraries for web scraping i.e. BeautifulSoup. The urllib package is required for parsing the URL. Re is the library for regular expressions that are used for text pre-processing. The urlopen function will be used to scrape the data. The read() will read the data on the URL. Further on, we will parse the data with the help of the BeautifulSoup object and the lxml parser.

In the Wikipedia articles, the text is present in the <p> tags. Hence we are using the find_all function to retrieve all the text which is wrapped within the <p> tags.

After scraping, we need to perform data preprocessing on the text extracted.

### Text Preprocessing

In [None]:
# Removing Square Brackets and Extra Spaces
text = re.sub(r'[[0-9]*]', ' ', article_text)
text = re.sub(r'[\n]', ' ', article_text)
text = re.sub('[^a-zA-Z]', ' ', article_text)
text = " ".join(text.split())
#text = re.sub(r's+', ' ', article_text)

In [None]:
text

The first task is to remove all the references made in the Wikipedia article. These references are all enclosed in square brackets. The below code will remove the square brackets and replace them with spaces.

The article_text will contain text without brackets which is the original text. We are not removing any other words or punctuation marks as we will use them directly to create the summaries.

Execute the below code to create weighted frequencies and also to clean the text:

In [None]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
#formatted_article_text = re.sub(r's+', ' ', formatted_article_text)

In [None]:
formatted_article_text

Here the formatted_article_text contains the formatted article. We will use this object to calculate the weighted frequencies and we will replace the weighted frequencies with words in the article_text object.

### Convert text to sentences

The sentences are broken down into words so that we have separate entities.

In [None]:
import nltk

In [None]:
sentence_list = nltk.sent_tokenize(text)

In [None]:
sentence_list

### Finding weighted frequencies of occurrence

In [None]:
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(str(text)):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [None]:
word_frequencies

All English stopwords from the nltk library are stored in the stopwords variable. Iterate over all the sentences, check if the word is a stopword. If the word is not a stopword, then check for its presence in the word_frequencies dictionary. If it doesn’t exist, then insert it as a key and set its value to 1. If it is already existing, just increase its count by 1.

In [None]:
maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

In [None]:
word_frequencies

To find the weighted frequency, divide the frequency of the word by the frequency of the most occurring word.

### Calculate sentence scores

We have calculated the weighted frequencies. Now scores for each sentence can be calculated by adding weighted frequencies for each word.

In [None]:
sentence_list = nltk.sent_tokenize(text)

In [None]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 10:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [None]:
sentence_scores

The sentence_scores dictionary has been created which will store the sentences as keys and their occurrence as values. Iterate over all the sentences, tokenize all the words in a sentence. If the word exists in word_frequences and also if the sentence exists in sentence_scores then increase its count by 1 else insert it as a key in the sentence_scores and set its value to 1. We are not considering longer sentences hence we have set the sentence length to 30.

### Summary of the article

The sentence_scores dictionary consists of the sentences along with their scores. Now, top N sentences can be used to form the summary of the article.

In [None]:
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)