# Extraction-based summarization
**In extraction-based summarization, a subset of words that represent the most important points is pulled from a piece of text and combined to make a summary.** 

# Code Blue-print

### Creating a dictionary for the word frequency table
frequency_table = _create_dictionary_table(article)

### Tokenizing the sentences
sentences = sent_tokenize(article)

### Algorithm for scoring a sentence by its words
sentence_scores = _calculate_sentence_scores(sentences, frequency_table)

### Getting the threshold
threshold = _calculate_average_score(sentence_scores)

### Producing the summary
article_summary = _get_article_summary(sentences, sentence_scores, 1.5 * threshold)

print(article_summary)

In [85]:
import bs4 as BeautifulSoup
import urllib.request  

# Fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

# Parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

# Returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

# Looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text

BeautifulSoup converts the incoming text to Unicode characters and the outgoing text to UTF-8 characters, saving you the hassle of managing different charset encodings while scraping text from the web.

In [86]:
len(article_content)   #len of char

21224

In [87]:
# One time run
# import nltk
# nltk.download('punkt')

In [88]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

def _create_dictionary_table(text_string):
    stemmer=PorterStemmer()
    words = word_tokenize(text_string)   #tokenizing string
    frequency_table=dict()      #creating a dict object
    stop_words=stopwords.words("english")
    
    for wd in  words:
        wd=stemmer.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table.keys():
            frequency_table[wd]+=1
        else:
            frequency_table[wd]=1
            
    return frequency_table
            

In [89]:
_create_dictionary_table(article_content)

{'20th': 20,
 '(': 10,
 'twentieth': 1,
 ')': 10,
 'centuri': 46,
 'began': 4,
 'januari': 1,
 '1': 3,
 ',': 299,
 '1901': 2,
 '[': 39,
 ']': 39,
 'end': 14,
 'decemb': 1,
 '31': 1,
 '2000': 2,
 '.': 125,
 '2': 2,
 'It': 6,
 'wa': 36,
 'tenth': 1,
 'final': 2,
 '2nd': 1,
 'millennium': 1,
 'unlik': 1,
 'year': 12,
 'leap': 2,
 'first': 19,
 'gregorian': 1,
 'calendar': 1,
 'sinc': 9,
 '1600': 1,
 'domin': 3,
 'chain': 2,
 'event': 1,
 'herald': 2,
 'signific': 4,
 'chang': 7,
 'world': 49,
 'histori': 7,
 'redefin': 1,
 'era': 1,
 ':': 4,
 'spanish': 1,
 'flu': 1,
 'pandem': 1,
 'war': 39,
 'I': 4,
 'II': 8,
 'nuclear': 10,
 'power': 12,
 'space': 4,
 'explor': 3,
 'nation': 19,
 'decolon': 3,
 'cold': 5,
 'post-cold': 1,
 'conflict': 8,
 ';': 12,
 'intergovernment': 1,
 'organ': 3,
 'cultur': 7,
 'homogen': 2,
 'develop': 17,
 'emerg': 3,
 'transport': 4,
 'commun': 10,
 'technolog': 20,
 'poverti': 2,
 'reduct': 1,
 'popul': 11,
 'growth': 2,
 'awar': 2,
 'environment': 5,
 'degrad':

**Tokenizing the article into sentences**

In [90]:
from nltk.tokenize import word_tokenize, sent_tokenize

sentences = sent_tokenize(article_content)

In [91]:
sentences[0][:7]

'The 20t'

##  Finding the weighted frequencies of the sentences

Importantly, to ensure long sentences do not have unnecessarily high scores over short sentences, we divided each score of a sentence by the number of words found in that sentence.

Also, to optimize the dictionary’s memory, we arbitrarily added sentence[:10], which refers to the **first 10 characters** in each sentence. **Entire sentence need not be stored only first 10 characters** (saves dict() memory).

In [92]:
def _calculate_sentence_scores(sentences, frequency_table):
    sentence_weight=dict()
    
    for sent in sentences:
        sentence_wordcount_without_stop_words = 0

        for word_weight in frequency_table.keys():
            
            if word_weight in sent.lower():
                sentence_wordcount_without_stop_words+=1
                
                if sent[:10] in sentence_weight:
                    sentence_weight[sent[:10]]+=frequency_table[word_weight]
                else:
                    sentence_weight[sent[:10]]= frequency_table[word_weight]
                    
        sentence_weight[sent[:10]] = sentence_weight[sent[:10]] / sentence_wordcount_without_stop_words
        
    return sentence_weight




In [93]:
_calculate_sentence_scores(sentences,_create_dictionary_table(article_content))

{'The 20th (': 28.85,
 '[2] It was': 27.333333333333332,
 'Unlike mos': 25.571428571428573,
 'The 20th c': 15.266586354821651,
 'It saw gre': 19.233333333333334,
 'Man-made g': 12.226415094339623,
 '[5]\nThe re': 24.966666666666665,
 'The Marsha': 10.27027027027027,
 'Throughout': 19.125,
 'The dissol': 16.205128205128204,
 'It took ov': 16.41860465116279,
 '[8][9][10]': 18.153846153846153,
 'Penicillin': 9.96551724137931,
 '[11] Machi': 18.314285714285713,
 'Trade impr': 16.6,
 'Until the ': 15.35,
 '[12]\nThe c': 21.944444444444443,
 'Nationalis': 17.96969696969697,
 'The centur': 27.85,
 'Terms like': 28.227272727272727,
 'Scientific': 22.282651072124757,
 'It was a c': 21.75,
 'Horses and': 19.307692307692307,
 'These deve': 18.48148148148148,
 'Humans exp': 39.333333333333336,
 'Mass media': 18.90625,
 'Advancemen': 11.85,
 'Rapid tech': 29.68421052631579,
 'World War ': 26.583333333333332,
 'However, t': 47.27272727272727,
 'For the fi': 23.714285714285715,
 'The last t': 23.9545

## Calculating the threshold of the sentences

To further tweak the kind of sentences eligible for summarization, we’ll create the average score for the sentences. With this threshold, we can avoid selecting the sentences with a lower score than the average score.

In [94]:
def _calculate_average_score(sentence_weight):
    
    # Calculating the average score for the sentences
    sum_values = 0
    for entry in sentence_weight.keys():
        sum_values+=sentence_weight[entry]
    
    average_score= sum_values/len(sentence_weight)
    return average_score  #threshold value

In [95]:
_calculate_average_score(_calculate_sentence_scores(sentences,_create_dictionary_table(article_content)))

22.878991308947914

## Getting the summary

In [96]:
def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sent in sentences:
        if sent[:10] in sentence_weight and sentence_weight[sent[:10]] >= (threshold):
            article_summary += " " + sent    #entire sentence is conacatnated not just sent[:10]
            sentence_counter += 1

    return article_summary

In [110]:
def _run_article_summary(article):
    
    #creating a dictionary for the word frequency table
    frequency_table = _create_dictionary_table(article)

    #tokenizing the sentences
    sentences = sent_tokenize(article)

    #algorithm for scoring a sentence by its words
    sentence_scores = _calculate_sentence_scores(sentences, frequency_table)

    #getting the threshold
    threshold = _calculate_average_score(sentence_scores)

    #producing the summary ## tweak around with threshold value to adjust length of the summary
    article_summary = _get_article_summary(sentences, sentence_scores, 1.5 * threshold)

    return article_summary

In [111]:
(print(_run_article_summary(article_content)))

 Humans explored space for the first time, taking their first footsteps on the Moon. However, these same wars resulted in the destruction of the imperial system. The victorious Bolsheviks then established the Soviet Union, the world's first communist state. At the beginning of the period, the British Empire was the world's most powerful nation,[15] having acted as the world's policeman for the past century. In total, World War II left some 60 million people dead. At the beginning of the century, strong discrimination based on race and sex was significant in general society. During the century, the social taboo of sexism fell. Communications and information technology, transportation technology, and medical advances had radically altered daily lives. Since the US was in a dominant position, a major part of the process was Americanization. Terrorism, dictatorship, and the spread of nuclear weapons were pressing global issues. Millions were infected with HIV, the virus which causes AIDS. 

## Here is the entire code for the simple extractive text summarizer in machine learning:

In [99]:
#importing libraries
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import bs4 as BeautifulSoup
import urllib.request  

#fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

#parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

#returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

#looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text


def _create_dictionary_table(text_string) -> dict:
   
    #removing stop words
    stop_words = set(stopwords.words("english"))
    
    words = word_tokenize(text_string)
    
    #reducing words to their root form
    stem = PorterStemmer()
    
    #creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1

    return frequency_table


def _calculate_sentence_scores(sentences, frequency_table) -> dict:   

    #algorithm for scoring a sentence by its words
    sentence_weight = dict()

    for sentence in sentences:
        sentence_wordcount = (len(word_tokenize(sentence)))
        sentence_wordcount_without_stop_words = 0
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]

        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] / sentence_wordcount_without_stop_words

       

    return sentence_weight

def _calculate_average_score(sentence_weight) -> int:
   
    #calculating the average score for the sentences
    sum_values = 0
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]

    #getting sentence average value from source text
    average_score = (sum_values / len(sentence_weight))

    return average_score

def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary


#Our main function
def _run_article_summary(article):
    
    #creating a dictionary for the word frequency table
    frequency_table = _create_dictionary_table(article)

    #tokenizing the sentences
    sentences = sent_tokenize(article)

    #algorithm for scoring a sentence by its words
    sentence_scores = _calculate_sentence_scores(sentences, frequency_table)

    #getting the threshold
    threshold = _calculate_average_score(sentence_scores)

    #producing the summary
    article_summary = _get_article_summary(sentences, sentence_scores, 1.5 * threshold) 

    return article_summary

if __name__ == '__main__':
    summary_results = _run_article_summary(article_content)
    print(summary_results)

 Humans explored space for the first time, taking their first footsteps on the Moon. However, these same wars resulted in the destruction of the imperial system. The victorious Bolsheviks then established the Soviet Union, the world's first communist state. At the beginning of the period, the British Empire was the world's most powerful nation,[15] having acted as the world's policeman for the past century. In total, World War II left some 60 million people dead. At the beginning of the century, strong discrimination based on race and sex was significant in general society. During the century, the social taboo of sexism fell. Communications and information technology, transportation technology, and medical advances had radically altered daily lives. Since the US was in a dominant position, a major part of the process was Americanization. Terrorism, dictatorship, and the spread of nuclear weapons were pressing global issues. Millions were infected with HIV, the virus which causes AIDS. 