https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/

Here is the code blueprint of the summarizer:

Here are the steps for creating a simple text summarizer in Python.

### Step 1: Preparing the data

In this example, we want to summarize the information found on (https://en.wikipedia.org/wiki/20th_century) Wikipedia article, which just gives an overview of the major happenings during the 20th century.

To enable us to fetch the article’s text, we’ll use the Beautiful Soup library.

Here is the code for scraping the article’s content:Here is the code for scraping the article’s content:

In [2]:
import bs4 as BeautifulSoup
import urllib.request  

# Fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

# Parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

# Returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

# Looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text

In the above code, we begin by importing the essential libraries for fetching data from the web page. The BeautifulSoup library is used for parsing the page while the urllib library is used for connecting to the page and retrieving the HTML.

BeautifulSoup converts the incoming text to Unicode characters and the outgoing text to UTF-8 characters, saving you the hassle of managing different charset encodings while scraping text from the web.

We’ll use the urlopen function from the urllib.request utility to open the web page. Then, we’ll use the read function to read the scraped data object. For parsing the data, we’ll call the BeautifulSoup object and pass two parameters to it; that is, the article_read and the html.parser.

The find_all function is used to return all the <p> elements present in the HTML. Furthermore, using .text enables us to select only the texts found within the <p> elements.

### Step 2: Processing the data

To ensure the scrapped textual data is as noise-free as possible, we’ll perform some basic text cleaning.  To assist us to do the processing, we’ll import a list of stopwords from the nltk library.

We’ll also import PorterStemmer, which is an algorithm for reducing words into their root forms. For example, cleaning, cleaned, and cleaner can be reduced to the root clean.

Furthermore, we’ll create a dictionary table having the frequency of occurrence of each of the words in the text. We’ll loop through the text and the corresponding words to eliminate any stop words.

Then, we’ll check if the words are present in the frequency_table. If the word was previously available in the dictionary, its value is updated by 1. Otherwise, if the word is recognized for the first time, its value is set to 1.

For example, the frequency table should look like the following:

WORD	FREQUENCY

century	7

world	4

United States	3

computer	1

Here is the code:

In [3]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
def _create_dictionary_table(text_string) -> dict:
   
    # Removing stop words
    stop_words = set(stopwords.words("english"))
    
    words = word_tokenize(text_string)
    
    # Reducing words to their root form
    stem = PorterStemmer()
    
    # Creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1

    return frequency_table

### Step 3:  Tokenizing the article into sentences

To split the article_content into a set of sentences, we’ll use the built-in method from the nltk library.

In [5]:
from nltk.tokenize import word_tokenize, sent_tokenize

sentences = sent_tokenize(article_content)

### Step 4: Finding the weighted frequencies of the sentences

To evaluate the score for every sentence in the text, we’ll be analyzing the frequency of occurrence of each term. In this case, we’ll be scoring each sentence by its words; that is, adding the frequency of each important word found in the sentence.

Take a look at the following code:

In [6]:
def _calculate_sentence_scores(sentences, frequency_table) -> dict:   

    # Algorithm for scoring a sentence by its words
    sentence_weight = dict()

    for sentence in sentences:
        sentence_wordcount = (len(word_tokenize(sentence)))
        sentence_wordcount_without_stop_words = 0
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]

        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] /        sentence_wordcount_without_stop_words
      
    return sentence_weight

Importantly, to ensure long sentences do not have unnecessarily high scores over short sentences, we divided each score of a sentence by the number of words found in that sentence.

Also, to optimize the dictionary’s memory, we arbitrarily added sentence[:7], which refers to the first 7 characters in each sentence. However, for longer documents, where you are likely to encounter sentences with the same first n_chars, it’s better to use hash functions or smart index functions to take into account such edge-cases and avoid collisions.

### Step 5: Calculating the threshold of the sentences

To further tweak the kind of sentences eligible for summarization, we’ll create the average score for the sentences. With this threshold, we can avoid selecting the sentences with a lower score than the average score.

Here is the code:

In [7]:
def _calculate_average_score(sentence_weight) -> int:
   
    # Calculating the average score for the sentences
    sum_values = 0
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]

    # Getting sentence average value from source text
    average_score = (sum_values / len(sentence_weight))

    return average_score

### Step 6: Getting the summary

Lastly, since we have all the required parameters, we can now generate a summary for the article.

In [8]:
def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary

In [10]:
def _run_article_summary(article):
    
    #creating a dictionary for the word frequency table
    frequency_table = _create_dictionary_table(article)

    #tokenizing the sentences
    sentences = sent_tokenize(article)

    #algorithm for scoring a sentence by its words
    sentence_scores = _calculate_sentence_scores(sentences, frequency_table)

    #getting the threshold
    threshold = _calculate_average_score(sentence_scores)

    #producing the summary
    article_summary = _get_article_summary(sentences, sentence_scores, 1.5 * threshold)

    return article_summary

In [11]:
if __name__ == '__main__':
    summary_results = _run_article_summary(article_content)
    print(summary_results)

 Humans explored space for the first time, taking their first footsteps on the Moon. However, these same wars resulted in the destruction of the imperial system. At the beginning of the period, the British Empire was the world's most powerful nation,[14] having acted as the world's policeman for the past century. The victorious Bolsheviks then established the Soviet Union, the world's first communist state. After the victory of the Allies in Europe, the war in Asia ended with the Soviet invasion of Manchuria and the dropping of two atomic bombs on Japan by the US, the first nation to develop nuclear weapons and the only one to use them in warfare. In total, World War II left some 60 million people dead. After the war, Germany was occupied and divided between the Western powers and the Soviet Union. With the Axis defeated and Britain and France rebuilding, the United States and the Soviet Union were left standing as the world's only superpowers. At the beginning of the century, strong d