The notebook below uses the nltk package (Natural Language Tool Kit) to create a summary of online articles. The sample article is a wikipedia article about reinforcement learning. Change the URL to get a summary of a different article. An article by Ekta Shah guided this approach to text summarization. 
https://www.analyticsvidhya.com/blog/2020/12/tired-of-reading-long-articles-text-summarization-will-make-your-task-easier/.

In [1]:
! pip install bs4
! pip install lxml
! pip install --user -U nltk

import bs4 as bs
import urllib.request
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')

Collecting nltk
  Downloading nltk-3.6.5-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 38.2 MB/s 
Collecting regex>=2021.8.3
  Downloading regex-2021.10.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (748 kB)
[K     |████████████████████████████████| 748 kB 71.4 MB/s 
Installing collected packages: regex, nltk
Successfully installed nltk-3.6.5 regex-2021.10.8
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

The code below obtains data through web scraping. The code uses the the BeautifulSoup and lxml libraries to parse text. Swap in another URL to summarize another article. 

In [2]:
# Replace URL with another article
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Hippopotamus')
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article, 'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
  article_text += p.text

In [3]:
# Remove square brackets and extra spaces from article
original_word_count = article_text.count(" ") + 1
article_text = re.sub(r"[[0-9]*]", "", article_text)
article_text = re.sub(r"\s+", " ", article_text)

# Remove special characters and extra whitespace
formatted_text = re.sub("[^a-zA-Z]", " ", article_text)
formatted_text = re.sub(r"\s+", " ", formatted_text)

  This is separate from the ipykernel package so we can avoid doing imports until


The code below creates a word frequency count. The nltk package provides stop words. 

In [4]:
# break sentences into words
sentence_list = nltk.sent_tokenize(article_text)
# obtain stop words from nltk library
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}

# Create a word count of all words that are not stopwords
for word in nltk.word_tokenize(formatted_text):
  if word not in stopwords:
    if word not in word_frequencies.keys():
      word_frequencies[word] = 1
    else:
      word_frequencies[word] +=1

max_frequency = max(word_frequencies.values())

# Calculate the weighted frequencies by dividing the frequency of each word by te max frequency 
for word in word_frequencies.keys():
  word_frequencies[word] = (word_frequencies[word]/max_frequency)


Calculate scores for the sentences.

In [5]:
sentence_scores = {}
for sentence in sentence_list:
  for word in nltk.word_tokenize(sentence.lower()):
    if word in word_frequencies.keys() and len(sentence.split(' ')) < 30:
        if sentence not in sentence_scores.keys():
          sentence_scores[sentence] = word_frequencies[word]
        else:
          sentence_scores[sentence] += word_frequencies[word]

The code below creates a summary using the top n sentences in the sentence scores dictionary. 

In [9]:
import heapq
import textwrap

# Create a summary of sentences using the top n sentences. 
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = " ".join(summary_sentences)

# Format paragraph output
summary = textwrap.dedent(summary).strip()
print(textwrap.fill(summary, width = 150))
print("")

# Print orignal word count and summary word count
word_count_summary = summary.count(" ") + 1
print(f"Summary Word Count: {word_count_summary}")
print(f"Original Word Count: {original_word_count}")

Hippos inhabit rivers, lakes, and mangrove swamps, where territorial males preside over a stretch of river and groups of five to thirty females and
young hippos. While hippos rest near each other in the water, grazing is a solitary activity and hippos are not territorial on land. The earliest
evidence of human interaction with hippos comes from butchery cut marks on hippo bones at Bouri Formation dated around 160,000 years ago.  The
hippopotamus (/ˌhɪpəˈpɒtəməs/ HIP-ə-POT-ə-məs; Hippopotamus amphibius), also called the hippo, common hippopotamus or river hippopotamus, is a large,
mostly herbivorous, semiaquatic mammal and ungulate native to sub-Saharan Africa. Isolated members of Malagasy hippos may have survived in remote
pockets; in 1976, villagers described a living animal called the kilopilopitsofy, which may have been a Malagasy hippo. Crocodiles are frequent
targets of hippo aggression, probably because they often inhabit the same riparian habitats; crocodiles may be either aggre