<p>The notebook below uses the nltk package (Natural Language Tool Kit) to create a summary of online articles. The sample article is about text summarization using abstractive methods. Change the URL to get a summary of a different article. A <a href="https://www.analyticsvidhya.com/blog/2020/12/tired-of-reading-long-articles-text-summarization-will-make-your-task-easier/">paper</a> by Ekta Shah guided this approach to text summarization. 
</p>

In [13]:
! pip install bs4
! pip install lxml
! pip install --user -U nltk

import bs4 as bs
from urllib.request import Request, urlopen
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

The code below obtains data through web scraping. The code uses the the BeautifulSoup and lxml libraries to parse text. Swap in another URL to summarize another article. 

In [14]:
# Replace URL with another article
# Work around 403 forbidden for web scrapping with bots
req = Request('https://towardsdatascience.com/understanding-automatic-text-summarization-2-abstractive-methods-7099fa8656fe', headers={'User-Agent': 'Mozilla/5.0'})
scraped_data = urlopen(req)

article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article, 'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
  article_text += p.text

In [15]:
# Remove square brackets and extra spaces from article
original_word_count = article_text.count(" ") + 1
article_text = re.sub(r"[[0-9]*]", "", article_text)
article_text = re.sub(r"\s+", " ", article_text)

# Remove special characters and extra whitespace
formatted_text = re.sub("[^a-zA-Z]", " ", article_text)
formatted_text = re.sub(r"\s+", " ", formatted_text)

The code below creates a word frequency count. The nltk package provides stop words. 

In [16]:
# break sentences into words
sentence_list = nltk.sent_tokenize(article_text)
# obtain stop words from nltk library
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}

# Create a word count of all words that are not stopwords
for word in nltk.word_tokenize(formatted_text):
  if word not in stopwords:
    if word not in word_frequencies.keys():
      word_frequencies[word] = 1
    else:
      word_frequencies[word] +=1

max_frequency = max(word_frequencies.values())

# Calculate the weighted frequencies by dividing the frequency of each word by te max frequency 
for word in word_frequencies.keys():
  word_frequencies[word] = (word_frequencies[word]/max_frequency)


Calculate scores for the sentences.

In [17]:
sentence_scores = {}
for sentence in sentence_list:
  for word in nltk.word_tokenize(sentence.lower()):
    if word in word_frequencies.keys() and len(sentence.split(' ')) < 30:
        if sentence not in sentence_scores.keys():
          sentence_scores[sentence] = word_frequencies[word]
        else:
          sentence_scores[sentence] += word_frequencies[word]

The code below creates a summary using the top n sentences in the sentence scores dictionary. 

In [19]:
import heapq
import textwrap

# Create a summary of sentences using the top n sentences. 
summary_sentences = heapq.nlargest(9, sentence_scores, key=sentence_scores.get)
summary = " ".join(summary_sentences)

# Format paragraph output
summary = textwrap.dedent(summary).strip()
print(textwrap.fill(summary, width = 100))
print("")

# Print orignal word count and summary word count
word_count_summary = summary.count(" ") + 1
print(f"Summary Word Count: {word_count_summary}")
print(f"Original Word Count: {original_word_count}")

# Uncomment to store data in a text file
#!echo "Text Summarization Techniques\n" > text_summarization_research.txt
#with open('text_summarization_research.txt', 'a') as writefile:
#    writefile.write(textwrap.fill(summary, width=150))
#    writefile.write("\n")
#    writefile.write(f"Summary word count: {word_count_summary}\n")
#    writefile.write(f"Original word count: {original_word_count}\n")

The word embeddings help to gain several insights about the word like whether a given word is
similar to a given word or not. The decoder model takes in the inputs and generates the predicted
words of the output sequence, given the previous word generated. The term “sequence to sequence
models” is used because the models are designed to create an output sequence of words from an input
sequence of words. So, the decoder has basically two actions at a time step, it can generate a word
from the target dictionary or it can point and copy a word. The decoder generates the words time
steps by time steps until the <end> tag is faced.This might raise a question, how are the words
generated?. We can see the for predicting a word each word in the input sequence is being assigned a
weight, called the attention weights. So, the paper proposed to take into consideration factors like
part of speech tags, named-entity tags, and TFIDF statistics of a word alongside embeddings to
represent a word. This