In [5]:
## input text article
article_text="Just what is agility in the context of software engineering work? Ivar Jacobson [Jac02a] provides a useful discussion: Agility  has become today’s buzzword when describing a modern software process. Everyone is agile. An agile team is a nimble team able to appropriately respond to changes. Change is what software development is very much about. Changes in the software being built, changes to the team members, changes because of new technology, changes of all kinds that may have an impact on the product they build or the project that creates the product. Support for changes should be built-in everything we do in software, something we embrace because it is the heart and soul of software. An agile team recognizes that software is developed by individuals working in teams and that the skills of these people, their ability to collaborate is at the core for the success of the project.In Jacobson’s view, the pervasiveness of change is the primary driver for agility. Software engineers must be quick on their feet if they are to accommodate the rapid changes that Jacobson describes.  But agility is more than an effective response to change. It also encompasses the philosophy espoused in the manifesto noted at the beginning of this chapter. It encourages team structures and attitudes that make communication (among team members, between technologists and business people, between software engineers and their managers) more facile. It emphasizes rapid delivery of operational software and deemphasizes the importance of intermediate work products (not always a good thing); it adopts the customer as a part of the development team and works to eliminate the “us and them” attitude that continues to pervade many software projects; it recognizes that planning in an uncertain world has its limits and that a project plan must be ﬂ exible.  Agility can be applied to any software process. However, to accomplish this, it is essential that the process be designed in a way that allows the project team to adapt tasks and to streamline them, conduct planning in a way that understands the ﬂ uidity of an agile development approach, eliminate all but the most essential work products and keep them lean, and emphasize an incremental delivery strategy that gets working software to the customer as rapidly as feasible for the product type and operational environment. "

In [22]:
article_text ='We use the method word_tokenize() to split a sentence into words. The output of word tokenization can be converted to Data Frame for better text understanding in machine learning applications. It can also be provided as input for further text cleaning steps such as punctuation removal, numeric character removal or stemming. Machine learning models need numeric data to be trained and make a prediction. Word tokenization becomes a crucial part of the text (string) to numeric data conversion. Please read about Bag of Words or CountVectorizer. Please refer to below word tokenize NLTK example to understand the theory better.'

In [36]:
article_text = 'Text summarization is the process of shortening long pieces of text while preserving key information content and overall meaning, to create a subset (a summary) that represents the most important or relevant information within the Text. An example of Text summarization problem is news article summarization, which attempts to automatically produce an abstract from a given article. Sometimes one might be interested in generating a summary from a single source article, while others can use multiple source articles (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary. An example for this is App called inShorts, which summarizes news articles into 60 words.'

## Import Modules

In [37]:
import re
import nltk

## Data Preprocessing

In [38]:
article_text = article_text.lower()
article_text

'text summarization is the process of shortening long pieces of text while preserving key information content and overall meaning, to create a subset (a summary) that represents the most important or relevant information within the text. an example of text summarization problem is news article summarization, which attempts to automatically produce an abstract from a given article. sometimes one might be interested in generating a summary from a single source article, while others can use multiple source articles (for example, a cluster of articles on the same topic). this problem is called multi-document summarization. a related application is summarizing news articles. imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary. an example for this is app called inshorts, which summarizes news articles into 60 words.'

In [39]:
# remove spaces, punctuations and numbers
clean_text = re.sub('[^a-zA-Z]', ' ', article_text)
clean_text = re.sub('\s+', ' ', clean_text)
clean_text

'text summarization is the process of shortening long pieces of text while preserving key information content and overall meaning to create a subset a summary that represents the most important or relevant information within the text an example of text summarization problem is news article summarization which attempts to automatically produce an abstract from a given article sometimes one might be interested in generating a summary from a single source article while others can use multiple source articles for example a cluster of articles on the same topic this problem is called multi document summarization a related application is summarizing news articles imagine a system which automatically pulls together news articles on a given topic from the web and concisely represents the latest news as a summary an example for this is app called inshorts which summarizes news articles into words '

In [40]:
## run this cell once to download stopwords
import nltk
nltk.download('stopwords')

import nltk
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [41]:
# split into sentence list
sentence_list = nltk.sent_tokenize(article_text)
sentence_list

['text summarization is the process of shortening long pieces of text while preserving key information content and overall meaning, to create a subset (a summary) that represents the most important or relevant information within the text.',
 'an example of text summarization problem is news article summarization, which attempts to automatically produce an abstract from a given article.',
 'sometimes one might be interested in generating a summary from a single source article, while others can use multiple source articles (for example, a cluster of articles on the same topic).',
 'this problem is called multi-document summarization.',
 'a related application is summarizing news articles.',
 'imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.',
 'an example for this is app called inshorts, which summarizes news articles into 60 words.']

In [42]:
## run this cell once to download stopwords
# import nltk
# nltk.download('stopwords')

## Word Frequencies

In [43]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(clean_text):
    if word not in stopwords:
        if word not in word_frequencies:
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [44]:
maximum_frequency = max(word_frequencies.values())

for word in word_frequencies:
    word_frequencies[word] = word_frequencies[word] / maximum_frequency

## Calculate Sentence Scores

In [45]:
sentence_scores = {}

for sentence in sentence_list:
    for word in nltk.word_tokenize(sentence):
        if word in word_frequencies and len(sentence.split(' ')) < 30:
            if sentence not in sentence_scores:
                sentence_scores[sentence] = word_frequencies[word]
            else:
                sentence_scores[sentence] += word_frequencies[word]

In [46]:
word_frequencies

{'abstract': 0.2,
 'app': 0.2,
 'application': 0.2,
 'article': 0.6,
 'articles': 1.0,
 'attempts': 0.2,
 'automatically': 0.4,
 'called': 0.4,
 'cluster': 0.2,
 'concisely': 0.2,
 'content': 0.2,
 'create': 0.2,
 'document': 0.2,
 'example': 0.6,
 'generating': 0.2,
 'given': 0.4,
 'imagine': 0.2,
 'important': 0.2,
 'information': 0.4,
 'inshorts': 0.2,
 'interested': 0.2,
 'key': 0.2,
 'latest': 0.2,
 'long': 0.2,
 'meaning': 0.2,
 'might': 0.2,
 'multi': 0.2,
 'multiple': 0.2,
 'news': 1.0,
 'one': 0.2,
 'others': 0.2,
 'overall': 0.2,
 'pieces': 0.2,
 'preserving': 0.2,
 'problem': 0.4,
 'process': 0.2,
 'produce': 0.2,
 'pulls': 0.2,
 'related': 0.2,
 'relevant': 0.2,
 'represents': 0.4,
 'shortening': 0.2,
 'single': 0.2,
 'sometimes': 0.2,
 'source': 0.4,
 'subset': 0.2,
 'summarization': 0.8,
 'summarizes': 0.2,
 'summarizing': 0.2,
 'summary': 0.6,
 'system': 0.2,
 'text': 0.8,
 'together': 0.2,
 'topic': 0.4,
 'use': 0.2,
 'web': 0.2,
 'within': 0.2,
 'words': 0.2}

In [47]:
sentence_scores

{'a related application is summarizing news articles.': 2.6,
 'an example for this is app called inshorts, which summarizes news articles into 60 words.': 3.8000000000000003,
 'an example of text summarization problem is news article summarization, which attempts to automatically produce an abstract from a given article.': 7.000000000000001,
 'imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.': 6.6000000000000005,
 'this problem is called multi-document summarization.': 1.6}

## Text Summarization

In [48]:
# get top 5 sentences
import heapq
summary = heapq.nlargest(2, sentence_scores, key=sentence_scores.get)

print(" ".join(summary))

an example of text summarization problem is news article summarization, which attempts to automatically produce an abstract from a given article. imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
