# Article summary using NLTK

Let's start from importing something to help us in opening an URL and scrapping content from web.

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

I will specify specific article from Washington Post.

In [2]:
articleURL = "https://www.washingtonpost.com/news/the-switch/wp/2016/10/18/the-pentagons-massive-new-telescope-is-designed-to-track-space-junk-and-watch-out-for-killer-asteroids/"

Now let me open URL, read it and decode.

In [3]:
page = urlopen(articleURL).read().decode('utf8','ignore') 
soup = BeautifulSoup(page,"lxml")
print(soup.prettify()[:400])

<!DOCTYPE html>
<html class="blog layout_article rendering-context-www outputtype_default-article" itemscope="" itemtype="http://schema.org/NewsArticle" lang="en">
 <head>
  <script>
   window.pbDeferredScripts=window.pbDeferredScripts||new Array;
  </script>
  <script id="_$cookiemonster">
   (function(document,undefined){var wl={};wl.reg=[];wl.map=[];function CM(wlmap,wlreg){this.wl={map:wl.map.


I need to find actual article from HTML...

In [4]:
print(soup.find('article').prettify()[:400])

<article class="paywall" itemprop="articleBody">
 <div class="inline-content inline-video">
  <div class="posttv-video-embed powa" data-ad-bar="1" data-aspect-ratio="0.5625" data-blurb="1" data-live="0" data-object-id="580640a4e4b0d16481f68b07" data-org="wapo" data-playthrough="1" data-uuid="24fd7912-9548-11e6-9cae-2a3574e296a6" data-youtube-id="aGH_nLCOWOw">
   <script src="https://d1pz6dax0t5mop


And extract only text from HTML tags...

In [5]:
soup.find('article').text[:400]

'      The Space Surveillance Telescope offers improvements in determining the orbits of newly discovered objects and provides rapid observations of events that may only occur over a relatively short period of time, like a supernova. (DARPAtv)    There are a lot of rocks flying around through space. Lots of debris, too. Old satellites, spent rocket boosters, even for a short while a spatula that go'

OK, but there is more "article" tags so I need to group them.

In [6]:
text = ' '.join(map(lambda p: p.text, soup.find_all('article')))
text[:400]

'      The Space Surveillance Telescope offers improvements in determining the orbits of newly discovered objects and provides rapid observations of events that may only occur over a relatively short period of time, like a supernova. (DARPAtv)    There are a lot of rocks flying around through space. Lots of debris, too. Old satellites, spent rocket boosters, even for a short while a spatula that go'

And remove ASCII characters...

In [7]:
text.encode('ascii', errors='replace').decode("utf-8").replace("?"," ")[:400]

'      The Space Surveillance Telescope offers improvements in determining the orbits of newly discovered objects and provides rapid observations of events that may only occur over a relatively short period of time, like a supernova. (DARPAtv)    There are a lot of rocks flying around through space. Lots of debris, too. Old satellites, spent rocket boosters, even for a short while a spatula that go'

### Creating a function

Everything is working so I will create a function for previous steps for further use.

In [8]:
def getTextWaPo(url):
    page = urlopen(url).read().decode('utf8')
    soup = BeautifulSoup(page,"lxml")
    text = ' '.join(map(lambda p: p.text, soup.find_all('article')))
    return text.encode('ascii', errors='replace').decode("utf-8").replace("?"," ")

Let's get this article now using function.

In [9]:
text = getTextWaPo(articleURL)

### Tokenization

Now I will tokenize all sentences then tokenize all words.

In [10]:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from string import punctuation

Tokenization of sentences:

In [11]:
sents = sent_tokenize(text)
sents[:5]

['      The Space Surveillance Telescope offers improvements in determining the orbits of newly discovered objects and provides rapid observations of events that may only occur over a relatively short period of time, like a supernova.',
 '(DARPAtv)    There are a lot of rocks flying around through space.',
 'Lots of debris, too.',
 'Old satellites, spent rocket boosters, even for a short while a spatula that got loose during a space shuttle mission in 2006.',
 'All of it swirling around in orbit, creating a bit of a traffic jam.']

Tokenization of words:

In [12]:
word_sent = word_tokenize(text.lower())
word_sent[:10]

['the',
 'space',
 'surveillance',
 'telescope',
 'offers',
 'improvements',
 'in',
 'determining',
 'the',
 'orbits']

I combine stopwords and punctuation.

In [13]:
_stopwords = set(stopwords.words('english') + list(punctuation))
#_stopwords

Now I remove all stopwords and punctuation from tokenized words.

In [14]:
word_sent=[word for word in word_sent if word not in _stopwords]

In [15]:
word_sent[:20]

['space',
 'surveillance',
 'telescope',
 'offers',
 'improvements',
 'determining',
 'orbits',
 'newly',
 'discovered',
 'objects',
 'provides',
 'rapid',
 'observations',
 'events',
 'may',
 'occur',
 'relatively',
 'short',
 'period',
 'time']

Now let's count frequency of all words

In [16]:
from nltk.probability import FreqDist
freq = FreqDist(word_sent)
#freq

And sort them from most common...

In [17]:
from heapq import nlargest

In [18]:
nlargest(10, freq, key=freq.get)

['space',
 'telescope',
 'satellites',
 'objects',
 'debris',
 'orbit',
 'air',
 'force',
 'around',
 'small']

Now we can build a ranking of sentences using word frequences in whole article.

In [19]:
from collections import defaultdict
ranking = defaultdict(int)

for i,sent in enumerate(sents):
    for w in word_tokenize(sent.lower()):
        if w in freq:
            ranking[i] += freq[w]
            
ranking

defaultdict(int,
            {0: 55,
             1: 25,
             2: 8,
             3: 38,
             4: 17,
             5: 46,
             6: 72,
             7: 62,
             8: 54,
             9: 28,
             10: 17,
             11: 15,
             12: 19,
             13: 26,
             14: 57,
             15: 26,
             16: 47,
             17: 72,
             18: 36,
             19: 68,
             20: 44,
             21: 63,
             22: 30,
             23: 27,
             24: 37,
             25: 23,
             26: 23,
             27: 84,
             28: 30})

This time we search for most valuable sentence...

In [20]:
sents_idx = nlargest(4, ranking, key=ranking.get)
sents_idx

[27, 6, 17, 19]

And print them in decreasing information value...

In [21]:
[sents[j] for j in sorted(sents_idx)]

['On Tuesday, the Defense Department took another significant step toward monitoring all of the cosmic junk swirling around in space, by delivering a gigantic new telescope capable of seeing small objects from very far away.',
 'The telescope is  a big improvement over the legacy ground-based optical telescopes that are used by the U.S. Air Force, because it can search large areas of sky and also track very faint (small) objects in and around GEO,  Brian Weeden, a Technical Advisor at the Secure World Foundation, wrote in an email.',
 'The telescope would join another new space debris tracking technology known as the Space Fence, which is now being built by Bethesda-based Lockheed Martin.',
 'Every military operation that takes place in the world today is critically dependent on space in one way or another,  Air Force Gen. John Hyten said in an interview earlier this year when he was the commander of the Air Force Space Command.']

And again we put all of this into function.

In [22]:
def summarize(text, n):
    sents = sent_tokenize(text)
    
    assert n <= len(sents)
    word_sent = word_tokenize(text.lower())
    _stopwords = set(stopwords.words('english') + list(punctuation))
    
    word_sent=[word for word in word_sent if word not in _stopwords]
    freq = FreqDist(word_sent)
    
    
    ranking = defaultdict(int)
    
    for i,sent in enumerate(sents):
        for w in word_tokenize(sent.lower()):
            if w in freq:
                ranking[i] += freq[w]
             
        
    sents_idx = nlargest(n, ranking, key=ranking.get)
    return [sents[j] for j in sorted(sents_idx)]

Now we can use this function to summarize different articles from Washington Post.

In [23]:
summarize(text,3)

['On Tuesday, the Defense Department took another significant step toward monitoring all of the cosmic junk swirling around in space, by delivering a gigantic new telescope capable of seeing small objects from very far away.',
 'The telescope is  a big improvement over the legacy ground-based optical telescopes that are used by the U.S. Air Force, because it can search large areas of sky and also track very faint (small) objects in and around GEO,  Brian Weeden, a Technical Advisor at the Secure World Foundation, wrote in an email.',
 'Every military operation that takes place in the world today is critically dependent on space in one way or another,  Air Force Gen. John Hyten said in an interview earlier this year when he was the commander of the Air Force Space Command.']