# Text Summarization using Word Importance

In this lecture we will apply the NLP technique you learned in previous lectures to solve a very useful problem: text summarization.

In [2]:
import sys
import nltk
import math
import operator
import urllib.request
from bs4 import BeautifulSoup
from nltk import FreqDist
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords


Many news produce HTML content using specific tag. The most common one is &lt;article&gt;&lt;/article&gt; that contains the actual textual information.

In [3]:
def getArticle(url,tag):
    # Most content providers will use the <article></article> tag to delimite the actual article's text
    try:
        response = urllib.request.urlopen(url)
        html = response.read().decode('utf-8')
        soup = BeautifulSoup(html,"html.parser")
    except:
        print ("Unexpected error:", sys.exc_info()[1])
        return (None,None)
    
    if (soup is None):
        return (None, None)
    
    article = ""
    
    # Finding the tag within the HTML
    if (soup.find_all(tag) is not None):
        article = ''.join(map(lambda p: p.text, soup.find_all(tag)))
        # Finding the paragraphs <p> tag
        #soup_p = BeautifulSoup(article, "html.parser")
        #if (soup_p.find_all('p') is not None):
        #    article = ''.join(map(lambda p: p.text, soup_p.find_all('p')))
            
    return article, soup.title.text

## Defining all NLP functions Needed

In [4]:
def remove_punctuation(corpus):
    punctuations = ".,\"-\\/#!?$%\^&\*;:{}=\-_'~()"    
    filtered_corpus = [token for token in corpus if (not token in punctuations)]
    return filtered_corpus

def apply_stopwording(corpus, min_len):
    filtered_corpus = [token for token in corpus if (not token in stopwords.words('english') and len(token)>min_len)]
    return filtered_corpus

def apply_stemming(corpus):
    stemmer = nltk.PorterStemmer()
    normalized_corpus = [stemmer.stem(token) for token in corpus]
    return normalized_corpus

def apply_lemmatization(corpus):
    lemmatizer = nltk.WordNetLemmatizer()
    normalized_corpus = [lemmatizer.lemmatize(token) for token in corpus]
    return normalized_corpus

def getSummary(article, low, high, num_of_sentences):
    freq={}
    sentences = sent_tokenize(article)
    # Calculate word frequencies
    for sentence in sentences:
        doc =nltk.Text(nltk.word_tokenize(sentence.lower()))
        doc_clean = apply_lemmatization(apply_stopwording(remove_punctuation(doc), 3))
        for token in doc_clean:
            if token in freq.keys():
                freq[token]+=1
            else:
                freq[token]=1
    
    #Normalize the counts: divide by the largest frequency so we can make comparisons
    max_freq = float(max(freq.values()))
    freq_final={}

    #Removing too frequenty (max_freq) or less frequent (min_frequent) elements
    for token in freq.keys():
        freq[token]=freq[token]/max_freq
        if (freq[token]<high and freq[token]>low):
            freq_final[token]=freq[token]
    
    # For debugging purposes
    #print (len(freq.keys()))
    #print (len(freq_final.keys()))
    
    #Now we are ready to summarize
    # 1. Score all sentences: Sum of all the frequency scores for all terms in the sentence
    # 2. Normalize the sentence score (longer sentences will score higher) by dividing by the longest sentence length
    # 3. Rank the sentences and return the top <num_of_senteces>
    
    scores = {}
    sLen = 0.0
    # 1. Score all sentences
    for sentence in sentences:
        # Process each sentence
        doc =nltk.Text(nltk.word_tokenize(sentence.lower()))
        doc_clean = apply_lemmatization(apply_stopwording(remove_punctuation(doc), 3))
        
        # Identify the longest sentence length
        if len(doc_clean)>sLen:
            sLen = len(doc_clean)
        
        # Score the sentence
        for token in doc_clean:
            if (token in freq_final.keys()):
                if sentence in scores.keys():
                    scores[sentence]+=freq_final[token]
                else:
                    scores[sentence]=freq_final[token]
        
    # 2. Normalize the sentence score
    for key in scores.keys():
        scores[key]=scores[key]/sLen

    # 3. Rank the scores
    sorted_sentences = reversed(sorted(scores.items(), key=operator.itemgetter(1)))
    
    # 4. Return the top <num_of_sentences>
    count = 1
    result = []
    for s in sorted_sentences:
        #print ("(%s) - %s" % (score, sentence))
        if (count<= num_of_sentences):
            sentence = s[0]
            score = s[1]
            result.append(sentence)
            count+=1
    return result

In [5]:
url = "https://www.washingtonpost.com/opinions/on-gun-violence-we-are-a-failed-state/2018/02/18/88ecf09a-137a-11e8-9065-e55346f6de81_story.html"
tag ="article"
article, title = getArticle(url,tag)

In [10]:
summary = getSummary(article,0.15,0.9,5)
for s in summary:
    print (s)

And Peter Stone and Greg Gordon of McClatchy reported in January that the FBI “is investigating whether a top Russian banker with ties to the Kremlin illegally funneled money to the National Rifle Association to help Donald Trump win the presidency.”  Wherever this Russia story goes, we already know that the NRA and its political servants are immobilizing our government on one of the gravest problems confronting us.
President Trump’s rote address to the nation after the killing of 17 people at Marjory Stoneman Douglas High School in Parkland, Fla., had all the passion of a CEO delivering a middling annual report.
He told us: “We are committed to working with state and local leaders to help secure our schools and tackle the difficult issue of mental health.”  Trump’s speech, as Vox’s German Lopez observed, was “one giant lie by omission.” Those 17 people were killed by an AR-15 rifle, not by a knife or a sword or a bomb.
Yes, and if Trump cared so much about mental health, he wouldn’t b

In [None]:
summary = getSummary(article,2)
print ("# of sentences: "+str(len(summary)))
index = 1
for sentence in summary:
    print ("%s - %s" % (index,sentence))
    index+=1
    

In [8]:
print (article)

 The surest sign a political regime is failing is its inability to do anything about a problem universally seen as urgent that has some obvious remedies. And it’s a mark of political corruption when unaccountable cliques block solutions that enjoy broad support and force their selfish interests to prevail over the common good.  On gun violence, the United States has become a corrupt failed state. This is the only conclusion to draw from the endless enraging replays of the same political paralysis, no matter how many children are gunned down at our schools or how many innocent Americans are slaughtered at shopping centers and other public places. Whatever happens, we can’t ban assault weapons, we can’t strengthen background checks, we can’t do anything. In corrupt failed states, politics is about lying and misdirection. On guns, our debate is a pack of lies and evasions. In no other country is the phrase “thoughts and prayers” a sacrilege, a cover for cowardice. In no other country are 