# Write a Simple Summarizer in Python from Scratch

https://towardsdatascience.com/write-a-simple-summarizer-in-python-e9ca6138a08e

Step 1. Read from source — Read the unabridged content from the source, a file in the case of this exercise.

Step 2. Perform formatting and cleanup — Format and clean up our format so that it is free of extra white space or other issues.

Step 3. Tokenize input — Take the input and break it up into its individual words.

Step 4. Scoring — Score (count) the frequency of each word in the input and score sentences based on word score.

Step 5. Selection — Choose the top N sentences based on their score.

In [44]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.probability import FreqDist
from heapq import nlargest
from collections import defaultdict

In [45]:
def read_file(path):
    try:
        with open(path, 'r') as file:
            return file.read()
    except IOError as e:
        print("Fatal Error: File ({}) could not be located or is not readable.".format(path))

In [46]:
def sanitize_input(data):
    replace = {
        ord('\f') : ' ',
        ord('\t') : ' ',
        ord('\n') : ' ',
        ord('\r') : None
    }
    return data.translate(replace)

In [47]:
def tokenize_content(content):
    stop_words = set(stopwords.words('english') + list(punctuation))
    words = word_tokenize(content.lower())
    
    return [
        sent_tokenize(content),
        [word for word in words if word not in stop_words]    
    ]

In [48]:
def score_tokens(filtered_words, sentence_tokens):
    word_freq = FreqDist(filtered_words)
    ranking = defaultdict(int)
    for i, sentence in enumerate(sentence_tokens):
        for word in word_tokenize(sentence.lower()):
            if word in word_freq:
                ranking[i] += word_freq[word]
    return ranking

In [49]:
def summarize(ranks, sentences, length):
    if int(length) > len(sentences): 
        print("Error, more sentences requested than available. Use --l (--length) flag to adjust.")
        exit()
    indexes = nlargest(length, ranks, key=ranks.get)
    final_sentences = [sentences[j] for j in sorted(indexes)]
    return ' '.join(final_sentences) 

In [50]:
content = read_file("Greenland-Melting-Full.txt")
content = sanitize_input(content)
print(content,"\n")
sentence_tokens, word_tokens = tokenize_content(content)  
print(sentence_tokens,"\n")
print(word_tokens,"\n")
    
sentence_ranks = score_tokens(word_tokens, sentence_tokens)
print(sentence_ranks,"\n")

print(summarize(sentence_ranks, sentence_tokens, 3))

Like a bowling ball on a skating rink, the black geodesic sphere of the East Greenland Ice-Core Project's communal living space stands out against the endless white nothingness of the Greenland ice sheet.  But the real action at East GRIP is under the surface. Researchers are drilling through more than 2.5 kilometers of ice, down to the bedrock below. The ice is sliding fast - for a glacier - toward the sea. Scientists here want to know why. The answer may hold clues to the future of the world's coastal cities.  Greenland is melting. As it melts, it adds roughly 1 millimeter of water per year to global sea levels. And the pace of melting is quickening.  If all the ice covering the world's largest island were to thaw, sea levels would rise roughly 6 meters. Scientists don't know how fast, or how likely, that is to happen. East GRIP is looking for evidence to inform both those questions.  The answers are a matter of growing urgency. The seas are rising faster. And the same processes at w

In [51]:
#http://ai.intelligentonlinetools.com/ml/text-summarization/
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
 
def get_only_text(url):
    page = urlopen(url)
    soup = BeautifulSoup(page, "lxml")
    text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
    return soup.title.text, text    
 
url="https://en.wikipedia.org/wiki/Deep_learning"
text = get_only_text(url)

# Text Summarization using sumy

In [52]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
 
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer   #found this is the best as 
# it is picking from beginning also while other skip

LANGUAGE = "english"
SENTENCES_COUNT = 3
 
if __name__ == "__main__":
    url="https://en.wikipedia.org/wiki/Deep_learning"
    text = get_only_text(url)
    parser = PlaintextParser(text, Tokenizer(LANGUAGE))
    # or for plain text files
    # parser = PlaintextParser.from_file("document.txt", Tokenizer(LANGUAGE))
 
    print ("--LsaSummarizer--")
    summarizer = LsaSummarizer(Stemmer(LANGUAGE))#Latent Semantic Analysis
    summarizer.stop_words = get_stop_words(LANGUAGE)
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
         
    print ("\n--LuhnSummarizer--")     
    summarizer = LuhnSummarizer(Stemmer(LANGUAGE)) #Word freq method
    summarizer.stop_words = ("I", "am", "the", "you", "are", "me", "is", "than", "that", "this",)
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)
         
    print ("\n--EdmundsonSummarizer--")     
    summarizer = EdmundsonSummarizer() #cue phrase method
    words = ("deep", "learning", "neural" )
    summarizer.bonus_words = words
    words = ("another", "and", "some", "next",)
    summarizer.stigma_words = words
    words = ("another", "and", "some", "next",)
    summarizer.null_words = words
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        print(sentence)   

--LsaSummarizer--
[138] Another example is Facial Dysmorphology Novel Analysis (FDNA) used to analyze cases of human malformation connected to a large database of genetic syndromes.\n Closely related to the progress that has been made in image recognition is the increasing application of deep learning techniques to various visual art tasks.
[142] Deep neural architectures provide the best results for constituency parsing,[143] sentiment analysis,[144] information retrieval,[145][146] spoken language understanding,[147] machine translation,[103][148] contextual entity linking,[148] writing style recognition,[149] Text classification and others.
[216] Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition[220] and artificial intelligence (AI).

--LuhnSummarizer--
[2]\n Deep-learning architect

# LexRank using sumy

In [53]:
text='''Operating a pawn shop in a small neighborhood, Cha Tae-sik now leads a quiet life. His only connection to the rest of the world is a little girl, So-mee, who lives nearby. A heroin addict and So-mis mother, Hyo-jeong, smuggles drugs from a drug trafficking organization and entrusts Tae-sik with the product without his knowledge. When the traffickers find out about this they kidnap both Hyo-jeong and So-mi. The gang sends a number of thugs to Tae-sik's pawn shop to retrieve the stolen drugs, but is easily overpowered by Tae-sik, making his identity ambiguous. However, upon learning that the gang now has in their possession both Hyo-jeong and So-mi, Tae-sik gives the beaten gang members what they are looking for.
Realizing that Tae-sik may serve better as a mule than their former thug, the brothers that lead the gang Man-sik and Jong-sik promise to release Hyo-jeong and So-mi under the condition that Tae-sik make a delivery for them. Tae-sik makes the decision to face the outside world in order to rescue So-mi. However, the delivery was part of a larger plot to eliminate a drug ring superior, Mr. Oh, and Tae-sik is arrested. At the same time, Hyo-jeongs body, with her organs harvested, is discovered in the back of the car used by Tae-sik when he made the delivery, and Tae-sik realizes that So-mis life may also be in danger. He fights off half a dozen detectives and escapes from the police station. During his escape, the police are bewildered at Tae-sik's display of power, combat techniques and agility, and further investigates his bio and finds out that he was once a black operation agent for the Korean government with numerous commendations, but dropped out from the agency after witnessing his pregnant wife being murdered in front of his eyes in connection with him by being hit by a truck while inside their car. The incident nearly drove him mad, hence he went into hiding.
Realizing this, the Narcotics head contacted a weakened Tae-sik after his encounter with a highly trained assassin who backs up the brothers. Now with the knowledge that So-mee is being used by an ANT organization to secretly smuggle drugs and - in the future might be killed for organ harvesting, Tae-sik goes on a mad killing.hunting spree to locate and save So-mee, who is his only connection to a caring world.
A gore battle ensues, from the ANT/Drug manufacturing location where he was able to free the remaining children and kill off the younger of the brothers, to their posh condo unit where the rest of the killing continues. Tae-sik went mad when a container that holds harvested eyes was rolled towards him, believing to be So-mees. A final stand-off with the assassin ensues, with Tae-sik winning the fight.
After killing off the last hoodlum in the parking lot, Tae-sik was about to resign to his fate by shooting himself when a scared and dirty So-mee emerged from the darkness. She was saved by the assassin who took to her kindly and killed instead the surgeon accomplice - his eyes were in the container that Tae-sik saw.
A police escort in the end sees Tae-sik and So-mee together in the back of the detectives car. While she sleeps. Tae-sik asks if they can be dropped off at the small convenience store - a surprise for So-mee. The owner got a shock upon seeing all the police lined up around them, and with the words: You really messed it up big time.
Tae-sik pulls up a fancy backpack with a star, and fills it up with fancy school stuff, much to her delight.
Tae-sik asks if he can hug her, and as they embrace, he could not stop the tears as they fell from his battered bandaged face.'''

import sumy
from sumy.parsers.plaintext import PlaintextParser #We're choosing a plaintext parser here, other parsers available for HTML etc.
from sumy.nlp.tokenizers import Tokenizer 
from sumy.summarizers.lex_rank import LexRankSummarizer #We're choosing Lexrank, other algorithms are also built in

#file = "plain_text.txt" #name of the plain-text file
#parser = PlaintextParser.from_file(file, Tokenizer("english"))

parser = PlaintextParser(text, Tokenizer("english"))
summarizer = LexRankSummarizer()

summary = summarizer(parser.document, 5) #Summarize the document with 5 sentences

for sentence in summary:
    print(sentence)


However, upon learning that the gang now has in their possession both Hyo-jeong and So-mi, Tae-sik gives the beaten gang members what they are looking for.
Now with the knowledge that So-mee is being used by an ANT organization to secretly smuggle drugs and - in the future might be killed for organ harvesting, Tae-sik goes on a mad killing.hunting spree to locate and save So-mee, who is his only connection to a caring world.
Tae-sik went mad when a container that holds harvested eyes was rolled towards him, believing to be So-mees.
A police escort in the end sees Tae-sik and So-mee together in the back of the detectives car.
Tae-sik pulls up a fancy backpack with a star, and fills it up with fancy school stuff, much to her delight.


# Text Rank using summa

In [54]:
#https://github.com/summanlp/textrank
text = "Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document. As the problem of information overload has grown, and as the quantity of data has increased, so has interest in automatic summarization. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax. An example of the use of summarization technology is search engines such as Google. Document summarization is another."""

In [55]:
from summa import summarizer
print(summarizer.summarize(text))

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.


In [56]:
#extract keywords
from summa import keywords
print(keywords.keywords(text))

summarization
text document
original


In [57]:
from summa.summarizer import summarize
print(summarize(text, ratio=0.6))

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.
An example of the use of summarization technology is search engines such as Google.
Document summarization is another.


In [58]:
print(summarize(text, words=20))
print()
print(summarize(text, words=40))
print()
print(summarize(text, words=60))
print()
print(summarize(text, words=80))

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.
Document summarization is another.

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.
As the problem of information overload has grown, and as the quantity of data has increased, so has interest in automatic summarization.
An example of the use of summarization technology is search engines such as Google.
Document summarization is another.

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important

Here are some other summarizers:

 - https://github.com/thavelick/summarize/ - Python, TF (very simple)
 - Reduction - Python, TextRank (simple)
 - Open Text Summarizer - C, TF without normalization
 - Simple program that summarize text - Python, TF without normalization
 - Intro to Computational Linguistics - Java, LexRank
 - Sumtract: Second project for UW LING 572 - Python
 - TextTeaser - Scala
 - PyTeaser - TextTeaser port in Python
 - Automatic Document Summarizer - Java, Bipartite HITS (no sources)
 - Pythia - Python, LexRank & Centroid
 - SWING - Ruby
 - Topic Networks - R, topic models & bipartite graphs
 - Almus: Automatic Text Summarizer - Java, LSA (without source code)
 - Musutelsa - Java, LSA (always freezes)
 - http://mff.bajecni.cz/index.php - C++
 - MEAD - Perl, various methods + evaluation framework