# Extractive Summary Tool

There are two kinds of summaries - **"Extractive"**, and **"Abstractive"**.

This tool is an **Extractive** summary tool; That means it simply selects the "most important" sentences from a body of text and returns them. Computers are dumb though - There's no guarantee that this is a good summary. In our case, this tool counts how often each word is included in the text, then assigns a weight based on how often a word is used. If a sentence uses commonly-used words often, it'll likely score higher, and be returned by this tool. There are some unintended consequences too: longer sentences will be ranked more highly than shorter sentences.

The way humans summarize text, you're synthesizing new content based on input. That's an abstractive summary, and not how this tool works. If that's something you're interested in, I'd recommend looking into Google Research's PEGASUS model.

Code borrowed from:
https://stackabuse.com/text-summarization-with-nltk-in-python/

Another reference:
https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70

## Capabilities

This tool can ingest the following file formats:
- .txt
- .pdf (text-based; this won't OCR anything)
- .doc, .docx
- Limited url support.
    - The tool will try to scrape your target site, but will not return particularly helpful messages if the requests fail

In [12]:
import nltk
from pprint import pprint
import heapq
import bs4 as bs
import urllib.request
import re
import textract

import docx # This is whacky, but it's how you import python-docx. 
#pip install docx will install the wrong thing though
from docx import Document
from docx.shared import Pt

import requests
import boilerpy3
from boilerpy3 import extractors

## Trying to add iPython widget support
Commented out below

In [2]:
# from ipywidgets import interact, interactive, fixed, interact_manual
# import ipywidgets as widgets

# def interactiveBoxes(k, article):
#     summarize(k, article)
    
# iplot = interact(interactiveBoxes, n = widgets.Text(value='20',
#     placeholder='Type something', description='Total #:',disabled=False),
#         article = widgets.Text(value='https://en.wikipedia.org/wiki/Abstract_(summary)',
#     placeholder='Type something', description='Highlight:',disabled=False),)

In [13]:
def getPDFtext(filename):
    pdf_text = textract.process(filename)
    if isinstance(pdf_text, (bytes, bytearray)):
        pdf_text = pdf_text.decode("utf-8")
    
    return pdf_text

def getDocXtext(filename):
    ## Dumps the text of your word doc
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return ' '.join(fullText)

def getUrlText(url):
#     scraped_data = urllib.request.urlopen(filename)
#     article = scraped_data.read()

#     parsed_article = bs.BeautifulSoup(article,'lxml')

#     paragraphs = parsed_article.find_all('p')

#     webtext = ""

#     for p in paragraphs:
#         webtext += p.text
#     return webtext

    doc = extractor.get_doc_from_url(url)
    return doc.content


def getText(filename):
    #Split the file once on a period, starting from the rear, then grab the last entry in the resultant list
    filetype = filename.split(".",-1)[-1].lower()
    article_text = ""
    if filetype == "pdf":
        try:
            print("Looks like a PDF")
            article_text = getPDFtext(filename)
        except:
            print("\t>>Couldn't grab text")
    elif filetype in ["doc", "docx"]:
        try:
            print("Looks like a Word Doc")
            article_text = getDocXtext(filename)
        except:
            print("\t>>Couldn't grab text")
            
    elif filetype == "txt":
        try:
            print("Looks like a .txt doc")
            article_text = open(filename, "r").read()
        except:
            print("\t>>Couldn't grab text")
            
    elif filename.startswith("http"):
        print("Looks like a link!")
        try:
            article_text = getUrlText(filename)
        except:
            print(">>Couldn't grab text")
    else:
        print("\t>>Not sure what kind of file that is!")
    print(f"\t>>{len(nltk.word_tokenize(article_text))} words\n\t>>{len(nltk.sent_tokenize(article_text))} sentences")
    return article_text

In [4]:
def summarize(k, article):
    article_text = getText(article)
    # Removing Square Brackets and Extra Spaces
    article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
    article_text = re.sub(r'\s+', ' ', article_text)


    # Removing special characters and digits
    formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
    formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
    
#     formatted_article_text = article_text.replace("\t", ' ', formatted_article_text)
        
    sentence_list = nltk.sent_tokenize(article_text)
    
    stopwords = nltk.corpus.stopwords.words('english')

    word_frequencies = {}
    for word in nltk.word_tokenize(formatted_article_text):
        if word not in stopwords:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1
                
    maximum_frequncy = max(word_frequencies.values())

    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
        
    sentence_scores = {}
    for sent in sentence_list:
        for word in nltk.word_tokenize(sent.lower()):
            if word in word_frequencies.keys():
                if len(sent.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word]
                    else:
                        sentence_scores[sent] += word_frequencies[word]
                        
    summary_sentences = heapq.nlargest(k, sentence_scores, key=sentence_scores.get)

    summary = ' '.join(summary_sentences)
    # print(summary)
    
#     return summary

    for doc in nltk.sent_tokenize(summary):
        print("• "+doc)

In [None]:
summarize(k, article)