<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Abstractive Text Summarization
Abstractive text summarization methods attempt to create a summary of a document by generating shorter text which captures the main points of the source document but is much shorter in length.  Unlike extractive summarization methods, the text in summaries produced using abstractive methods may include new phrases and sentences which did not appear in the source text.

The current state-of-the-art approach for abstractive text summarization uses transformer models which have been pre-trained or fine-tuned on large datasets with documents suitable for the summarization task.  In this notebook we will use the open source [Hugging Face library](https://huggingface.co) to load and use a transformer model.

**Notes:**  
- This does not need to be run on GPU, although it will take a few minutes to run on CPU
- This notebook uses a [DistilBart model](https://arxiv.org/pdf/2010.13002.pdf), but you can also use other Bart models or Google's T5 instead  

**References:**  
- Review the Hugging Face [summarization documentation](https://huggingface.co/docs/transformers/task_summary#summarization)


In [1]:
from bs4 import BeautifulSoup
import nltk
from sentence_transformers import SentenceTransformer, util
import numpy as np
import requests
from transformers import pipeline

from nltk.corpus import stopwords
import numpy as np

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

## Get document to summarize
We will use BeautifulSoup to get the content of an article on the web and strip the text content from the hmtl.

In [2]:
# Get article
url = 'https://en.wikipedia.org/wiki/Random_forest'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Extract body text from article
bodytext = soup.find_all('p')
bodytext = [i.text for i in bodytext]
article_text = ' '.join(bodytext)

## Load the model & associated tokenizer
We will use the open source Hugging Face library to load a pre-trained transformer model from their Model Zoo.  Hugging Face recommends using a Bart or Google's T5 model for summarization tasks.  Below we will use a [DistilBart model](https://arxiv.org/pdf/2010.13002.pdf), but you can try others.

In [3]:
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")

## Generate summary
Now that our model is loaded we can use it to generate summary text.  We first tokenize the article text and then feed the tokenized text into the model to generate the summary.  We are able to specify a desired minimum and maximum length for the output summary.  Note that the DistilBart model can accept a maximum input sequence length of 1024, and so we must either truncate our source document to 1024 characters or create batches of 1024 characters and summarize each batch, and then combine for the full document summary.

Let's first try it by simply truncating our input text to 1024 characters.

In [20]:
def truncate_summary(input_text,min_length,max_length):
    inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=max_length, min_length=min_length, length_penalty=1.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0])

In [27]:
# Set desired min and max length for summary
min_length = 50
max_length = 400
# Generate summary
summary = truncate_summary(article_text,min_length,max_length)
# Clean up output formatting
summary = summary.split('</s>')[-2].split('<s>')[-1].strip()

print('Length of the source document: {}'.format(len(article_text)))
print('Length of the summary: {}'.format(len(summary)))
print('Summary: ')
print(summary)

Length of the source document: 24552
Length of the summary: 375
Summary: 

Random forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. Random forests generally outperform decision trees, but their accuracy is lower than gradient boosted trees.


Now let's try another approach of "chunking" our document into chunks of 1024 characters and summarizing each one, and then combining.

In [28]:
def chunked_summary(input_text,min_chunk_len,max_chunk_len):
    # Separate the input text into chunks
    chunked_inputs = [input_text[i:i+1024] for i in range(0,len(input_text),1024)]
    summary = ''
    # Get input for each chunk
    for chunk in chunked_inputs:
        chunk_summary = truncate_summary(chunk,min_chunk_len,max_chunk_len)
        chunk_summary = chunk_summary.split('</s>')[-2].split('<s>')[-1].strip()
        summary += (' '+chunk_summary)
    return summary


In [30]:
# Set desired min and max length for summary
min_length = 25
max_length = 100
# Generate summary
summary = chunked_summary(article_text,min_length,max_length)

print('Length of the source document: {}'.format(len(article_text)))
print('Length of the summary: {}'.format(len(summary)))
print('Summary: ')
print(summary)

Length of the source document: 24552
Length of the summary: 6299
Summary: 
 Random forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. The first algorithm for random decision forests was created in 1995 by Tin Kam Ho using the random subspace method. Random forests are frequently used as "blackbox" models in businesses. They generate reasonable predictions across a wide range of data while requiring little configuration. The general method of random decision forests was first proposed by Ho in 1995. Breiman's notion of random forests was influenced by the work of Amit and Geman[13] who introduced the idea of searching over a random subset of the available decisions when splitting a node. In this method a forest of trees is grown, and variation among the trees is introduced by proje