<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Extractive Text Summarization
Extractive methods of text summarization try to summarize a document by selecting a subset of sentences which retain the most important points in the document.  In this notebook we will apply the [TextRank method](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) to extract the most important sentences in the document as the summary.  We build a graph from the sentences in the document and use the [PageRank algorithm](https://en.wikipedia.org/wiki/PageRank) to select the most central sentences in the document, which should also be the most important sentences in the document.  To create the graph from the document, we calculate the cosine similarity of each sentence with every other in the document and create a similarity matrix.  The similarity represents the weight of the edge between every pair of sentences.  The intuition is that the sentences which are most "connected" to the maximum number of other sentences in the document should be the most important.

In order to calculate the similarity between sentences in a document, we need to create vectors representing each sentence. There are many ways we can do this - in this notebook we will demonstrate text summarization using Count Vectorization and TF-IDF Vectorization to create the text feature vectors.

**Notes:**  
- This does not need to be run on GPU for smaller documents such as articles

**References:**  
- Read the [TextRank paper](https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) by Mihalcea and Tarau

In [16]:
from bs4 import BeautifulSoup
import nltk
import numpy as np
import requests

from nltk import sent_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

from sklearn.feature_extraction.text import TfidfVectorizer

## Get document to summarize
We will use BeautifulSoup to get the content of an article on the web and strip the text content from the hmtl.

In [27]:
# Get article
url = 'https://en.wikipedia.org/wiki/Linear_regression'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Extract body text from article
bodytext = soup.find_all('p')
bodytext = [i.text for i in bodytext]
article_text = ' '.join(bodytext)

## Pre-process text
We will use some simple pre-processing on our document text:  
- Split the text into sentences
- Remove non-alphanumeric characters and stopwords from each sentence  
- Separate sentences into lists of lower-case words

In [28]:
def get_sentences(document):
    sentences = sent_tokenize(document)
    return sentences

In [29]:
def preprocess(sents):
    sentences_processed = []
    for sentence in sents:
        sentence_reduced = sentence.replace("[^a-zA-Z0-9_]", '')
        sentence_reduced = [word.lower() for word in sentence_reduced.split(' ') if word.lower() not in stopwords.words('english')]
        sentences_processed.append(' '.join(word for word in sentence_reduced))
    return sentences_processed

## Create features using word counts or TFIDF
Now we are ready to create our features.  We will first use simple word counts to create a numeric feature vector for each sentence.  Rather than using the Scikit-learn convenience method, we'll do this from scratch to demonstrate it.

In [30]:
def vectorize(sentences, vectorizer_type='count'):
    if vectorizer_type == 'count':
        # Get vocabulary for entire document
        sentences = [sent.split(' ') for sent in sentences]
        all_words = list(set([word for s in sentences for word in s]))

        # Create feature vector for each sentence
        feature_vecs = []
        for sentence in sentences:
            feature_vec = [0] * len(all_words)
            for word in sentence:
                feature_vec[all_words.index(word)] += 1
            feature_vecs.append(feature_vec)
    else:
        vectorizer = TfidfVectorizer()
        feature_vecs = vectorizer.fit_transform(sentences)
        feature_vecs = feature_vecs.todense().tolist()
        
    return feature_vecs

## Create graph representing document
We will now convert our document, represented by sentence feature vectors, into a graph representing the document.  The nodes of the graph are the sentences, and the edges connecting the nodes represent the similarity of each sentence to every other.  To generate the graph we will create an adjacency matrix which stores the similarity values between every pair of sentences in the document.

In [31]:
def generate_adjacency_matrix(feature_vecs):
    # Create empty adjacency matrix
    adjacency_matrix = np.zeros((len(feature_vecs), len(feature_vecs)))
 
    # Populate the adjacency matrix using the similarity of all pairs of sentences
    for i in range(len(feature_vecs)):
        for j in range(len(feature_vecs)):
            if i == j: #ignore if both are the same sentence
                continue 
            adjacency_matrix[i][j] = 1 - cosine_distance(feature_vecs[1], feature_vecs[j])
    
    return adjacency_matrix

## Apply PageRank to get most important sentences
Now that we have generated a graph representing the document, we can apply the PageRank algorithm to identify the most important sentences in the document as the most central nodes in the graph.

In [32]:
def summarize(sentences,adjacency_matrix,top_n):

    # Create the graph representing the document
    document_graph = nx.from_numpy_array(adjacency_matrix)

    # Apply PageRank algorithm to get centrality scores for each node/sentence
    scores = nx.pagerank(document_graph)
    scores_list = list(scores.values())

    # Sort and pick top sentences
    ranking_idx = np.argsort(scores_list)[::-1]
    ranked_sentences = [sentences[i] for i in ranking_idx]   

    summary = []
    for i in range(top_n):
        summary.append(ranked_sentences[i])

    summary = " ".join(summary)

    return summary

## Run the summarizer
We've created all the components we need, now let's try it out on our example document.

In [34]:
sentences_extracted = get_sentences(article_text)
sentences_processed = preprocess(sentences_extracted)
# Can set vectorizer_type = 'count' or 'tfidf' to change vectorizer type
feature_vecs = vectorize(sentences_processed,vectorizer_type='count')
adjacency_matrix = generate_adjacency_matrix(feature_vecs)
summary = summarize(sentences_extracted,adjacency_matrix,top_n=5)
print(summary)

The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. Such models are called linear models. Multiple linear regression is a generalization of simple linear regression to the case of more than one independent variable, and a special case of general linear models, restricted to one dependent variable. These are not the same as multivariable linear models (also called "multiple linear models"). "General linear models" are also called "multivariate linear models".
