# Text Summarization

In [188]:
from bs4 import BeautifulSoup
import nltk
from sentence_transformers import SentenceTransformer, util
import numpy as np
import requests
from transformers import pipeline

from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

from sklearn.feature_extraction.text import TfidfVectorizer

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [34]:
# Summarize saved article
# with open("rf_article.txt") as f:
#     f_list = f.readlines()
# document = [i.strip() for i in f_list]
# document = ' '.join(document)

In [180]:
# Summarize web article 
url = 'https://towardsdatascience.com/understanding-random-forest-58381e0602d2'
url='https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

article_text = soup.text
h1s = soup.find_all('h1')
h1s = [i.text for i in h1s]

for title in h1s:
    article_text = article_text.replace(title,'SECTION_BREAK',1)
    
article_sections = article_text.split('SECTION_BREAK')[1:]

sections = {}
for i in range(len(h1s)):
    sections[h1s[i]]=article_sections[i]

## Summarize using SentenceTransformer and LexRank

In [182]:
"""
This example uses LexRank (https://www.aaai.org/Papers/JAIR/Vol22/JAIR-2214.pdf)
to create an extractive summarization of a long document.
The document is splitted into sentences using NLTK, then the sentence embeddings are computed. We
then compute the cosine-similarity across all possible sentence pairs.
We then use LexRank to find the most central sentences in the document, which form our summary.
Input document: First section from the English Wikipedia Section
Output summary:
Located at the southern tip of the U.S. state of New York, the city is the center of the New York metropolitan area, the largest metropolitan area in the world by urban landmass.
New York City (NYC), often called simply New York, is the most populous city in the United States.
Anchored by Wall Street in the Financial District of Lower Manhattan, New York City has been called both the world's leading financial center and the most financially powerful city in the world, and is home to the world's two largest stock exchanges by total market capitalization, the New York Stock Exchange and NASDAQ.
New York City has been described as the cultural, financial, and media capital of the world, significantly influencing commerce, entertainment, research, technology, education, politics, tourism, art, fashion, and sports.
If the New York metropolitan area were a sovereign state, it would have the eighth-largest economy in the world.
"""


#from lexrank import degree_centrality_scores
import sys
sys.path.append("../lexrank/")
from lexrank.lexrank import  degree_centrality_scores

model = SentenceTransformer('all-MiniLM-L6-v2')

In [183]:
def summarize(document,model):
    #Split the document into sentences
    sentences = nltk.sent_tokenize(document)
    print("Num sentences:", len(sentences))

    #Compute the sentence embeddings
    embeddings = model.encode(sentences, convert_to_tensor=True)

    #Compute the pair-wise cosine similarities
    cos_scores = util.cos_sim(embeddings, embeddings).numpy()

    #Compute the centrality for each sentence
    centrality_scores = degree_centrality_scores(cos_scores, threshold=None)

    #We argsort so that the first element is the sentence with the highest score
    most_central_sentence_indices = np.argsort(-centrality_scores)
    
    return [sentences[idx].strip() for idx in most_central_sentence_indices[0:2]]

for section in sections.keys():
    if len(sections[section])>0:
        print(section)
        top_sents = summarize(sections[section],model)
        for sent in top_sents:
            print(sent)
        print()

A One-Stop Shop for Principal Component Analysis
Num sentences: 8
It’s safe to say that I’m not “entirely satisfied with the available texts” here.As a result, I wanted to put together the “What,” “When,” “How,” and “Why” of PCA as well as links to some of the resources that can help to further explain this topic.
Being familiar with some or all of the following will make this article and PCA as a method easier to understand: matrix operations/linear algebra (matrix multiplication, matrix transposition, matrix inverses, matrix decomposition, eigenvectors/eigenvalues) and statistics/machine learning (standardization, variance, covariance, independence, linear regression, feature selection).

What is PCA?
Num sentences: 23
In the GDP example above, instead of considering every single variable, we might drop all variables except the three we think will best predict what the U.S.’s gross domestic product will look like.
But — and here’s the kicker — because these new independent variables 

## Summarize using word counts and Networkx PageRank
Source: https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70

In [170]:
def preprocess(input_text):
    article = input_text.split(". ")
    sentences = []
    for sentence in article:
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    return sentences

def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)
 
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix


def generate_summary(input_text, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    #sentences =  read_article(file_name)
    sentences = preprocess(input_text)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)
    scores_list = list(scores.values())

    # Step 4 - Sort the rank and pick top sentences
    ranking_idx = np.argsort(scores_list)[::-1]
    ranked_sentence = [sentences[i] for i in ranking_idx]   

    for i in range(top_n):
        summarize_text.append(" ".join(ranked_sentence[i]))

    # Step 5 - Offcourse, output the summarize texr
    #print("Summarize Text: \n", ". ".join(summarize_text))
    summary = ". ".join(summarize_text)
    return summary


In [171]:
for section in sections.keys():
    print(section)
    num_sents = len(sections[section].split('. '))
    print(num_sents)
    if num_sents>0:
        if num_sents>3:
            top_n=2
        else:
            top_n = num_sents-1
        top_sents = generate_summary(sections[section],top_n=top_n)
        print(top_sents)
        print()

A One-Stop Shop for Principal Component Analysis
8
It’s safe to say that I’m not “entirely satisfied with the available texts” here.As a result, I wanted to put together the “What,” “When,” “How,” and “Why” of PCA as well as links to some of the resources that can help to further explain this topic. You are writing a book because you are not entirely satisfied with the available texts.”I apply the authors’ logic here

What is PCA?
23
Say we have ten independent variables. However, we create these new independent variables in a specific way and order these new variables by how well they predict our dependent variable.You might say, “Where does the dimensionality reduction come into play?” Well, we keep as many of the new independent variables as we want, but we drop the “least important ones.” Because we ordered the new variables by how well they predict our dependent variable, we know which variable is the most important and least important

When should I use PCA?
2
If you answered “no

## Summarize using TFIDF and Networx PageRank

In [186]:
def preprocess(input_text,stopwords):
#     article = input_text.split(". ")
#     sentences = []
#     for sentence in article:
#         sentence = sentence.replace("[^a-zA-Z]", " ")
#         sentences.append(sentence)
    sentences = nltk.sent_tokenize(input_text)
    return sentences

def featurize(input_text,stopwords):
    text_minus_stopwords = []
    for sentence in input_text:
        sentence_reduced = [word for word in sentence.split(' ') if word not in stopwords]
        sentence_reduced = ' '.join(sentence_reduced)
        text_minus_stopwords.append(sentence_reduced)
    vectorizer = TfidfVectorizer()
    featurized_text = vectorizer.fit_transform(text_minus_stopwords)
    return featurized_text

def sentence_similarity(featurized_text,idx1, idx2):
    vector1 = featurized_text[idx1,:].todense().tolist()[0]
    vector2 = featurized_text[idx2,:].todense().tolist()[0]
    return 1 - cosine_distance(vector1, vector2)
 
def build_similarity_matrix(featurized_text):
    # Create an empty similarity matrix
    dim = featurized_text.shape[0]
    similarity_matrix = np.zeros((dim,dim))
 
    for idx1 in range(dim):
        for idx2 in range(dim):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(featurized_text, idx1, idx2)
            
    return similarity_matrix


def generate_summary(input_text, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []

    # Preprocess text and featurize using TFIDF
    sentences = preprocess(input_text,stop_words)
    featurized_text = featurize(sentences,stop_words)

    # Generate similary martix
    sentence_similarity_martix = build_similarity_matrix(featurized_text)

    # Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)
    scores_list = list(scores.values())

    # Sort the scores and pick top sentences
    ranking_idx = np.argsort(scores_list)[::-1]
    ranked_sentences = [sentences[i] for i in ranking_idx[:top_n]]   
    summary = " ".join(ranked_sentences)
    
    return summary



In [187]:
for section in sections.keys():
    print(section)
    num_sents = len(sections[section].split('. '))
    print(num_sents)
    if num_sents>0:
        if num_sents>3:
            top_n=2
        else:
            top_n = num_sents-1
        top_sents = generate_summary(sections[section],top_n=top_n)
        print(top_sents)
        print()

A One-Stop Shop for Principal Component Analysis
8
It’s safe to say that I’m not “entirely satisfied with the available texts” here.As a result, I wanted to put together the “What,” “When,” “How,” and “Why” of PCA as well as links to some of the resources that can help to further explain this topic. You are writing a book because you are not entirely satisfied with the available texts.”I apply the authors’ logic here.

What is PCA?
23
In feature extraction, we create ten “new” independent variables, where each “new” independent variable is a combination of each of the ten “old” independent variables. However, we create these new independent variables in a specific way and order these new variables by how well they predict our dependent variable.You might say, “Where does the dimensionality reduction come into play?” Well, we keep as many of the new independent variables as we want, but we drop the “least important ones.” Because we ordered the new variables by how well they predict our

## Summarize using Transformers

In [223]:
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")

def summarize_bart(input_text):
#     if len(input_text)>1024:
#         input_text = input_text[:1024]
#     summarizer = pipeline("summarization")
#     summarized = summarizer(input_text, min_length=50, max_length=400)
#     return summarized[0]['summary_text']
    inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(
        inputs["input_ids"], max_length=300, min_length=50, length_penalty=1.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0])

In [224]:
for section in sections.keys():
    print(section)
    num_sents = len(sections[section].split('. '))
    print(num_sents)
    top_sents = summarize_bart(sections[section])
    print(top_sents)
    print()

A One-Stop Shop for Principal Component Analysis
8
</s><s> Principal component analysis (PCA) is an important technique to understand in the fields of statistics and data science. The algorithm we’ll cover is pretty technical. Being familiar with some or all of the following will make this article and PCA as a method easier to understand.</s>

What is PCA?
23
</s><s> You have a lot of variables to consider when trying to predict what the U.S. GDP will look like in 2017. You might ask, “How do I take all of the variables I’ve collected and focus on only a few of them? In technical terms, you want to “reduce the dimension of your feature space”</s>

When should I use PCA?
2
</s><s> Do you want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration? Are you comfortable making your independent variables less interpretable? If you answered “yes” to all three questions, then PCA is a good method to use.</s>

How does PCA work?
51
</s>