In [None]:
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

### Install libraries

**IMPORTANT:** Currently BERT in this notebook does not works well with Mac M1 and M2 hardware architectures. 

In [None]:
# !pip install spacy
# !python -m spacy download en_core_web_sm
# !pip install bert-extractive-summarizer

# Text summarization 

**Sources:**
- https://medium.com/luisfredgs/automatic-text-summarization-with-machine-learning-an-overview-68ded5717a25
- https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/
- https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/
- https://iq.opengenus.org/bert-for-text-summarization/

Text summarization refers to the technique of condensing a lengthy text document into a short and well-written summary that captures the essential information andmainideas of the original text.
This process is achieved by highlighting the significant points of the document.

There are **two different approaches** used for text summarization:
- Extractive Summarization
- Abstractive Summarization

# Extractive Summarization

In extraction-based summarization, a subset of words that represent the most important points is pulled from a piece of text and combined to make a summary.
In machine learning, extractive summarization usually involves weighing the essential sections of sentences and using the results to generate summaries.

Summarizing the text consists of the following steps: 

1. **Preprocessing:**

    Tokenization: Break the text into individual words or phrases (Simple White-Space Tokenization, Regular Expression-Based Tokenization). 

    Stop Words Removal: Eliminate common words that do not carry significant meaning (libraries such as NLTK and Spacy).

    Lemmatization or Stemming: Reduce words to their base form to normalize the text (Porter stemming algorithm, Lancaster stemming algorithm,lemmatization techniques provided by libraries like NLTK or Spacy).

2. **Sentence Scoring:**

    Sentence Importance Calculation: Assign scores to sentences based on different features such as word frequency, sentence length, and position in the document.

    Use of Statistical Methods: Apply statistical techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to measure the importance of each word in the context of the entire

3. **Sentence Ranking:**

    Rank sentences based on their calculated importance scores.

    Identify the top-ranked sentences as potential candidates for the summary.

4. **Summary Generation:**

    Select Top Sentences: Choose the top-ranked sentences based on the predetermined summary length or the desired compression ratio.

    Arrange the Selected Sentences: Organize the selected sentences in a coherent manner to ensure the flow and coherence of the summary.

    Optional Post-Processing: Perform additional linguistic processing to improve the grammatical structure and overall readability of the summary.

5. **Output:**

    Generate the final extractive summary by combining the selected sentences.

    Present the summary in a readable format that effectively captures the key points of the source text

### The example of using extractive text summarization

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from heapq import nlargest

# Load the SpaCy English model
nlp = spacy.load("en_core_web_sm")

def summarize_text(text, num_sentences=3):

    # Phase 1: Preprocessing
    doc = nlp(text)

    # Build word frequency
    word_frequencies = {}
    for word in doc:
        if word.text not in STOP_WORDS:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

    # Get the most frequent word
    max_frequency = max(word_frequencies.values())

    # Normalize the frequencies
    for word in word_frequencies.keys():
        word_frequencies[word] = word_frequencies[word] / max_frequency

    # Phase 2: Sentence Scoring
    sentence_tokens = [sent for sent in doc.sents]
    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]

    # Phase 3: Sentence Ranking
    summarized_sentences = nlargest(num_sentences, sentence_scores, key=sentence_scores.get)
    final_sentences = [w.text for w in summarized_sentences]

    # Phase 4: Summary Generation
    summary = ' '.join(final_sentences)
    return summary

def read_text_from_file(file_name):
    with open(file_name, 'r') as file:
        text = file.read()
    return text

# Read input text from file
file_name = 'text_summarization.txt'
text = read_text_from_file(file_name)

# Generate and print the summary
summary = summarize_text(text)

# Phase 5: Output 
print("Text summarization:\n\n" + summary)

# Abstractive Summarization using pretrained model(s) - BERT

In abstractive summarization, advanced deep learning techniques are applied to paraphrase and shorten the original document, just like humans do.
Since abstractive machine learning algorithms can generate new phrases and sentences that represent the most important information from the source text, they can assist in overcoming the grammatical inaccuracies of the extraction techniques. 

### The example of using abstractive text summarization

In the method of abstractive text summarization, we will use the pretrained model BERT.

Utilizing BERT for text summarization involves fine-tuning the pre-trained model on a dataset specific to summarization tasks. This process leverages BERT's extensive knowledge base, acquired from pre-training on a vast corpus of text, to adapt its capabilities to the specific requirements of summarization. The result is a powerful tool that can efficiently process large documents, extract key points, and present them in a clear, concise manner, transforming the way information is consumed and comprehended in the digital age.

The "Bidirectional" part of BERT means that this assistant doesn’t just look at the words before or after a given word to understand its meaning; it considers the entire sentence, or even multiple sentences, at once. This comprehensive view allows it to grasp the subtleties of language, such as how the meaning of a word can change based on the words around it.

**Advantages**

1. Contextual Understanding: BERT's bidirectional nature allows it to understand the context of words in a sentence more effectively than many previous models, leading to more accurate and coherent summaries.
2. Pre-trained Model: Since BERT has been pre-trained on a vast corpus of text, it comes with a general understanding of language, which can significantly reduce the time and resources required for model training for specific summarization tasks.
3. Versatility: BERT can be fine-tuned with additional layers for a wide range of NLP tasks beyond summarization, such as question answering and sentiment analysis, making it a versatile tool in the NLP toolkit.
4. High Performance: BERT has demonstrated state-of-the-art performance on numerous NLP benchmarks, indicating its capability to produce high-quality text summaries.

**Disadvantages**

1. Resource Intensive: BERT's complexity and the size of its neural network make it computationally expensive, requiring significant hardware resources for training and inference, which might not be accessible to everyone.
2. Fine-tuning Challenges: While BERT can be fine-tuned for specific tasks, the process requires NLP expertise and can be time-consuming to optimize for best performance on text summarization specifically.
3. Overfitting Risk: Given its large parameter count, there's a risk of overfitting, especially when fine-tuning on smaller datasets. This could lead to less generalizable models that don't perform well on unseen data.
4. Handling of Long Documents: BERT has a maximum token limit (typically 512 tokens), which can be a limitation for summarizing longer documents directly, necessitating workarounds that may complicate the summarization process. 

### Running note
- the following code downloads more than 1.34 GB model and metadata
- inference may take longer time (several minutes) based on hardware

In [None]:
import warnings
# Suppress FutureWarnings, specifically those from sklearn
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")

from summarizer import Summarizer

# Initialize the model
model=Summarizer()

In [None]:
# Read input text from file
file_name = 'text_summarization.txt'
text = read_text_from_file(file_name)

summary=model(text)
print("Text summarization:\n\n" + summary)