# **Business Objective**

Text summarization is an active research area within data science. While text summarization techniques have a history, recent years have seen significant progress in the fields of natural language processing and deep learning. Many technology companies are actively contributing to this research and publishing research papers. Salesforce, for example, has published significant papers demonstrating state-of-the-art abstractive summarization techniques. In May 2018, a substantial milestone was achieved with the release of a sizable summarization dataset, supported by a Google Research grant.

Despite the intense research activities, there is a lack of literature discussing practical applications of AI-driven summarization. Summarization is a complex task without a universal solution. Factors such as document length and content genre (e.g., technology, sports, finance, travel) significantly influence the approach to summarization. Summarizing a news article, for instance, is quite different from summarizing a financial earnings report. Consequently, the approach to summarization must be tailored to the specific use case.

---


# **BART Summarization Pre-Training Data Description: CNN/DM**

The CNN/DailyMail dataset, as introduced by Hermann et al. in 2015, comprises 93,000 articles from CNN and 220,000 articles from the Daily Mail newspapers. Both publications include concise bullet-point summaries alongside their articles. A non-anonymized variant of this dataset can be found in the work by See et al. in 2017.

To obtain this dataset, you can download and extract the stories directories for both CNN and the Daily Mail. The files are accessible for download through the terminal using the gdown tool, which can be installed with the command "pip install gdown."

---


In [1]:
# python version- 3.8.10 and 3.10.12(recent colab python version)
!pip install datasets==2.9.0
!pip install transformers==4.26.1
!pip install pytorch_lightning==1.9.1
!pip install torch==1.13.1+cu116
!pip install scikit-learn==1.0.2
!pip install pandas==1.3.5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets==2.9.0
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7 (from datasets==2.9.0)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets==2.9.0)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets==2.9.0)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m19.1 MB/s[0m eta [36

In [2]:
from datasets import load_dataset, list_datasets

datasets = list_datasets()

Permalink: https://huggingface.co/datasets/viewer/?dataset=cnn_dailymail&config=3.0.0



In [3]:
from pprint import pprint

print(f"🤩 Currently {len(datasets)} datasets are available on the hub:")
pprint(datasets, compact=True)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 'joey234/mmlu-high_school_computer_science-rule-neg',
 'joey234/mmlu-high_school_european_history-rule-neg',
 'joey234/mmlu-high_school_geography-rule-neg',
 'joey234/mmlu-high_school_government_and_politics-rule-neg',
 'joey234/mmlu-high_school_macroeconomics-rule-neg',
 'joey234/mmlu-high_school_mathematics-rule-neg',
 'joey234/mmlu-high_school_microeconomics-rule-neg',
 'joey234/mmlu-high_school_physics-rule-neg',
 'joey234/mmlu-high_school_psychology-rule-neg',
 'joey234/mmlu-high_school_statistics-rule-neg',
 'joey234/mmlu-high_school_us_history-rule-neg',
 'joey234/mmlu-high_school_world_history-rule-neg',
 'joey234/mmlu-human_aging-rule-neg', 'joey234/mmlu-human_sexuality-rule-neg',
 'joey234/mmlu-international_law-rule-neg',
 'joey234/mmlu-jurisprudence-rule-neg',
 'joey234/mmlu-logical_fallacies-rule-neg',
 'joey234/mmlu-machine_learning-rule-neg', 'joey234/mmlu-management-rule-neg',
 'joey234/mmlu-marketing-rul

---

In [4]:
# Load a subset of the "cnn_dailymail" dataset (version 3.0.0) with the first 15 examples from the "train" split
dataset_ = load_dataset('cnn_dailymail', '3.0.0', split='train[:15]')


Downloading builder script:   0%|          | 0.00/8.33k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/9.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/3.0.0 to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de...


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de. Subsequent calls will reuse this data.


In [5]:
print(dataset_)

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 15
})


In [6]:
print(f"👉Dataset len(dataset): {len(dataset_)}")
print("\n👉First item 'dataset[0]':")
pprint(dataset_[0])

👉Dataset len(dataset): 15

👉First item 'dataset[0]':
{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe '
            'gains access to a reported £20 million ($41.1 million) fortune as '
            "he turns 18 on Monday, but he insists the money won't cast a "
            'spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter '
            'and the Order of the Phoenix" To the disappointment of gossip '
            'columnists around the world, the young actor says he has no plans '
            'to fritter his cash away on fast cars, drink and celebrity '
            'parties. "I don\'t plan to be one of those people who, as soon as '
            'they turn 18, suddenly buy themselves a massive sports car '
            'collection or something similar," he told an Australian '
            'interviewer earlier this month. "I don\'t think I\'ll be '
            'particularly extravagant. "The things I like buying are things '
            'that cost a

---

### **BART Fine-Tuning: Using Transformers**

In [7]:
# Importing librareis
import torch
from torch.nn import functional as F
from torch import nn
import pytorch_lightning as pl

from transformers import BartForConditionalGeneration, BartTokenizer
from sklearn.model_selection import train_test_split
import pandas as pd

from transformers import (
    AdamW,
    get_linear_schedule_with_warmup
)
from torch.utils.data import DataLoader

In [8]:
# Checking out the GPU we have access to. This is output is from the google colab version. 
!nvidia-smi

Fri Jun  9 04:17:39 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    25W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

---

In [9]:
import torch
import pandas as pd
import pytorch_lightning as pl
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

class Dataset(torch.utils.data.Dataset):
    """Custom dataset class for text summarization using PyTorch DataLoader.

    For more information about Dataset and DataLoader, see:
    https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    """

    def __init__(self, texts, summaries, tokenizer, source_len, summ_len):
        """
        Initialize the Dataset.

        Args:
            texts (list): List of input texts.
            summaries (list): List of target summaries.
            tokenizer: Tokenizer for text encoding.
            source_len (int): Maximum length for input text.
            summ_len (int): Maximum length for target summary.
        """
        self.texts = texts
        self.summaries = summaries
        self.tokenizer = tokenizer
        self.source_len = source_len
        self.summ_len = summ_len

    def __len__(self):
        """
        Get the number of samples in the dataset.
        """
        return len(self.summaries) - 1

    def __getitem__(self, index):
        """
        Get a single data sample from the dataset.

        Args:
            index (int): Index of the data sample to retrieve.

        Returns:
            Tuple containing:
            - source input IDs
            - source attention mask
            - target input IDs
            - target attention mask
        """
        text = ' '.join(str(self.texts[index]).split())
        summary = ' '.join(str(self.summaries[index]).split())

        # Article text pre-processing
        source = self.tokenizer.batch_encode_plus([text],
                                                  max_length=self.source_len,
                                                  pad_to_max_length=True,
                                                  return_tensors='pt')
        # Summary Target pre-processing
        target = self.tokenizer.batch_encode_plus([summary],
                                                  max_length=self.summ_len,
                                                  pad_to_max_length=True,
                                                  return_tensors='pt')

        return (
            source['input_ids'].squeeze(),
            source['attention_mask'].squeeze(),
            target['input_ids'].squeeze(),
            target['attention_mask'].squeeze()
        )

class BARTDataLoader(pl.LightningDataModule):
    '''Pytorch Lightning Model Dataloader class for BART'''

    def __init__(self, tokenizer, text_len, summarized_len, file_path,
                 corpus_size, columns_name, train_split_size, batch_size):
        """
        Initialize the BARTDataLoader.

        Args:
            tokenizer: Tokenizer for text encoding.
            text_len (int): Maximum length for input text.
            summarized_len (int): Maximum length for target summary.
            file_path (str): Path to the CSV data file.
            corpus_size (int): Number of rows to read from the CSV file.
            columns_name (list): List of column names to use.
            train_split_size (float): Size of the training split (e.g., 0.8 for 80%).
            batch_size (int): Batch size for data loading.
        """
        super().__init__()
        self.tokenizer = tokenizer
        self.text_len = text_len
        self.summarized_len = summarized_len
        self.input_text_length = summarized_len
        self.file_path = file_path
        self.nrows = corpus_size
        self.columns = columns_name
        self.train_split_size = train_split_size
        self.batch_size = batch_size

    def prepare_data(self):
        """
        Load and preprocess the data from the CSV file.
        """
        data = pd.read_csv(self.file_path, nrows=self.nrows, encoding='latin-1')
        data = data[self.columns]
        data.iloc[:, 1] = 'summarize: ' + data.iloc[:, 1]
        self.text = list(data.iloc[:, 0].values)
        self.summary = list(data.iloc[:, 1].values)

    def setup(self, stage=None):
        """
        Split the data into training and validation sets.

        Args:
            stage (str): The current stage ('fit' or 'test').
        """
        X_train, y_train, X_val, y_val = train_test_split(
            self.text, self.summary, train_size=self.train_split_size
        )

        self.train_dataset = (X_train, y_train)
        self.val_dataset = (X_val, y_val)

    def train_dataloader(self):
        """
        Create a DataLoader for the training dataset.
        """
        train_data = Dataset(texts=self.train_dataset[0],
                             summaries=self.train_dataset[1],
                             tokenizer=self.tokenizer,
                             source_len=self.text_len,
                             summ_len=self.summarized_len)
        return DataLoader(train_data, self.batch_size)

    def val_dataloader(self):
        """
        Create a DataLoader for the validation dataset.
        """
        val_dataset = Dataset(texts=self.val_dataset[0],
                              summaries=self.val_dataset[1],
                              tokenizer=self.tokenizer,
                              source_len=self.text_len,
                              summ_len=self.summarized_len)
        return DataLoader(val_dataset, self.batch_size)


In [10]:
import torch
import pytorch_lightning as pl
from transformers import AdamW

class AbstractiveSummarizationBARTFineTuning(pl.LightningModule):
    """Abstractive summarization model class for fine-tuning BART."""

    def __init__(self, model, tokenizer):
        """
        Initialize the AbstractiveSummarizationBARTFineTuning model.

        Args:
            model: Pre-trained BART model.
            tokenizer: BART tokenizer.
        """
        super().__init__()
        self.model = model
        self.tokenizer = tokenizer

    def forward(self, input_ids, attention_mask, decoder_input_ids,
                decoder_attention_mask=None, lm_labels=None):
        """
        Forward pass for the model.

        Args:
            input_ids: Input token IDs.
            attention_mask: Attention mask for input.
            decoder_input_ids: Target token IDs.
            decoder_attention_mask: Attention mask for target.
            lm_labels: Language modeling labels.

        Returns:
            Model outputs.
        """
        outputs = self.model.forward(
            input_ids=input_ids,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            labels=decoder_input_ids
        )

        return outputs

    def preprocess_batch(self, batch):
        """
        Reformat and preprocess the batch for model input.

        Args:
            batch: Batch of data.

        Returns:
            Formatted input and target data.
        """
        input_ids, source_attention_mask, decoder_input_ids, \
        decoder_attention_mask = batch

        y = decoder_input_ids
        decoder_ids = decoder_input_ids
        source_ids = input_ids
        source_mask = source_attention_mask

        return source_ids, source_mask, decoder_ids, decoder_attention_mask, decoder_attention_mask

    def training_step(self, batch, batch_idx):
        """
        Training step for the model.

        Args:
            batch: Batch of training data.
            batch_idx: Index of the batch.

        Returns:
            Loss for the training step.
        """
        input_ids, source_attention_mask, decoder_input_ids, \
        decoder_attention_mask, lm_labels = self.preprocess_batch(batch)

        outputs = self.forward(input_ids=input_ids, attention_mask=source_attention_mask,
                               decoder_input_ids=decoder_input_ids,
                               decoder_attention_mask=decoder_attention_mask,
                               lm_labels=lm_labels
                       )
        loss = outputs.loss

        return loss

    def validation_step(self, batch, batch_idx):
        """
        Validation step for the model.

        Args:
            batch: Batch of validation data.
            batch_idx: Index of the batch.

        Returns:
            Loss for the validation step.
        """
        input_ids, source_attention_mask, decoder_input_ids, \
        decoder_attention_mask, lm_labels = self.preprocess_batch(batch)

        outputs = self.forward(input_ids=input_ids, attention_mask=source_attention_mask,
                               decoder_input_ids=decoder_input_ids,
                               decoder_attention_mask=decoder_attention_mask,
                               lm_labels=lm_labels
                       )
        loss = outputs.loss

        return loss

    def training_epoch_end(self, outputs):
        """
        Calculate and log the average training loss for the epoch.

        Args:
            outputs: List of training step outputs.
        """
        avg_loss = torch.stack([x["loss"] for x in outputs]).mean()
        self.log('Epoch', self.trainer.current_epoch)
        self.log('avg_epoch_loss', {'train': avg_loss})

    def val_epoch_end(self, loss):
        """
        Calculate and log the average validation loss for the epoch.

        Args:
            loss: List of validation step losses.
        """
        avg_loss = torch.stack([x["loss"] for x in loss]).mean()
        self.log('avg_epoch_loss', {'Val': avg_loss})

    def configure_optimizers(self):
        """
        Configure and return the optimizer for the model.

        Returns:
            Optimizer for training.
        """
        model = self.model
        optimizer = AdamW(model.parameters())
        self.opt = optimizer

        return [optimizer]


In [12]:
# Tokenizer
# Upload the curated_data_subset.csv if using Colab or change the path to a local file
model_ = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

# Dataloader
# Initialize a DataLoader for processing and loading data
dataloader = BARTDataLoader(tokenizer=tokenizer, text_len=512,
                            summarized_len=150,
                            file_path='/content/curated_data_subset.csv',
                            corpus_size=50, columns_name=['article_content','summary'],
                            train_split_size=0.8, batch_size=2)

# Read and pre-process data
dataloader.prepare_data()

# Train-test Split
# Split the data into training and validation sets
dataloader.setup()


In [13]:
# Main Model class
# Create an instance of the AbstractiveSummarizationBARTFineTuning model
model = AbstractiveSummarizationBARTFineTuning(model=model_, tokenizer=tokenizer)


In [14]:
# Trainer Class
# Initialize a PyTorch Lightning Trainer for training and evaluation
# You can specify the number of GPUs (e.g., gpus=1) if available, or remove it if not.
trainer = pl.Trainer(check_val_every_n_epoch=1, max_epochs=5, gpus=1)

# Fit model
# Train the model using the specified trainer and data loader
trainer.fit(model, dataloader)


  rank_zero_deprecation(
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                         | Params
-------------------------------------------------------
0 | model | BartForConditionalGeneration | 139 M 
-------------------------------------------------------
139 M     Trainable params
0         Non-trainable params
139 M     Total params
557.682   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

  rank_zero_warn(
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
  rank_zero_warn(
  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]



Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


---

# **BART Abstractive Summarization: Using Pre Trained Model**

In [16]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

In [19]:
def summarize_article(article):
    # Load BART model and tokenizer
    model_name = 'facebook/bart-large-cnn'
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = BartForConditionalGeneration.from_pretrained(model_name)

    # Tokenize and encode the article
    inputs = tokenizer.encode(article, return_tensors='pt', max_length=1024, truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs, num_beams=4, max_length=150, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example usage
article = """
My friends are cool but they eat too many carbs.
"""

summary = summarize_article(article)
print("Summary:")
print(summary)


Summary:
My friends are cool but they eat too many carbs. That's what this is all about. I don't want you to think I'm a bad person. I'm not. I just don't like to be around people who eat too much carbs. This is my way of telling you that.


In [17]:
article = '''Former U.S. President Donald Trump charged in classified documents probe
There was no immediate confirmation from the Justice Department regarding Mr. Trump’s assertion, although some U.S. media outlet cited sources saying that the former U.S. President has bee indicted'''

In [18]:
summary = summarize_article(article)
print("Summary:")
print(summary)

Summary:
Former U.S. President Donald Trump charged in classified documents probe. There was no immediate confirmation from the Justice Department regarding Mr. Trump’s assertion. Some media outlet cited sources saying that the former U.s. President has bee indicted. The Justice Department has not commented on the reports.


---

# References:

* https://huggingface.co/transformers/model_doc/bart.html
* https://www.pytorchlightning.ai/
* https://www.frase.io/blog/20-applications-of-automatic-summarization-in-the-enterprise/
* https://medium.com/sciforce/towards-automatic-summarization-part-2-abstractive-methods-c424386a65ea
* https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-attention-mechanism-9e844763d07b
* https://github.com/CurationCorp/curation-corpus

---