<a href="https://colab.research.google.com/github/Ahmad10Raza/Text-Summarizer-WebApp/blob/master/notebook/Text%20Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Summarization**

Text summarization is the process of extracting the main points from a text document and presenting them in a concise and coherent manner. It can be used to shorten a document, make it more readable, or extract key information.

There are many different approaches to text summarization, each with its own strengths and weaknesses. Some of the most common approaches include:

**Extractive summarization:** This approach simply extracts the most important sentences from the document and presents them in a new order.

**Abstractive summarization:** This approach attempts to generate a new summary that captures the main points of the document in a more natural and coherent way.

**Hybrid summarization:** This approach combines elements of both extractive and abstractive summarization.

The choice of which approach to use depends on the specific task at hand. For example, extractive summarization is often used for tasks such as generating bullet points or highlights, while abstractive summarization is often used for tasks such as generating short summaries or creating news articles.

Text summarization is a challenging task, but it has a wide range of applications. It can be used in areas such as information retrieval, natural language processing, and machine learning.

In [2]:
! pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

! pip install --upgrade accelerate
! pip uninstall -y transformers accelerate
! pip install transformers accelerate

# Importing Libraries

In [3]:
from transformers import pipeline, set_seed
from datasets import load_dataset, load_from_disk
import matplotlib.pyplot as plt
from datasets import load_dataset
import pandas as pd
from datasets import load_dataset, load_metric

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## pegasus Model
"google/pegasus-cnn_dailymail" refers to a **pre-trained model for text summarization** developed by Google AI. It is based on the Pegasus neural network architecture and trained on a massive dataset of news articles and summaries from the CNN/Daily Mail website.

Here's a breakdown of what "google/pegasus-cnn_dailymail" signifies:

* **google:** Indicates the model originates from Google AI.
* **pegasus:** Refers to the underlying neural network architecture used for the model, which is specifically designed for text summarization tasks.
* **cnn_dailymail:** Specifies the dataset the model was trained on, which consists of news articles and summaries from the CNN/Daily Mail website.

Therefore, "google/pegasus-cnn_dailymail" represents a readily available pre-trained model for text summarization, enabling users to perform the following:

* **Generate summaries of text documents:** The model can be used to automatically create concise and informative summaries of any text input, such as news articles, research papers, or even emails.
* **Fine-tune for specific tasks:** The pre-trained model can be further adapted and trained on smaller, domain-specific datasets to improve its performance on particular text summarization tasks.
* **Explore and understand text summarization:** Users can leverage this model to experiment and gain insights into the workings of text summarization algorithms.

Overall, "google/pegasus-cnn_dailymail" is a valuable resource for researchers, developers, and anyone interested in applying text summarization techniques for various purposes.


In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [5]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Define device for model execution (CPU or GPU)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Specify pre-trained model checkpoint
model_ckpt = "google/pegasus-cnn_dailymail"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# Load pre-trained model for text summarization
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)




tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

## Example usage:

In [6]:

text = "This is a long and detailed text that requires a concise summary."
encoded_text = tokenizer(text, return_tensors="pt").to(device)
generated_summary = model_pegasus.generate(**encoded_text)
decoded_summary = tokenizer.decode(generated_summary[0], skip_special_tokens=True)

# Print the generated summary
print(f"Summary: {decoded_summary}")

Summary: This is a long and detailed text that requires a concise summary.<n>This is a long and detailed text that requires a concise summary.<n>This is a long and detailed text that requires a concise summary.


# The Datasets

In [7]:
#dowload & unzip data

! wget https://github.com/entbappy/Branching-tutorial/raw/master/summarizer-data.zip
! unzip summarizer-data.zip

--2023-12-09 17:21:28--  https://github.com/entbappy/Branching-tutorial/raw/master/summarizer-data.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/summarizer-data.zip [following]
--2023-12-09 17:21:28--  https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/summarizer-data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7903594 (7.5M) [application/zip]
Saving to: ‘summarizer-data.zip’


2023-12-09 17:21:29 (101 MB/s) - ‘summarizer-data.zip’ saved [7903594/7903594]

Archive:  summarizer-data.zip
  inflating: samsum-test.csv         
  infl

In [8]:
data=load_from_disk('samsum_dataset')
data

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [10]:
data.shape

{'train': (14732, 3), 'test': (819, 3), 'validation': (818, 3)}

In [13]:
split_lengths = [len(data[split])for split in data]

print(f"Split lengths: {split_lengths}")
print(f"Features: {data['train'].column_names}")
print("\nDialogue:")

print(data["test"][1]["dialogue"])

print("\nSummary:")

print(data["test"][1]["summary"])

Split lengths: [14732, 819, 818]
Features: ['id', 'dialogue', 'summary']

Dialogue:
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Summary:
Eric and Rob are going to watch a stand-up on youtube.


In [14]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch['dialogue'] , max_length = 1024, truncation = True )

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['summary'], max_length = 128, truncation = True )

    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }


In [17]:
data_pt = data.map(convert_examples_to_features, batched = True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]



In [18]:
data_pt['train']

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [19]:
data_pt['train'][1]

{'id': '13728867',
 'dialogue': 'Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great',
 'summary': 'Olivia and Olivier are voting for liberals in this election. ',
 'input_ids': [18038,
  151,
  2632,
  127,
  119,
  6228,
  118,
  115,
  136,
  2974,
  152,
  10463,
  151,
  35884,
  130,
  329,
  107,
  18038,
  151,
  2587,
  314,
  1242,
  10463,
  151,
  1509,
  1],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [18038, 111, 34296, 127, 6228, 118, 33195, 115, 136, 2974, 107, 1]}

In [21]:
from transformers import DataCollatorForSeq2Seq
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

In [22]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)

In [24]:

trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=data_pt["test"],
                  eval_dataset=data_pt["validation"])

In [25]:
trainer.train()

You're using a PegasusTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss


TrainOutput(global_step=51, training_loss=3.0043694084765864, metrics={'train_runtime': 176.4837, 'train_samples_per_second': 4.641, 'train_steps_per_second': 0.289, 'total_flos': 313450454089728.0, 'train_loss': 3.0043694084765864, 'epoch': 1.0})

In [26]:
# Evaluation

def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]



def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''

        # Finally, we decode the generated texts,
        # replace the  token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]

        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]


        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    #  Finally compute and return the ROUGE scores.
    score = metric.compute()
    return score


In [27]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = load_metric('rouge')

  rouge_metric = load_metric('rouge')


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [29]:
score = calculate_metric_on_test_ds(
    data['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
)

rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = [f'pegasus'] )

100%|██████████| 5/5 [00:23<00:00,  4.66s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.021073,0.0,0.020539,0.020758


## Save The Model

In [30]:
model_pegasus.save_pretrained("pegasus-samsum-model")

In [31]:
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [32]:
#Load

tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

In [34]:
#Prediction

gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}



sample_text = data["test"][3]["dialogue"]

reference = data["test"][3]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)

##
print("Dialogue:")
print(sample_text)


print("\nReference Summary:")
print(reference)


print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Dialogue:
Will: hey babe, what do you want for dinner tonight?
Emma:  gah, don't even worry about it tonight
Will: what do you mean? everything ok?
Emma: not really, but it's ok, don't worry about cooking though, I'm not hungry
Will: Well what time will you be home?
Emma: soon, hopefully
Will: you sure? Maybe you want me to pick you up?
Emma: no no it's alright. I'll be home soon, i'll tell you when I get home. 
Will: Alright, love you. 
Emma: love you too. 

Reference Summary:
Emma will be home soon and she will let Will know.

Model Summary:
Emma: not really, but it's ok, don't worry about cooking though, I'm not hungry .<n>Will: soon, hopefully Will: you sure? Maybe you want me to pick you up?<n>Emma: no no it's alright. I'll be home soon, i'll tell you when I get home.


# **Thank You!**