# TEXT SUMMARIZATION USING BART TRANSFORMER MODEL

MODEL = BART (Bidirectional and Auto Regressive Transformers)

#### LOADING THE DATASET

In [2]:
!pip install datasets



In [3]:
from datasets import load_dataset

ds = load_dataset("knkarthick/dialogsum")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/442k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [4]:
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [6]:
ds['train'][1]['dialogue']

"#Person1#: Hello Mrs. Parker, how have you been?\n#Person2#: Hello Dr. Peters. Just fine thank you. Ricky and I are here for his vaccines.\n#Person1#: Very well. Let's see, according to his vaccination record, Ricky has received his Polio, Tetanus and Hepatitis B shots. He is 14 months old, so he is due for Hepatitis A, Chickenpox and Measles shots.\n#Person2#: What about Rubella and Mumps?\n#Person1#: Well, I can only give him these for now, and after a couple of weeks I can administer the rest.\n#Person2#: OK, great. Doctor, I think I also may need a Tetanus booster. Last time I got it was maybe fifteen years ago!\n#Person1#: We will check our records and I'll have the nurse administer and the booster as well. Now, please hold Ricky's arm tight, this may sting a little."

In [7]:
ds['train'][1]['summary']

'Mrs Parker takes Ricky for his vaccines. Dr. Peters checks the record and then gives Ricky a vaccine.'

## 1. USING THE MODEL WITHOUT FINE TUNING

#### LOADING THE BART MODEL

In [8]:
!pip install transformers



In [9]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("summarization", model="facebook/bart-large-cnn")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [12]:
article_1 = ds['train'][1]['dialogue']

pipe(article_1, max_length=20, min_length=10, do_sample=False)

[{'summary_text': 'Ricky has received his Polio, Tetanus and Hepatitis B shots.'}]

## 2. FINE-TUNING THE MODEL

In [13]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")



'input_ids' represent the tokenized form of your input text. Each token (which could be a word or part of a word) is converted into a unique integer ID based on the model's vocabulary.

'attention_mask' is a tensor that indicates which tokens should be attended to and which should be ignored (usually padding tokens). It’s a binary mask where typically:

- 1 indicates that the token should be attended to.
- 0 indicates that the token is padding and should be ignored.

In sequence-to-sequence models, such as text summarization models, you have:

- Input IDs: Tokenized IDs of the source text (e.g., dialogue).
- Target IDs: Tokenized IDs of the target text (e.g., summary).<br>

During training, the model computes the loss between the predicted sequence and the target sequence. To ensure that padding tokens do not affect this loss calculation, padding token IDs are often replaced with -100.

About Padding = https://www.nature.com/articles/s41598-020-71450-8/figures/1

In [14]:
#tokenization

def preprocess_function(batch):
    source = batch['dialogue']
    target = batch["summary"]
    source_ids = tokenizer(source, truncation=True, padding="max_length", max_length=128)
    target_ids = tokenizer(target, truncation=True, padding="max_length", max_length=128)

    # Replace pad token id with -100 for labels to ignore padding in loss computation
    labels = target_ids["input_ids"]
    labels = [[(label if label != tokenizer.pad_token_id else -100) for label in labels_example] for labels_example in labels]

    return {
        "input_ids": source_ids["input_ids"],
        "attention_mask": source_ids["attention_mask"],
        "labels": labels
    }

df_source = ds.map(preprocess_function, batched=True)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [16]:
# Define training arguments
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="/content",  # Replace with your output directory
    per_device_train_batch_size=8,
    num_train_epochs=2,  # Adjust number of epochs as needed
    remove_unused_columns=False
)

In [17]:
# Create Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=df_source["train"],
    eval_dataset=df_source["test"]
)

trainer.train()

Step,Training Loss
500,1.5915
1000,1.4878
1500,1.4338
2000,1.0835
2500,1.017
3000,0.9991


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_

TrainOutput(global_step=3116, training_loss=1.259912561850003, metrics={'train_runtime': 3534.9048, 'train_samples_per_second': 7.05, 'train_steps_per_second': 0.881, 'total_flos': 6750530835578880.0, 'train_loss': 1.259912561850003, 'epoch': 2.0})

In [18]:
# Evaluate the model
eval_results = trainer.evaluate()

# Print evaluation results
print(eval_results)

{'eval_loss': 1.67119562625885, 'eval_runtime': 58.0579, 'eval_samples_per_second': 25.836, 'eval_steps_per_second': 3.238, 'epoch': 2.0}


## SAVING THE MODEL

In [20]:
# Save the model and tokenizer after training
model.save_pretrained("/content/model_directory")
tokenizer.save_pretrained("/content/model_directory")

Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('/content/model_directory/tokenizer_config.json',
 '/content/model_directory/special_tokens_map.json',
 '/content/model_directory/vocab.json',
 '/content/model_directory/merges.txt',
 '/content/model_directory/added_tokens.json',
 '/content/model_directory/tokenizer.json')

#### SUMMARIZING THE CUSTOM DATA USING SAVED MODEL AND TOKENIZER

In [23]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("/content/model_directory")
model = AutoModelForSeq2SeqLM.from_pretrained("/content/model_directory")

# Function to summarize a blog post
def summarize(blog_post):
    # Tokenize the input blog post
    inputs = tokenizer(blog_post, max_length=1024, truncation=True, return_tensors="pt")

    # Generate the summary
    summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

    # Decode the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example blog post
blog_post = """
Very early abortions

Between five and seven weeks, a pregnancy can be ended by a procedure called menstrual extraction. This procedure is also sometimes called menstrual regulation, mini-suction, or preemptive abortion. The contents of the uterus are suctioned out through a thin (3-4 mm) plastic
tube that is inserted through the undilated cervix. Suction is applied either by a bulb syringe or a small pump.

Another method is called the “morning after” pill, or emergency contraception. Basically, it involves taking high doses of birth control pills within 24 to 48 hours of having unprotected sex. The high doses of hormones causes the uterine lining to change so that it will not support a pregnancy. Thus, if the egg has been fertilized, it is simply expelled from the body.

There are two types of emergency contraception.

One type is identical to ordinary birth control pills, anduses the hormones estrogen and progestin). This type is available with a prescription under the brand name Preven. But women can even use their regular birth control pills for emergency contraception, after they check with their doctor about the proper dose. About half of women who use birth control pills for emergency contraception get nauseated and 20 percent vomit. This method cuts the risk of pregnancy 75 percent. The other type of morning-after pill contains only one hormone: progestin, and is available under the brand name Plan B. It is more effective than the first type with a lower risk of nausea and vomiting. It reduces the risk of pregnancy 89 percent.
"""

# Get the summary
summary = summarize(blog_post)
print("Summary:", summary)


Summary: There are two types of emergency contraception. One is identical to ordinary birth control pills and uses the hormones estrogen and progestin. The other contains only one hormone and is available under the brand name Plan B. It is more effective than the first type with a lower risk of nausea and vomiting.


## COMPUTING ROUGE SCORES

In [25]:
!pip install transformers rouge-score





In [27]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
from rouge_score import rouge_scorer, scoring

# Load the fine-tuned model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("/content/model_directory")
model = AutoModelForSeq2SeqLM.from_pretrained("/content/model_directory")

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Function to compute ROUGE for each prediction-reference pair
def compute_rouge(prediction, reference):
    scores = scorer.score(reference, prediction)
    return scores

# Generate predictions using the fine-tuned model
def generate_summary(text):
    inputs = tokenizer(text, max_length=1024, truncation=True, return_tensors="pt")
    summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40,
                                 length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Example: Evaluate ROUGE score on a sample from the dataset
dialogue = """Very early abortions

Between five and seven weeks, a pregnancy can be ended by a procedure called menstrual extraction. This procedure is also sometimes called menstrual regulation, mini-suction, or preemptive abortion. The contents of the uterus are suctioned out through a thin (3-4 mm) plastic
tube that is inserted through the undilated cervix. Suction is applied either by a bulb syringe or a small pump.

Another method is called the “morning after” pill, or emergency contraception. Basically, it involves taking high doses of birth control pills within 24 to 48 hours of having unprotected sex. The high doses of hormones causes the uterine lining to change so that it will not support a pregnancy. Thus, if the egg has been fertilized, it is simply expelled from the body.

There are two types of emergency contraception.

One type is identical to ordinary birth control pills, anduses the hormones estrogen and progestin). This type is available with a prescription under the brand name Preven. But women can even use their regular birth control pills for emergency contraception, after they check with their doctor about the proper dose. About half of women who use birth control pills for emergency contraception get nauseated and 20 percent vomit. This method cuts the risk of pregnancy 75 percent. The other type of morning-after pill contains only one hormone: progestin, and is available under the brand name Plan B. It is more effective than the first type with a lower risk of nausea and vomiting. It reduces the risk of pregnancy 89 percent.
"""

reference_summary = "There are two types of emergency contraception. One is identical to ordinary birth control pills and uses the hormones estrogen and progestin. The other contains only one hormone and is available under the brand name Plan B. It is more effective than the first type with a lower risk of nausea and vomiting."

# Generate model prediction
predicted_summary = generate_summary(dialogue)

# Compute ROUGE scores
rouge_scores = compute_rouge(predicted_summary, reference_summary)

# Display the ROUGE scores
print("Generated Summary:", predicted_summary)
print("ROUGE Scores:")
for metric, score in rouge_scores.items():
    print(f"{metric}: Precision={score.precision}, Recall={score.recall}, F1={score.fmeasure}")


Generated Summary: There are two types of emergency contraception. One is identical to ordinary birth control pills and uses the hormones estrogen and progestin. The other contains only one hormone and is available under the brand name Plan B. It is more effective than the first type with a lower risk of nausea and vomiting.
ROUGE Scores:
rouge1: Precision=1.0, Recall=1.0, F1=1.0
rouge2: Precision=1.0, Recall=1.0, F1=1.0
rougeL: Precision=1.0, Recall=1.0, F1=1.0


## To complement ROUGE with BLEU and BERTScore, you can follow the steps below. These metrics offer additional insights:

#### BLEU: Measures the precision of n-grams, often used in machine translation.
#### BERTScore: Uses contextual embeddings to compare semantic similarity.

In [28]:
!pip install transformers rouge-score nltk bert-score


Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert-score
Successfully installed bert-score-0.3.13


In [29]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from bert_score import score as bert_score


In [30]:
# Initialize Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

# Initialize ROUGE Scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Function to generate a summary
def generate_summary(text):
    inputs = tokenizer(text, max_length=1024, truncation=True, return_tensors="pt")
    summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40,
                                 length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Function to compute ROUGE Score
def compute_rouge(prediction, reference):
    return scorer.score(reference, prediction)

# Function to compute BLEU Score
def compute_bleu(prediction, reference):
    reference_tokens = [reference.split()]
    prediction_tokens = prediction.split()
    smooth_fn = SmoothingFunction().method4  # Smoothing for short sentences
    bleu = sentence_bleu(reference_tokens, prediction_tokens, smoothing_function=smooth_fn)
    return bleu

# Function to compute BERTScore
def compute_bert_score(predictions, references):
    P, R, F1 = bert_score(predictions, references, lang="en")
    return {"Precision": P.mean().item(), "Recall": R.mean().item(), "F1": F1.mean().item()}




In [31]:
# Example dialogue and reference summary
dialogue = """
Between five and seven weeks, a pregnancy can be ended by a procedure called menstrual extraction.
Another method is called the “morning after” pill, or emergency contraception.
There are two types of emergency contraception. One type is identical to birth control pills, using estrogen and progestin.
The other type contains only progestin and is available under the brand name Plan B.
"""

reference_summary = (
    "There are two types of emergency contraception. One is identical to ordinary birth control pills "
    "and uses the hormones estrogen and progestin. The other contains only one hormone and is available "
    "under the brand name Plan B. It is more effective than the first type with a lower risk of nausea and vomiting."
)

# Generate model prediction
predicted_summary = generate_summary(dialogue)

# Compute Scores
rouge_scores = compute_rouge(predicted_summary, reference_summary)
bleu_score = compute_bleu(predicted_summary, reference_summary)
bert_scores = compute_bert_score([predicted_summary], [reference_summary])

# Display the Results
print("Generated Summary:", predicted_summary)
print("\nROUGE Scores:")
for metric, score in rouge_scores.items():
    print(f"{metric}: Precision={score.precision}, Recall={score.recall}, F1={score.fmeasure}")

print(f"\nBLEU Score: {bleu_score:.4f}")
print("\nBERTScore:")
for metric, value in bert_scores.items():
    print(f"{metric}: {value:.4f}")


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Generated Summary: Between five and seven weeks, a pregnancy can be ended by a procedure called menstrual extraction. Another method is called the “morning after” pill, or emergency contraception. One type is identical to birth control pills, using estrogen and progestin.

ROUGE Scores:
rouge1: Precision=0.46153846153846156, Recall=0.33962264150943394, F1=0.3913043478260869
rouge2: Precision=0.21052631578947367, Recall=0.15384615384615385, F1=0.17777777777777778
rougeL: Precision=0.3333333333333333, Recall=0.24528301886792453, F1=0.2826086956521739

BLEU Score: 0.0621

BERTScore:
Precision: 0.8849
Recall: 0.8880
F1: 0.8864
