<a href="https://colab.research.google.com/github/Pranav-JJ/Transformers-Abstractive-Summarisation/blob/main/AbstractiveSummarisationXsum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers==4.20.0
!pip install keras_nlp==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install nltk
!pip install rouge-score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.20.0
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m82.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.1.0 (from transformers==4.20.0)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1 (from transformers==4.20.0)
  Downloading tokenizers-0.12.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m93.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.12.1 transforme

In [None]:
import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
# The percentage of the dataset you want to split as train and test
TRAIN_TEST_SPLIT = 0.05

MAX_INPUT_LENGTH = 1024  # Maximum length of the input to the model
MIN_TARGET_LENGTH = 5  # Minimum length of the output by the model
MAX_TARGET_LENGTH = 128  # Maximum length of the output by the model
BATCH_SIZE = 2  # Batch-size for training our model
LEARNING_RATE =  1e-5  # Learning-rate for training our model
MAX_EPOCHS = 1  # Maximum number of epochs we will train the model for

# This notebook is built on the t5-small checkpoint from the Hugging Face Model Hub
MODEL_CHECKPOINT = "t5-small"

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("xsum", split="train")




In [None]:
print(raw_datasets)

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})


In [None]:
print(raw_datasets[0])



In [None]:
raw_datasets = raw_datasets.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)

In [None]:
print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 10202
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 10203
    })
})


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

In [None]:
if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

In [None]:
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"], max_length=MAX_TARGET_LENGTH, truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

In [None]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/10202 [00:00<?, ? examples/s]

Map:   0%|          | 0/10203 [00:00<?, ? examples/s]

In [None]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [None]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [None]:
import keras_nlp

rouge_l = keras_nlp.metrics.RougeL()


def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)





<keras.callbacks.History at 0x7f95a113f550>

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")

summarizer(
    raw_datasets["test"][0]["document"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
)

Token indices sequence length is longer than the specified maximum sequence length for this model (786 > 512). Running this sequence through the model will result in indexing errors


[{'summary_text': 'Jeremy Corbyn has said the UK must "stand together" and defend Christian values after Brexit, she said.'}]

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
output_dir = ".model"  # Replace with your desired output directory

# Save the model and tokenizer
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)


Configuration saved in .model/config.json
Model weights saved in .model/tf_model.h5
tokenizer config file saved in .model/tokenizer_config.json
Special tokens file saved in .model/special_tokens_map.json


('.model/tokenizer_config.json',
 '.model/special_tokens_map.json',
 '.model/tokenizer.json')

In [None]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="TF-Finetuned-xsum",
    tokenizer=tokenizer,
)


Cloning https://huggingface.co/phoen1x/TF-Finetuned-xsum into local empty directory.


In [None]:
model.push_to_hub("TF-Finetuned-xsum", organization="keras-io")
tokenizer.push_to_hub("TF-Finetuned-xsum", organization="keras-io")

Configuration saved in TF-Finetuned-xsum/config.json
Model weights saved in TF-Finetuned-xsum/tf_model.h5


Upload file tf_model.h5:   0%|          | 1.00/231M [00:00<?, ?B/s]

To https://huggingface.co/phoen1x/TF-Finetuned-xsum
   7335431..51c35ba  main -> main

   7335431..51c35ba  main -> main

tokenizer config file saved in TF-Finetuned-xsum/tokenizer_config.json
Special tokens file saved in TF-Finetuned-xsum/special_tokens_map.json
To https://huggingface.co/phoen1x/TF-Finetuned-xsum
   51c35ba..1d1008f  main -> main

   51c35ba..1d1008f  main -> main



'https://huggingface.co/phoen1x/TF-Finetuned-xsum/commit/1d1008f1cf86e9a9390f5446e957eb006ec30dd8'

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("phoen1x/TF-Finetuned-xsum")

model = AutoModelForSeq2SeqLM.from_pretrained("phoen1x/TF-Finetuned-xsum", from_tf=True)

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

All TF 2.0 model weights were used when initializing T5ForConditionalGeneration.

All the weights of T5ForConditionalGeneration were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.


In [None]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."


In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="phoen1x/TF-Finetuned-xsum")
summarizer(text)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at phoen1x/TF-Finetuned-xsum.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Your max_length is set to 200, but you input_length is only 103. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': 'No one making under $400,000 per year will pay a penny more in taxes than a decade ago, the Inflation Reduction Act says.'}]

In [None]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])

reference_summary = "Under an order passed by the appellant, a Magistrate, one G was put in possession of some property on October 14, 1955. in revision the order was set aside by the High Court on August 27, 1957, and the opposite party S applied, on November. 20, 1957, to the appellant for redelivery of possession. G applied to the High Court for a review of its previous order and on November 25, 1957, the application was admitted and an interim stay was granted of the proceedings before the appellant. On November 26, 1957, an application bearing an illegible signature and not Supported by an affidavit was filed before the appellant indicating that the High Court had stayed the proceedings. A telegram addressed to a pleader, not the Counsel for G, was filed along with the application. The appellant refused to act on this application and telegram and on November 27, 1957, he passed an order allowing the application of S for restitution. On November 28, 1957, a copy of the order of the High Court was received and thereupon the writ for redelivery of possession was not issued. The High Court convicted the appellant for contempt of court for passing the order for restitution on November 27, when the High Court had stayed the proceedings. The appellant appealed to the Supreme Court and impleaded the Chief justice and judges of the High Court as respondents. 320 Held, that the appellant was not guilty of cortempt of court. Before a subordinate court can be held to be guilt, of contempt of court it must be stablished that it had knowledge of the order of the High Court and intentionally disobeyed it. The knowledge must be obtained from a source which was either authorised or otherwise authentic. In the present case the appellant was entitled to ignore the application as well as the telegram. In a contempt matter the Chief justice and judges of the High Court should not be made parties and the title of such a proceeding should be In re. . the alleged contemnor. "
generated_summary = "The appellant was a Sub Divisional Magistrate at Dhenkanal in the year 1957. In a criminal matter (1)(1936) A.C. 322. 41 322 before the appellant for redelivery of possession. The application was not ready and so the matter was adjourned to November 27, 1957. It was also confirmed by the Additional District Magistrat in appeal. This application was accompanied by a telegram addressed to Mr. Neelakanth Misra, Pleader."


scores = scorer.score(reference_summary, generated_summary)

print(scores['rouge1'].fmeasure)  # Print F1 score for ROUGE-1
print(scores['rouge2'].fmeasure)  # Print F1 score for ROUGE-2
print(scores['rougeL'].fmeasure)  # Print F1 score for ROUGE-L



0.21307506053268765
0.09732360097323602
0.13559322033898302
