# **INSTALLING THE REQUIRED LIBRARIES**

In [1]:
pip install --upgrade transformers



In [2]:
pip -q install datasets

In [3]:
pip install evaluate



In [4]:
pip install rouge_score



In [5]:
pip install accelerate -U



In [6]:
pip install transformers[torch]



In [7]:
pip install huggingface_hub



# **IMPORTING THE LIBRARIES**

In [8]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Your token has been saved in your con

In [9]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [10]:
billsum

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})

In [11]:
billsum = billsum.train_test_split(test_size=0.2)

In [12]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nChapter 15.8 (commencing with Section 67395) is added to Part 40 of Division 5 of Title 3 of the Education Code, to read:\nCHAPTER  15.8. Autism Employment and Education Act\n67395.\n(a) This chapter shall be known, and may be cited, as the Autism Employment and Education Act.\n(b) The Legislature finds and declares all of the following:\n(1) Autism spectrum disorder (ASD) is a lifelong neurological condition estimated to affect as many as one in 88 children. It is now the most common neurological disorder affecting children and one of the most common developmental disabilities.\n(2) Many individuals living with ASD will need some level of support over the course of their lives. In cases where adolescents and adults with severe autism are placed into long-term care or other supported housing arrangements, the annual cost of housing, which includes caregiver time, can be four hundred dollars ($400) per

In [25]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-base"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on google-t5/t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [26]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [27]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [28]:
tokenized_billsum['train'].features

{'text': Value(dtype='string', id=None),
 'summary': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

In [29]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [30]:
import evaluate

rouge = evaluate.load("rouge")

In [31]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [32]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [33]:
model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

In [34]:
training_args = Seq2SeqTrainingArguments(
    "bert-on-the-billsum",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)



In [35]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [36]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,1.920353,0.1939,0.1054,0.1689,0.1688,19.0
2,2.296300,1.860363,0.1953,0.1051,0.1714,0.1714,19.0
3,1.927000,1.834611,0.1955,0.1068,0.1716,0.1715,19.0
4,1.849100,1.826394,0.195,0.1052,0.1714,0.1713,19.0




TrainOutput(global_step=1980, training_loss=1.9693615229442867, metrics={'train_runtime': 1480.4311, 'train_samples_per_second': 2.672, 'train_steps_per_second': 1.337, 'total_flos': 4818074830110720.0, 'train_loss': 1.9693615229442867, 'epoch': 4.0})

In [37]:
model.push_to_hub("Reyansh4/T5_on_Billsum")

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Reyansh4/T5_on_Billsum/commit/f0f593d011a46b3599623c323adb775e38be8719', commit_message='Upload T5ForConditionalGeneration', commit_description='', oid='f0f593d011a46b3599623c323adb775e38be8719', pr_url=None, pr_revision=None, pr_num=None)

In [38]:
tokenizer.push_to_hub("Reyansh4/T5_on_Billsum")

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Reyansh4/T5_on_Billsum/commit/f4a76f21d1536f32e62cb1c0c72bc68047935d26', commit_message='Upload tokenizer', commit_description='', oid='f4a76f21d1536f32e62cb1c0c72bc68047935d26', pr_url=None, pr_revision=None, pr_num=None)

In [39]:
from transformers import pipeline
from datasets import load_metric

summarizer = pipeline("summarization", model="Reyansh4/T5_on_Billsum")

reference_summary = "AI and NLP are transforming industries. Text summarization is a valuable application of NLP."

rouge = load_metric("rouge")

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.7k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Generated Summary: Artificial intelligence (AI) is transforming industries and societies around the world. From healthcare to finance to transportation, AI has the potential to revolutionize how we live and work. One area where AI is making a significant impact is in natural language processing (NLP). NLP is a branch of AI that focuses on the interaction between computers and humans through natural language.


  rouge = load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

ROUGE scores: {'rouge1': AggregateScore(low=Score(precision=0.14754098360655737, recall=0.6428571428571429, fmeasure=0.24), mid=Score(precision=0.14754098360655737, recall=0.6428571428571429, fmeasure=0.24), high=Score(precision=0.14754098360655737, recall=0.6428571428571429, fmeasure=0.24)), 'rouge2': AggregateScore(low=Score(precision=0.03333333333333333, recall=0.15384615384615385, fmeasure=0.0547945205479452), mid=Score(precision=0.03333333333333333, recall=0.15384615384615385, fmeasure=0.0547945205479452), high=Score(precision=0.03333333333333333, recall=0.15384615384615385, fmeasure=0.0547945205479452)), 'rougeL': AggregateScore(low=Score(precision=0.09836065573770492, recall=0.42857142857142855, fmeasure=0.16000000000000003), mid=Score(precision=0.09836065573770492, recall=0.42857142857142855, fmeasure=0.16000000000000003), high=Score(precision=0.09836065573770492, recall=0.42857142857142855, fmeasure=0.16000000000000003)), 'rougeLsum': AggregateScore(low=Score(precision=0.09836

In [41]:
text_to_summarize = """
Artificial intelligence (AI) is transforming industries and societies around the world. From healthcare to finance to transportation, AI has the potential to revolutionize how we live and work. One area where AI is making a significant impact is in natural language processing (NLP). NLP is a branch of AI that focuses on the interaction between computers and humans through natural language.In recent years, NLP has seen rapid advancements thanks to deep learning techniques and large datasets. Models like OpenAI's GPT and Google's BERT have achieved remarkable results in tasks such as language translation, sentiment analysis, and text summarization.Text summarization, in particular, is a valuable application of NLP. It involves condensing a piece of text into a shorter version while preserving its key information and meaning. Summarization models like T5 are capable of producing high-quality summaries that capture the essence of a document in just a few sentences.In this example, we'll use the T5 model to summarize a passage about the importance of AI and NLP. Let's see how well the model can distill the main points of the text into a concise summary.
"""

In [44]:
summary = summarizer(text_to_summarize, max_length=150, min_length=30, do_sample=False)[0]

print("Generated Summary:", summary['summary_text'])

Generated Summary: Artificial intelligence (AI) is transforming industries and societies around the world. From healthcare to finance to transportation, AI has the potential to revolutionize how we live and work. One area where AI is making a significant impact is in natural language processing (NLP).


In [47]:
rouge_output = rouge.compute(predictions=[summary['summary_text']], references=[[reference_summary]])

print("ROUGE scores:", rouge_output['rouge1'])

ROUGE scores: AggregateScore(low=Score(precision=0.16279069767441862, recall=0.5, fmeasure=0.24561403508771928), mid=Score(precision=0.16279069767441862, recall=0.5, fmeasure=0.24561403508771928), high=Score(precision=0.16279069767441862, recall=0.5, fmeasure=0.24561403508771928))


In [46]:
print("ROUGE scores:", rouge_output['rouge2'])

ROUGE scores: AggregateScore(low=Score(precision=0.023809523809523808, recall=0.07692307692307693, fmeasure=0.03636363636363636), mid=Score(precision=0.023809523809523808, recall=0.07692307692307693, fmeasure=0.03636363636363636), high=Score(precision=0.023809523809523808, recall=0.07692307692307693, fmeasure=0.03636363636363636))


In [48]:
print("ROUGE scores:", rouge_output['rougeL'])

ROUGE scores: AggregateScore(low=Score(precision=0.13953488372093023, recall=0.42857142857142855, fmeasure=0.2105263157894737), mid=Score(precision=0.13953488372093023, recall=0.42857142857142855, fmeasure=0.2105263157894737), high=Score(precision=0.13953488372093023, recall=0.42857142857142855, fmeasure=0.2105263157894737))


In [51]:
print("ROUGE scores:", rouge_output['rougeLsum'])

ROUGE scores: AggregateScore(low=Score(precision=0.13953488372093023, recall=0.42857142857142855, fmeasure=0.2105263157894737), mid=Score(precision=0.13953488372093023, recall=0.42857142857142855, fmeasure=0.2105263157894737), high=Score(precision=0.13953488372093023, recall=0.42857142857142855, fmeasure=0.2105263157894737))
