<a href="https://colab.research.google.com/github/MehrdadDastouri/t5_text_summarization/blob/main/t5_text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
# Import necessary libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pre-trained T5 model and tokenizer
model_name = "t5-small"  # You can use "t5-base" or "t5-large" for larger models
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Define a function for text summarization
def summarize_text(text, max_length=50, min_length=20, length_penalty=2.0, num_beams=4):
    """
    Summarizes the input text using the T5 model.

    Args:
    - text (str): The input text to be summarized.
    - max_length (int): Maximum length of the summary.
    - min_length (int): Minimum length of the summary.
    - length_penalty (float): Penalty for longer summaries.
    - num_beams (int): Number of beams for beam search.

    Returns:
    - str: The summarized text.
    """
    # Prepend "summarize: " to the input text as required by T5
    input_text = "summarize: " + text

    # Tokenize the input text
    inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)

    # Generate the summary
    summary_ids = model.generate(
        inputs,
        max_length=max_length,
        min_length=min_length,
        length_penalty=length_penalty,
        num_beams=num_beams,
        early_stopping=True,
    )

    # Decode the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example text for summarization
text = """
The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
The novel virus was first identified in an outbreak in the Chinese city of Wuhan in December 2019. Attempts to contain it there failed, allowing the virus to spread to other areas of Asia and later worldwide.
The pandemic has caused global social and economic disruption, including the largest global recession since the Great Depression.
Widespread supply shortages, including food shortages, were caused by supply chain disruptions. Reduced human activity led to an unprecedented decrease in pollution.
Educational institutions and public areas were partially or fully closed in many jurisdictions, and many events were cancelled or postponed during 2020 and 2021.
"""

# Summarize the text
summary = summarize_text(text, max_length=50, min_length=20, length_penalty=2.0, num_beams=4)
print("Original Text:")
print(text)
print("\nSummarized Text:")
print(summary)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Original Text:

The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). 
The novel virus was first identified in an outbreak in the Chinese city of Wuhan in December 2019. Attempts to contain it there failed, allowing the virus to spread to other areas of Asia and later worldwide. 
The pandemic has caused global social and economic disruption, including the largest global recession since the Great Depression. 
Widespread supply shortages, including food shortages, were caused by supply chain disruptions. Reduced human activity led to an unprecedented decrease in pollution. 
Educational institutions and public areas were partially or fully closed in many jurisdictions, and many events were cancelled or postponed during 2020 and 2021.


Summarized Text:
the novel virus was first identified in an outbreak in the Chinese city of Wuhan in Decemb