# Summarizing News Articles with the T5 Model

In [None]:
!pip install transformers
! pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In this example, we are using the T5 model to perform text summarization on a news article. We start by sending a request to the URL of the news article and getting the HTML response. Then, we use the BeautifulSoup library to parse the HTML and extract the main text of the article.

Next, we encode the text using the T5 tokenizer and set the model to evaluation mode. We then generate a summary of the text by calling the `model.generate()` method and specifying the maximum length of the summary. Finally, we decode the summary and remove the start and end tokens to get the final summary.

Overall, this example demonstrates how to use the T5 model to summarize the main text of a news article in Python. You can customize the length of the summary or use a different variant of the T5 model by adjusting the code accordingly.

In [17]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import requests
from bs4 import BeautifulSoup

# Load the T5 model
model = T5ForConditionalGeneration.from_pretrained('t5-base')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

# Set the URL of the news article
url = "https://edition.cnn.com/2022/12/24/world/new-minerals-discovered-in-el-ali-meteorite-scn/index.html"

# Send a request to the URL and get the HTML response
response = requests.get(url)
html = response.text

# Use BeautifulSoup to parse the HTML and extract the main text
soup = BeautifulSoup(html, 'html.parser')
article = soup.find('article')

# Set the input text and the maximum length of the summary
text = article.text
max_length = 20

# Encode the input text and set the model to evaluation mode
input_ids = tokenizer.encode(text, return_tensors='pt')
model.eval()

# Generate the summary
summary_ids = model.generate(input_ids, 
                             max_length=max_length)

# Decode the summary and remove the start and end tokens
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Find the position of the final dot in the summary
dot_pos = summary.rfind(".")

# Truncate the summary at the final dot
if dot_pos > 0:
    summary = summary[:dot_pos+1]

Token indices sequence length is longer than the specified maximum sequence length for this model (1449 > 512). Running this sequence through the model will result in indexing errors


In [28]:
summary.capitalize()

"Scientists have identified two minerals that don't naturally form on earth."