<a href="https://colab.research.google.com/github/Khoawawa/text-summarization/blob/main/text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

INSTALL DEPENDENCIES

In [53]:
!pip install transformers
!pip install datasets
!pip install accelerate
!pip install evaluate
!pip install numpy
!pip install requests
!pip install bs4
!pip install bert-score



LOAD BART-LARGE-CNN MODEL AND CNN_DAILYMAIL TEST DATASET FOR EVALUATION

In [9]:
from transformers import pipeline

bart_pipe = pipeline("summarization", model = "facebook/bart-large-cnn")




In [10]:
from datasets import load_dataset
ds_test = load_dataset("abisee/cnn_dailymail","3.0.0", split = "test")

EVALUATE BART-LARGE-CNN MODEL USING BERTSCORE

In [45]:
no_eval_articles = 1

In [25]:
def chunked_text(text, chunk_size):
    chunks = []
    for i in range(0,len(text), chunk_size):
        chunk = text[i:i+chunk_size]
        chunks.append(chunk)
    return chunks

In [48]:
def summarize(bart_pipe, text, chunk_size, chunk_summary_size=128):
    chunks = chunked_text(text,chunk_size)

    summaries = []

    for chunk in chunks:
      # Tokenize the chunk to get token length
      tokenized_chunk = tokenizer(chunk, return_tensors='pt', truncation=True)
      input_length = tokenized_chunk['input_ids'].shape[1]  # Number of tokens in the chunk

      size = min(chunk_summary_size, input_length//2)

      summary = bart_pipe(chunk,max_length = size, min_length = 1, do_sample = False)[0]['summary_text']

      summaries.append(summary)

    return ' '.join(summaries)


In [49]:
bart_summaries = []
ref_summaries = []
CHUNK_SIZE = 1024
for i in range(no_eval_articles):
    article = ds_test[i]['article']
    summary = ds_test[i]['highlights']
    # SUMMARIZE
    bart_summary = summarize(bart_pipe,article,CHUNK_SIZE)

    bart_summaries.append(bart_summary)
    ref_summaries.append(summary)

In [50]:
from evaluate import load

bert_score = load("bertscore")

results = bert_score.compute(predictions=bart_summaries, references=ref_summaries, model_type="facebook/bart-large-cnn")

f1s = results['f1']
precisions = results['precision']



In [57]:
import numpy

print(f"F1: {numpy.average(f1s)}")
print(f"Precisions: {numpy.average(precisions)}")

F1: 0.5722450613975525
Precisions: 0.5055912137031555


GET DATA FROM CNN WEBSITE AND SUMMARIZE IT

In [58]:
import requests
from bs4 import BeautifulSoup

def scrape_cnn_article(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")

        # For CNN articles
        if "cnn.com" in url:
            title = soup.find('h1').get_text()
            article_body = soup.find_all('p', class_="paragraph inline-placeholder vossi-paragraph")
            content = " ".join([p.get_text() for p in article_body])
            return title, content
        else:
            return None,None

    else:
        print(f"Failed to retrieve the article. Status code: {response.status_code}")
        return None, None

In [59]:
title, content = scrape_cnn_article("https://edition.cnn.com/2024/10/20/politics/mcdonalds-donald-trump-pennsylvania/index.html")

data ={
    'title': title,
    'article': content,
}

In [60]:
cnn_summary = summarize(bart_pipe,data['article'],CHUNK_SIZE)

In [61]:
print(cnn_summary)

The former president stopped by one of the fast-food chain's Pennsylvania franchises. He swapped his suit jacket for an apron to work as a fry attendant. It is the same job Vice President Kamala Harris has said she held as a young woman. Donald Trump visited a McDonald’s in Washington state on Sunday. He told the owner he had always wanted to work at the fast food chain. Trump regularly accuses Hillary Harris of making up her work history at the restaurant. Harris worked the register and manned the fry and ice cream machines, an official says. Her time there was repeatedly referenced onstage at this summer’s Democratic National Convention. Trump has repeatedly questioned the biographies of his rivals, often without merit. He was one of the loudest voices in the debunked “birther’s” movement. During a 2007 deposition, lawyers caught Trump lying at least 30 times over two days. “It’s an innocent form of exaggeration,” he wrote, “and a very effective form of promotion” The former presiden