<a href="https://colab.research.google.com/github/Khoawawa/text-summarization/blob/main/text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

INSTALL DEPENDENCIES

In [1]:
!pip install -U transformers
!pip install -U datasets
!pip install -U accelerate
!pip install -U evaluate
!pip install -U requests
!pip install -U bs4
!pip install -U bert-score



LOAD BART-LARGE-CNN MODEL AND CNN_DAILYMAIL TEST DATASET FOR EVALUATION

In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import load_dataset

In [3]:
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

In [4]:
ds_test = load_dataset("abisee/cnn_dailymail","3.0.0", split = "test")

EVALUATE BART-LARGE-CNN MODEL USING BERTSCORE

In [5]:
def abstract_summarize(text,max_length=250,min_length=30):
    tokenized_text = tokenizer(text,
                               max_length = 1024,
                               padding = "max_length",
                               truncation = True,
                               return_tensors = "pt"
                               )

    output = model.generate(
        tokenized_text["input_ids"],
        max_length = max_length,
        min_length = min_length
        )

    summary = tokenizer.decode(output[0], skip_special_tokens=True)

    return summary


In [6]:
def summarize(text):
    abstractive_summary = abstract_summarize(text)
    return abstractive_summary

In [7]:
bart_summaries = []
ref_summaries = []
no_eval_articles = 1
CHUNK_SIZE = 1024
for i in range(no_eval_articles):
    article = ds_test[i]['article']
    summary = ds_test[i]['highlights']
    # SUMMARIZE
    bart_summary = summarize(article)

    bart_summaries.append(bart_summary)
    ref_summaries.append(summary)

In [8]:
from evaluate import load

bert_score = load("bertscore")

results = bert_score.compute(predictions=bart_summaries, references=ref_summaries, model_type="facebook/bart-large-cnn")
score = {
    'f1': results['f1'],
    'precision':results['precision'],
    'recall': results['recall']
}

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

In [9]:
import numpy

print(f"F1: {numpy.average(score['f1'])}")
print(f"Precisions: {numpy.average(score['precision'])}")
print(f"Recall: {numpy.average(score['recall'])}")

F1: 0.6951242089271545
Precisions: 0.7186446189880371
Recall: 0.673094630241394


GET DATA FROM CNN WEBSITE AND SUMMARIZE IT

In [10]:
import requests
from bs4 import BeautifulSoup

def scrape_cnn_article(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")

        # For CNN articles
        if "cnn.com" in url:
            title = soup.find('h1').get_text()
            article_body = soup.find_all('p', class_="paragraph inline-placeholder vossi-paragraph")
            content = " ".join([p.get_text() for p in article_body])
            return title, content
        else:
            return None,None

    else:
        print(f"Failed to retrieve the article. Status code: {response.status_code}")
        return None, None

In [11]:
title, content = scrape_cnn_article("https://edition.cnn.com/2024/10/20/politics/mcdonalds-donald-trump-pennsylvania/index.html")

data ={
    'title': title,
    'article': content,
}

In [13]:
cnn_summary = summarize(data['article'])

In [14]:
print(cnn_summary)

Donald Trump stopped by a McDonald’s in Pennsylvania during his Sunday swing. He handed customers food through the drive-thru window, telling them he had made it himself. It's the same job Vice President Kamala Harris has said she held as a young woman. Trump has grown fixated on Harris’ employment there.
