<a href="https://colab.research.google.com/github/Jaywestty/News-Crime-Classification/blob/main/Text_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **NEWS TEXT SUMMARIZER PROJECT**

####**Project Description:**
This project aims to automatically summarize news articles into concise, factual highlights using Hugging Face Transformers. The summarization model is based on the bart-base architecture, chosen for its strong performance on abstractive summarization while remaining lightweight enough to run within Google Colab's free-tier resource limits. The dataset, sourced from Hugging Face’s public datasets repository, contains diverse news articles for training and evaluation. The system is designed to generate short, accurate, and easily readable summaries that retain the key points of the original article, making it useful for quick news consumption.

#### **Install dependecies**

In [3]:
!pip install transformers datasets evaluate rouge_score accelerate nltk -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m14.5 MB/s[0m eta [36m0:0

#### **Import required libraries**

In [4]:
from datasets import load_dataset
from transformers import BartForConditionalGeneration, BartTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments
import numpy as np
import torch
import nltk
import gc
import evaluate
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [5]:
#Clear up memory to aid colab limit
def clear_memory():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

#### **Load Bart tokenizer and model**

In [6]:
model_name = "facebook/bart-base"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

print(f"Model loaded! Parameters: {model.num_parameters():,}")
clear_memory()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Model loaded! Parameters: 139,420,416


#### **Load Dataset**

In [7]:
print("Loading CNN-DailyMail dataset...")
dataset = load_dataset('cnn_dailymail', '3.0.0')

print("Sample article:\n", dataset['train'][0]['article'][:200])
print("\nSample summary:\n", dataset['train'][0]['highlights'])

# Reduce dataset for Colab constraints
train_dataset = dataset['train'].select(range(8000))  # Slightly smaller for BART
val_dataset = dataset['validation'].select(range(800))
test_dataset = dataset['test'].select(range(800))

print(f"Dataset sizes - Train: {len(train_dataset)}, Val: {len(val_dataset)}, Test: {len(test_dataset)}")

Loading CNN-DailyMail dataset...


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Sample article:
 LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on 

Sample summary:
 Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .
Dataset sizes - Train: 8000, Val: 800, Test: 800


#### **BART-specific preprocessing**

In [8]:
max_input_length = 1024  # BART can handle longer inputs
max_target_length = 142  # CNN-DM standard summary length

def preprocess(example):
    model_inputs = tokenizer(
        example['article'],
        max_length=max_input_length,
        truncation=True,
        padding="max_length"  # Changed from True
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            example['highlights'],
            max_length=max_target_length,
            truncation=True,
            padding="max_length"
        )

    # Replace pad token id with -100 for label loss masking
    labels_ids = labels["input_ids"]
    labels_ids = [
        [(token if token != tokenizer.pad_token_id else -100) for token in label]
        for label in labels_ids
    ]
    model_inputs["labels"] = labels_ids
    return model_inputs


print("Preprocessing datasets...")
train_tokenized = train_dataset.map(preprocess, batched=True, remove_columns=train_dataset.column_names)
val_tokenized = val_dataset.map(preprocess, batched=True, remove_columns=val_dataset.column_names)

clear_memory()

Preprocessing datasets...


Map:   0%|          | 0/8000 [00:00<?, ? examples/s]



Map:   0%|          | 0/800 [00:00<?, ? examples/s]

In [9]:
print(train_tokenized)

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 8000
})


In [10]:
print("Pad token ID:", tokenizer.pad_token_id)
print("Vocab size:", tokenizer.vocab_size)


Pad token ID: 1
Vocab size: 50265


#### **Data Collator (Dynamic Padding)**

In [11]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

#### **Load ROUGE for Evaluation**

In [12]:
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = [[(token if token != -100 else tokenizer.pad_token_id) for token in label] for label in labels]
    labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=predictions, references=labels, use_stemmer=True)
    return {k: round(v * 100, 2) for k, v in result.items()}

Downloading builder script: 0.00B [00:00, ?B/s]

#### **Training Arguments**

In [13]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./bart-news-summarizer",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=2,  # Small for Colab memory
    per_device_eval_batch_size=2,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),
    logging_dir='./logs',
)


#### **Trainer Setup**

In [14]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

#### **Train the Model**

In [15]:
clear_memory()
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjuwonfadairo10[0m ([33mjuwonfadairo10-jay[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.0609,2.12732,24.34,9.83,19.93,22.31




TrainOutput(global_step=4000, training_loss=2.253662353515625, metrics={'train_runtime': 1218.5305, 'train_samples_per_second': 6.565, 'train_steps_per_second': 3.283, 'total_flos': 4877891665920000.0, 'train_loss': 2.253662353515625, 'epoch': 1.0})

#### **Evaluatin of Model**

In [16]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 2.127319574356079, 'eval_rouge1': 24.34, 'eval_rouge2': 9.83, 'eval_rougeL': 19.93, 'eval_rougeLsum': 22.31, 'eval_runtime': 164.7847, 'eval_samples_per_second': 4.855, 'eval_steps_per_second': 2.427, 'epoch': 1.0}


#### **Test the Model on an article**

In [30]:
import re

def clean_and_merge_article(article):
    # Step 1: Clean article text
    article = re.sub(r"\s+", " ", article.strip())  # collapse spaces & newlines
    article = article.replace(" ,", ",").replace(" .", ".")  # fix space before punctuation

    # Step 2: Summarize using your model
    inputs = tokenizer(article, return_tensors="pt", max_length=1024, truncation=True).to(model.device)
    summary_ids = model.generate(**inputs, max_length=142, min_length=56, length_penalty=2.0, num_beams=4)
    raw_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Step 3: Merge summary into one sentence
    summary = re.sub(r'\s+', ' ', raw_summary.strip())
    sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', summary)
    sentences = [s.strip(" .") for s in sentences if s.strip()]

    if not sentences:
        return ""
    if len(sentences) == 1:
        return sentences[0] + "."

    merged = ", ".join(sentences[:-1]) + " and " + sentences[-1]
    return merged.strip() + "."


In [36]:
article = """
          President Bola Tinubu has paid tribute to a former Minister of Agriculture and ex-Peoples Democratic Party’s National Chairman, Chief Audu Ogbeh, describing him as a patriot who believed deeply in the nation’s potential.The family of the former minister announced his death in a statement earlier on Saturday.In a statement on Saturday by his Special Adviser on Information and Strategy, Bayo Onanuga, the President said the late politician was “always ready with facts and figures to support his propositions” and “a man of strong convictions who spoke the truth as he saw it.”The President added that Ogbeh’s contributions to Nigeria’s agricultural development, political stability, and democratic growth “will remain indelible in the country’s history.”Tinubu extended condolences to the government and people of Benue State, as well as Ogbeh’s family, friends, and associates, praying “the Almighty God will receive his soul and comfort his family at this difficult time.”

"""

In [37]:
clean = clean_and_merge_article(article)
print(clean)

President Bola Tinubu pays tribute to former Minister of Agriculture and ex-Peoples Democratic Party's National Chairman, Ogbeh's contributions to Nigeria's agricultural development, political stability, democratic growth will remain indelible in the country's history and Tinubu extended condolences to the government and people of Benue State, as well as Ogbeh's family.
