<a href="https://colab.research.google.com/github/Jaywestty/News-Crime-Classification/blob/main/News_Text_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **NEWS TEXT SUMMARIZER PROJECT**

####**Project Description:**
This project aims to automatically summarize news articles into concise, factual highlights using Hugging Face Transformers. The summarization model is based on the bart-base architecture, chosen for its strong performance on abstractive summarization while remaining lightweight enough to run within Google Colab's free-tier resource limits. The dataset, sourced from Hugging Face’s public datasets repository, contains diverse news articles for training and evaluation. The system is designed to generate short, accurate, and easily readable summaries that retain the key points of the original article, making it useful for quick news consumption.

#### **Install dependecies**

In [None]:
!pip install transformers datasets evaluate rouge_score accelerate nltk -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m80.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m95.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m55.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.7 MB/s[0m eta [36m0:0

#### **Import required libraries**

In [None]:
from datasets import load_dataset
from transformers import BartForConditionalGeneration, BartTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments
import numpy as np
import torch
import nltk
import gc
import evaluate
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#Clear up memory to aid colab limit
def clear_memory():
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

#### **Load BART Tokenizer and Model**  

For fine-tuning, we load the **BART-base** model and tokenizer directly from Hugging Face.  

- **BART-base** is chosen over **T5-small** because:  
  - It generally produces **higher-quality summaries**.  
  - It balances performance with efficiency, making it suitable for **Colab free tier GPUs**.  


In [None]:
model_name = "facebook/bart-base"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

print(f"Model loaded! Parameters: {model.num_parameters():,}")
clear_memory()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Model loaded! Parameters: 139,420,416


#### **Load Dataset**

We use the **CNN/DailyMail dataset** provided by Hugging Face Datasets.  

Due to restricted GPU access on Colab, we work with a **subset**:  
- 8,000 samples from the **training set**  
- 800 samples from the **validation set**  
- 800 samples from the **test set**  

In [None]:
print("Loading CNN-DailyMail dataset...")
dataset = load_dataset('cnn_dailymail', '3.0.0')

print("Sample article:\n", dataset['train'][0]['article'][:200])
print("\nSample summary:\n", dataset['train'][0]['highlights'])

# Reduce dataset for Colab constraints
train_dataset = dataset['train'].select(range(8000))  # Slightly smaller for BART
val_dataset = dataset['validation'].select(range(800))
test_dataset = dataset['test'].select(range(800))

print(f"Dataset sizes - Train: {len(train_dataset)}, Val: {len(val_dataset)}, Test: {len(test_dataset)}")

Loading CNN-DailyMail dataset...


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Sample article:
 LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on 

Sample summary:
 Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .
Dataset sizes - Train: 8000, Val: 800, Test: 800


#### **BART-specific preprocessing**

## Data Preprocessing  
Before training, we need to prepare the text for BART:  
- Tokenize the input and target texts  
- Truncate or pad sequences to a fixed length  
- Format inputs and labels for Seq2Seq training  



In [None]:
max_input_length = 1024  # BART can handle longer inputs
max_target_length = 142  # CNN-DM standard summary length

def preprocess(example):
    model_inputs = tokenizer(
        example['article'],
        max_length=max_input_length,
        truncation=True,
        padding="max_length"  # Changed from True
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            example['highlights'],
            max_length=max_target_length,
            truncation=True,
            padding="max_length"
        )

    # Replace pad token id with -100 for label loss masking
    labels_ids = labels["input_ids"]
    labels_ids = [
        [(token if token != tokenizer.pad_token_id else -100) for token in label]
        for label in labels_ids
    ]
    model_inputs["labels"] = labels_ids
    return model_inputs


print("Preprocessing datasets...")
train_tokenized = train_dataset.map(preprocess, batched=True, remove_columns=train_dataset.column_names)
val_tokenized = val_dataset.map(preprocess, batched=True, remove_columns=val_dataset.column_names)

clear_memory()

Preprocessing datasets...


Map:   0%|          | 0/8000 [00:00<?, ? examples/s]



Map:   0%|          | 0/800 [00:00<?, ? examples/s]

In [None]:
print(train_tokenized)

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 8000
})


In [None]:
print("Pad token ID:", tokenizer.pad_token_id)
print("Vocab size:", tokenizer.vocab_size)


Pad token ID: 1
Vocab size: 50265


#### **Load ROUGE for Evaluation**


In [None]:
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = [[(token if token != -100 else tokenizer.pad_token_id) for token in label] for label in labels]
    labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=predictions, references=labels, use_stemmer=True)
    return {k: round(v * 100, 2) for k, v in result.items()}

Downloading builder script: 0.00B [00:00, ?B/s]

#### **Train the Model**

The model is trained on the reduced dataset:  
- Training loss is logged  
- Validation loss is tracked for overfitting  
- Best checkpoint is saved  


In [None]:
#Training Arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./bart-news-summarizer",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=3,   # Increase to 3 (sweet spot for Colab free tier)
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),
    logging_dir='./logs',
    logging_steps=100,   # Log progress every 100 steps
    save_strategy="epoch"  # Save at the end of each epoch
)


In [None]:
#Trainer Setup
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
clear_memory()
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjuwonfadairo10[0m ([33mjuwonfadairo10-jay[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.1622,2.242791,23.44,9.01,19.2,21.54




Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,2.1622,2.242791,23.44,9.01,19.2,21.54
2,1.6402,2.230019,24.66,9.58,19.92,22.48
3,1.389,2.266821,24.27,9.56,19.82,22.28


TrainOutput(global_step=12000, training_loss=1.7839368540445963, metrics={'train_runtime': 2612.7081, 'train_samples_per_second': 9.186, 'train_steps_per_second': 4.593, 'total_flos': 1.463367499776e+16, 'train_loss': 1.7839368540445963, 'epoch': 3.0})

#### **Best Model Selection**  

After training, **epoch 2** was identified as the best-performing checkpoint.  

- **Training Loss:** 1.60  
- **Validation Loss:** 2.23  
- **ROUGE-1:** 24.6  
- **ROUGE-2:** 9.5  
- **ROUGE-Lsum:** 22.48  

These results show that by the second epoch, the model had already reached an optimal balance between **low loss** and **high ROUGE scores**, indicating strong summarization quality without overfitting.  

For this reason, **epoch 2 is saved as the final model checkpoint**, which will be used for deployment.  


#### **Evaluation of Model**

We evaluate the model using:  
- **ROUGE scores** (ROUGE-1, ROUGE-2, ROUGE-L)

In [None]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 2.2668209075927734, 'eval_rouge1': 24.27, 'eval_rouge2': 9.56, 'eval_rougeL': 19.82, 'eval_rougeLsum': 22.28, 'eval_runtime': 192.8242, 'eval_samples_per_second': 4.149, 'eval_steps_per_second': 2.074, 'epoch': 3.0}


#### **Testing Best Checkpoint (Epoch 2)  on an article**

Before saving, we reload the **epoch 2 checkpoint** to verify its performance on a sample article.  
This step ensures that the model generates **coherent and concise summaries** before committing it as our final saved version.  

By testing on real input text, we can confirm that the chosen checkpoint generalizes well beyond the validation set.  


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Point directly to epoch 2 checkpoint
model_path = "./bart-news-summarizer/checkpoint-8000"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

In [None]:
print(model)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_n

In [None]:
import re

def clean_and_merge_article(article):
    # Step 1: Clean article text
    article = re.sub(r"\s+", " ", article.strip())  # collapse spaces & newlines
    article = article.replace(" ,", ",").replace(" .", ".")  # fix space before punctuation

    # Step 2: Summarize using your model
    inputs = tokenizer(article, return_tensors="pt", max_length=1024, truncation=True).to(model.device)
    summary_ids = model.generate(**inputs, max_length=200, min_length=80, length_penalty=2.0, num_beams=4)
    raw_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Step 3: Merge summary into one sentence
    summary = re.sub(r'\s+', ' ', raw_summary.strip())
    sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', summary)
    sentences = [s.strip(" .") for s in sentences if s.strip()]

    if not sentences:
        return ""
    if len(sentences) == 1:
        return sentences[0] + "."

    merged = ", ".join(sentences[:-1]) + " and " + sentences[-1]
    return merged.strip() + "."


In [None]:
article = """
The Vice Chancellor, Federal University of Technology and Environmental Sciences, Iyin Ekiti, Prof. Gbenga Aribisala, has said that the new institution will begin the admission process in September.

Aribisala said that the admission process would follow the National Universities Commission Resource Verification exercise taking place soon in the university.

The VC, who spoke in Ado Ekiti on Sunday at a reception to celebrate the 90th birthday of his mother, Deaconess Felicia Aribisala, also canvassed support from well-meaning Nigerians to the institution, saying, “A technology-based institution of this nature is capital-intensive”.

He said, “The NUC is coming for Resource Verification of all the 36 programmes that we are trying to offer. As soon as they come, by the special grace of God, we have provided those things that will be needed.
“We have provided a modern laboratory for all the programmes. We have a library now. We have classrooms fixed.

“We have offices and furniture fixed. We have all of those things. So we are very confident we are going to scale through.

“By the time we now scale through, by the special grace of God, by September this year, we are going to ask those who are interested in our university to do Change of University, and admission will begin. That is the icing. And after that, recruitment of staff will just follow”.

The VC, who said that funding of education should not be left to the government alone, said, “Universities need a lot of funding. Funding is a major challenge. You have to provide facilities and all of those things.
“So, as I speak to you, we (FUTES) do not have enough funds. That’s why we keep appealing and going to people because the government cannot do it all alone. We have been visiting some people who are public-spirited, people who like education, tertiary education.

“If we have people who want to donate buildings, we are going to name such after them; people who want to give scholarships; people who want to build hostels in such a manner that it is their own and they will take rent and all of those things.

“I think the funding is crucial because if you look at the nature of our university, University of Technology and Environmental Sciences, it is capital-intensive, it is technology-based.

“It means we need a lot of equipment. As I said, the government cannot do everything. So we need help at this time financially,” the VC said.

Aribisala disclosed that the land issue, which could have been a challenge to the university, had been resolved amicably with an agreement made with the concerned families.

“As I speak to you now, it has been resolved. The 200 hectares that have been donated to the university are very intact.

“There has been an agreement. The community and government will also pay some compensation to the families.

“So they are now at peace. The community is not trying to force the land. I think that was the kind of misconception that happened at the time,” the Vice Chancellor said.
"""

In [None]:
clean = clean_and_merge_article(article)
print(clean)

The Vice Chancellor, Federal University of Technology and Environmental Sciences, Iyin Ekiti, has said that the new institution will begin the admission process in September, Aribisala canvassing support from well-meaning Nigerians to the institution, He says funding of education should not be left to the government alone and The land issue, which could have been a challenge to the university, has been resolved amicably.


### **Model Saving and Deployment **

With **epoch 2** identified as the best-performing checkpoint, we save this model for future use.  

The saved model can be:  
- **Reloaded locally** for inference or further fine-tuning  
- **Uploaded to Hugging Face Hub** to make it publicly accessible  
- **Integrated into applications** (e.g., Flask, FastAPI, or Streamlit apps) for real-world summarization tasks  

This ensures that our best model is preserved and can be easily deployed for production-level usage.  


In [None]:
!pip install huggingface_hub

from huggingface_hub import notebook_login
notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

checkpoint_path = "./bart-news-summarizer/checkpoint-8000"  # <-- your epoch2 path
save_path = "./bart-summarizer-epoch2"  # final folder you’ll save

# load checkpoint
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_path)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)

# save clean copy
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Epoch2 model saved at {save_path}")

Epoch2 model saved at ./bart-summarizer-epoch2


In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(save_path)
tokenizer = AutoTokenizer.from_pretrained(save_path)

print("✅ Reloaded epoch2 model successfully")

✅ Reloaded epoch2 model successfully


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

repo_id = "Jayywestty/bart-summarizer-epoch2"  # change to your HF username/repo

# load from your already cleaned folder
model = AutoModelForSeq2SeqLM.from_pretrained("./bart-summarizer-epoch2")
tokenizer = AutoTokenizer.from_pretrained("./bart-summarizer-epoch2")

# push to hub
model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)

print(f"✅ Model uploaded to https://huggingface.co/{repo_id}")


Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...summarizer-epoch2/model.safetensors:   0%|          |  131kB /  558MB            

README.md: 0.00B [00:00, ?B/s]

✅ Model uploaded to https://huggingface.co/Jayywestty/bart-summarizer-epoch2
