<div align="center">
  <h1><strong>News Article Summarizer</strong></h1>
</div>
This project automatically generates summaries of Indian news articles about the Mahakumbh by following three main steps:

## Task 1. Dataset Collection
- **Scraping:** Uses Selenium and BeautifulSoup to scrape articles from Indian Express.
- **Extraction & Storage:** Extracts titles, dates, and summaries, saving them as `mahakumbh_articles.csv`.

## Task 2. Dataset Annotation
- **Cleaning & Preparation:** Cleans the text and combines the title and summary.
- **Ground Truth:** Uses the original summary as the annotated summary.

## Task 3. Model Development & Summarization
- **Model:** Utilizes the pre-trained `T5-base` for abstractive summarization.
- **Processing & Training:** Tokenizes and splits the dataset; trains with Hugging Face’s `Seq2SeqTrainer`, logging loss and ROUGE scores.
- **Evaluation:** Tracks validation loss and ROUGE metrics to assess performance.
- **Generation:** Produces high-quality summaries and saves them in `outputT5_fixed.csv`.

### **Contributors & Work Distribution**  

1. **Aditya Vilasrao Bhagat (2411AI27)** – Led dataset collection, including web scraping, storage, and initial processing. Also handled text cleaning and preparation.  
2. **Divyanshu Singh (2411AI41)** – Took charge of model development, implementing tokenization, dataset splitting, and training setup.  
3. **Vaibhav Shikhar Singh (2411AI48)** – Focused on model training, evaluation, and final performance assessment.  

## Task 1 and Task 2 (Dataset Collection & Dataset Annotation)

In [1]:
!apt-get update
!apt-get install -y google-chrome-stable wget unzip
!pip install selenium webdriver-manager beautifulsoup4 pandas

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [75.2 kB]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,604 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,843 kB]
Hit:13 http:/

In [2]:
# Search for Chrome and Chromium binaries in /usr directory
!find /usr -name "google-chrome*"
!find /usr -name "chromium-browser*"

In [3]:
# List files for google-chrome-stable if installed
!dpkg -L google-chrome-stable

# List files for chromium-browser if installed
!dpkg -L chromium-browser


[1mdpkg-query:[0m package 'google-chrome-stable' is not installed
Use dpkg --contents (= dpkg-deb --contents) to list archive files contents.
[1mdpkg-query:[0m package 'chromium-browser' is not installed
Use dpkg --contents (= dpkg-deb --contents) to list archive files contents.


In [4]:
# Install Google Chrome (if needed)
!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
!apt-get update
!apt install -y ./google-chrome-stable_current_amd64.deb

# Or install Chromium
!apt-get update
!apt-get install -y chromium-browser

--2025-04-20 06:10:54--  https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
Resolving dl.google.com (dl.google.com)... 74.125.24.93, 74.125.24.190, 74.125.24.136, ...
Connecting to dl.google.com (dl.google.com)|74.125.24.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 115256744 (110M) [application/x-debian-package]
Saving to: ‘google-chrome-stable_current_amd64.deb’


2025-04-20 06:10:54 (264 MB/s) - ‘google-chrome-stable_current_amd64.deb’ saved [115256744/115256744]

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-ba

In [5]:
!which google-chrome
!which chromium-browser

/usr/bin/google-chrome
/usr/bin/chromium-browser


In [6]:
# Import libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time

# Set up Chrome options for Colab using Chromium
chrome_options = Options()
# Set binary location to the Chromium binary found in your system
chrome_options.binary_location = '/usr/bin/chromium-browser'
chrome_options.add_argument("--headless")  # Use headless mode
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-software-rasterizer")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--disable-web-security")
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")

# Initialize the driver using the Chromium binary and installed Chromedriver
driver = webdriver.Chrome(options=chrome_options)

# Define the URL of the webpage to scrape
url = "https://indianexpress.com/about/mahakumbh/"
driver.get(url)

# Wait for initial load
time.sleep(5)

# Function to load more articles
def load_more_articles():
    while True:
        try:
            # Find and click the "Load More" button
            load_more_button = WebDriverWait(driver, 5).until(
                EC.element_to_be_clickable((By.XPATH,
                    "//span[contains(@class, 'm-featured-link__highlight') and contains(text(), 'Load More')]"
                ))
            )
            load_more_button.click()
            time.sleep(3)  # Wait for new articles to load
        except Exception as e:
            print("No more 'Load More' button found or an error occurred.", str(e))
            break

# Scroll down & load more articles
load_more_articles()

# Parse final page source with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()

# Extract article data
titles, dates, summaries = [], [], []

articles_section = soup.find('div', {'id': 'tag_article'})
if articles_section:
    articles = articles_section.find_all('div', {'class': 'details'})
    for article in articles:
        h3_tag = article.find('h3')
        a_tag = h3_tag.find('a', href=True) if h3_tag else None
        p_tags = article.find_all('p')

        title = a_tag.text.strip() if a_tag else 'No Title'
        date = p_tags[0].text.strip() if len(p_tags) > 0 else 'No Date'
        summary = " ".join([p.text.strip() for p in p_tags[1:]]) if len(p_tags) > 1 else 'No Summary'

        titles.append(title)
        dates.append(date)
        summaries.append(summary)

# Create and save DataFrame
df = pd.DataFrame({'Title': titles, 'Date & Time': dates, 'Summary': summaries})
df.to_csv('mahakumbh_articles.csv', index=False, encoding='utf-8')
print("Data has been successfully saved to mahakumbh_articles.csv")


No more 'Load More' button found or an error occurred. Message: 
Stacktrace:
#0 0x5d0ae9c05cea <unknown>
#1 0x5d0ae96b65f0 <unknown>
#2 0x5d0ae9707a33 <unknown>
#3 0x5d0ae9707c21 <unknown>
#4 0x5d0ae9756274 <unknown>
#5 0x5d0ae972d68d <unknown>
#6 0x5d0ae9753660 <unknown>
#7 0x5d0ae972d433 <unknown>
#8 0x5d0ae96f9ea3 <unknown>
#9 0x5d0ae96fab01 <unknown>
#10 0x5d0ae9bcab3b <unknown>
#11 0x5d0ae9bcea21 <unknown>
#12 0x5d0ae9bb1c32 <unknown>
#13 0x5d0ae9bcf594 <unknown>
#14 0x5d0ae9b95eef <unknown>
#15 0x5d0ae9bf3d98 <unknown>
#16 0x5d0ae9bf3f76 <unknown>
#17 0x5d0ae9c04b36 <unknown>
#18 0x7903471f5ac3 <unknown>

Data has been successfully saved to mahakumbh_articles.csv


## Task 3. Model Development & Summarization

In [7]:
!pip install pandas torch scikit-learn transformers datasets rouge-score selenium webdriver-manager beautifulsoup4

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.

In [8]:
!pip install pandas torch scikit-learn transformers datasets rouge-score
!pip install selenium webdriver-manager beautifulsoup4



In [9]:
import os
import re
import pandas as pd
import torch
import tempfile
import shutil
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import Dataset
from rouge_score import rouge_scorer, scoring
from torch.utils.data import DataLoader

# Optimize CUDA memory allocation
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"
torch.backends.cuda.matmul.allow_tf32 = True

# Clean text function
def clean_text(text):
    text = text.encode("ascii", errors="ignore").decode()
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Load and preprocess data
df = pd.read_csv("mahakumbh_articles.csv")
df['Title'] = df['Title'].apply(clean_text)
df['Summary'] = df['Summary'].apply(clean_text)
df['Date & Time'] = df['Date & Time'].apply(clean_text)
df['Article'] = df['Title'] + ". " + df['Summary']
df.dropna(subset=['Article', 'Summary'], inplace=True)

if 'Annotated_Summary' not in df.columns:
    df['Annotated_Summary'] = df['Summary']

# Load tokenizer and model
model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")

# Define max sequence lengths
max_input_length = 512
max_target_length = 150

# Preprocessing function
def preprocess_function(examples):
    inputs = examples["Article"]
    targets = examples["Annotated_Summary"]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length", return_tensors="pt")
    labels = tokenizer(targets, max_length=max_target_length, truncation=True, padding="max_length", return_tensors="pt")["input_ids"]

    # Ensure -100 padding for loss masking
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

# Create datasets
dataset = Dataset.from_pandas(df[["Article", "Annotated_Summary"]])
train_val_test = dataset.train_test_split(test_size=0.2, seed=42)
valid_test = train_val_test["test"].train_test_split(test_size=0.5, seed=42)
train_dataset, val_dataset, test_dataset = train_val_test["train"], valid_test["train"], valid_test["test"]

# Preprocess datasets
train_dataset = train_dataset.map(preprocess_function, batched=True)
val_dataset = val_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

# Set format for PyTorch
columns = ["input_ids", "attention_mask", "labels"]
train_dataset.set_format(type="torch", columns=columns)
val_dataset.set_format(type="torch", columns=columns)
test_dataset.set_format(type="torch", columns=columns)

# Use DataLoader for efficiency
train_dataloader = DataLoader(train_dataset, batch_size=1, num_workers=2, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=1, num_workers=2)

# Temporary directory for training outputs
temp_dir = tempfile.mkdtemp()

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir=temp_dir,
    eval_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=0,
    num_train_epochs=10,
    predict_with_generate=True,
    generation_max_length=180,
    generation_num_beams=6,
    fp16=True,
    logging_steps=50,
    save_strategy="no",
    report_to=[],
    max_grad_norm=1.0,
    gradient_accumulation_steps=8,  # Prevent gradient underflow
)

# Safe decoding function
def safe_decode(token_ids):
    valid_ids = [int(t) for t in token_ids if 0 <= int(t) < tokenizer.vocab_size]
    return tokenizer.decode(valid_ids, skip_special_tokens=True)

# Metric computation function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = [safe_decode(pred.tolist()) for pred in predictions]
    decoded_labels = [
        safe_decode([l if l != -100 else tokenizer.pad_token_id for l in label])
        for label in labels
    ]
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    aggregator = scoring.BootstrapAggregator()
    for ref, pred in zip(decoded_labels, decoded_preds):
        scores = scorer.score(ref, pred)
        aggregator.add_scores(scores)
    result = aggregator.aggregate()
    return {key: value.mid.fmeasure * 100 for key, value in result.items()}

# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print("Starting training...")
torch.cuda.empty_cache()
train_result = trainer.train()
torch.cuda.empty_cache()
print("Training completed.")
print("Final training metrics:", train_result.metrics)

# Save trained model
torch.save(model.state_dict(), "final_model.pt")
shutil.rmtree(temp_dir)

# Evaluate on the test set
print("Evaluating on the test set...")
torch.cuda.empty_cache()
results = trainer.evaluate(test_dataset)
torch.cuda.empty_cache()
print("Test set evaluation results:")
print(results)

# Function to generate summaries
def generate_summary(article_text):
    inputs = tokenizer.encode(article_text, return_tensors="pt", max_length=max_input_length, truncation=True).to("cuda")
    outputs = model.generate(
        inputs,
        max_length=180,
        num_beams=6,
        early_stopping=True,
        length_penalty=2.0,
        min_length=40
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Generate summaries for all articles
generated_summaries = []
for idx, article in enumerate(df['Article']):
    print(f"\n--- Article {idx+1} ---")
    print(article)
    summary = generate_summary(article)
    print("\n--- Generated Summary ---")
    print(summary)
    generated_summaries.append(summary)

df["Generated_Summary"] = generated_summaries
df.to_csv("outputT5_fixed.csv", index=False)
print("\nSummaries have been saved to outputT5_fixed.csv")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/88 [00:00<?, ? examples/s]

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

Map:   0%|          | 0/12 [00:00<?, ? examples/s]

  trainer = Seq2SeqTrainer(


Starting training...


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel
1,No log,0.258646,83.316059,82.030323,83.112358
2,No log,0.15466,85.03435,84.234798,85.03435
3,No log,0.118082,87.474875,87.240624,87.410349
4,No log,0.095457,87.636304,87.102998,87.907239
5,0.269900,0.078471,89.060611,88.591617,89.66435
6,0.269900,0.06592,92.173442,91.682193,92.078035
7,0.269900,0.058982,92.190269,91.55429,92.173442
8,0.269900,0.05523,93.605669,92.682065,92.906382
9,0.269900,0.053015,93.717902,92.336841,93.605669
10,0.051600,0.052351,93.589126,92.773893,92.962499


Training completed.
Final training metrics: {'train_runtime': 334.3059, 'train_samples_per_second': 2.632, 'train_steps_per_second': 0.329, 'total_flos': 535882943692800.0, 'train_loss': 0.1492013321681456, 'epoch': 10.0}
Evaluating on the test set...


Test set evaluation results:
{'eval_loss': 0.06734377890825272, 'eval_rouge1': 90.11608821776844, 'eval_rouge2': 88.01404853128992, 'eval_rougeL': 89.69593521752938, 'eval_runtime': 21.1487, 'eval_samples_per_second': 0.567, 'eval_steps_per_second': 0.567, 'epoch': 10.0}

--- Article 1 ---
In Bihar, Ganga water has not conformed to faecal coliform levels at 34 sites, state pollution control board tells NGT. The Central Pollution Control Board had submitted to the National Green Tribunal that faecal coliform levels in the Ganga water at the Sangam during the Mahakumbh were higher than permissible limits, making it unfit for bathing.

--- Generated Summary ---
The Central Pollution Control Board had submitted to the National Green Tribunal that faecal coliform levels in the Ganga water at the Sangam during the Mahakumbh were higher than permissible limits, making it unfit for bathing.

--- Article 2 ---
Maha Kumbh awakened nation like Dandi; Opposition questions silence on stampede: PM M