<div align="center">
  <h1><strong>News Article Summarizer</strong></h1>
</div>
This project automatically generates summaries of Indian news articles about the Mahakumbh by following three main steps:

## Task 1. Dataset Collection
- **Scraping:** Uses Selenium and BeautifulSoup to scrape articles from Indian Express.
- **Extraction & Storage:** Extracts titles, dates, and summaries, saving them as `mahakumbh_articles.csv`.

## Task 2. Dataset Annotation
- **Cleaning & Preparation:** Cleans the text and combines the title and summary.
- **Ground Truth:** Uses the original summary as the annotated summary.

## Task 3. Model Development & Summarization
- **Model:** Utilizes the pre-trained `T5-base` for abstractive summarization.
- **Processing & Training:** Tokenizes and splits the dataset; trains with Hugging Face’s `Seq2SeqTrainer`, logging loss and ROUGE scores.
- **Evaluation:** Tracks validation loss and ROUGE metrics to assess performance.
- **Generation:** Produces high-quality summaries and saves them in `outputT5_fixed.csv`.

### **Contributors & Work Distribution**  

1. **Aditya Vilasrao Bhagat (2411AI27)** – Led dataset collection, including web scraping, storage, and initial processing. Also handled text cleaning and preparation.  
2. **Divyanshu Singh (2411AI41)** – Took charge of model development, implementing tokenization, dataset splitting, and training setup.  
3. **Vaibhav Shikhar Singh (2411AI48)** – Focused on model training, evaluation, and final performance assessment.  

## Task 1 and Task 2 (Dataset Collection & Dataset Annotation)

In [None]:
!apt-get update
!apt-get install -y google-chrome-stable wget unzip
!pip install selenium webdriver-manager beautifulsoup4 pandas

In [None]:
# Search for Chrome and Chromium binaries in /usr directory
!find /usr -name "google-chrome*"
!find /usr -name "chromium-browser*"

In [None]:
# List files for google-chrome-stable if installed
!dpkg -L google-chrome-stable

# List files for chromium-browser if installed
!dpkg -L chromium-browser


In [None]:
# Install Google Chrome (if needed)
!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
!apt-get update
!apt install -y ./google-chrome-stable_current_amd64.deb

# Or install Chromium
!apt-get update
!apt-get install -y chromium-browser

In [None]:
!which google-chrome
!which chromium-browser

In [None]:
# Import libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time

# Set up Chrome options for Colab using Chromium
chrome_options = Options()
# Set binary location to the Chromium binary found in your system
chrome_options.binary_location = '/usr/bin/chromium-browser'
chrome_options.add_argument("--headless")  # Use headless mode
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-software-rasterizer")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--disable-web-security")
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")

# Initialize the driver using the Chromium binary and installed Chromedriver
driver = webdriver.Chrome(options=chrome_options)

# Define the URL of the webpage to scrape
url = "https://indianexpress.com/about/mahakumbh/"
driver.get(url)

# Wait for initial load
time.sleep(5)

# Function to load more articles
def load_more_articles():
    while True:
        try:
            # Find and click the "Load More" button
            load_more_button = WebDriverWait(driver, 5).until(
                EC.element_to_be_clickable((By.XPATH,
                    "//span[contains(@class, 'm-featured-link__highlight') and contains(text(), 'Load More')]"
                ))
            )
            load_more_button.click()
            time.sleep(3)  # Wait for new articles to load
        except Exception as e:
            print("No more 'Load More' button found or an error occurred.", str(e))
            break

# Scroll down & load more articles
load_more_articles()

# Parse final page source with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()

# Extract article data
titles, dates, summaries = [], [], []

articles_section = soup.find('div', {'id': 'tag_article'})
if articles_section:
    articles = articles_section.find_all('div', {'class': 'details'})
    for article in articles:
        h3_tag = article.find('h3')
        a_tag = h3_tag.find('a', href=True) if h3_tag else None
        p_tags = article.find_all('p')

        title = a_tag.text.strip() if a_tag else 'No Title'
        date = p_tags[0].text.strip() if len(p_tags) > 0 else 'No Date'
        summary = " ".join([p.text.strip() for p in p_tags[1:]]) if len(p_tags) > 1 else 'No Summary'

        titles.append(title)
        dates.append(date)
        summaries.append(summary)

# Create and save DataFrame
df = pd.DataFrame({'Title': titles, 'Date & Time': dates, 'Summary': summaries})
df.to_csv('mahakumbh_articles.csv', index=False, encoding='utf-8')
print("Data has been successfully saved to mahakumbh_articles.csv")


## Task 3. Model Development & Summarization

In [None]:
!pip install pandas torch scikit-learn transformers datasets rouge-score selenium webdriver-manager beautifulsoup4

In [None]:
!pip install pandas torch scikit-learn transformers datasets rouge-score
!pip install selenium webdriver-manager beautifulsoup4

In [None]:
import os
import re
import pandas as pd
import torch
import tempfile
import shutil
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import Dataset
from rouge_score import rouge_scorer, scoring
from torch.utils.data import DataLoader

# Optimize CUDA memory allocation
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"
torch.backends.cuda.matmul.allow_tf32 = True

# Clean text function
def clean_text(text):
    text = text.encode("ascii", errors="ignore").decode()
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Load and preprocess data
df = pd.read_csv("mahakumbh_articles.csv")
df['Title'] = df['Title'].apply(clean_text)
df['Summary'] = df['Summary'].apply(clean_text)
df['Date & Time'] = df['Date & Time'].apply(clean_text)
df['Article'] = df['Title'] + ". " + df['Summary']
df.dropna(subset=['Article', 'Summary'], inplace=True)

if 'Annotated_Summary' not in df.columns:
    df['Annotated_Summary'] = df['Summary']

# Load tokenizer and model
model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")

# Define max sequence lengths
max_input_length = 512
max_target_length = 150

# Preprocessing function
def preprocess_function(examples):
    inputs = examples["Article"]
    targets = examples["Annotated_Summary"]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding="max_length", return_tensors="pt")
    labels = tokenizer(targets, max_length=max_target_length, truncation=True, padding="max_length", return_tensors="pt")["input_ids"]

    # Ensure -100 padding for loss masking
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

# Create datasets
dataset = Dataset.from_pandas(df[["Article", "Annotated_Summary"]])
train_val_test = dataset.train_test_split(test_size=0.2, seed=42)
valid_test = train_val_test["test"].train_test_split(test_size=0.5, seed=42)
train_dataset, val_dataset, test_dataset = train_val_test["train"], valid_test["train"], valid_test["test"]

# Preprocess datasets
train_dataset = train_dataset.map(preprocess_function, batched=True)
val_dataset = val_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

# Set format for PyTorch
columns = ["input_ids", "attention_mask", "labels"]
train_dataset.set_format(type="torch", columns=columns)
val_dataset.set_format(type="torch", columns=columns)
test_dataset.set_format(type="torch", columns=columns)

# Use DataLoader for efficiency
train_dataloader = DataLoader(train_dataset, batch_size=1, num_workers=2, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=1, num_workers=2)

# Temporary directory for training outputs
temp_dir = tempfile.mkdtemp()

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir=temp_dir,
    eval_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=0,
    num_train_epochs=10,
    predict_with_generate=True,
    generation_max_length=180,
    generation_num_beams=6,
    fp16=True,
    logging_steps=50,
    save_strategy="no",
    report_to=[],
    max_grad_norm=1.0,
    gradient_accumulation_steps=8,  # Prevent gradient underflow
)

# Safe decoding function
def safe_decode(token_ids):
    valid_ids = [int(t) for t in token_ids if 0 <= int(t) < tokenizer.vocab_size]
    return tokenizer.decode(valid_ids, skip_special_tokens=True)

# Metric computation function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = [safe_decode(pred.tolist()) for pred in predictions]
    decoded_labels = [
        safe_decode([l if l != -100 else tokenizer.pad_token_id for l in label])
        for label in labels
    ]
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    aggregator = scoring.BootstrapAggregator()
    for ref, pred in zip(decoded_labels, decoded_preds):
        scores = scorer.score(ref, pred)
        aggregator.add_scores(scores)
    result = aggregator.aggregate()
    return {key: value.mid.fmeasure * 100 for key, value in result.items()}

# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print("Starting training...")
torch.cuda.empty_cache()
train_result = trainer.train()
torch.cuda.empty_cache()
print("Training completed.")
print("Final training metrics:", train_result.metrics)

# Save trained model
torch.save(model.state_dict(), "final_model.pt")
shutil.rmtree(temp_dir)

# Evaluate on the test set
print("Evaluating on the test set...")
torch.cuda.empty_cache()
results = trainer.evaluate(test_dataset)
torch.cuda.empty_cache()
print("Test set evaluation results:")
print(results)

# Function to generate summaries
def generate_summary(article_text):
    inputs = tokenizer.encode(article_text, return_tensors="pt", max_length=max_input_length, truncation=True).to("cuda")
    outputs = model.generate(
        inputs,
        max_length=180,
        num_beams=6,
        early_stopping=True,
        length_penalty=2.0,
        min_length=40
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Generate summaries for all articles
generated_summaries = []
for idx, article in enumerate(df['Article']):
    print(f"\n--- Article {idx+1} ---")
    print(article)
    summary = generate_summary(article)
    print("\n--- Generated Summary ---")
    print(summary)
    generated_summaries.append(summary)

df["Generated_Summary"] = generated_summaries
df.to_csv("outputT5_fixed.csv", index=False)
print("\nSummaries have been saved to outputT5_fixed.csv")