# 7.4.1 News Aggregation

This notebook demonstrates a **News Aggregation** pipeline leveraging both extractive summarization (SLMs) and abstractive summarization (LLMs) to turn multiple articles on a topic into concise, informative summaries.

## 1. Install Dependencies

We’ll use:
- **newspaper3k** to fetch and parse online articles  
- **Sumy** for extractive summarization (LexRank)  
- **Transformers** for abstractive summarization (BART & PEGASUS)  
- **NLTK** for tokenization support

In [None]:
!pip install newspaper3k sumy nltk transformers

In [None]:
import nltk
nltk.download('punkt')  # required by Sumy

In [None]:
from newspaper import Article
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from transformers import pipeline

In [None]:
# -- Sample news article URLs (you can swap these for any RSS-derived list) --
urls = [
    "https://www.bbc.com/news/world-us-canada-65878956",
    "https://www.bbc.com/news/technology-65872304",
    "https://www.bbc.com/news/business-65870012"
]

def fetch_article_text(url):
    art = Article(url)
    art.download()
    art.parse()
    return art.text

# Fetch and combine
articles = [fetch_article_text(u) for u in urls]
combined_text = "\n\n".join(articles)
print(f"Fetched {len(articles)} articles, total characters: {len(combined_text)}")

In [None]:
def extractive_summary(text: str, num_sentences: int = 3) -> str:
    parser    = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LexRankSummarizer()
    summary    = summarizer(parser.document, num_sentences)
    return " ".join(str(s) for s in summary)

ext_sum = extractive_summary(combined_text, num_sentences=5)
print("=== Extractive Summary (LexRank, 5 sentences) ===\n")
print(ext_sum)

In [None]:
# Set up Hugging Face summarization pipelines
bart_summarizer    = pipeline("summarization", model="facebook/bart-large-cnn")
pegasus_summarizer = pipeline("summarization", model="google/pegasus-xsum")

def abstractive_summary(text: str, model: str = "bart",
                        max_length: int = 150, min_length: int = 40) -> str:
    if model == "bart":
        return bart_summarizer(text,
                               max_length=max_length,
                               min_length=min_length,
                               do_sample=False)[0]["summary_text"]
    elif model == "pegasus":
        return pegasus_summarizer(text,
                                  max_length=max_length,
                                  min_length=min_length,
                                  do_sample=False)[0]["summary_text"]
    else:
        raise ValueError(f"Unsupported model: {model}")

abs_bart    = abstractive_summary(combined_text, model="bart")
abs_pegasus = abstractive_summary(combined_text, model="pegasus")

print("=== Abstractive Summary (BART) ===\n", abs_bart, "\n")
print("=== Abstractive Summary (PEGASUS) ===\n", abs_pegasus)

## 2. Analysis & Next Steps

- **Extractive vs. Abstractive**: The LexRank summary is guaranteed to be fact-faithful (it only picks sentences), while BART/PEGASUS can rephrase and remove redundancies but may hallucinate.  
- **Scaling Up**: Swap the static URL list for an RSS-feed reader loop or a news API to ingest hundreds of articles.  
- **Clustering**: Before summarizing, cluster articles by similarity (e.g., with a vector store) to generate topic-specific summaries.  
- **Deployment**:  
  - **Local**: Expose via a Flask/FastAPI endpoint.  
  - **Cloud**: Containerize with Docker; deploy on AWS/GCP with autoscaling.  
  - **Real-Time**: Integrate with streaming platforms (Kafka, Pub/Sub) for live summarization.

## 3. Conclusion

In this use case, we built a simple yet powerful News Aggregation system:
1. **Fetched** real-world articles.  
2. **Summarized** them extractively with Sumy (LexRank).  
3. **Generated** fluent abstractive summaries with BART & PEGASUS.  

This pattern can be extended with clustering, reranking, custom fine-tuning, or pipeline orchestration to power real-time news dashboards and alerting systems.