# News Crawling

News crawling is essential for gathering and analyzing the latest information from various sources in real-time. This exercise explores two approaches to news crawling: (1) reading RSS feeds to collect structured updates from news websites, and (2) leveraging the Fundus library developed by Humboldt University Berlin, which provides tools for advanced web scraping and data extraction.

## Crawl News Website using rss reader



In [1]:
# RSS feed: https://www.tagesschau.de/infoservices/alle-meldungen-100~rss2.xml
import feedparser

feed = feedparser.parse(
    "https://www.tagesschau.de/infoservices/alle-meldungen-100~rss2.xml"
)

for entry in feed.entries:
    print(f"Title: {entry.title}")
    print(f"Link: {entry.link}")
    print(f"Published: {entry.published}")
    print(f"Summary: {entry.summary}")
    print("-" * 80)

Title: Forscher halten stärkeres Beben in Istanbul für möglich
Link: https://www.tagesschau.de/ausland/istanbul-sorge-vor-neuen-beben-100.html
Published: Wed, 23 Apr 2025 20:58:01 +0200
Summary: Mehr als 200 Verletzte, aber kaum Schäden: Die Bilanz des Erdbebens in Istanbul klingt weniger schlimm als befürchtet. Doch Forscher rechnen bald mit heftigeren Erschütterungen. Warum ist die Region besonders gefährdet?
--------------------------------------------------------------------------------
Title: Ukraine-Liveblog: ++ Trump macht der Ukraine neue Vorwürfe ++
Link: https://www.tagesschau.de/newsticker/liveblog-ukraine-mittwoch-490.html
Published: Wed, 23 Apr 2025 20:43:44 +0200
Summary: US-Präsident Trump kritisiert die Ukraine. Das Land verlängere den Krieg, weil es die Krim nicht aufgeben wolle. Die ukrainische Vizeregierungschefin Swyrydenko betont, dass Gespräche mit Russland möglich seien, eine Kapitulation aber nicht.
---------------------------------------------------------------

## Crawl news websites with fundus

Fundus libary: https://github.com/flairNLP/fundus

In [2]:
from fundus import PublisherCollection, Crawler, NewsMap

crawler = Crawler(PublisherCollection.de.Tagesschau)

for article in crawler.crawl(max_articles=10, save_to_file="tagesschau_news.json"):
    print(article)

Fundus-Article including 1 image(s):
- Title: "Forscher halten stärkeres Beben in Istanbul für möglich"
- Text:  "Mehr als 200 Verletzte, aber kaum Schäden: Die Bilanz des Erdbebens in Istanbul
          klingt weniger schlimm als befürchtet. Doch Forscher rechnen [...]"
- URL:    https://www.tagesschau.de/ausland/istanbul-sorge-vor-neuen-beben-100.html
- From:   Tagesschau (2025-04-23 18:58)
Fundus-Article including 3 image(s):
- Title: "Ukraine-Liveblog: ++ Trump macht der Ukraine neue Vorwürfe ++"
- Text:  "US-Präsident Trump kritisiert die Ukraine. Das Land verlängere den Krieg, weil
          es die Krim nicht aufgeben wolle. Die ukrainische [...]"
- URL:    https://www.tagesschau.de/newsticker/liveblog-ukraine-mittwoch-490.html
- From:   Tagesschau (2025-04-23 18:43)
Fundus-Article:
- Title: "Ukraine-Verhandlungen in der Sackgasse?"
- Text:  "Ein neuer US-Vorschlag zur Beendigung des Kriegs gegen die Ukraine, der auch
          Gebietsabtretungen vorsieht, stößt in Kiew auf Ableh

In [8]:
# TODO:
# - update the crawler to crawl news sources from the us
# - Read the docs here (https://github.com/flairNLP/fundus/blob/master/docs/3_the_article_class.md) and following to filter all articles that talk about "trump" in the topics or title
# - The articles should have a title and body

from fundus import PublisherCollection, Crawler, NewsMap

crawler = Crawler(PublisherCollection.us)


def topic_filter(extracted: dict[str, any]) -> bool:
    if not "title" in extracted or not "body" in extracted:
        return True
    
    if "trump" in extracted["title"].lower():
        return True

    # https://realpython.com/python-walrus-operator/
    if topics := extracted.get("topics"):
        for topic in topics:
            if "trump" in topic.lower():
                return True
    return False


for article in crawler.crawl(
    max_articles=5,
    only_complete=topic_filter
):
    print(article)

Fundus-Article:
- Title: "What to know about the attack on tourists in Indian-administered Kashmir"
- Text:  "At least 26 people were killed and more than a dozen were injured when militants
          opened fire on tourists in a picturesque mountain town in [...]"
- URL:    https://www.washingtonpost.com/world/2025/04/23/india-kashmir-attack-pahalgam-baisaran/
- From:   Washington Post (2025-04-23 18:46)
Fundus-Article including 1 image(s):
- Title: "Star Wars’ ‘Andor’ Season 2 Depicts the Banality of American Fascism"
- Text:  "The franchise’s Disney era has been defined by toothless politics, but Andor
          Season 2 is a vivid metaphor for America’s descent into [...]"
- URL:    https://www.wired.com/story/star-wars-andor-season-2-depicts-the-banality-of-american-fascism/
- From:   Wired (2025-04-23 13:43)
Fundus-Article including 2 image(s):
- Title: "Grupo Firme Announce Album 'Evolución,' First Original LP in 3 Years"
- Text:  "Frontman Eduin Caz says the group has been "mor