# Sourcing and grouping RSS feeds

There's a lot of focus nowadays on webscraping for information retrieval, but before web-scraping and its questionable ethics there already existed a solution for computer-readable news data.

RSS (Rich Site Summary) feeds publish frequently updated information on the news/pages/documents available from a site.  The RSS document (an xml kind of thing) contains a headline, summary description (usually just one sentence), a picture and a link to the full article or document.  RSS systems are used in RSS reader software and other news aggregation apps to provide information on available stories while minimising overhead (eg; an app for browsing news articles doesn't have to retrieve and load every article/web page in full, just the much smaller amount of information in the RSS document).

So I can use them to create a corpus of summarized news items right?

Info on how to get the RSS docs from https://www.pythonforbeginners.com/feedparser/using-feedparser-in-python

In [1]:
import numpy as np
import pandas as pd
import feedparser

### The BBC feed
Is a fairly straight-forward example

In [2]:
bbc = feedparser.parse("http://feeds.bbci.co.uk/news/rss.xml")

In [14]:
bbc.keys()

dict_keys(['feed', 'entries', 'bozo', 'headers', 'href', 'status', 'encoding', 'version', 'namespaces'])

In [9]:
len(bbc['entries'])

41

In [8]:
for article in bbc['entries']:
    print(article['title'], article['summary'], article['links'][0]['href'])

Brexit: Ministers plead with MPs to back Theresa May's deal MPs are said to be planning to delay Brexit, as Jeremy Corbyn vows to try to topple the government "soon". https://www.bbc.co.uk/news/uk-politics-46853689
Alps snow: Avalanche kills three skiers near Lech, Austria Three German skiers die and one is missing after an avalanche near the Austrian ski resort of Lech. https://www.bbc.co.uk/news/world-europe-46854663
Paris bakery explosion death toll rises to four The body of a woman is found in the rubble of the buildings destroyed by a huge explosion on Saturday. https://www.bbc.co.uk/news/world-europe-46856109
Sturgeon refers herself to standards panel over Salmond case A standards panel will review the conduct of Scotland's first minister during an investigation into the Alex Salmond allegations. https://www.bbc.co.uk/news/uk-scotland-46856934
Woman and toddler left in crashed car in Long Eaton The driver hit five cars before crashing and trying to run off, leaving a toddler and 

### The Reuters feed
Includes a lot of formatting information within the summary field, which is a pain in the ass.  Solved by splitting on "<"

In [22]:
reu = feedparser.parse("http://feeds.reuters.com/Reuters/worldNews")

In [24]:
len(reu['entries'])

20

In [21]:
for article in reu['entries']:
    print(article['title'], article['summary'].split("<")[0], article['links'][0]['href'])

Sex abuse cases color immigration debate before Finnish election The parliamentary heads of two of Finland's largest parties have called for action after investigations against 19 foreign-born men on suspicion of sexual abuse of minors. http://feeds.reuters.com/~r/Reuters/worldNews/~3/rQGsBqoi9Mg/sex-abuse-cases-color-immigration-debate-before-finnish-election-idUSKCN1P70OE
French media denounce violent 'yellow vest' attacks on press French media and reporters' organizations on Sunday denounced attacks on journalists by "yellow vest" anti-government protesters and called for better protection after a series of incidents this weekend. http://feeds.reuters.com/~r/Reuters/worldNews/~3/rFeAOyBS21w/french-media-denounce-violent-yellow-vest-attacks-on-press-idUSKCN1P70J3
Congo should recount presidential election vote - Southern African bloc An influential African bloc urged Democratic Republic of Congo on Sunday to recount votes cast in its disorganized presidential election, raising pressu

In [25]:
cbn = feedparser.parse("http://www.cbn.com/cbnnews/world/feed/")

## 1. Parsing lots of feeds!

Each of these feeds only gives a set number of news stories (to limit overhead/abuse I guess) so lets parse lots of different feeds and build a news corpus.

Lots of different country's world news sites - I sourced all the links from this blog;
https://blog.feedspot.com/world_news_rss_feeds/

In [2]:
feeds = ["http://feeds.bbci.co.uk/news/world/rss.xml",
         "http://feeds.reuters.com/Reuters/worldNews",
         "http://www.cbn.com/cbnnews/world/feed/",
         "https://news.google.com/news/rss/headlines/section/topic/WORLD?ned=us&hl=en",
         "https://www.reddit.com/r/worldnews/.rss"
         "https://www.nytimes.com/svc/collections/v1/publish/https://www.nytimes.com/section/world/rss.xml",
         "https://www.buzzfeed.com/world.xml",
         "http://www.aljazeera.com/xml/rss/all.xml",
         "http://defence-blog.com/feed",
         "http://www.e-ir.info/category/blogs/feed",
         "http://www.globalissues.org/news/feed",
         "https://www.thecipherbrief.com/feed",
         "http://feeds.feedburner.com/WarNewsUpdates",
         "https://www.yahoo.com/news/world/rss",
         "https://www.theguardian.com/world/rss",
         "http://feeds.washingtonpost.com/rss/world",
         "http://timesofindia.indiatimes.com/rssfeeds/296589292.cms",
         "http://feeds.feedburner.com/ndtvnews-world-news",
         "https://www.rt.com/rss/news",
         "http://www.independent.co.uk/news/world/rss",
         "http://www.spiegel.de/international/index.rss",
         "http://www.npr.org/rss/rss.php?id=1004",
         "http://feeds.feedburner.com/daily-express-world-news",
         "https://www.cnbc.com/id/100727362/device/rss/rss.html",
         "http://www.mirror.co.uk/news/world-news/rss.xml",
         "http://www.cbc.ca/cmlink/rss-world",
         "https://www.cbsnews.com/latest/rss/world",
         "https://www.thesun.co.uk/news/worldnews/feed",
         "http://www.latimes.com/world/rss2.0.xml",
         "https://sputniknews.com/export/rss2/world/index.xml",
         "http://abcnews.go.com/abcnews/internationalheadlines",
         "http://www.abc.net.au/news/feed/52278/rss.xml",
         "https://www.vox.com/rss/world/index.xml",
         "http://feeds.skynews.com/feeds/rss/world.xml",
         "http://www.smh.com.au/rssheadlines/world/article/rss.xml",
         "http://en.rfi.fr/general/rss",
         "http://feeds.news24.com/articles/news24/World/rss",
         "http://www.rawstory.com/category/world/feed",
         "http://www.euronews.com/rss?level=theme&name=news"]

In [47]:
# Parse all, drop info you don't want
corpus = []

for source in feeds:
    feed = feedparser.parse(source)
    
    for article in feed['entries']:
    
        try:
            corpus.append({"title":article['title'],                  # News titles
                           "summary":article['summary'].split("<")[0],  # payload (sans any HTML stuff)
                           "date":article['published'],
                           "link":article['links'][0]['href']})      # associated links
    
        except KeyError as e:
            print("failed on ", article, e)

print("Finished!  With {} articles".format(len(corpus)))

failed on  {'id': 'http://www.globalissues.org/news/2019/01/12/24878', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/01/12/24878', 'title': "Argentina's Indigenous People Fight for Land Rights", 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': "Argentina's Indigenous People Fight for Land Rights"}, 'updated': '2019-01-12T00:24:33-08:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=1, tm_mday=12, tm_hour=8, tm_min=24, tm_sec=33, tm_wday=5, tm_yday=12, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/01/12/24878', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/01/00.jpg" width="640" alt="" /></p><p>TARTAGAL, Argentina, Jan 12 (IPS)  - Nancy López lives in a house made of clay, wood and corrugated metal sheets, on private land dedicated to agriculture. She is part of an indigenous community of 12 families in north

Finished!  With 989 articles


In [42]:
for article in corpus:
    print(article['date'])

Saudi Prince al-Faisal warns against US Syria pullout
Juan Guaidó: Venezuela's opposition leader briefly detained
James Watson: Scientist loses titles after claims over race
Paris bakery explosion death toll rises to four
Trump denies hiding detail of Putin summit talks from staff
Alps snow: Avalanche kills three skiers near Lech, Austria
Greek government crisis over Macedonia name change
DR Congo election: Sadc proposes unity government
ANC manifesto: Land, jobs and blockchain
Cesare Battisti: Italian ex-militant arrested in Bolivia
Andy Murray: Roger Federer & Novak Djokovic pay tribute to Briton
Colombia: Missing Farc leader Iván Márquez re-appears on video
Refugee homes in Lebanon have been flooded
The Maasai 'Olympics' helping to save lions' lives
Rahaf al-Qunun: Saudi teen refugee arrives in Canada
Actress Rania Youssef facing jail term over revealing dress
US shutdown: 'I don't need a wall, I want money to plant crops'
Jayme Closs: Missing girl lauded for 'courageous' escape
Sno

In [45]:
feed['entries'][0]

{'title': 'Democrats vow congressional action following reports on Trump, Russia',
 'title_detail': {'type': 'text/plain',
  'language': None,
  'base': 'https://www.euronews.com/rss?level=theme&name=news',
  'value': 'Democrats vow congressional action following reports on Trump, Russia'},
 'links': [{'rel': 'alternate',
   'type': 'text/html',
   'href': 'http://www.euronews.com/2019/01/13/democrats-vow-congressional-action-following-reports-trump-russia-n958151'}],
 'link': 'http://www.euronews.com/2019/01/13/democrats-vow-congressional-action-following-reports-trump-russia-n958151',
 'id': 'http://www.euronews.com/2019/01/13/democrats-vow-congressional-action-following-reports-trump-russia-n958151',
 'guidislink': False,
 'published': 'Sun, 13 Jan 2019 19:18:00 +0100',
 'published_parsed': time.struct_time(tm_year=2019, tm_mon=1, tm_mday=13, tm_hour=18, tm_min=18, tm_sec=0, tm_wday=6, tm_yday=13, tm_isdst=0),
 'summary': '&quot;America deserves the truth and the Foreign Affairs Com