# Sourcing and grouping RSS feeds

There's a lot of focus nowadays on webscraping for information retrieval, but before web-scraping and its questionable ethics there already existed a solution for computer-readable news data.

RSS (Rich Site Summary) feeds publish frequently updated information on the news/pages/documents available from a site.  The RSS document (an xml kind of thing) contains a headline, summary description (usually just one sentence), a picture and a link to the full article or document.  RSS systems are used in RSS reader software and other news aggregation apps to provide information on available stories while minimising overhead (eg; an app for browsing news articles doesn't have to retrieve and load every article/web page in full, just the much smaller amount of information in the RSS document).

So I can use them to create a corpus of summarized news items right?

Info on how to get the RSS docs from https://www.pythonforbeginners.com/feedparser/using-feedparser-in-python

In [1]:
import json

import numpy as np
import pandas as pd

from datetime import datetime

# Own convenience module
from rsspump import RSSPump

## 1. Parsing lots of feeds!

Each of these feeds only gives a set number of news stories (to limit overhead/abuse I guess) so lets parse lots of different feeds and build a news corpus.

Lots of different country's world news sites - I sourced all the links from this blog;
https://blog.feedspot.com/world_news_rss_feeds/

In [2]:
feeds_df = pd.read_csv("rss_urls.csv")
feeds_df.head()

Unnamed: 0,url,type
0,http://feeds.bbci.co.uk/news/world/rss.xml,world
1,http://feeds.reuters.com/Reuters/worldNews,world
2,http://www.cbn.com/cbnnews/world/feed/,world
3,https://news.google.com/news/rss/headlines/sec...,world
4,https://www.reddit.com/r/worldnews/.rsshttps:/...,world


In [3]:
# Parse all, drop info you don't want
corpus = []
entities = []

for index, row in feeds_df.iterrows():
    pump = RSSPump(row['url'])
    corpus = corpus + pump.get_stories()
    entities = entities + pump.get_entities()
print("Finished!  With {} articles".format(len(corpus)))

failed on  {'id': 'http://www.globalissues.org/news/2019/05/31/25329', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/05/31/25329', 'title': 'We Can’t Halt Extinctions Unless We Protect Water', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': 'We Can’t Halt Extinctions Unless We Protect Water'}, 'updated': '2019-05-31T10:22:11-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=5, tm_mday=31, tm_hour=17, tm_min=22, tm_sec=11, tm_wday=4, tm_yday=151, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/05/31/25329', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/05/fisherman_2_.jpg" width="640" alt="" /></p><p>COLOMBO, Sri Lanka, May 31 (IPS)  - Claudia Sadoff is Director General, International Water Management Institute (IWMI)Global <a href="https://www.ipbes.net/news/Media-Release-Global-Assessment" rel="noopener" 

failed on  {'id': 'http://www.globalissues.org/news/2019/05/28/25322', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/05/28/25322', 'title': 'Asia-Pacific Region Viewed as Engine of the World Economy', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': 'Asia-Pacific Region Viewed as Engine of the World Economy'}, 'updated': '2019-05-28T13:19:10-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=5, tm_mday=28, tm_hour=20, tm_min=19, tm_sec=10, tm_wday=1, tm_yday=148, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/05/28/25322', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/05/Armida-Salsiah-Alisjahbana_2_.jpg" width="640" alt="" /></p><p>BANGKOK, Thailand, May 28 (IPS)  - Armida Salsiah Alisjahbana * is UN Under-Secretary-General and Executive Secretary of the UN Economic and Social Commission for Asia and the P

failed on  {'id': 'http://www.globalissues.org/news/2019/05/29/25325', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/05/29/25325', 'title': 'A Call for Concrete Changes to Achieve a More Gender Equal World', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': 'A Call for Concrete Changes to Achieve a More Gender Equal World'}, 'updated': '2019-05-29T13:49:06-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=5, tm_mday=29, tm_hour=20, tm_min=49, tm_sec=6, tm_wday=2, tm_yday=149, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/05/29/25325', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/05/HRH-Sarah-Zeid_.jpg" width="640" alt="" /></p><p>AMMAN, May 29 (IPS)  - Princess Sarah Zeid is a member of UNHCR\'s Advisory Group on Gender, Forced Displacement, and Protection, a Special Advisor to the World Food Programme on

## 2.  Quick check that the data's what we expect

In [4]:
len(corpus)

1557

In [5]:
corpus[0]

{'title': "Mexico 'won't be provoked by US' over migrant row",
 'summary': 'The Mexican leader hits back after President Trump vowed tariffs until illegal migration was curbed.',
 'date': 'Fri, 31 May 2019 18:27:49 GMT',
 'link': 'https://www.bbc.co.uk/news/world-us-canada-48477335',
 'source_url': 'http://feeds.bbci.co.uk/news/world/rss.xml',
 'retrieval_timestamp': '2019-05-31 21:23:05.565232'}

In [6]:
entities[0]

{'source_url': 'http://feeds.bbci.co.uk/news/world/rss.xml',
 'retrieval_timestamp': '2019-05-31 21:23:05.565232',
 'title': "Mexico 'won't be provoked by US' over migrant row",
 'entities': 'GPE:Mexico,GPE:US,NORP:Mexican,PERSON:Trump'}

## 3.  Dump to file

In [7]:
# Dump the corpus to file, record the date and time in the filename
filename = "./working/RSS_corpus_{}.json".format(datetime.now().strftime("%Y-%m-%d %H%M").replace(" ", "_") )

with open(filename, "w") as f:
    json.dump(corpus, f)

In [8]:
# Dump the entities to file, record the date and time in the filename
filename = "./working/RSS_entities_{}.json".format(datetime.now().strftime("%Y-%m-%d %H%M").replace(" ", "_") )

with open(filename, "w") as f:
    json.dump(entities, f)