# Sourcing and grouping RSS feeds

There's a lot of focus nowadays on webscraping for information retrieval, but before web-scraping and its questionable ethics there already existed a solution for computer-readable news data.

RSS (Rich Site Summary) feeds publish frequently updated information on the news/pages/documents available from a site.  The RSS document (an xml kind of thing) contains a headline, summary description (usually just one sentence), a picture and a link to the full article or document.  RSS systems are used in RSS reader software and other news aggregation apps to provide information on available stories while minimising overhead (eg; an app for browsing news articles doesn't have to retrieve and load every article/web page in full, just the much smaller amount of information in the RSS document).

So I can use them to create a corpus of summarized news items right?

Info on how to get the RSS docs from https://www.pythonforbeginners.com/feedparser/using-feedparser-in-python

In [1]:
import json

import numpy as np
import pandas as pd

from datetime import datetime

# Own convenience module
from rsspump import RSSPump

## 1. Parsing lots of feeds!

Each of these feeds only gives a set number of news stories (to limit overhead/abuse I guess) so lets parse lots of different feeds and build a news corpus.

Lots of different country's world news sites - I sourced all the links from this blog;
https://blog.feedspot.com/world_news_rss_feeds/

In [2]:
feeds_df = pd.read_csv("rss_urls.csv")
feeds_df.head()

Unnamed: 0,url,type
0,http://feeds.bbci.co.uk/news/world/rss.xml,world
1,http://feeds.reuters.com/Reuters/worldNews,world
2,http://www.cbn.com/cbnnews/world/feed/,world
3,https://news.google.com/news/rss/headlines/sec...,world
4,https://www.reddit.com/r/worldnews/.rsshttps:/...,world


In [3]:
# Parse all, drop info you don't want
corpus = []
entities = []

for index, row in feeds_df.iterrows():
    pump = RSSPump(row['url'])
    corpus = corpus + pump.get_stories()
    entities = entities + pump.get_entities()
print("Finished!  With {} articles".format(len(corpus)))

failed on  {'id': 'http://www.globalissues.org/news/2019/06/18/25377', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/06/18/25377', 'title': 'Colombia – Trade Unionism Under Threat of Death', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': 'Colombia – Trade Unionism Under Threat of Death'}, 'updated': '2019-06-18T13:40:56-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=6, tm_mday=18, tm_hour=20, tm_min=40, tm_sec=56, tm_wday=1, tm_yday=169, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/06/18/25377', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/06/colombia1.jpg" width="640" alt="" /></p><p>BOGOTA, Jun 18 (IPS)  - Miguel Morantes was almost murdered. Ever since, three bodyguards are part of his everyday life in one of the most dangerous countries for trade union members.</p><p><a href="http://www.globali

failed on  {'id': 'http://www.globalissues.org/news/2019/06/18/25374', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/06/18/25374', 'title': 'The Importance of the Upcoming FAO Election', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': 'The Importance of the Upcoming FAO Election'}, 'updated': '2019-06-18T08:06:31-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=6, tm_mday=18, tm_hour=15, tm_min=6, tm_sec=31, tm_wday=1, tm_yday=169, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/06/18/25374', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/06/Ambassador-Tom-official-photo_.jpg" width="640" alt="" /></p><p>ROME, Jun 18 (IPS)  - With each passing day, the world gets just a little smaller as the internet and cell phones bring our communities together, reveal our shared challenges, and lay bare our failures.  A

failed on  {'id': 'http://www.globalissues.org/news/2019/06/17/25372', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/06/17/25372', 'title': '&lt;em&gt;The Art of the Deal&lt;/em&gt;: What Trump May Teach Us', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': '&lt;em&gt;The Art of the Deal&lt;/em&gt;: What Trump May Teach Us'}, 'updated': '2019-06-17T13:29:13-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=6, tm_mday=17, tm_hour=20, tm_min=29, tm_sec=13, tm_wday=0, tm_yday=168, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/06/17/25372', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p>STOCKHOLM / ROME, Jun 17 (IPS)  - Like any dealer he was watching for the card\nthat is so high and wild\nhe\'ll never need to deal another.                                         \n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\x

failed on  {'id': 'http://www.globalissues.org/news/2019/06/17/25370', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/06/17/25370', 'title': 'Air Pollution Ranked as Biggest Environmental Threat to Human Health', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': 'Air Pollution Ranked as Biggest Environmental Threat to Human Health'}, 'updated': '2019-06-17T11:33:20-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=6, tm_mday=17, tm_hour=18, tm_min=33, tm_sec=20, tm_wday=0, tm_yday=168, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/06/17/25370', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/06/Air-Pollution_2_.jpg" width="640" alt="" /></p><p>UNITED NATIONS, Jun 17 (IPS)  - In a world that is becoming more and more industrial by the day, air pollution appears to be on the rise. </p><p>While there have been e

failed on  {'id': 'http://www.globalissues.org/news/2019/06/18/25376', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/06/18/25376', 'title': 'As Sudan Struggles, AU Should Press for Justice and Accountability', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': 'As Sudan Struggles, AU Should Press for Justice and Accountability'}, 'updated': '2019-06-18T09:49:32-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=6, tm_mday=18, tm_hour=16, tm_min=49, tm_sec=32, tm_wday=1, tm_yday=169, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/06/18/25376', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/06/africanunion.jpg" width="640" alt="" /></p><p>WASHINGTON DC, Jun 18 (IPS)  - Carine Kaneza Nantulya is the Africa advocacy director at Human Rights WatchOn June 6, the <a href="https://au.int/en" data-saferedirecturl="http

failed on  {'id': 'http://www.globalissues.org/news/2019/06/18/25375', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/06/18/25375', 'title': 'UN’s Development Goals Remain Largely Elusive', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': 'UN’s Development Goals Remain Largely Elusive'}, 'updated': '2019-06-18T07:51:09-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=6, tm_mday=18, tm_hour=14, tm_min=51, tm_sec=9, tm_wday=1, tm_yday=169, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/06/18/25375', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/06/sdgs_22_.jpg" width="640" alt="" /></p><p>UNITED NATIONS, Jun 18 (IPS)  - The United Nations, in a new report to be released next month, has warned "there is no escaping the fact that the global landscape for the implementation of the 17 Sustainable Development Goa

failed on  {'id': 'http://www.globalissues.org/news/2019/06/17/25371', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/06/17/25371', 'title': "'What it Takes to Feed 7.5 Billion People'", 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': "'What it Takes to Feed 7.5 Billion People'"}, 'updated': '2019-06-17T13:08:07-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=6, tm_mday=17, tm_hour=20, tm_min=8, tm_sec=7, tm_wday=0, tm_yday=168, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/06/17/25371', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/06/48079080523_a3f7347db0_z.jpg" width="640" alt="" /></p><p>ANKARA, Jun 17 (IPS)  - Events marking the <a href="https://www.unccd.int/news-events/25-years-protecting-our-land-biodiversity-and-climate">25th anniversary of the Convention to Combat Desertification (UNCCD) and 

## 2.  Quick check that the data's what we expect

In [4]:
len(corpus)

1561

In [5]:
corpus[0]

{'title': 'Patrick Shanahan: Trump says his choice for Pentagon chief is out',
 'summary': 'The president tweeted that Patrick Shanahan will "devote more time to his family".',
 'date': 'Tue, 18 Jun 2019 18:26:23 GMT',
 'link': 'https://www.bbc.co.uk/news/world-us-canada-48665798',
 'source_url': 'http://feeds.bbci.co.uk/news/world/rss.xml',
 'retrieval_timestamp': '2019-06-18 19:45:18.293350'}

In [6]:
entities[0]

{'source_url': 'http://feeds.bbci.co.uk/news/world/rss.xml',
 'retrieval_timestamp': '2019-06-18 19:45:18.293350',
 'title': 'Patrick Shanahan: Trump says his choice for Pentagon chief is out',
 'entities': 'PERSON:Patrick Shanahan,ORG:Trump,ORG:Pentagon,PERSON:Patrick Shanahan'}

## 3.  Dump to file

In [7]:
# Dump the corpus to file, record the date and time in the filename
filename = "./working/RSS_corpus_{}.json".format(datetime.now().strftime("%Y-%m-%d %H%M").replace(" ", "_") )

with open(filename, "w") as f:
    json.dump(corpus, f)

In [8]:
# Dump the entities to file, record the date and time in the filename
filename = "./working/RSS_entities_{}.json".format(datetime.now().strftime("%Y-%m-%d %H%M").replace(" ", "_") )

with open(filename, "w") as f:
    json.dump(entities, f)