# Sourcing and grouping RSS feeds

There's a lot of focus nowadays on webscraping for information retrieval, but before web-scraping and its questionable ethics there already existed a solution for computer-readable news data.

RSS (Rich Site Summary) feeds publish frequently updated information on the news/pages/documents available from a site.  The RSS document (an xml kind of thing) contains a headline, summary description (usually just one sentence), a picture and a link to the full article or document.  RSS systems are used in RSS reader software and other news aggregation apps to provide information on available stories while minimising overhead (eg; an app for browsing news articles doesn't have to retrieve and load every article/web page in full, just the much smaller amount of information in the RSS document).

So I can use them to create a corpus of summarized news items right?

Info on how to get the RSS docs from https://www.pythonforbeginners.com/feedparser/using-feedparser-in-python

In [51]:
import feedparser
import json
import datetime
import nltk
import importlib

import numpy as np
import pandas as pd

from datetime import datetime
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# For the spacy run
import spacy
from spacy import displacy
from collections import Counter

# Load the english language model
import en_core_web_sm
nlp = en_core_web_sm.load()

In [2]:
# These only need running once on your pc!
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Martin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Martin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

### The BBC feed
Is a fairly straight-forward example

In [44]:
bbc = feedparser.parse("http://feeds.bbci.co.uk/news/rss.xml")

In [48]:
bbc.keys

{'title': 'BBC News - Home',
 'title_detail': {'type': 'text/plain',
  'language': None,
  'base': 'http://feeds.bbci.co.uk/news/rss.xml',
  'value': 'BBC News - Home'},
 'subtitle': 'BBC News - Home',
 'subtitle_detail': {'type': 'text/html',
  'language': None,
  'base': 'http://feeds.bbci.co.uk/news/rss.xml',
  'value': 'BBC News - Home'},
 'links': [{'rel': 'alternate',
   'type': 'text/html',
   'href': 'https://www.bbc.co.uk/news/'}],
 'link': 'https://www.bbc.co.uk/news/',
 'image': {'href': 'https://news.bbcimg.co.uk/nol/shared/img/bbc_news_120x60.gif',
  'title': 'BBC News - Home',
  'title_detail': {'type': 'text/plain',
   'language': None,
   'base': 'http://feeds.bbci.co.uk/news/rss.xml',
   'value': 'BBC News - Home'},
  'links': [{'rel': 'alternate',
    'type': 'text/html',
    'href': 'https://www.bbc.co.uk/news/'}],
  'link': 'https://www.bbc.co.uk/news/'},
 'generator_detail': {'name': 'RSS for Node'},
 'generator': 'RSS for Node',
 'updated': 'Tue, 28 May 2019 17:38

In [5]:
len(bbc['entries'])

43

In [6]:
for article in bbc['entries']:
    print(article['title'], article['summary'], article['links'][0]['href'])

Alastair Campbell expelled from Labour Party Tony Blair's former spin doctor and People's Vote campaigner was kicked out after voting Lib Dem. https://www.bbc.co.uk/news/uk-48434842
JLS star Oritse Williams not guilty of rape Oritse Williams denied raping the woman after a concert in Wolverhampton in December 2016. https://www.bbc.co.uk/news/uk-england-birmingham-48380382
Tory leadership contest: Jeremy Hunt warns against no-deal Brexit 'suicide' The foreign secretary says a no-deal policy could see the Tories "annihilated" in a general election. https://www.bbc.co.uk/news/uk-politics-48428761
Spencer Matthews at jeweller's during armed robbery Made in Chelsea's Spencer Matthews says he was collecting a vintage watch when armed robbers struck. https://www.bbc.co.uk/news/uk-england-london-48435132
Equality watchdog launches Labour anti-Semitism probe The Equality and Human Rights Commission has received a number of complaints about discrimination. https://www.bbc.co.uk/news/uk-48433964


### The Reuters feed
Includes a lot of formatting information within the summary field, which is a pain in the ass.  Solved by splitting on "<"

In [49]:
reu = feedparser.parse("http://feeds.reuters.com/Reuters/worldNews")

In [50]:
reu.feed

{'title': 'Reuters: World News',
 'title_detail': {'type': 'text/plain',
  'language': None,
  'base': 'http://feeds.reuters.com/Reuters/worldNews',
  'value': 'Reuters: World News'},
 'links': [{'rel': 'alternate',
   'type': 'text/html',
   'href': 'https://www.reuters.com'},
  {'rel': 'self',
   'type': 'application/rss+xml',
   'href': 'http://feeds.reuters.com/Reuters/worldNews'},
  {'rel': 'hub',
   'href': 'http://pubsubhubbub.appspot.com/',
   'type': 'text/html'}],
 'link': 'https://www.reuters.com',
 'subtitle': "Reuters.com is your source for breaking news, business, financial and investing news, including personal finance and stocks.  Reuters is the leading global provider of news, financial information and technology solutions to the world's media, financial institutions, businesses and individuals.",
 'subtitle_detail': {'type': 'text/html',
  'language': None,
  'base': 'http://feeds.reuters.com/Reuters/worldNews',
  'value': "Reuters.com is your source for breaking news

In [8]:
len(reu['entries'])

20

In [9]:
for article in reu['entries']:
    print(article['title'], article['summary'].split("<")[0], article['links'][0]['href'])

U.S. Supreme Court takes up Mexican border shooting dispute The U.S. Supreme Court on Tuesday agreed to decide whether the family of a Mexican teenager fatally shot while on Mexican soil by a U.S. Border Patrol agent who fired from across the border in Texas can pursue a civil rights lawsuit in American courts. http://feeds.reuters.com/~r/Reuters/worldNews/~3/JoUI6UxA65U/u-s-supreme-court-takes-up-mexican-border-shooting-dispute-idUSKCN1SY1JC
Satellite images show fields in northwest Syria on fire New satellite images show fields, orchards and olive groves burning in northwest Syria, where the army has waged an assault against rebels in their last major stronghold. http://feeds.reuters.com/~r/Reuters/worldNews/~3/cVvTpLLRieM/satellite-images-show-fields-in-northwest-syria-on-fire-idUSKCN1SY1R7
Finland's coalition parties agree spending boost The five Finnish parties in talks to form a new government plan to increase public spending by 1.2 billion euros over four years, Finland's likely

## 1. Parsing lots of feeds!

Each of these feeds only gives a set number of news stories (to limit overhead/abuse I guess) so lets parse lots of different feeds and build a news corpus.

Lots of different country's world news sites - I sourced all the links from this blog;
https://blog.feedspot.com/world_news_rss_feeds/

In [15]:
feeds_df = pd.read_csv("rss_urls.csv")
feeds_df.head()

Unnamed: 0,url,type
0,http://feeds.bbci.co.uk/news/world/rss.xml,world
1,http://feeds.reuters.com/Reuters/worldNews,world
2,http://www.cbn.com/cbnnews/world/feed/,world
3,https://news.google.com/news/rss/headlines/sec...,world
4,https://www.reddit.com/r/worldnews/.rsshttps:/...,world


In [23]:
# Parse all, drop info you don't want
corpus = []

for index, row in feeds_df.iterrows():
    feed = feedparser.parse(row['url'])
    
    for article in feed['entries']:
    
        try:
            corpus.append({"title":article['title'],                     # News titles
                           "summary":article['summary'].split("<")[0],   # payload (sans any HTML stuff)
                           "date":article['published'],
                           "link":article['links'][0]['href'],           # associated links
                           "source_url":row['url'],
                           "type":row['type'],
                           "retrieval_timestamp":str(datetime.now())})      
    
        except KeyError as e:
            print("failed on ", article, e)

print("Finished!  With {} articles".format(len(corpus)))

failed on  {'id': 'http://www.globalissues.org/news/2019/05/28/25322', 'guidislink': True, 'link': 'http://www.globalissues.org/news/2019/05/28/25322', 'title': 'Asia-Pacific Region Viewed as Engine of the World Economy', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'http://www.globalissues.org/news/feed', 'value': 'Asia-Pacific Region Viewed as Engine of the World Economy'}, 'updated': '2019-05-28T13:19:10-07:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=5, tm_mday=28, tm_hour=20, tm_min=19, tm_sec=10, tm_wday=1, tm_yday=148, tm_isdst=0), 'links': [{'href': 'http://www.globalissues.org/news/2019/05/28/25322', 'rel': 'alternate', 'type': 'text/html'}], 'summary': '<p><img src="http://cdn.ipsnews.net/Library/2019/05/Armida-Salsiah-Alisjahbana_2_.jpg" width="640" alt="" /></p><p>BANGKOK, Thailand, May 28 (IPS)  - Armida Salsiah Alisjahbana * is UN Under-Secretary-General and Executive Secretary of the UN Economic and Social Commission for Asia and the P

Finished!  With 1549 articles


In [12]:
for article in corpus:
    print(article['date'])

Tue, 28 May 2019 15:26:06 GMT
Tue, 28 May 2019 11:46:17 GMT
Tue, 28 May 2019 15:32:26 GMT
Mon, 27 May 2019 23:05:17 GMT
Tue, 28 May 2019 14:54:06 GMT
Tue, 28 May 2019 15:45:50 GMT
Tue, 28 May 2019 08:54:06 GMT
Tue, 28 May 2019 16:00:43 GMT
Tue, 28 May 2019 08:32:18 GMT
Tue, 28 May 2019 15:43:18 GMT
Tue, 28 May 2019 13:15:36 GMT
Tue, 28 May 2019 13:08:15 GMT
Mon, 27 May 2019 23:14:48 GMT
Mon, 27 May 2019 23:13:30 GMT
Tue, 28 May 2019 03:04:28 GMT
Mon, 27 May 2019 23:32:28 GMT
Sun, 26 May 2019 23:10:35 GMT
Sun, 26 May 2019 23:11:48 GMT
Tue, 28 May 2019 07:41:20 GMT
Mon, 27 May 2019 08:50:36 GMT
Mon, 27 May 2019 23:10:52 GMT
Mon, 27 May 2019 08:20:33 GMT
Mon, 27 May 2019 12:17:34 GMT
Sun, 26 May 2019 20:22:44 GMT
Mon, 27 May 2019 03:00:18 GMT
Mon, 27 May 2019 13:08:08 GMT
Mon, 27 May 2019 03:13:41 GMT
Sun, 26 May 2019 23:11:09 GMT
Tue, 28 May 2019 12:01:40 -0400
Tue, 28 May 2019 11:55:36 -0400
Tue, 28 May 2019 11:52:16 -0400
Tue, 28 May 2019 11:51:18 -0400
Tue, 28 May 2019 11:50:16 -0400


In [24]:
feed['entries'][0]

{'title': 'Ambulance service confirms man has died in Dunmow High Street this afternoon',
 'title_detail': {'type': 'text/plain',
  'language': None,
  'base': 'https://www.dunmowbroadcast.co.uk/cmlink/dunmow_broadcast_news_1_295754',
  'value': 'Ambulance service confirms man has died in Dunmow High Street this afternoon'},
 'links': [{'rel': 'alternate',
   'type': 'text/html',
   'href': 'https://www.dunmowbroadcast.co.uk/ambulance-service-confirms-man-has-died-in-dunmow-high-street-this-afternoon-1-6075654'}],
 'link': 'https://www.dunmowbroadcast.co.uk/ambulance-service-confirms-man-has-died-in-dunmow-high-street-this-afternoon-1-6075654',
 'summary': '<!--PSTYLE=TX Standard Intro--><p>A man has died in Dunmow High Street this afternoon. </p>',
 'summary_detail': {'type': 'text/html',
  'language': None,
  'base': 'https://www.dunmowbroadcast.co.uk/cmlink/dunmow_broadcast_news_1_295754',
  'value': '<!--PSTYLE=TX Standard Intro--><p>A man has died in Dunmow High Street this aftern

In [26]:
# Dump the corpus to file, record the date and time in the filename
filename = "./working/RSS_corpus_{}.JSON".format(datetime.now().strftime("%Y-%m-%d %H:%M").replace(" ", "_") )

with open(filename, "w") as f:
    json.dump(corpus, f)

# 2.  Named Entity Recognition

The idea here is for every article I want a list of people/places/entities mentioned.

I got this code/approach/etc from; https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

## Old way with NLTK

This is really just for reference, Spacy for the win!

In [None]:
filename = "./working/RSS_corpus_2019-01-18_14:16.JSON"

with open(filename, "r") as f:
    corpus = json.load(f)

In [None]:
# Little helper function to get the POS tag tuples
def preprocess(sent):
    sent = word_tokenize(sent)
    sent = pos_tag(sent)
    return(sent)

In [None]:
example_tagged = preprocess(corpus[0]['summary'])
example_tagged

In [None]:
def get_names_nltk(sent):
    pattern = 'NP: {<DT>?<JJ>*<NN>}'
    cp = nltk.RegexpParser(pattern)
    cs = cp.parse(example_tagged)
    
    return(cs) # [item for item in cs if item[1].startswith("NNP")])

In [None]:
get_names_nltk(example_tagged)

## With Spacy

In [39]:
doc = nlp(corpus[3]['summary'])

# Am I distilling out "entities" here?  Is that what "ents" means?
print([(x.text, x.label_) for x in doc.ents])

[('Amazon', 'ORG'), ('Jeff Bezos', 'PERSON'), ('half', 'CARDINAL'), ('37bn', 'MONEY')]


## Producing an entity-frequency table

In [40]:
def count_entities(corpus, ignore_tags = ['CARDINAL', 'DATE', 'MONEY', 'ORDINAL', 'QUANTITY']):
    
    entities = pd.DataFrame()
    
    for doc in corpus:
        entries = [{"text":x.text, "POS":x.label_, "count":1} \
                   for x in nlp(doc['summary']).ents \
                   if x.label_ not in ignore_tags ]
        
        for each in entries:
            entities = entities.append(each, ignore_index=True)
        
    return(entities.groupby(['text', 'POS']).\
                       agg('count').\
                       reset_index().\
                       sort_values("count", ascending=False))

In [41]:
corpus_entities = count_entities(corpus)

In [42]:
corpus_entities

Unnamed: 0,text,POS,count
354,EU,ORG,37
432,French,NORP,28
396,European,NORP,27
564,Japan,GPE,26
1169,US,GPE,26
1162,U.S.,GPE,23
1421,,ORG,23
1164,UK,GPE,22
347,Donald Trump,PERSON,18
211,British,NORP,17


## Testing new rsspump module

In [59]:
from rsspump import rsspump

In [65]:
importlib.reload(rsspump)
test = rsspump.RSSPump("http://feeds.bbci.co.uk/news/rss.xml")
test.fetch()
test.get_stories()

[{'title': "I'm not a Lib Dem, says Alastair Campbell after Labour expulsion",
  'summary': "The Labour Party kicked out Tony Blair's former spin doctor after he publicly said he voted Lib Dem.",
  'date': 'Tue, 28 May 2019 19:27:38 GMT',
  'link': 'https://www.bbc.co.uk/news/uk-48434842',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-28 21:01:10.364950'},
 {'title': 'JLS star Oritse Williams not guilty of rape',
  'summary': 'Oritse Williams denied raping the woman after a concert in Wolverhampton in December 2016.',
  'date': 'Tue, 28 May 2019 16:17:58 GMT',
  'link': 'https://www.bbc.co.uk/news/uk-england-birmingham-48380382',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-28 21:01:10.364950'},
 {'title': 'Conservative leadership: BBC to host TV debates',
  'summary': "The winner of the Conservative Party leadership race is set to become the UK's next prime minister.",
  'date': 'Tue, 28 May 2019 

In [66]:
test.get_stories()

[{'title': "I'm not a Lib Dem, says Alastair Campbell after Labour expulsion",
  'summary': "The Labour Party kicked out Tony Blair's former spin doctor after he publicly said he voted Lib Dem.",
  'date': 'Tue, 28 May 2019 19:27:38 GMT',
  'link': 'https://www.bbc.co.uk/news/uk-48434842',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-28 21:01:10.364950'},
 {'title': 'JLS star Oritse Williams not guilty of rape',
  'summary': 'Oritse Williams denied raping the woman after a concert in Wolverhampton in December 2016.',
  'date': 'Tue, 28 May 2019 16:17:58 GMT',
  'link': 'https://www.bbc.co.uk/news/uk-england-birmingham-48380382',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-28 21:01:10.364950'},
 {'title': 'Conservative leadership: BBC to host TV debates',
  'summary': "The winner of the Conservative Party leadership race is set to become the UK's next prime minister.",
  'date': 'Tue, 28 May 2019 

In [72]:
importlib.reload(rsspump)
test = rsspump.RSSPumpB("http://feeds.bbci.co.uk/news/rss.xml")

In [73]:
test.get_stories()

[{'title': "I'm not a Lib Dem, says Alastair Campbell after Labour expulsion",
  'summary': "The Labour Party kicked out Tony Blair's former spin doctor after he publicly said he voted Lib Dem.",
  'date': 'Tue, 28 May 2019 19:27:38 GMT',
  'link': 'https://www.bbc.co.uk/news/uk-48434842',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-28 21:22:43.112753'},
 {'title': 'JLS star Oritse Williams not guilty of rape',
  'summary': 'Oritse Williams denied raping the woman after a concert in Wolverhampton in December 2016.',
  'date': 'Tue, 28 May 2019 16:17:58 GMT',
  'link': 'https://www.bbc.co.uk/news/uk-england-birmingham-48380382',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-28 21:22:43.112753'},
 {'title': 'Conservative leadership: BBC to host TV debates',
  'summary': "The winner of the Conservative Party leadership race is set to become the UK's next prime minister.",
  'date': 'Tue, 28 May 2019 

In [74]:
test.get_entities()

[{'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'title': "I'm not a Lib Dem, says Alastair Campbell after Labour expulsion",
  'entities': "NORP:Dem,PERSON:Alastair Campbell,ORG:Labour,ORG:The Labour Party,PERSON:Tony Blair's,PERSON:Lib Dem"},
 {'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'title': 'JLS star Oritse Williams not guilty of rape',
  'entities': 'PERSON:Oritse Williams,PERSON:Oritse Williams,GPE:Wolverhampton'},
 {'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'title': 'Conservative leadership: BBC to host TV debates',
  'entities': 'ORG:BBC,ORG:the Conservative Party,GPE:UK'},
 {'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'title': 'Boots review puts 200 stores at risk of closure',
  'entities': ''},
 {'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'title': "Spencer Matthews at jeweller's during armed robbery",
  'entities': 'PERSON:Spencer Matthews,ORG:Chelsea,PERSON:Spencer Matthews'},
 {'source_url': 'http://feeds.bbci.