# Sourcing and grouping RSS feeds

There's a lot of focus nowadays on webscraping for information retrieval, but before web-scraping and its questionable ethics there already existed a solution for computer-readable news data.

RSS (Rich Site Summary) feeds publish frequently updated information on the news/pages/documents available from a site.  The RSS document (an xml kind of thing) contains a headline, summary description (usually just one sentence), a picture and a link to the full article or document.  RSS systems are used in RSS reader software and other news aggregation apps to provide information on available stories while minimising overhead (eg; an app for browsing news articles doesn't have to retrieve and load every article/web page in full, just the much smaller amount of information in the RSS document).

So I can use them to create a corpus of summarized news items right?

Info on how to get the RSS docs from https://www.pythonforbeginners.com/feedparser/using-feedparser-in-python

In [None]:
import feedparser
import json
import datetime
import nltk
import importlib

import numpy as np
import pandas as pd

from datetime import datetime
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# For the spacy run
import spacy
from spacy import displacy
from collections import Counter

# Load the english language model
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
# These only need running once on your pc!
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### The BBC feed
Is a fairly straight-forward example

In [None]:
bbc = feedparser.parse("http://feeds.bbci.co.uk/news/rss.xml")

In [None]:
bbc.keys

In [None]:
len(bbc['entries'])

In [None]:
for article in bbc['entries']:
    print(article['title'], article['summary'], article['links'][0]['href'])

### The Reuters feed
Includes a lot of formatting information within the summary field, which is a pain in the ass.  Solved by splitting on "<"

In [None]:
reu = feedparser.parse("http://feeds.reuters.com/Reuters/worldNews")

In [None]:
reu.feed

In [None]:
len(reu['entries'])

In [None]:
for article in reu['entries']:
    print(article['title'], article['summary'].split("<")[0], article['links'][0]['href'])

## 1. Parsing lots of feeds!

Each of these feeds only gives a set number of news stories (to limit overhead/abuse I guess) so lets parse lots of different feeds and build a news corpus.

Lots of different country's world news sites - I sourced all the links from this blog;
https://blog.feedspot.com/world_news_rss_feeds/

In [None]:
feeds_df = pd.read_csv("rss_urls.csv")
feeds_df.head()

In [None]:
# Parse all, drop info you don't want
corpus = []

for index, row in feeds_df.iterrows():
    feed = feedparser.parse(row['url'])
    
    for article in feed['entries']:
    
        try:
            corpus.append({"title":article['title'],                     # News titles
                           "summary":article['summary'].split("<")[0],   # payload (sans any HTML stuff)
                           "date":article['published'],
                           "link":article['links'][0]['href'],           # associated links
                           "source_url":row['url'],
                           "type":row['type'],
                           "retrieval_timestamp":str(datetime.now())})      
    
        except KeyError as e:
            print("failed on ", article, e)

print("Finished!  With {} articles".format(len(corpus)))

In [None]:
for article in corpus:
    print(article['date'])

In [None]:
feed['entries'][0]

In [None]:
# Dump the corpus to file, record the date and time in the filename
filename = "./working/RSS_corpus_{}.JSON".format(datetime.now().strftime("%Y-%m-%d %H:%M").replace(" ", "_") )

with open(filename, "w") as f:
    json.dump(corpus, f)

# 2.  Named Entity Recognition

The idea here is for every article I want a list of people/places/entities mentioned.

I got this code/approach/etc from; https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

## Old way with NLTK

This is really just for reference, Spacy for the win!

In [None]:
filename = "./working/RSS_corpus_2019-01-18_14:16.JSON"

with open(filename, "r") as f:
    corpus = json.load(f)

In [None]:
# Little helper function to get the POS tag tuples
def preprocess(sent):
    sent = word_tokenize(sent)
    sent = pos_tag(sent)
    return(sent)

In [None]:
example_tagged = preprocess(corpus[0]['summary'])
example_tagged

In [None]:
def get_names_nltk(sent):
    pattern = 'NP: {<DT>?<JJ>*<NN>}'
    cp = nltk.RegexpParser(pattern)
    cs = cp.parse(example_tagged)
    
    return(cs) # [item for item in cs if item[1].startswith("NNP")])

In [None]:
get_names_nltk(example_tagged)

## With Spacy

In [None]:
doc = nlp(corpus[3]['summary'])

# Am I distilling out "entities" here?  Is that what "ents" means?
print([(x.text, x.label_) for x in doc.ents])

## Producing an entity-frequency table

In [None]:
def count_entities(corpus, ignore_tags = ['CARDINAL', 'DATE', 'MONEY', 'ORDINAL', 'QUANTITY']):
    
    entities = pd.DataFrame()
    
    for doc in corpus:
        entries = [{"text":x.text, "POS":x.label_, "count":1} \
                   for x in nlp(doc['summary']).ents \
                   if x.label_ not in ignore_tags ]
        
        for each in entries:
            entities = entities.append(each, ignore_index=True)
        
    return(entities.groupby(['text', 'POS']).\
                       agg('count').\
                       reset_index().\
                       sort_values("count", ascending=False))

In [None]:
corpus_entities = count_entities(corpus)

In [None]:
corpus_entities

## Testing new rsspump module

In [1]:
from rsspump import RSSPump

In [2]:
#importlib.reload(rsspump)
test = RSSPump("http://feeds.bbci.co.uk/news/rss.xml")

In [3]:
test.corpus

[{'title': 'Brexit: Boris Johnson ordered to appear in court over £350m claim',
  'summary': 'He is accused of misconduct in public office for his claim during the EU referendum in 2016.',
  'date': 'Wed, 29 May 2019 16:40:12 GMT',
  'link': 'https://www.bbc.co.uk/news/uk-politics-48445430',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-29 20:38:14.210003'},
 {'title': 'Robert Mueller: Charging Trump was not an option',
  'summary': 'The special counsel said legal guidelines meant he was unable to charge a sitting president.',
  'date': 'Wed, 29 May 2019 17:27:58 GMT',
  'link': 'https://www.bbc.co.uk/news/world-us-canada-48450534',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-29 20:38:14.210003'},
 {'title': "English 'short-changed on care funding'",
  'summary': 'Funding per person in Scotland is 43% higher than in England, while in Wales it is 33% higher.',
  'date': 'Wed, 29 May 2019 11:00:35 G

In [4]:
test.refresh()

'Refreshed feed data.'

In [5]:
test.get_stories()

[{'title': 'Brexit: Boris Johnson ordered to appear in court over £350m claim',
  'summary': 'He is accused of misconduct in public office for his claim during the EU referendum in 2016.',
  'date': 'Wed, 29 May 2019 16:40:12 GMT',
  'link': 'https://www.bbc.co.uk/news/uk-politics-48445430',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-29 20:38:17.366986'},
 {'title': 'Robert Mueller: Charging Trump was not an option',
  'summary': 'The special counsel said legal guidelines meant he was unable to charge a sitting president.',
  'date': 'Wed, 29 May 2019 17:27:58 GMT',
  'link': 'https://www.bbc.co.uk/news/world-us-canada-48450534',
  'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-29 20:38:17.366986'},
 {'title': "English 'short-changed on care funding'",
  'summary': 'Funding per person in Scotland is 43% higher than in England, while in Wales it is 33% higher.',
  'date': 'Wed, 29 May 2019 11:00:35 G

In [6]:
test.get_entities()

[{'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-29 20:38:17.366986',
  'title': 'Brexit: Boris Johnson ordered to appear in court over £350m claim',
  'entities': 'PERSON:Boris Johnson,ORG:EU'},
 {'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-29 20:38:17.366986',
  'title': 'Robert Mueller: Charging Trump was not an option',
  'entities': 'PERSON:Robert Mueller:,PERSON:Charging Trump'},
 {'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-29 20:38:17.366986',
  'title': "English 'short-changed on care funding'",
  'entities': 'LANGUAGE:English,GPE:Scotland,PERCENT:43%,GPE:England,GPE:Wales,PERCENT:33%'},
 {'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',
  'retrieval_timestamp': '2019-05-29 20:38:17.366986',
  'title': 'Carbon credit fraud trial collapses as expert witness was no expert',
  'entities': ''},
 {'source_url': 'http://feeds.bbci.co.uk/news/rss.xml',