# RSS Testing

Compare 3 methods of RSS feed collection:
- Current MediaCloud API https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md
- Python feed_seeker library https://github.com/mitmedialab/feed_seeker
- Manually generated list https://github.com/berkmancenter/mediacloud/issues/333

In [12]:
## Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import pandas 
from datetime import datetime
import seaborn as sns
import glob
from os.path import basename
sns.set_context('poster')
import re
import csv

## Read manual list

In [15]:
df_sources = pandas.read_csv('../data/random-sources.txt')

In [17]:
df_sources.head()

Unnamed: 0,media_id,url
0,1747,http://www.dailymail.co.uk/home/index.html
1,1750,http://www.telegraph.co.uk/
2,20120,http://www.nj.com
3,1,http://nytimes.com
4,9,http://www.chicagotribune.com/


### Here's the list of media sources we're testing. There should be 50. But there are 48.


In [54]:
media_ids = list(df_sources.media_id)
print(len(media_ids))
print(len(set(media_ids)))

50
48


In [29]:
df = pandas.read_csv('../data/feeds-manual.csv')

In [30]:
df.head()

Unnamed: 0,media_id,url,feed_url
0,1750,http://www.telegraph.co.uk/,http://announcements.telegraph.co.uk/rss-feeds
1,1750,http://www.telegraph.co.uk/,https://www.telegraph.co.uk/finance/rssfeeds/
2,1750,http://www.telegraph.co.uk/,https://feedly.com/i/subscription/feed%2Fhttp%...
3,9,http://www.chicagotribune.com/,http://www.chicagotribune.com/news/local/break...
4,9,http://www.chicagotribune.com/,http://www.chicagotribune.com/business/rss2.0.xml


In [48]:
len(set(list(df.feed_url)))

1522

In [51]:
len(set(list(df.media_id)))

42

In [40]:
sources_dict = df_sources.to_dict('records')
sources_dict[0]

{'media_id': 1747, 'url': 'http://www.dailymail.co.uk/home/index.html'}

### What's up with the missing sources in the manually generated one?


In [58]:
set(media_ids) - set(list(df.media_id))

{29812, 57174, 118429, 348065, 385113, 385125}

In [63]:
set(df_sources.url) - set(df.url)

{'http://cn.linkedin.com/',
 'http://in.news.yahoo.com#spider',
 'http://wkrg.com/',
 'http://www.superiodicoaldia.com/',
 'http://www.theheritagenews.com/',
 'http://www.viasat1.com.gh/vone/'}

## Setup MC API

In [4]:
import mediacloud, json, datetime
mc = mediacloud.api.MediaCloud('7e5510da993cd51097818a48374dff44495cb251f859ec01d61aaae59284fb6c')

In [72]:
with open('../data/feeds-via-API.csv', 'w') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
    spamwriter.writerow(['media_id','url','feed_url']) 
    for i in range(len(sources_dict)):
        source = sources_dict[i]
        
        print("SOURCE: " + str(i+1) + " " + source['url'])
        feeds = mc.feedList(source['media_id'])
        print(len(feeds))
        for feed in feeds:
            spamwriter.writerow([source['media_id'], source['url'], feed['url']])
            print(feed['url'])
        print("\n")
            

SOURCE: 1 http://www.dailymail.co.uk/home/index.html
20
http://www.dailymail.co.uk/news/columnist-1001421/alex-brummer.rss
http://www.dailymail.co.uk/femail/columnist-1005042/alexandra-shulman.rss
http://www.dailymail.co.uk/news/columnist-323/allison-pearson.rss
http://www.dailymail.co.uk/news/columnist-463/amanda-platell.rss
http://www.dailymail.co.uk/tvshowbiz/azstar/archives/amy-winehouse.rss
http://www.dailymail.co.uk/news/columnist-248/andrew-alexander.rss
http://www.dailymail.co.uk/news/columnist-1041755/andrew-pierce.rss
http://www.dailymail.co.uk/sport/columnist-1031773/andy-roddick.rss
http://www.dailymail.co.uk/sport/columnist-1040409/andy-townsend.rss
http://www.dailymail.co.uk/sport/teampages/arsenal.rss
http://www.dailymail.co.uk/sport/columnist-1000161/ash-wednesday.rss
http://www.dailymail.co.uk/sport/teampages/aston-villa.rss
http://www.dailymail.co.uk/tvshowbiz/columnist-1000601/baz-bamigboye.rss
http://www.dailymail.co.uk/femail/columnist-465/bel-mooney.rss
http://www

2
http://newsok.com
spider:http://newsok.com


SOURCE: 14 http://telesurtv.net/
9
http://www.telesurtv.net/rss
http://telesurtv.net/
http://www.telesurtv.net/rss/RssLatinoamerica.html
http://www.telesurtv.net/rss/RssCultura.html
http://www.telesurtv.net/rss/RssDeporte.html
http://www.telesurtv.net/rss/RssPortada.html
http://www.telesurtv.net/rss/RssOpinion.xml
http://www.telesurtv.net/rss/RssMundo.html
http://www.telesurtv.net/rss/RssBlogs.xml


SOURCE: 15 http://www.rightwingwatch.org
4
http://www.rightwingwatch.org
http://www.rightwingwatch.org/rss.xml
http://www.rightwingwatch.org/feed/
MediaWords::ImportStories::Feedly:http://www.rightwingwatch.org


SOURCE: 16 http://www.commentarymagazine.com/blogs
12
http://www.commentarymagazine.com/blogs/index.php/feed
http://www.commentarymagazine.com/blogs/index.php/category/connecting-the-dots/feed
http://www.commentarymagazine.com/blogs/index.php/category/contentions/feed
http://www.commentarymagazine.com/blogs/index.php/category/the-horiz

2
http://in.news.yahoo.com#spider
https://in.news.yahoo.com/rss


SOURCE: 40 http://indiatoday.intoday.in/#spider
20
http://indiatoday.intoday.in/#spider
http://indiatoday.intoday.in/rss/homepage-topstories.jsp
http://indiatoday.intoday.in/rss/article.jsp?sid=120
http://indiatoday.intoday.in/rss/article.jsp?sid=150
http://indiatoday.intoday.in/rss/article.jsp?sid=149
http://indiatoday.intoday.in/rss/article.jsp?sid=146
http://indiatoday.intoday.in/rss/article.jsp?sid=27
http://indiatoday.intoday.in/rss/article.jsp?sid=41
http://indiatoday.intoday.in/rss/article.jsp?sid=21
http://indiatoday.intoday.in/rss/article.jsp?sid=25
http://indiatoday.intoday.in/rss/article.jsp?sid=24
http://indiatoday.intoday.in/rss/article.jsp?sid=36
http://indiatoday.intoday.in/rss/article.jsp?sid=85
http://indiatoday.intoday.in/rss/article.jsp?sid=61
http://indiatoday.intoday.in/rss/article.jsp?sid=34
http://indiatoday.intoday.in/rss/article.jsp?sid=30
http://indiatoday.intoday.in/rss/gallery.jsp?pcid=0
http:

### Way more sources manually...

In [74]:
df[df.media_id==1]

Unnamed: 0,media_id,url,feed_url
1190,1,http://nytimes.com,http://rss.nytimes.com/services/xml/rss/nyt/Ho...
1191,1,http://nytimes.com,http://rss.nytimes.com/services/xml/rss/nyt/Wo...
1192,1,http://nytimes.com,https://atwar.blogs.nytimes.com/feed/
1193,1,http://nytimes.com,http://rss.nytimes.com/services/xml/rss/nyt/Af...
1194,1,http://nytimes.com,http://rss.nytimes.com/services/xml/rss/nyt/Am...
1195,1,http://nytimes.com,http://rss.nytimes.com/services/xml/rss/nyt/As...
1196,1,http://nytimes.com,http://rss.nytimes.com/services/xml/rss/nyt/Eu...
1197,1,http://nytimes.com,http://rss.nytimes.com/services/xml/rss/nyt/Mi...
1198,1,http://nytimes.com,http://rss.nytimes.com/services/xml/rss/nyt/US...
1199,1,http://nytimes.com,http://rss.nytimes.com/services/xml/rss/nyt/Ed...


In [75]:
df_api = pandas.read_csv('../data/feeds-via-API.csv')

In [77]:
df_api[df_api.media_id == 1]

Unnamed: 0,media_id,url,feed_url
60,1,http://nytimes.com,http://bits.blogs.nytimes.com/rss2.xml
61,1,http://nytimes.com,http://dealbook.blogs.nytimes.com/rss2.xml
62,1,http://nytimes.com,http://feeds.feedburner.com/essentialknowledge
63,1,http://nytimes.com,http://www.nytimes.com/services/xml/rss/nyt/Af...
64,1,http://nytimes.com,http://www.nytimes.com/services/xml/rss/nyt/Am...
65,1,http://nytimes.com,http://www.nytimes.com/services/xml/rss/nyt/Ar...
66,1,http://nytimes.com,http://www.nytimes.com/services/xml/rss/nyt/Ar...
67,1,http://nytimes.com,http://www.nytimes.com/services/xml/rss/nyt/As...
68,1,http://nytimes.com,http://www.nytimes.com/services/xml/rss/nyt/Au...
69,1,http://nytimes.com,http://www.nytimes.com/services/xml/rss/nyt/Ba...


In [1]:
from feed_seeker import find_feed_url

In [2]:
find_feed_url('http://www.rightwingwatch.org')

'http://www.rightwingwatch.org/feed/'