## Challenge

Do a little scraping or API-calling of your own.  Pick a new website and see what you can get out of it.  Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.  

Formally, your goal is to write a scraper that will:

1) Return specific pieces of information (rather than just downloading a whole page)  
2) Iterate over multiple pages/queries  
3) Save the data to your computer  

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest.  Write up a report from scraping code to summary and share it with your mentor.

In [1]:
import json
import scrapy
from scrapy.crawler import CrawlerProcess

class SpidyQuotesSpider(scrapy.Spider):
    name = 'newsapi'
    start_urls = [
        'https://newsapi.org/v2/top-headlines?country=us&apiKey=efd926684bba4292a267b8d2c8b52084',
        'https://newsapi.org/v2/top-headlines?country=gb&apiKey=efd926684bba4292a267b8d2c8b52084',
        'https://newsapi.org/v2/top-headlines?country=ru&apiKey=efd926684bba4292a267b8d2c8b52084',
        'https://newsapi.org/v2/top-headlines?country=ca&apiKey=efd926684bba4292a267b8d2c8b52084',
        'https://newsapi.org/v2/top-headlines?country=fr&apiKey=efd926684bba4292a267b8d2c8b52084']
 
    def parse(self, response):
        data = json.loads(response.body)
        #print(data)
        x = data['articles']
        #print(x)
        for each in x:
            print(each['source']['name'])
        for each in x:
            yield {
                    'title': each['source']['name']
                    }
                    
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'NewsLinks.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 100 links.    
    'CLOSESPIDER_PAGECOUNT' : 10
})
                                         

# Starting the crawler with our spider.
process.crawl(SpidyQuotesSpider)
process.start()
print('First 100 links extracted!')

Mercurynews.com
CNBC
USA Today
Chicagotribune.com
Bbc.com
CNN
The Wall Street Journal
Nydailynews.com
The Washington Post
The Washington Post
The New York Times
Latimes.com
Chicagotribune.com
Reuters
The Washington Post
Npr.org
The Wall Street Journal
The Washington Post
The Wall Street Journal
NBC News
Daily Mail
The Guardian (AU)
Independent
The Economist
The Economist
Independent
Independent
Bloomberg
Mirror
Standard.co.uk
Daily Mail
The Guardian (AU)
The Guardian (AU)
Football365.com
Daily Mail
Standard.co.uk
The Telegraph
The Guardian (AU)
Gloucestershirelive.co.uk
Daily Mail
Lenta
Vesti.ru
Fontanka.ru
Interfax.ru
RBC
Tass.ru
RBC
Lenta
Ria.ru
Rambler.ru
RBC
Vesti.ru
Ria.ru
Riafan.ru
Ria.ru
Www.bfm.ru
Tvrain.ru
Rambler.ru
RBC
RBC
Vox.com
The Globe And Mail
Nationalpost.com
Bbc.com
Fox News
CBC News
Fox News
Thechronicleherald.ca
The New York Times
CBC News
Huffingtonpost.ca
CBC News
The Globe And Mail
The Washington Post
Reuters
Ctvnews.ca
Anandtech.com
The Globe And Mail
Al.com
Sp

In [2]:
import pandas as pd

# Checking whether we got data 

News = pd.read_json('NewsLinks.json')
print(News.shape)
print(News.tail())

(100, 1)
                 title
95          Ctvnews.ca
96       Anandtech.com
97  The Globe And Mail
98              Al.com
99       Spokesman.com


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import seaborn as sns
from scipy import stats, integrate
%matplotlib inline

In [5]:
np.unique(News, return_counts=True)

(array(['20minutes.fr', 'Al.com', 'Anandtech.com', 'Bbc.com',
        'Bienpublic.com', 'Bloomberg', 'CBC News', 'CNBC', 'CNN',
        'Chicagotribune.com', 'Ctvnews.ca', 'Daily Mail', 'Fontanka.ru',
        'Football365.com', 'Fox News', 'Francebleu.fr', 'Francetvinfo.fr',
        'Gloucestershirelive.co.uk', 'Huffingtonpost.ca', 'Independent',
        'Interfax.ru', "L'equipe", 'Latimes.com', 'Le Monde',
        'Ledauphine.com', 'Lefigaro.fr', 'Lenta', 'Leparisien.fr',
        'Lepoint.fr', 'Mercurynews.com', 'Mirror', 'NBC News',
        'Nationalpost.com', 'Npr.org', 'Nydailynews.com', 'RBC',
        'Rambler.ru', 'Reuters', 'Ria.ru', 'Riafan.ru', 'Spokesman.com',
        'Standard.co.uk', 'Tass.ru', 'The Economist', 'The Globe And Mail',
        'The Guardian (AU)', 'The New York Times', 'The Telegraph',
        'The Wall Street Journal', 'The Washington Post',
        'Thechronicleherald.ca', 'Tvrain.ru', 'USA Today', 'Vesti.ru',
        'Vox.com', 'Www.bfm.ru'], dtype=object),

In [6]:
news_dict = {}
for each in News.title:
    if each in news_dict:
        news_dict[each] += 1
    else:
        news_dict[each] = 1

In [7]:
news_dict

{'20minutes.fr': 3,
 'Al.com': 1,
 'Anandtech.com': 1,
 'Bbc.com': 2,
 'Bienpublic.com': 1,
 'Bloomberg': 1,
 'CBC News': 3,
 'CNBC': 1,
 'CNN': 1,
 'Chicagotribune.com': 2,
 'Ctvnews.ca': 1,
 'Daily Mail': 4,
 'Fontanka.ru': 1,
 'Football365.com': 1,
 'Fox News': 2,
 'Francebleu.fr': 1,
 'Francetvinfo.fr': 4,
 'Gloucestershirelive.co.uk': 1,
 'Huffingtonpost.ca': 1,
 'Independent': 3,
 'Interfax.ru': 1,
 "L'equipe": 1,
 'Latimes.com': 1,
 'Le Monde': 1,
 'Ledauphine.com': 1,
 'Lefigaro.fr': 4,
 'Lenta': 2,
 'Leparisien.fr': 3,
 'Lepoint.fr': 1,
 'Mercurynews.com': 1,
 'Mirror': 1,
 'NBC News': 1,
 'Nationalpost.com': 1,
 'Npr.org': 1,
 'Nydailynews.com': 1,
 'RBC': 5,
 'Rambler.ru': 2,
 'Reuters': 2,
 'Ria.ru': 3,
 'Riafan.ru': 1,
 'Spokesman.com': 1,
 'Standard.co.uk': 2,
 'Tass.ru': 1,
 'The Economist': 2,
 'The Globe And Mail': 3,
 'The Guardian (AU)': 4,
 'The New York Times': 2,
 'The Telegraph': 1,
 'The Wall Street Journal': 3,
 'The Washington Post': 5,
 'Thechronicleherald.ca

In [8]:
for each in news_dict:
    if news_dict[each] > 2:
        print(each)

The Wall Street Journal
The Washington Post
Daily Mail
The Guardian (AU)
Independent
RBC
Ria.ru
The Globe And Mail
CBC News
Francetvinfo.fr
Lefigaro.fr
20minutes.fr
Leparisien.fr


## Conclusion

For this challenge I decided to use News API, https://newsapi.org/docs, as an access point for news articles from multiple sources around the world. This challenge was difficult for several reasons. The first being that the lesson had only shown how to access an xml API page rather than a JSON source. Next, I wasn't able to use the 'next page' code because this json API doesn't have a 'next page' button. And lastly, for some reason my computer was sometimes finding incorrect information when using the APIs. So there was a lot of trial and error to reach the point of identifying the 13 most common news sources for the top stories in the US, UK, Russia, Canada, and France. 