Looking at Amazon's robots.txt file (or Twitter's, or Facebook's), you may be surprised to see them prohibit or severely restrict scraping.  Aren't there a lot of projects online using Twitter data?  And how dare they keep all that delicious, delicious information to themselves?  But before you start setting `'ROBOTSTXT_OBEY' = False`, read on!

Most of The Big Websites (Google, Facebook, Twitter, etc) have APIs that allow you to access their information programmatically without using webpages.  This is good for both you and the website.  With an API, you can ask the server to send you only the specific information you want, without having to retrieve, filter out, and discard the CSS, HTML, PHP, and other code from the website.  This minimizes demand on the server and speeds up your task.  

APIs typically include their own throttling to keep you from overloading the server, usually done by limiting the number of server requests per hour to a certain number.  

To access an API, you will usually need an API key or token that uniquely identifies you.  This lets the company or service providing the API keep an eye on your usage and track what you are doing.  Different API keys can also be associated with different levels of authorization and access, so they work as a data security measure.  Keys or tokens may also be set to expire after a certain amount of time or number of uses.

## Anatomy of an API

*Access*- You request a key.  Your program provides the key with each API call, and it determines what your program can do in the API.  
*Requests*- Your program requests the data you want with a call to the API.  The request will be made up of a method (type of query, using language defined by the API) and parameters (refine the query).  
*Response*- The data returned by the API, usually in a common format such as JSON that your program can parse.  

The specific syntax for each of these elements, and the format of the response, will vary from API to API.  In addition, APIs vary widely in their level of documentation and ease of use.  Before diving too deeply into an API-scraping project, do some judicious googling and if you see a lot of posts [like this one](https://mollyrocket.com/casey/stream_0029.html) consider going elsewhere.  Not all websites put their APIs front-and-center (did you know there are APIs for [NASA](https://api.nasa.gov/), [Marvel Comics](http://developer.marvel.com/), and [Star Wars](https://swapi.co/)?) so google will be your friend there as well.

## Basics of API Queries: Wikipedia's API

The process of using an API sounds a lot like scraping (make request, get response), but with an occasional added authorization layer.  Scrapy can handle authorization, so we can use it to access APIs too.

That said, the first API we'll pull from is [Wikipedia's](https://www.mediawiki.org/wiki/API:Main_page), which doesn't require an authorization key.  Aside from needing to master the API's language, you'll find that using scrapy with an API is very similar to using scrapy on a website.

We want to know what other entries on Wikipedia link to the [Monty Python](https://en.wikipedia.org/wiki/Monty_Python) page.  To do this, we can build a query using the [Wikipedia API Sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox).  Someone who is comfortable with the MediaWiki API syntax wouldn't need to use the sandbox, but for beginners it is very handy.  Note that API queries are nothing like SQL queries in syntax, despite their shared name.

The query we will use looks like this:
`https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=linkshere&titles=Monty_Python&lhprop=title%7Credirect`

Let's break that down into it's components:

* `w/api.php`
    * Tells the server that we are using the API to pull info, rather than scraping the raw pages.  
    
* `action=query`   
    * We want information from the API (as opposed to changing information in the API)  
    
* `format=xml`  
    * Format the return in xml- then we will parse it with xpath  
    
* `prop=linkshere`  
    * We are interested in which pages link to our target page 
    
* `titles=Monty_Python`  
    * The target page is the Monty Python page.  Note that we used the exact name of the wikipedia page (Monty_Python).  
    
* `lhprop=title`  
    * From those links, we want the title of each page  
    
* `redirect`  
    * We also want to know if that link is a redirect  
    

The syntax of the MediaWiki API is based on php, thus the inclusion of `?` and `&` in the query.

For most of the query elements, we could have passed multiple arguments.  For example, we could request the URL as well as the title of the linking pages, or asked for all the pages that link to Monty_Python and to Monty_Python's_Flying_Circus.  

A query like this highlights why APIs are so handy.  Without an API, to find out the name of every page on Wikipedia that links to the Monty Python page we would have to scrape every single one of the 5,000,000+ articles in the English-language Wikipedia.  

If you haven't done so already, click on the query link above and see what it returns.



## Why use Scrapy for API calls

For some API calls, scrapy would be overkill.  If you know that your query can be answered in one response, then you don't need scrapy- you can use the `requests` library to make your API call and a library like `lxml` to parse the return.

The Wikipedia API, however, will only return ten items at a time in response to a query.  This sort of limitation is common to APIs to avoid overwhelming the server.  We can use scrapy to iterate over query results the same way that we iterated over the pages of the EverydaySexism website. 

Let's see the Wikipedia API and scrapy in action:



In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess


class WikiSpider(scrapy.Spider):
    name = "WS"
    
    # Here is where we insert our API call.
    start_urls = [
        'https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=linkshere&titles=Monty_Python&lhprop=title%7Credirect'
        ]

    # Identifying the information we want from the query response and extracting it using xpath.
    def parse(self, response):
        for item in response.xpath('//lh'):
            # The ns code identifies the type of page the link comes from.  '0' means it is a Wikipedia entry.
            # Other codes indicate links from 'Talk' pages, etc.  Since we are only interested in entries, we filter:
            if item.xpath('@ns').extract_first() == '0':
                yield {
                    'title': item.xpath('@title').extract_first() 
                    }
        # Getting the information needed to continue to the next ten entries.
        next_page = response.xpath('continue/@lhcontinue').extract_first()
        
        # Recursively calling the spider to process the next ten entries, if they exist.
        if next_page is not None:
            next_page = '{}&lhcontinue={}'.format(self.start_urls[0],next_page)
            yield scrapy.Request(next_page, callback=self.parse)
            
    
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'PythonLinks.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 100 links.    
    'CLOSESPIDER_PAGECOUNT' : 10
})
                                         

# Starting the crawler with our spider.
process.crawl(WikiSpider)
process.start()
print('First 100 links extracted!')

First 100 links extracted!


In [2]:
import pandas as pd

# Checking whether we got data 

Monty=pd.read_json('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/PythonLinks.json', orient='records')
print(Monty.shape)
print(Monty.tail())

(92, 1)
                        title
87               Hans Moleman
88              Ripping Yarns
89  List of British comedians
90         Wensleydale cheese
91              Art Garfunkel


## Wrap up

Our API call was successful.  While we examined 100 links, we only saved 92 (the others weren't links from entry pages).  

We've barely scraped (pun intended) the surface of what scrapy and APIs can do.  Scrapy has changed a lot in the years since its debut, so when googling make sure the answers you see are from 2015 at the latest-- otherwise you'll likely not be able to use the code.  

Back to the issue of authorization keys- often the key is simply included in the query string as an additional arguments.  In other cases, if you need your scraper to be able to enter a key or login information into a form, scrapy [has you covered](http://stackoverflow.com/questions/30102199/form-authentication-login-a-site-using-scrapy).  

There's a lot of fun to be had in scraping and APIs-- it's a way to feel like you're getting a lot of information with very little effort!  Beware, however.  You're not getting information at all.  Scraping gives you *data*, an undifferentiated mess of bytes with no compelling meaning on its own.  Think of that list of Wiki entries that link to Monty Python.  It's cool that we could get it, but what does it mean?  Your job as a data scientist is to convert *data* to *information*-- something people can use to make decisions or understand the world.  Modeling data to get information is hard but worthwhile work, and its those kinds of projects that will really build your portfolio as you go on the market.  

That said, scraping up some original data can provide the *foundation* for an interesting and original final project.

## Challenge

Do a little scraping or API-calling of your own.  Pick a new website and see what you can get out of it.  Expect that you'll run into bugs and blind alleys, and rely on your mentor to help you get through.  

Formally, your goal is to write a scraper that will:

1) Return specific pieces of information (rather than just downloading a whole page)  
2) Iterate over multiple pages/queries  
3) Save the data to your computer  

Once you have your data, compute some statistical summaries and/or visualizations that give you some new insights into your scraping topic of interest.  Write up a report from scraping code to summary and share it with your mentor.

In [1]:
# Importing in each cell because of the kernel restarts.

#First Iteration: Scraping only the Titles

import scrapy
import re
from scrapy.crawler import CrawlerProcess

class BASpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "BAS"
    
    # URL(s) to start with.
    start_urls = [
        'https://www.bonappetit.com/basically',
    ]

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <article> element on the page.
        for article in response.xpath('//*[@id="react-app"]/div/div[2]/div/div[@class="basically__home__card basically__home__card--full basically__home__card--standard"]'):
            
            # Yield a dictionary with the values we want.
            yield {
                
                'title': article.xpath('span/h2/a/span/text()').extract()
                'link'

                
                #'name': article.xpath('header/h2/a/@title').extract_first(),
                #'date': article.xpath('header/section/span[@class="entry-date"]/text()').extract_first(),
                #'text': article.xpath('section[@class="entry-content"]/p/text()').extract(),
                #'tags': article.xpath('*/span[@class="tag-links"]/a/text()').extract()
            }
        # Get the URL of the previous page.
#        next_page = response.xpath('//div[@class="nav-previous"]/a/@href').extract_first()
        
        # There are a LOT of pages here.  For our example, we'll just scrape the first 9.
        # This finds the page number. The next segment of code prevents us from going beyond page 9.
#        pagenum = int(re.findall(r'\d+',next_page)[0])
        
        # Recursively call the spider to run on the next page, if it exists.
#        if next_page is not None and pagenum < 10:
#            next_page = response.urljoin(next_page)
            # Request the next page and recursively parse it the same way we did above
#            yield scrapy.Request(next_page, callback=self.parse)

# Tell the script how to run the crawler by passing in settings.
# The new settings have to do with scraping etiquette.          
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'Basically.json',       # Name our storage file.
    'LOG_ENABLED': False,          # Turn off logging for now.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True
})

# Start the crawler with our spider.
process.crawl(BASpider)
process.start()
print('Success!')

Success!


In [1]:
### Specific Article
# Importing in each cell because of the kernel restarts.
import scrapy
import re
from scrapy.crawler import CrawlerProcess

class BAArtSpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "BAArt"
    
    # URL(s) to start with.
    start_urls = [
        'https://www.bonappetit.com/story/what-is-a-waxy-potato',
    ]

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <article> element on the page.
        article = response.xpath('//*[@id="main-content"]/article')
            
        # Yield a dictionary with the values we want.
        yield {

            'title': article.xpath('div[1]/header/div/div[1]/h1/text()').extract(),
            'subtitle': article.xpath('div[1]/header/div/div[2]/p/text()').extract(),
            'contributor': article.xpath('div[1]/header/div/div[2]/div[1]/div/div/div/div/a/text()').extract(),
            'date': article.xpath('div[1]/header/div/div[2]/div[1]/div/time/text()').extract(),
            'text': " ".join(article.xpath('div[2]/div/div[1]/div/div[1]//text()').extract())
            
            #'name': article.xpath('header/h2/a/@title').extract_first(),
            #'date': article.xpath('header/section/span[@class="entry-date"]/text()').extract_first(),
            #'text': article.xpath('section[@class="entry-content"]/p/text()').extract(),
            #'tags': article.xpath('*/span[@class="tag-links"]/a/text()').extract()
        }
        # Get the URL of the previous page.
#        next_page = response.xpath('//div[@class="nav-previous"]/a/@href').extract_first()
        
        # There are a LOT of pages here.  For our example, we'll just scrape the first 9.
        # This finds the page number. The next segment of code prevents us from going beyond page 9.
#        pagenum = int(re.findall(r'\d+',next_page)[0])
        
        # Recursively call the spider to run on the next page, if it exists.
#        if next_page is not None and pagenum < 10:
#            next_page = response.urljoin(next_page)
            # Request the next page and recursively parse it the same way we did above
#            yield scrapy.Request(next_page, callback=self.parse)

# Tell the script how to run the crawler by passing in settings.
# The new settings have to do with scraping etiquette.          
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'BasicallyPotatoe.json',       # Name our storage file.
    'LOG_ENABLED': False,          # Turn off logging for now.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True
})

# Start the crawler with our spider.
process.crawl(BAArtSpider)
process.start()
print('Success!')

Success!


In [1]:
### Specific Article
# Importing in each cell because of the kernel restarts.
import scrapy
import re
from scrapy.crawler import CrawlerProcess

class BasicallySpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "Basically"

    # URL(s) to start with.
    start_urls = ['https://www.bonappetit.com/basically']
    BASE_URL = 'https://www.bonappetit.com/'

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        links = response.xpath('//*[@id="react-app"]/div/div[2]/div/div/span/h2/a/@href').extract()
        for link in links:
            absolute_url = self.BASE_URL + like
            yield scrapy.Request(absolute_url, callback = self.parse_attr)
        
    def parse_attr(self, response):
        # Iterate over every <article> element on the page.
        article = response.xpath('//*[@id="main-content"]/article')
            
        # Yield a dictionary with the values we want.
        yield {

            'title': article.xpath('div[1]/header/div/div[1]/h1/text()').extract(),
            'subtitle': article.xpath('div[1]/header/div/div[2]/p/text()').extract(),
            'contributor': article.xpath('div[1]/header/div/div[2]/div[1]/div/div/div/div/a/text()').extract(),
            'date': article.xpath('div[1]/header/div/div[2]/div[1]/div/time/text()').extract(),
            'text': " ".join(article.xpath('div[2]/div/div[1]/div/div[1]//text()').extract())
            
             }

# Tell the script how to run the crawler by passing in settings.
# The new settings have to do with scraping etiquette.          
process = CrawlerProcess({
    'FEED_FORMAT': 'json',                  # Store data in JSON format.
    'FEED_URI': 'BasicallyArticles.json',   # Name our storage file.
    'LOG_ENABLED': False,                   # Turn off logging for now.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True
})

# Start the crawler with our spider.
process.crawl(BasicallySpider)
process.start()
print('Success!')

Success!


In [1]:
### The Final Round! This one actually worked.
# Importing in each cell because of the kernel restarts.
import scrapy
import re
from scrapy.crawler import CrawlerProcess

class BasicItem(scrapy.Item):
    # define the fields for your item here like:
    link = scrapy.Field()
    title = scrapy.Field()
    subtitle = scrapy.Field()
    contributor = scrapy.Field()
    date = scrapy.Field()
    text = scrapy.Field()

class BasicallySpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "Basically"

    # URL(s) to start with.
    start_urls = ['https://www.bonappetit.com/basically']
    BASE_URL = 'https://www.bonappetit.com/'

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        links = response.xpath('//*[@id="react-app"]/div/div[2]/div/div/span/h2/a/@href').extract()
        for link in links:
            absolute_url = self.BASE_URL + link
            yield scrapy.Request(absolute_url, callback = self.parse_attr)
        
    def parse_attr(self, response):
        article = response.xpath('//*[@id="main-content"]/article')
        item = BasicItem()
        item['link'] = response.url
        item['title'] = article.xpath('div[1]/header/div/div[1]/h1/text()').extract(),
        item['subtitle'] = article.xpath('div[1]/header/div/div[2]/p/text()').extract(),
        item['contributor'] =  article.xpath('div[1]/header/div/div[2]/div[1]/div/div/div/div/a/text()').extract(),
        item['date'] = article.xpath('div[1]/header/div/div[2]/div[1]/div/time/text()').extract(),
        item['text'] = " ".join(article.xpath('div[2]/div/div[1]/div/div[1]//text()').extract())
        return item

# Tell the script how to run the crawler by passing in settings.
# The new settings have to do with scraping etiquette.          
process = CrawlerProcess({
    'FEED_FORMAT': 'json',                  # Store data in JSON format.
    'FEED_URI': 'BasicallyArticles.json',   # Name our storage file.
    'LOG_ENABLED': False,                   # Turn off logging for now.
    'ROBOTSTXT_OBEY': True,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True
})

# Start the crawler with our spider.
process.crawl(BasicallySpider)
process.start()
print('Success!')

Success!


In [4]:
import pandas as pd
basic_df = pd.read_json('BasicallyArticles.json')
basic_df[basic_df['contributor'] == '[[Sarah Jampel]]']

Unnamed: 0,contributor,date,link,subtitle,text,title
0,[[Christina Chae]],"[[September 13, 2019]]",https://www.bonappetit.com/story/should-you-bu...,[[What’s it good for? How do you use it? And m...,I didn’t think of myself as an “appliance junk...,[[Real Talk: Should You Buy a Food Processor?]]
1,[[Al Cullito]],"[[June 6, 2019]]",https://www.bonappetit.com/story/happy-hour-wi...,"[[This refreshing, low-ABV cocktail is what Ju...",Tired of cocktail recipes that call for expens...,[[A Pimm's Cup Recipe to Start Summer Off Right]]
2,[[Elyse Inamin]],"[[June 7, 2019]]",https://www.bonappetit.com/story/never-fail-ki...,[[You can't go wrong with spaghetti tossed wit...,"Welcome to Never Fail, a weekly column where...",[[The Kinda-Healthy Kale Pasta I Can Always Co...
3,[[Molly Ba]],"[[June 11, 2019]]",https://www.bonappetit.com/story/mediocre-berr...,"[[Not all berries are created equal, but these...",If you follow any Bon Appétit staffer on Ins...,[[These 2 Ingredients Make Even Mediocre Berri...
4,[[Alex Delan]],"[[June 14, 2019]]",https://www.bonappetit.com/story/better-whippe...,[[Your whipped cream could be even better. (Ye...,Some would call us big whipped cream people ov...,[[This One Ingredient Takes Whipped Cream From...


In [5]:
basic_df['text'][0]

"I didn’t think of myself as an “appliance junkie” before I moved into my current apartment. By Brooklyn standards, it feels luxuriously large, like I could run a restaurant supply store out of the kitchen. As it turns out, all I really needed to change my tune was simply more square-footage: I’m now the co-owner of, among other things, a food processor that I’m surprised to find myself reaching for all the time. After a year of chopping, slicing, shredding, and puréeing, I have some thoughts on whether you, too, should own a food processor. First, let’s talk about what a food processor even  is . At minimum, a basic food processor comes with a bowl, a removable lid, a base (which contains the motor), and a very sharp blade. It’s more or less an extremely powerful knife that excels at quickly chopping and grinding tons of different ingredients, from onions to to nuts to hard cheeses, in a matter of seconds. The set of attachments will shred and slice carrots ( carrot cake !), cabbage (

In [22]:
contributor_list = [contributor[0][0] for contributor in basic_df['contributor']]
#pd.Series(contributor_list).unique()
basic_df['contributor'] = contributor_list

In [23]:
basic_df.head()

Unnamed: 0,contributor,date,link,subtitle,text,title
0,Christina Chae,"[[September 13, 2019]]",https://www.bonappetit.com/story/should-you-bu...,[[What’s it good for? How do you use it? And m...,I didn’t think of myself as an “appliance junk...,[[Real Talk: Should You Buy a Food Processor?]]
1,Al Cullito,"[[June 6, 2019]]",https://www.bonappetit.com/story/happy-hour-wi...,"[[This refreshing, low-ABV cocktail is what Ju...",Tired of cocktail recipes that call for expens...,[[A Pimm's Cup Recipe to Start Summer Off Right]]
2,Elyse Inamin,"[[June 7, 2019]]",https://www.bonappetit.com/story/never-fail-ki...,[[You can't go wrong with spaghetti tossed wit...,"Welcome to Never Fail, a weekly column where...",[[The Kinda-Healthy Kale Pasta I Can Always Co...
3,Molly Ba,"[[June 11, 2019]]",https://www.bonappetit.com/story/mediocre-berr...,"[[Not all berries are created equal, but these...",If you follow any Bon Appétit staffer on Ins...,[[These 2 Ingredients Make Even Mediocre Berri...
4,Alex Delan,"[[June 14, 2019]]",https://www.bonappetit.com/story/better-whippe...,[[Your whipped cream could be even better. (Ye...,Some would call us big whipped cream people ov...,[[This One Ingredient Takes Whipped Cream From...


In [33]:
basic_df[basic_df['contributor'] == 'Sarah Jampe']['text'][148]

"Here’s a thought: Would life would be (just a tad bit) easier if, at the grocery store, potatoes were divided up into the “waxy,” “floury,” and “in-between” categories? If you’ve ever bought a couple of russets thinking you were going to make potato salad, or a pound of fingerlings thinking they’d make a really unique mashed potato situation… then you feel my pain. Waxy and floury potatoes are not interchangeable—your gloppy potato salad or gluey mash will have alerted you to that truth. But what do these categories even mean, which type is best for what use, and what common potatoes fall in each group?! And is there a kind of potato that will almost surely work in any situation? You have questions, and we have answers: Waxy New potatoes , Red Bliss, pee wees, fingerlings! These potatoes, which are often small (and apparently, fun-named?), have thin, smooth skin and creamy, almost shiny flesh. They’re also known for their particularly potato-forward flavor. Because waxy potatoes are r

In [26]:
basic_df['contributor'].unique()

array(['Christina Chae', 'Al Cullito', 'Elyse Inamin', 'Molly Ba',
       'Alex Delan', 'Sarah Jampe', 'Carla Lalli Musi', 'Alison Roma',
       'Amiel Stane', 'Aliza Abarbane', 'Emma Wartzma', 'Claire Saffit',
       'Alyse Whitne', 'Julia Krame', 'Jesse Spark', 'Rochelle Bilo',
       'Alex Begg', 'Emily Schult', 'Elaheh Nozar', 'Hilary Cadiga',
       'Carey Poli', 'Amanda Shapir', 'Rachel Karte', 'Adam Rapopor',
       'Meryl Rothstei', 'Alex Pastro'], dtype=object)

In [34]:
basic_df.shape

(149, 6)

Successfully scraped the contents by category of 149 articles on the bon appetit 'basically' blog.