# Scraping Tutorial - Whole Foods

We keep using Whole Foods (WF) as an example of looking up average demographics.  To find the average demographic for Whole Foods, we need to find where the WF stores are located.  In the absence of a list of store locations, we may need to venture forth onto the web to get one ourselves.  We may need to scrape.

Scraping is essentially the process of automatically gathering text or data from a web page.  Depending on the site, this can be very easy or very difficult.

In the case of WF, their store location page looks plain enough (http://www.wholefoodsmarket.com/stores/list/state)

**Note: If you are using anaconda, you probably have all the required packages.  If not, you will need to `pip install`: `requests`, `beautifulsoup4`, and `pandas`**

## Initialization and imports

In [1]:
import requests # Used for getting HTML
from bs4 import BeautifulSoup # Used for parsing HTML
import pandas as pd # Used to organize parsed data

BASE_URL = "http://www.wholefoodsmarket.com/stores/list/state"
PAGE_URL = "http://www.wholefoodsmarket.com/stores/list/state?field_postal_address_administrative_area=&page=1"

## Get a single URL

Here we will simply request one page and look at the response that we get back.

In [None]:
response = requests.get(PAGE_URL)
#print response.url
print response.text

This is the plain html from the page.  If we look through carefully, we can see some things that look like an address.  Ideally we want to extract each of those addresses into a convenient data structure.  In short we want to parse this html into somehting useful.

Conveniently, this html seems to be very well structured.  We can see that the tags seem to be well organized.  There are tags such as:

`<div class="views-field views-field-field-postal-address">`

`<div class="throughfare">450 Rhode Island St</div>`

`<span class="locality">San Francisco</span>`

`<span class="state">CA</span>`

`<span class="postal-code">94107</span>`

One way of doing this is regular expressions, but those can often be difficult to work with. Instead we can use `Beautiful Soup` which is designed with this task in mind. In the following section we will input the HTML into BeautifulSoup and use a pretty printing feature to get a better view.

In [None]:
soup = BeautifulSoup(response.text)
print soup.prettify()

It looks like the information we want is within a div tag called `"view-field views-field-field-postal-address"`.  We will ask BeautifulSoup to find the first example of this tag and return everything inside it.

In [28]:
address_class = "views-field views-field-field-postal-address"
parent_div = soup.find('div', attrs={'class': address_class}) #Find (at most) *one*
print parent_div

<div class="views-field views-field-field-postal-address"> <div class="field-content"><div class="street-block"><div class="thoroughfare">225 Lincoln Blvd.</div></div><div class="addressfield-container-inline locality-block country-US"><span class="locality">Venice</span>, <span class="state">CA</span> <span class="postal-code">90291</span></div><span class="country">United States</span></div> </div>


Yep, that looks like all the info we need for an address, so now we will parse out the remaining components.

In [27]:
street_class = "thoroughfare"
city_class = "locality"
state_class = "state"
zip_class = "postal-code"

address = {}

street = parent_div.find("div", street_class)
address["street"] = street.text

city = parent_div.find("span", city_class)
address["city"] = city.text

state = parent_div.find("span", state_class)
address["state"] = state.text

zip_code = parent_div.find("span", zip_class)
address["zip code"] = zip_code.text

print address

{'city': u'Venice', 'state': u'CA', 'street': u'225 Lincoln Blvd.', 'zip code': u'90291'}


Great, that gives us one address.  Now we just have to do this and loop over all the elements in the page.

In [40]:
address_list = []

for store in soup.find_all("div", address_class):
    address = {}
    
    street = store.find("div", street_class)
    address["street"] = street.text

    city = store.find("span", city_class)
    address["city"] = city.text

    state = store.find("span", state_class)
    address["state"] = state.text

    zip_code = store.find("span", zip_class)
    address["zip code"] = zip_code.text

    print address
    address_list.append(address)

{'city': u'Venice', 'state': u'CA', 'street': u'225 Lincoln Blvd.', 'zip code': u'90291'}
{'city': u'Thousand Oaks', 'state': u'CA', 'street': u'740 North Moorpark Rd', 'zip code': u'91360'}
{'city': u'San Francisco', 'state': u'CA', 'street': u'1765 California St', 'zip code': u'94109'}
{'city': u'Walnut Creek', 'state': u'CA', 'street': u'1333 Newell Ave', 'zip code': u'94596'}
{'city': u'Napa', 'state': u'CA', 'street': u'3682 Bel Aire Plaza', 'zip code': u'94558'}
{'city': u'West Hollywood', 'state': u'CA', 'street': u'7871 Santa Monica Blvd', 'zip code': u'90046'}
{'city': u'San Ramon', 'state': u'CA', 'street': u'100 Sunset Drive', 'zip code': u'94583'}
{'city': u'Novato', 'state': u'CA', 'street': u'790 De Long Avenue', 'zip code': u'94945-7005'}
{'city': u'Santa Monica', 'state': u'CA', 'street': u'2201 Wilshire Blvd', 'zip code': u'90403'}
{'city': u'San Mateo', 'state': u'CA', 'street': u'1010 Park Place', 'zip code': u'94403'}
{'city': u'Santa Clarita', 'state': u'CA', 'stre

## What's next? All the URLs

In [45]:
PAGE_URL = "http://www.wholefoodsmarket.com/stores/list/state?field_postal_address_administrative_area=&page="
page_num = 1
max_pages = 3

address_class = "views-field views-field-field-postal-address"
street_class = "thoroughfare"
city_class = "locality"
state_class = "state"
zip_class = "postal-code"



In [46]:
address_list = []
for page_num in range(1, max_pages + 1):
    response = requests.get(PAGE_URL + str(page_num))
    soup = BeautifulSoup(response.text)

    for store in soup.find_all("div", address_class):
        address = {}

        street = store.find("div", street_class)
        address["street"] = street.text

        city = store.find("span", city_class)
        address["city"] = city.text

        state = store.find("span", state_class)
        address["state"] = state.text

        zip_code = store.find("span", zip_class)
        address["zip code"] = zip_code.text

        address_list.append(address)

In [48]:
print len(address_list)

60
