# Starbucks in Manhattan?

Our objective today is to map out and explore every Starbuck location in Manhattan. For the purposes of analysis, we 
will then use that list to compute average distance-to-Starbucks in Manhattan.

## First, some background

There are a few ways to get data off of the web: OCR, web scraping, APIs, and data stores are all valid approaches which are variously necessary for different specific problems. For our problem we will be using an API---specifically, the Yelp! API. We need a list of coordinates of chain store locations in Manhattan, and---lo and behold---Yelp! provides us with just such a list! We just have to figure out how to get it out of their hands into ours.

For brevity, we are actually skipping one of the hardest parts of a data science project: figuring out what tools to use to approach the problem. Until you become a pro (and even often then) you will almost never have much of an idea about what you should do to answer a particular question, and will instead have to research your options and pick the best one (if one is available at all!).

## Before you begin

If you want to bring your own computer to the workshop (**this is recommended**), you will need to do the following in order to get everything we need for this workshop set up:

1. [Register for an account on Yelp!](https://www.yelp.com/signup).
2. Every Yelp! account is able to generate and use an API key, which needs to be passed to be passed to the website for every query, for security/monetization/rate limiting purposes. [Generate your own API key here](https://www.yelp.com/developers/manage_api_keys). You'll need this information during the workshop!
3. Download and install the [Anaconda Python 3.5 distribution](https://www.continuum.io/downloads). Note that this is a fairly sizable library: Anaconda is the gold standard of data science and so includes stable versions of most of the heavyweight tools that you will need for data science.
4. Verify that your installation worked: open `Terminal` or `Command Console`. If you've never done so before, Google for instructions on doing so for your OS in question--shouldn't be hard. Got it open? Good; now type in `python`. If all is well, this should open the Python interpreter: to make sure you installed the right version of Python, type in `print("Hello world!")` and make sure that doesn't raise an error message. Once done, get out with `close()`.
5. Type (copy-paste) into the console: `pip install yelp folium requests geojson geopy`. This will install all of the listed libraries--more on what a library is later--that we will use for this project.

If you are not planning on bringing your own computer, the computers in the development room will have these packages pre-installed and ready to go already.

## Super-basic Python

This workshop is a beginner's Python workshop, but it has a specific goal in mind. Thus my focus is on providing a summary of the Python necessary for data analysis development and nothing more. If you're serious about this domain you will need to spend a lot of time learning Python on your own as well!

Obligatory print statement.

In [None]:
print("Hello World!")

Print statement in a function.

In [None]:
def print_stuff():
    print("Hello World!")

Now we can execute it.

In [None]:
print_stuff()

Functions can take arguments.

In [None]:
def print_better_stuff(arg):
    print(arg)

In [None]:
print_better_stuff("Better stuff right here!")

When an argument is declared this way it is mandatory: the function call will fail otherwise.

In [None]:
print_better_stuff()

Python has a facility for optional parameters.

In [None]:
def print_even_better_stuff(stuff="Hello World!"):
    print("Hello World!")

In [None]:
print_even_better_stuff()

In [None]:
print_even_better_stuff(stuff="Goodbye World!")

Print statements are great--you will use them to debug your code forever after. But let's move on to functions which actually do stuff to things, and then return them. This is the bread and butter of programming. In Python this is handled by a `return` statement.

In [None]:
def print_this_thing(stuff_to_print):
    return "There, I gave you " + stuff_to_print + ". Happy?"

In this case we are building a `string`. One of the simplest data types, a string is just that: a string of characters.

In [None]:
print_this_thing("50 bucks")

You can tell this is a string because it has single quotes (`'`) around it. You can also have double-quoted strings, the difference is only that of a minor technicality:

In [None]:
"Also a string."

Other data types are integers:

In [None]:
1 + 1

And floats (for "floating point").

In [None]:
3.14 + 42

All of these data types represent instances of `objects`. We can bind objects to names so that we can use them to do useful stuff.

In [None]:
life = 1
death = -1
life + death # 1 + -1

Functions which `return` something return objects.

In [None]:
def return_42():
    return 42

life = return_42()
death = -return_42() # Hey! We took a negative!
life + death

Python is an object-oriented language, as is almost every programming language of note today, so in addition to these simple objects - strings, floats, etc. - we can also define our own, more complicated objects. Here's how to create one:

In [None]:
class Life_Universe_Everything:
    
    def __init__(self):
        self.answer = 42
    
    def answer_question(self):
        return self.answer

In [None]:
question = Life_Universe_Everything()

There's a lot going on here - but in the interest of time we'll gloss over the details, since we'll only be using pre-existing objects in this project, not defining our own. Instead I want to point out two things:

1. To run an object method, do `object.method()`.
2. To access an object parameter, do `object.parameter`.

Since our `answer_question()` method is pretty silly, we can answer this question in two ways:

In [None]:
question.answer_question()

In [None]:
question.answer

These kinds of structures are especially important in data science, where robust error handling is super critical.

There are two kinds of basic data structures that we will use today: `lists` and `dicts`. Here's a list:

In [None]:
silly_list = [1, 2, 3, 4, "Elmo wants to count some more!"]
print(silly_list)

To index a list you use an index. **Note that indexes in programming start at 0, not 1**. See:

In [None]:
print(silly_list[0])
print(silly_list[1])

If you try to get an item that is outside of the bounds of the list, you get an explosion:

In [None]:
print(silly_list[5])

The other data type is the dict. A dict stores information by name.

In [None]:
silly_dict   = {"One": 1,
                "Two": 2,
                "Three": 3,
                "Four": 4,
                "Five": 5
               }

To index these we have to call what we want by name.

In [None]:
silly_dict["One"]

We can provide control conditions using a so-called `if-else` block.

In [None]:
def count_to_four(name):
    if name == 'Elmo':
        print("One, two, three, four, I want to count some more!")
    else:
        print("Are you questioning my intelligence?")

In [None]:
count_to_four("Elmo")

In [None]:
count_to_four("You")

`==` here is a comparison operator. It checks if the `name` argument that we pass really is `Elmo`. `If` it is, do this, or `else` do that.

Loops are useful for more complex things. There are two types, `while` loops and `for` loops. A `while` loop executes as long as a condition holds, a `for` loop iterates through a list.

Most programming languages implement so-called `try-catch` blocks to help with handling errors. Python is no exception. Here's an example in action:

In [None]:
try:
    print(1 + "Dagnabit!")
except TypeError:
    print("You can't add an integer and a string! Like what?")

You can also `raise` your own errors.

In [None]:
raise OSError("Most troubling, Master Bruce.")

Why reinvent the wheel? A lot of libraries (packages) exist out there that solve a lot of the problems that you will encounter, at a minimum, and allow whole new worlds to explore, at a maximum.

To get these packages yourself you need to `pip install` them in the command console. We did that a bunch of times for the stuff that we need for this project at the beginning of this session (or at home beforehand, even better!). Then once they're available, you can `import` them so that you can use them.

Here's what happens when your package is not available:

In [None]:
import pseudoscorpion

Of course when it is available, nothing happens - it just works.

In [None]:
import os

There's one more semantic thing that we need to pay attention to. When importing a library you can choose instead to import a class or module from that library. There's all sorts of constructions that you can use for this stuff, but for our purposes you'll only see three:

* `import library`. Then, if we want the `Book` object in our code we'll need `library.Book`.
* `from library import Book`. Then if we want the `Book` object in our code we'll need the `Book` object, that's it.
* `import library as lib`. Then if we want the `Book` object we'll need `lib.Book`. Nothing major.

That concludes Python in 20 minutes! I'll talk about resources for learning more Python---and *really* learning it, on the level necessary to operate as a data scientist---at the end of the presentation.

## Some boilerplate

Let's import a giant pile of stuff! We'll talk about each of these in turn over the course of the presentation.

In [None]:
from yelp.client import Client
from yelp.oauth1_authenticator import Oauth1Authenticator
from yelp.errors import BusinessUnavailable
import os
import json
from pandas import DataFrame
import folium
import geojson
import random
import requests
import numpy as np
from geopy.distance import vincenty
import bokeh.plotting as plt

Let's authenticate with Yelp!.

Here's the method we will use to import our credentials:

In [None]:
def import_credentials(filename='yelp_credentials.json'):
    """
    Finds the credentials file describing the token that's needed to access Yelp services.

    :param filename -- The filename at which Yelp service credentials are stored. Defaults to
    `yelp_credentials.json`.
    """
    if filename in [f for f in os.listdir('.') if os.path.isfile(f)]:
        data = json.load(open(filename))
        return data
    else:
        raise IOError('This API requires Yelp credentials to work. Did you forget to define them?')

At this point you need to copy those credentials you have open in another tab (the ones on Yelp! - to reopen them jump to the top of this notebook and read the setup instructions again to get the link). Go into the following code and replace those `???` with the proper information from that page!

In [None]:
auth = Oauth1Authenticator(
    consumer_key='dkJPGu_jtTyHwsEgZIZN6g',
    consumer_secret='lGsYFWNwi0QUNwN8XsNL4HmfvyE',
    token='zCXUmjJvJ2TAHz99lIOtZF7UNN6sd_RI',
    token_secret='kDP4zFl4j2kqdu1Ey_qGgQMqRbs'
)

client = Client(auth)

What is this actually doing? I'll talk briefly about it.

## The core code

**Now here comes the most important bit in this whole notebook. We're going to walk through this method line-by-line**.

In [None]:
def fetch_businesses(name, area='New York'):
    area = area.lower().replace(' ', '-')
    name = name.lower().replace(' ', '-')
    """
    Fetches all yelp.obj.business_response.BusinessResponse objects for incidences of the given chain in Manhattan.
    Constructs Yelp business ids for incidences of the chain in the area, then queries Yelp to check if they
    exist.
    IDs are constructed name-location-number, so we just have to check numbers in ascending order until it breaks.
    e.g. http://www.yelp.com/biz/gregorys-coffee-new-york-18 is good.
         http://www.yelp.com/biz/gregorys-coffee-new-york-200 is not.
    Then we do reverse GIS searches using the business ID through the Yelp API and extract coordinates from the results.
    Some technical notes:
    1.  The first incidence of any store in the area is reported without any numeral.
        e.g. "dunkin-donuts-new-york", not "dunkin-donuts-new-york-1".
        Numbers pick up from there: the next shitty hole in the wall will be "dunkin-donuts-new-york-2".
    2.  Yelp IDs are unique and are not reassigned when a location is closed.
        Thus we need to check for and exclude closed locations when munging the data.
    3.  Places with a single instance in Manhattan sometimes have a "name-place-2" that redirects to their only location.
        At least this seems to be the case with Bibble & Sip...
        This is checked and corrected for further down the line, by the fetch_businesses() method.
    4.  Sometimes IDs are given to locations that don't actually really exist.
        e.g. the best-buy-3 id points to a non-existant storefront.
        But best-buy-4, best-buy-5, and so on actually exist!
        Yelp acknowledges this fact, but still returns a BusinessUnavaialable error when queries.
        This method sends a web request and checks the response and terminates on a 404, which has proven to be a reliable
        way of circumnavigating this issue.
    5.  In case the above doesn't work...
        The manual_override parameter forces the fetcher to keep moving past this error.
        For debugging purposes, this method prints a URL for the purposes of manually checking breakpoints.
        That way you can incrementally run fetch() and then comb over trouble spots you find by moving up manual_override.
        If you hit that URL and you get either a valid ID or an invalid but existing ID, you need to bump up manual_override
        to correct it and rerun the fetch.
        If you hit that URL and you get a 404 page then you're done!
        e.x. In the Best Buy case both best-buy-3 and best-buy-10 are phantoms.
        But once we set manual_override=10 we're good, and get all of the actual storefronts.
    """
    i = 2
    # Run the first one through by hand.
    try:
        responses = [client.get_business("{0}-{1}".format(name, area))]
    # This can happen, and did, in the Dunkin' Donuts case.
    except BusinessUnavailable:
        responses = []
        pass
    # The rest are handled by a loop.
    while True:
        bus_id = "{0}-{1}-{2}".format(name, area, i)
        try:
            response = client.get_business(bus_id)
        except BusinessUnavailable:
            # We manually check trouble spots.
            # But see the TODO.
            if requests.get('http://www.yelp.com/biz/' + bus_id).status_code != requests.codes.ok:
                break
            else:
                # Increment the counter but don't include the troubled ID.
                i += 1
                continue
        responses += [response]
        i += 1
    print("Ended `fetch_businesses()` on:", "http://www.yelp.com/biz/" + bus_id)
    return responses     

*Long explainy part in words, not in text*

Let's try running this method on an example business. What do we get back?

In [None]:
bas = fetch_businesses('Bibble and Sip')
bas

We don't need all of the nuts and bolts of the business entity, though all of that is present in these objects. We need just one piece of information: their coordinates. After reading some of the library's documentation (ommitted) you will find that to that you need a somewhat hideously long construction:

In [None]:
bas[0]

In [None]:
bas[0].business.location.coordinate.latitude

In [None]:
[bas[0].business.location.coordinate.latitude, bas[0].business.location.coordinate.longitude]

It works! We can move on to our next chunky method:

In [None]:
def frame(responses):
    """
    Given a list of yelp.obj.business_response.BusinessResponse objects like the one returns by fetch_businesses(),
    builds a coordinate-logging DataFrame out of them.
    """
    latitudes = [response.business.location.coordinate.latitude for response in responses]
    longitudes = [response.business.location.coordinate.longitude for response in responses]
    df = DataFrame({'latitude': latitudes, 'longitude': longitudes})
    df.index.name=responses[0].business.name
    return df

*Explainy explainy explainy.*

Ok so what do we get?

In [None]:
frame(bas)

These two locations are in fact one location!

Data science is all about **edge cases**. This problem doesn't apply to chain stores - which is what we will be dealing with - so we can actually, yes, *ignore it*. Wow. Such duct tape.

DataFrames like this one are the interpretive core of the Python data science stack, and you'll get very familiar with them very quickly as you go along.

Next we'll write the Folium method we need to for visualization. Our DataFrame serves as an intermediary!

In [None]:
def map_coordinates(df):
    """
    Returns a folium map of all of the coordinates stored in a coordinate DataFrame, like the one returned by frame().
    """
    ret = folium.Map(location=[40.753889, -73.983611], zoom_start=11)
    for row in df.iterrows():
        ret.simple_marker([row[1]['latitude'], row[1]['longitude']])
    return ret

In [None]:
bas_df = frame(bas)
map_coordinates(bas_df)

Yes it really is that easy!!!

Let's run this on something a little harder. You'll notice that this query takes a little while to run, because remember that in the background we're sending and reading requests to and from the web: this takes time.

In [None]:
map_coordinates(frame(fetch_businesses('Gregorys Coffee')))

Still, it executes reasonably fast. The following query we're going to have to leave running for awhile.

In [None]:
starbucks = fetch_businesses('Starbucks')
map_coordinates(frame(starbucks))

## Part II: Measuring average distance

Now that we've got our fancy Starbucks map we can jump into our true end goal: measuring distances. Hold on to your seatbelts, because this is going to be a wild ride!

*Explain methodology and approach*

Ok, let's look at the actual code.

To start with, just take a quick look at [geojson.io](http://geojson.io).

Ok, so let's assume I've generates this shapefile - [here is what my attempt came out to be](https://github.com/ResidentMario/chain-incidence/blob/master/manhattan.geojson). I then saved this file locally - if you followed the instructions at the beginning of this workshop, you will have it!

These first two methods are for loading the shapefile and dumping the result into a list of coordinates.

In [None]:
def load_geojson(filename="manhattan.geojson"):
    """
    Returns a geojson object for the given file.
    """
    with open(filename) as f:
        dat = f.read()
        obj = geojson.loads(dat)
    return obj


def load_coordinates(name):
    """
    Loads Manhattan.
    What else?
    Are you surprised?
    """
    # Encode according to our storage scheme.
    filename = name.lower().replace(' ', '_') + '.geojson'
    obj = load_geojson(filename)
    if obj['type'] == 'FeatureCollection':
        ret = []
        for feature in obj['features']:
            ret += list(geojson.utils.coords(feature))
        # return ret
    elif obj['type'] == 'Feature':
        ret = list(geojson.utils.coords(obj))
    # GeoJSON stores coordinates [Longitude, Latitude] -- the "modern" format.
    # For historical reasons, coordinates are usually represented in the format [Latitude, Longitude].
    # And this is indeed the format that the rest of the libraries used for this project expect.
    # So we need to swap the two: [Longitude, Latitude] -> [Latitude, Longitude]
    ret = [(coord[1], coord[0]) for coord in ret]
    return ret

Next up, a highly analytical point-in-polygon algorithm. This one we can treat as a black box - I never put in the time to figure out how it works, exactly, and instead just grabbed it off of elsewhere on the Internet.

In [None]:
def point_inside_polygon(x,y,poly):
    """
    Checks if a point is inside a polygon.
    Used to validate points as being inside of Manahttan.
    Borrowed from: http://www.ariel.com.au/a/python-point-int-poly.html
    
    The shapely library provides features for this and other things besides, but is too much to deal with at the moment.
    """

    n = len(poly)
    inside = False

    p1x,p1y = poly[0]
    for i in range(n+1):
        p2x,p2y = poly[i % n]
        if y > min(p1y,p2y):
            if y <= max(p1y,p2y):
                if x <= max(p1x,p2x):
                    if p1y != p2y:
                        xints = (y-p1y)*(p2x-p1x)/(p2y-p1y)+p1x
                    if p1x == p2x or x <= xints:
                        inside = not inside
        p1x,p1y = p2x,p2y

    return inside

So we have a list of coordinates which are boundaries of Manhattan and a method for checking if some random point is located in that shape. Here are the methods I use for generating a list of points which satisfy our conditions. I'll walk through these bits line-by-line.

In [None]:
def generate_sample_points(coordinate_list, n=1000):
    """
    Generates n uniformly distributed sample points within the given coordinate list.
    
    When the geometry is sufficiently complex and the list of points large this query can take a while to process.
    """
    lats, longs = list(map(lambda coords: coords[0], coordinate_list)), list(map(lambda coords: coords[1], coordinate_list))
    max_lat = max(lats)
    min_lat = min(lats)
    max_long = max(longs)
    min_long = min(longs)
    ret = []
    while True:
        p_lat = random.uniform(min_lat, max_lat)
        p_long = random.uniform(min_long, max_long)
        if point_inside_polygon(p_lat, p_long, coordinate_list):
            ret.append((p_lat, p_long))
            if len(ret) > n:
                break
        else:
            continue
    return ret


def sample_points(search_location, n=10000):
    """
    Given the name of the location being search, returns n uniformally distributed points within that location.
    
    Wraps the above.
    """
    return generate_sample_points(load_coordinates(search_location), n)

Let's take a look at what we come up with!

In [None]:
manhattan_point_cloud = sample_points("Manhattan", n=2000)

The following is the format needed to get a Bokeh rendering of our point cloud.

In [None]:
plt.output_notebook(hide_banner=True)

p = plt.figure(height=500,
                width=960,
                title="Manhattan Point Cloud",
                x_axis_label="Latitude",
                y_axis_label="Longitude"
               )

p.scatter(
    [coord[0] for coord in manhattan_point_cloud],
    [coord[1] for coord in manhattan_point_cloud]
)

plt.show(p)

Finally, the methods that compress all of this splurge down to just the one number we want: average distance! Again let's walk through this line-by-line.

In [None]:
def get_minimum_distance(coordinate, coordinate_list):
    """
    Naively calculates the minimum distance in the point cloud.
    """
    best_coord = (0, 0)
    best_distance = 1000
    for candidate_coord in coordinate_list:
        dist = vincenty(coordinate, candidate_coord).miles
        if dist < best_distance:
            best_coord = candidate_coord
            best_distance = dist
    return best_distance


def average_distance(chain_name, search_location, point_cloud):
    """
    This is the main function of this notebook!
    Takes the name of the chain in question and the point cloud associated with the location
    for which we are computing average distance.
    Returns the average distance to that chain within that location.
    We ask for a point cloud and not the name of the location because it's more efficient to precompute an extremely large,
    essentially totally random point cloud, and then check against that, instead of recomputing it every round.
    Output is in feet!
    """
    # First load the coordinates corresponding to the search location..
    location_coords = load_coordinates(search_location)
    # Now generate a list of the chains' locations.
    chain_df = frame(fetch_businesses(chain_name))
    chain_coords = list(zip(chain_df['latitude'], chain_df['longitude']))
    # Finally, get and average the minimum distances between the points in the point cloud and the chain locations.
    distances = [get_minimum_distance(point, chain_coords) for point in point_cloud]
    avg = sum(distances)/len(distances)
    avg_in_feet = int(5280*avg)
    return avg_in_feet

At last we are ready to generate our final answer!

In [None]:
avg = average_distance("Starbucks", "Manhattan", manhattan_point_cloud)

That concludes this workshop!


**For next time**:

All of our future work is going to be in Python - maybe if you're already a pro in R though you won't need it - so you really need to bootcamp the language. [This is the best introductory course in data science-targeted Python available](https://www.datacamp.com/courses/intro-to-python-for-data-science). Start doing stuff there!!!

Try to come up with a question like this one which you think you would be able to answer using tools from data science, and try to look up data sources that you can use to do that.