# Course 9 Capstone Project: Where to Put the Next Starbucks in Manhattan?

## Introduction/Business Problem: 

Starbucks' executives hired a data science team from IBM to find out where to put their next coffee shop in Manhattan.
Starbucks wants to know based upon their current locations and the concentration of their competition in Manhattan, New York, 
where might be the best place for their next coffee shop. 

Ideally, Starbucks wants to put a coffee shop directly in the middle of a spot with the most competitors to chip away at the market share of their competitors and capitalize on a known area of Manhattan that already has high foot traffic by coffee shop patrons/consumers (known high demand for coffee). 

## Data 

Data Needed - 3 Dataframes: 1) A Dataframe for all coffee shops in Manhattan 2) A Dataframe with Starbucks current locations 3) A Dataframe with all the coffee shops in Manhattan minus the Starbucks coffee shops (a Dataframe of just Starbucks competitors).

### Libraries Needed For Data Analysis 

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.18.1-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00  23.38 MB/s
geopy-1.18.1-p 100% |################################| Time: 0:00:00  35.91 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  48.74 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  36.87 MB/s
vincent-0.4.4- 100% |###################

### Download the New York City Data

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


### Load the data 

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

### Look at the data

In [4]:
newyork_data

{'bbox': [-74.2492599487305,
  40.5033187866211,
  -73.7061614990234,
  40.9105606079102],
 'crs': {'properties': {'name': 'urn:ogc:def:crs:EPSG::4326'}, 'type': 'name'},
 'features': [{'geometry': {'coordinates': [-73.84720052054902,
     40.89470517661],
    'type': 'Point'},
   'geometry_name': 'geom',
   'id': 'nyu_2451_34572.1',
   'properties': {'annoangle': 0.0,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661],
    'borough': 'Bronx',
    'name': 'Wakefield',
    'stacked': 1},
   'type': 'Feature'},
  {'geometry': {'coordinates': [-73.82993910812398, 40.87429419303012],
    'type': 'Point'},
   'geometry_name': 'geom',
   'id': 'nyu_2451_34572.2',
   'properties': {'annoangle': 0.0,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.874294193

### Create features key

In [5]:
neighborhoods_data = newyork_data['features']

### Transform the data into a pandas dataframe

In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [7]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


### Loop through the data to fill in the dataframe

In [41]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [9]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### Use geocoder to get the longitude and latitude of NYC

In [42]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7308619, -73.9871558.


### Slice the data to just include Manhattan

In [43]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


### Define Foursquare Credentials and Version

In [44]:
CLIENT_ID = 'W21OBLNNZY5YUVJCLZK123MQKNOG3FDJHLIK4P3KMT3LYMDI' # your Foursquare ID
CLIENT_SECRET = '0IOT2CUFZRZ4X4VBCB4IOJDNLNO5BFXLZMMH30EQ1UFDFXWZ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: W21OBLNNZY5YUVJCLZK123MQKNOG3FDJHLIK4P3KMT3LYMDI
CLIENT_SECRET:0IOT2CUFZRZ4X4VBCB4IOJDNLNO5BFXLZMMH30EQ1UFDFXWZ


### Get latitude and longitude data for first neighborhood in Manhattan as an example

In [14]:
manhattan_data.loc[0, 'Neighborhood']

'Marble Hill'

In [45]:
neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


### Repeat this for all the neighborhoods in Manhattan with venues (limited to 1000 which should give us all)

In [46]:
LIMIT = 1000 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=W21OBLNNZY5YUVJCLZK123MQKNOG3FDJHLIK4P3KMT3LYMDI&client_secret=0IOT2CUFZRZ4X4VBCB4IOJDNLNO5BFXLZMMH30EQ1UFDFXWZ&v=20180605&ll=40.87655077879964,-73.91065965862981&radius=500&limit=1000'

In [47]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c52003e9fb6b767e739640e'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4b4429abf964a52037f225e3-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/pizza_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d1ca941735',
         'name': 'Pizza Place',
         'pluralName': 'Pizza Places',
         'primary': True,
         'shortName': 'Pizza'}],
       'delivery': {'id': '72548',
        'provider': {'icon': {'name': '/delivery_provider_seamless_20180129.png',
          'prefix': 'https://fastly.4sqi.net/img/general/cap/',
          'sizes': [40, 50]},
         'name': 'seamless'},
        'url': 'https://www.seamless.com/menu/arturos-pizza-5189-broadway-ave-new-york/72548?affiliate=1131&utm_source=foursquare-affiliat

### Extracting the Categories for all Venues and Neighborhoods in Manhattan

In [48]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [49]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Arturo's,Pizza Place,40.874412,-73.910271
1,Bikram Yoga,Yoga Studio,40.876844,-73.906204
2,Tibbett Diner,Diner,40.880404,-73.908937
3,Starbucks,Coffee Shop,40.877531,-73.905582
4,Land & Sea Restaurant,Seafood Restaurant,40.877885,-73.905873


In [50]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Arturo's,Pizza Place,40.874412,-73.910271
1,Bikram Yoga,Yoga Studio,40.876844,-73.906204
2,Tibbett Diner,Diner,40.880404,-73.908937
3,Starbucks,Coffee Shop,40.877531,-73.905582
4,Land & Sea Restaurant,Seafood Restaurant,40.877885,-73.905873


In [52]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [53]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )


Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards
Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyve

In [54]:
print(manhattan_venues.shape)
manhattan_venues.head()

(6622, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
4,Marble Hill,40.876551,-73.91066,Land & Sea Restaurant,40.877885,-73.905873,Seafood Restaurant


### Now pull out Coffee Shops and create a separate dataframe with only Coffee Shop Venues

### Delete every Venue that is not a Coffee Shop - Get a Dataframe that is just Coffee Shops

In [55]:
coffee_shop_data = manhattan_venues.loc[manhattan_venues['Venue Category'] == 'Coffee Shop']

In [56]:
coffee_shop_data.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
6,Marble Hill,40.876551,-73.91066,Starbucks,40.873755,-73.908613,Coffee Shop
86,Chinatown,40.715618,-73.994279,Little Canal,40.714317,-73.990361,Coffee Shop
117,Chinatown,40.715618,-73.994279,Cafe Grumpy,40.715069,-73.989952,Coffee Shop
146,Washington Heights,40.851903,-73.9369,Starbucks,40.850961,-73.93833,Coffee Shop


### Reset the index on the dataframe

In [57]:
CSIndexed = coffee_shop_data.reset_index()  



In [58]:
CSIndexed.head()

Unnamed: 0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
1,6,Marble Hill,40.876551,-73.91066,Starbucks,40.873755,-73.908613,Coffee Shop
2,86,Chinatown,40.715618,-73.994279,Little Canal,40.714317,-73.990361,Coffee Shop
3,117,Chinatown,40.715618,-73.994279,Cafe Grumpy,40.715069,-73.989952,Coffee Shop
4,146,Washington Heights,40.851903,-73.9369,Starbucks,40.850961,-73.93833,Coffee Shop


In [59]:
CSIndexed.shape

(260, 8)

### Get a Dataframe of exclusively of Starbucks Coffee Shops

In [62]:
Starbucks_data = CSIndexed.loc[CSIndexed['Venue'] == 'Starbucks']

In [63]:
Starbucks = Starbucks_data.reset_index()

In [64]:
Starbucks

Unnamed: 0,level_0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
1,1,6,Marble Hill,40.876551,-73.91066,Starbucks,40.873755,-73.908613,Coffee Shop
2,4,146,Washington Heights,40.851903,-73.9369,Starbucks,40.850961,-73.93833,Coffee Shop
3,14,529,Upper East Side,40.775639,-73.960508,Starbucks,40.773533,-73.95981,Coffee Shop
4,20,650,Yorkville,40.77593,-73.947118,Starbucks,40.772356,-73.949984,Coffee Shop
5,29,768,Roosevelt Island,40.76216,-73.949168,Starbucks,40.75936,-73.953153,Coffee Shop
6,34,971,Lincoln Square,40.773529,-73.985338,Starbucks,40.771392,-73.982424,Coffee Shop
7,71,2053,Manhattan Valley,40.797307,-73.964286,Starbucks,40.795369,-73.965589,Coffee Shop
8,73,2077,Manhattan Valley,40.797307,-73.964286,Starbucks,40.79888,-73.96837,Coffee Shop
9,90,2334,Battery Park City,40.711932,-74.016869,Starbucks,40.712217,-74.011585,Coffee Shop


In [68]:
Starbucks.shape

(28, 9)

### Make a Dataframe of All Coffee Shops in Manhattan Minus the Starbucks Coffee Shops

In [65]:
CSIndexed_no_Starbucks = CSIndexed[~CSIndexed['Venue'].isin(['Starbucks'])]



In [66]:
CSIndexed_no_Starbucks.reset_index()

Unnamed: 0,level_0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,2,86,Chinatown,40.715618,-73.994279,Little Canal,40.714317,-73.990361,Coffee Shop
1,3,117,Chinatown,40.715618,-73.994279,Cafe Grumpy,40.715069,-73.989952,Coffee Shop
2,5,213,Inwood,40.867684,-73.92121,Darling Coffee,40.868034,-73.92051,Coffee Shop
3,6,278,Hamilton Heights,40.823604,-73.949688,Monkey Cup,40.825694,-73.947234,Coffee Shop
4,7,294,Hamilton Heights,40.823604,-73.949688,Matto Espresso (Espresso Matto),40.824958,-73.951759,Coffee Shop
5,8,295,Hamilton Heights,40.823604,-73.949688,Manhattanville Coffee,40.821496,-73.944595,Coffee Shop
6,9,311,Hamilton Heights,40.823604,-73.949688,Starbucks NAC Rotunda,40.819923,-73.950154,Coffee Shop
7,10,347,Manhattanville,40.816934,-73.957385,Kuro Kuma,40.813892,-73.960027,Coffee Shop
8,11,415,East Harlem,40.792249,-73.944182,Dear Mama Coffee,40.792255,-73.940779,Coffee Shop
9,12,481,Upper East Side,40.775639,-73.960508,Handcraft Coffee,40.773535,-73.95967,Coffee Shop


### Check the shape to make sure it is correct (260 - 28 = 232 is the shape of the new Dataframe if correct - see below)

In [70]:
CSIndexed_no_Starbucks.shape

(232, 8)