# Coffee Lovers guide to America - comparing five major US cities

----------------------------------------------------------------------------------------------------------------------
 Felix Reznitskiy
 
 December 18, 2020
 
----------------------------------------------------------------------------------------------------------------------

## Introduction

![image](./coffee-caffeinated-history.jpg)

Coffee first became popular in the U.S. after the Boston Tea Party, when the switch was seen as “patriotic,” [according to PBS](http://www.pbs.org/food/the-history-kitchen/history-coffee/). And since Starbucks debuted in 1971, the drink is now accessible almost anywhere you go. A recent survey by the National Coffee Association found that [62 percent](https://www.ncausa.org/Newsroom/NCA-releases-Atlas-of-American-Coffee) of Americans drink coffee every day, with the average coffee drinker consuming 3 cups daily.
What gave way to java culture? Science, for one, has convinced us that caffeine possesses multiple health benefits besides mental stimulation. At the right dosages, caffeine may contribute to [longevity](https://time.com/5326420/coffee-longevity-study/). Perhaps just as important, though, is coffee’s social purpose. Today, coffee stations are a staple of the workplace, and tens of thousands of shops serve as meeting places for friends, dates and coworkers – though in 2020 many have had to provide take-out service only due to the COVID-19 pandemic.

## Business Problem

Our customer wants to open a coffee beans roasting facility in one of the major US cities. In order for the new business to be successful, he needs to find the best location for the new place. Therefore, we are requested to find the city and the neighborhood with the highest density of coffee shops.
To determine the best city for the new business, we will find a major city with the highest density of coffee shops out of five major US cities. Next, we will compare the neighborhoods to determine the one with the highest density.

## Data Description

We will fetch data about coffee shops in following 5 largest US cities:
 -	New York City, NY (Population: 8,622,357)
 -	Los Angeles, CA (Population: 4,085,014)
 -	Chicago, IL (Population: 2,670,406)
 -	Houston, TX (Population: 2,378,146)
 -	Phoenix, AZ (Population: 1,743,469)

Using geopy we will find the coordinates for each city center and then using Foursquare API we will collect the coffee shops data. After the data collection we will visualize each city data on a separate Folium map. Then we will measure the density, and we will merge the results into a single table which will be sorted to find the winning city. City with the highest density (lowest mean distance) will be considered the best.



### 1. Geocoders

We require geographical location data for each of the five cities. City center information will be used as a starting point for the FourSquare API (we will run search query around particular geographical location). We will use geopy.geocoders to obtain the city center coordinates for each of the five cities:
- city
- latitude
- longitude

### 2. Foursquare API

We need to make sure we are fetching only coffee shops during the Foursquare API search.
We will run Foursquare API once, and we will fetch one coffee-shop from one city in order to extract the category Id of "Coffee Shop". This Id will be used to limit the search and fetch only one venue category.
- category name
- category Id

After the city center information and category Id are fetched, we will run the FourSquare API search query and pull the list of coffee shops for each city:
- venue name
- venue category
- latitude
- longitude

We will create a Folium map for each city and visualize all the data to make a preliminary analysis of coffee shops density.

Next, we will use this data for measuring the density of coffee shops in selected cities. We will measure density as a mean distance from venues to the city center coordinates. City with the lowest mean distance will be considered as the best.

We will also measure the density as a mean distance from venues to the mean coordinates of all the coffee shops in the city. Then we will create a dataframe with the following columns:
- City
- Average_Proximity_To_The_City_Center
- Average_Distance_To_Mean_Coordinates
- Coffee_Shops_Per_City

As I mentioned above, the city with the lowest mean distance will be considered as the best.

### 3. Public databases / websites scraping

In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains all the boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.
We will use the nested JSON file provided as part of one of the labs during the course: newyork_data.json.

Following features will be extracted from the JSON file:

- Borough
- Neighborhood
- Latitude
- Longitude

After the best city is found, we will use K-Means clustering to find the neighborhoods with the highest density of coffee shops. We are going to utilize the pandas dataframes and Folium maps to cluster the venues and present the findings on the map.

The following features will be used for the map creation and clustering:
 - Neighborhood Latitude (extracted from JSON file)
 - Neighborhood Longitude (extracted from JSON file)
 - Venue Latitude (extracted from Foursquare API)
 - Venue Longitude (extracted from Foursquare API)
 - Cluster Labels (generated)


In [1]:
import numpy as np # library for working with arrays, vectors etc.
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None) # table formatting settings
pd.set_option('display.max_rows', None) # table formatting settings
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import folium # library for generating the maps
import json # library to handle JSON files
from pandas import json_normalize # library to flatten nested JSON in Pandas
import math # built-in module that you can use for mathematical tasks, has a set of methods and constants

from sklearn.cluster import KMeans # we will be using k-means clustering later to visualize the best neighborhoods' venues

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

print('Libraries imported.')

Libraries imported.


In [2]:
cityList = ['New York, NY', 'Los Angeles, CA', 'Chicago, IL', 'Houston, TX', 'Phoenix, AZ']

cityCoordinates = {}

for city in cityList:
    address = city # 'New York City, NY'
    geolocator = Nominatim(user_agent="my_coffee_explorer")
    location = geolocator.geocode(address)
    cityCoordinates[city] = [location.latitude, location.longitude]
    print('The geograpical coordinate of {} are {}, {}.'.format(city, cityCoordinates[city][0], cityCoordinates[city][1]))

The geograpical coordinate of New York, NY are 40.7127281, -74.0060152.
The geograpical coordinate of Los Angeles, CA are 34.0536909, -118.242766.
The geograpical coordinate of Chicago, IL are 41.8755616, -87.6244212.
The geograpical coordinate of Houston, TX are 29.7589382, -95.3676974.
The geograpical coordinate of Phoenix, AZ are 33.4484367, -112.0741417.


In [1]:
search_query = 'Coffee'
#search_query = 'Coffee Shop'
radius = 500
#print(search_query + ' .... OK!')
CLIENT_ID = '0YOP1FXJVEUP5BOXUZG1FH3Y2EIWH04A5EYLAVRC2SUXR2XT' # your Foursquare ID
CLIENT_SECRET = 'CLASSIFIED' # your Foursquare Secret
ACCESS_TOKEN = 'CLASSIFIED' # your FourSquare Access Token
VERSION = '20180605'
LIMIT = 1 # we will use this single query result to fetch the category Id
#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

First, we need to figure out the coffee shops category Id in order to proceed with fetching the coffee shops data using Foursquare API

In [4]:
neighborhood_latitude = cityCoordinates['New York, NY'][0]
neighborhood_longitude = cityCoordinates['New York, NY'][1]

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude,ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)

# checking the URL
#print(url)

#fetching one Coffee Shop in order to get the category Id
queryResult = requests.get(url).json()

# fetching category name and id
#print(queryResult['response']['venues'][0]['categories'][0]['name'] + ", " + queryResult['response']['venues'][0]['categories'][0]['id']) #'4bf58dd8d48988d1e0931735'
print(queryResult['response']['venues'][0]) #'4bf58dd8d48988d1e0931735'

{'id': '49c79540f964a520af571fe3', 'name': 'Blue Spoon Coffee Co.', 'location': {'address': '76 Chambers St', 'crossStreet': 'at Broadway', 'lat': 40.714427584609766, 'lng': -74.00685853301651, 'labeledLatLngs': [{'label': 'display', 'lat': 40.714427584609766, 'lng': -74.00685853301651}], 'distance': 202, 'postalCode': '10007', 'cc': 'US', 'city': 'New York', 'state': 'NY', 'country': 'United States', 'formattedAddress': ['76 Chambers St (at Broadway)', 'New York, NY 10007']}, 'categories': [{'id': '4bf58dd8d48988d1e0931735', 'name': 'Coffee Shop', 'pluralName': 'Coffee Shops', 'shortName': 'Coffee Shop', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_', 'suffix': '.png'}, 'primary': True}], 'referralId': 'v-1609268332', 'hasPerk': False}


Now we can proceed with pulling the data

In [5]:
LIMIT = 500
results = {}
for city in cityList:
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&limit={}&categoryId={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, city, LIMIT,
        "4bf58dd8d48988d1e0931735") # Category from the previous step
    results[city] = requests.get(url).json()

In [6]:
#from pandas import json_normalize
df_venues={}
for city in cityList:
    venues = json_normalize(results[city]['response']['groups'][0]['items'])
    df_venues[city] = venues[['venue.name', 'venue.location.address', 'venue.location.lat', 'venue.location.lng','venue.location.postalCode']]
    df_venues[city].columns = ['name', 'address', 'lat', 'lng','postalCode']

Let's take a look at the map to see the density of coffee shops in each city.

In [7]:
print(venues.shape)

(100, 28)


In [8]:
CoffeeShopsPerCity = [] # this list will be used later for the final report
maps = {} # will contain five maps of the cities
for city in cityList:
    maps[city] = folium.Map(location=[cityCoordinates[city][0], cityCoordinates[city][1]], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(df_venues[city]['lat'], df_venues[city]['lng'], df_venues[city]['name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[city])  
    print(f"Total number of coffee shops in {city} = ", results[city]['response']['totalResults'])
    CoffeeShopsPerCity.append(results[city]['response']['totalResults'])

Total number of coffee shops in New York, NY =  219
Total number of coffee shops in Los Angeles, CA =  198
Total number of coffee shops in Chicago, IL =  188
Total number of coffee shops in Houston, TX =  155
Total number of coffee shops in Phoenix, AZ =  161


In [9]:
maps[cityList[0]]

In [10]:
maps[cityList[1]]

In [11]:
maps[cityList[2]]

In [12]:
maps[cityList[3]]

In [13]:
maps[cityList[4]]

We can see that New York and Chicago have the highest density of coffee shops.

In order to justify the above observations, we will measure the density and create a table with the concrete numbers.
Let's use two methods of calculations - we will calculate average distance from coffee shops to the corresponding city center, and also average distance of the venues to their mean coordinates.

In [14]:
# these lists will be used later for creating the final report columns
citiesCol=[]
distance1Col=[]
distance2Col=[]

for city in cityList:
    # calculating mean coordinates of the coffee shops
    coffeeShopsMeanCoordinates = [df_venues[city]['lat'].mean(), df_venues[city]['lng'].mean()] 
    #print(city)
    # calculating average distance from coffee shops to the city center coordinates
    averageDistanceToCenter = np.mean(np.apply_along_axis(lambda x: math.hypot(x[0]-cityCoordinates[city][0],x[1]-cityCoordinates[city][1]),1,df_venues[city][['lat','lng']].values))
    #print(averageDistanceToCenter)
    # calculating average distance from coffee shops to the mean coordinates
    averageDistanceToMean = np.mean(np.apply_along_axis(lambda x: math.hypot(x[0]-coffeeShopsMeanCoordinates[0],x[1]-coffeeShopsMeanCoordinates[1]),1,df_venues[city][['lat','lng']].values))
    #print(averageDistanceToMean)
    citiesCol.append(city)
    distance1Col.append(averageDistanceToCenter)
    distance2Col.append(averageDistanceToMean)

# building the final report table
dfReport = pd.DataFrame()
dfReport['City'] = citiesCol
dfReport['Average_Proximity_To_The_City_Center'] = distance1Col
dfReport['Average_Distance_To_Mean_Coordinates'] = distance2Col
dfReport['Coffee_Shops_Per_City'] = CoffeeShopsPerCity

# sorting the results by average proximity (the lower, the better)
dfReport = dfReport.sort_values(by=['Average_Proximity_To_The_City_Center'], ascending=True)
dfReport.reset_index(drop = True, inplace = True)

dfReport.head()

Unnamed: 0,City,Average_Proximity_To_The_City_Center,Average_Distance_To_Mean_Coordinates,Coffee_Shops_Per_City
0,"New York, NY",0.033239,0.022924,219
1,"Chicago, IL",0.069026,0.050861,188
2,"Houston, TX",0.115246,0.107027,155
3,"Phoenix, AZ",0.132877,0.12388,161
4,"Los Angeles, CA",0.139922,0.101165,198


## Conclusion

#### We can see that New York has the highest density of coffee shops.
#### Therefore, we pronounce New York the best city for coffee lovers!!!

Let's cluster the NY neighborhoods in order to find the ones with the highest density. First, we need to scrape the list of neighborhoods.

New York has a total of 5 boroughs and 306 neighborhoods. In order to segment the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

In [15]:
import json

with open('C:/Users/UserName/Downloads/_Coursera/_IBM Data Science Capstone project/newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [16]:
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

Notice how all the relevant data is in the features key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [17]:
neighborhoods_data = newyork_data['features']

### Transform the data into a pandas dataframe
The next task is essentially transforming this data of nested Python dictionaries into a pandas dataframe. So let's start by creating an empty dataframe.

In [18]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [19]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.

In [20]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [21]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


Quickly examine the resulting dataframe.

In [22]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### Use geopy library to get the latitude and longitude values of New York City.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.

In [23]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


### Create a map of New York with neighborhoods superimposed on top.

In [24]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [25]:
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 0YOP1FXJVEUP5BOXUZG1FH3Y2EIWH04A5EYLAVRC2SUXR2XT
CLIENT_SECRET:BF5TKWMGU004CM3X2FVAAAAMOOHLCJ1PJUNO2ARGKFYBJW1A


Now we are ready to clean the json and structure it into a pandas dataframe.

In [26]:
import pandas as pd
nearby_venues = pd.DataFrame()

Now write the code to gather info for each neighborhood.

In [27]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Analyze Each Borough/Neighborhood

We will use the function written above to gather info for each neighborhood's venues into a separate Pandas dataframe.

In [28]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [29]:
queens_data = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)
queens_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138


In [30]:
brooklyn_data = neighborhoods[neighborhoods['Borough'] == 'Brooklyn'].reset_index(drop=True)
brooklyn_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Brooklyn,Bay Ridge,40.625801,-74.030621
1,Brooklyn,Bensonhurst,40.611009,-73.99518
2,Brooklyn,Sunset Park,40.645103,-74.010316
3,Brooklyn,Greenpoint,40.730201,-73.954241
4,Brooklyn,Gravesend,40.59526,-73.973471


In [31]:
bronx_data = neighborhoods[neighborhoods['Borough'] == 'Bronx'].reset_index(drop=True)
bronx_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [32]:
statenIsland_data = neighborhoods[neighborhoods['Borough'] == 'Staten Island'].reset_index(drop=True)
statenIsland_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Staten Island,St. George,40.644982,-74.079353
1,Staten Island,New Brighton,40.640615,-74.087017
2,Staten Island,Stapleton,40.626928,-74.077902
3,Staten Island,Rosebank,40.615305,-74.069805
4,Staten Island,West Brighton,40.631879,-74.107182


In [33]:
manhattan_venues = getNearbyVenues(manhattan_data['Neighborhood'],manhattan_data['Latitude'],manhattan_data['Longitude'])
print("Done!")

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards
Done!


In [34]:
queens_venues = getNearbyVenues(queens_data['Neighborhood'],queens_data['Latitude'],queens_data['Longitude'])
print("Done!")

Astoria
Woodside
Jackson Heights
Elmhurst
Howard Beach
Corona
Forest Hills
Kew Gardens
Richmond Hill
Flushing
Long Island City
Sunnyside
East Elmhurst
Maspeth
Ridgewood
Glendale
Rego Park
Woodhaven
Ozone Park
South Ozone Park
College Point
Whitestone
Bayside
Auburndale
Little Neck
Douglaston
Glen Oaks
Bellerose
Kew Gardens Hills
Fresh Meadows
Briarwood
Jamaica Center
Oakland Gardens
Queens Village
Hollis
South Jamaica
St. Albans
Rochdale
Springfield Gardens
Cambria Heights
Rosedale
Far Rockaway
Broad Channel
Breezy Point
Steinway
Beechhurst
Bay Terrace
Edgemere
Arverne
Rockaway Beach
Neponsit
Murray Hill
Floral Park
Holliswood
Jamaica Estates
Queensboro Hill
Hillcrest
Ravenswood
Lindenwood
Laurelton
Lefrak City
Belle Harbor
Rockaway Park
Somerville
Brookville
Bellaire
North Corona
Forest Hills Gardens
Jamaica Hills
Utopia
Pomonok
Astoria Heights
Hunters Point
Sunnyside Gardens
Blissville
Roxbury
Middle Village
Malba
Hammels
Bayswater
Queensbridge
Done!


In [35]:
brooklyn_venues = getNearbyVenues(brooklyn_data['Neighborhood'],brooklyn_data['Latitude'],brooklyn_data['Longitude'])
print("Done!")

Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker Heights
Gerritsen Beach
Marine Park
Clinton Hill
Sea Gate
Downtown
Boerum Hill
Prospect Lefferts Gardens
Ocean Hill
City Line
Bergen Beach
Midwood
Prospect Park South
Georgetown
East Williamsburg
North Side
South Side
Ocean Parkway
Fort Hamilton
Ditmas Park
Wingate
Rugby
Remsen Village
New Lots
Paerdegat Basin
Mill Basin
Fulton Ferry
Vinegar Hill
Weeksville
Broadway Junction
Dumbo
Homecrest
Highland Park
Madison
Erasmus
Done!


In [36]:
bronx_venues = getNearbyVenues(bronx_data['Neighborhood'],bronx_data['Latitude'],bronx_data['Longitude'])
print("Done!")

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Claremont Village
Concourse Village
Mount Eden
Mount Hope
Bronxdale
Allerton
Kingsbridge Heights
Done!


In [37]:
statenIsland_venues = getNearbyVenues(statenIsland_data['Neighborhood'],statenIsland_data['Latitude'],statenIsland_data['Longitude'])
print("Done!")

St. George
New Brighton
Stapleton
Rosebank
West Brighton
Grymes Hill
Todt Hill
South Beach
Port Richmond
Mariner's Harbor
Port Ivory
Castleton Corners
New Springville
Travis
New Dorp
Oakwood
Great Kills
Eltingville
Annadale
Woodrow
Tottenville
Tompkinsville
Silver Lake
Sunnyside
Park Hill
Westerleigh
Graniteville
Arlington
Arrochar
Grasmere
Old Town
Dongan Hills
Midland Beach
Grant City
New Dorp Beach
Bay Terrace
Huguenot
Pleasant Plains
Butler Manor
Charleston
Rossville
Arden Heights
Greenridge
Heartland Village
Chelsea
Bloomfield
Bulls Head
Richmond Town
Shore Acres
Clifton
Concord
Emerson Hill
Randall Manor
Howland Hook
Elm Park
Manor Heights
Willowbrook
Sandy Ground
Egbertville
Prince's Bay
Lighthouse Hill
Richmond Valley
Fox Hills
Done!


In [38]:
manhattan_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop


Next, we will clean the data to remove all irrelevant venues (other than coffee shops) from the dataframe.

In [39]:
manhattan_venues = manhattan_venues[manhattan_venues['Venue Category'].isin(['Coffee Shop'])]

In [40]:
manhattan_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
5,Marble Hill,40.876551,-73.91066,Starbucks,40.873755,-73.908613,Coffee Shop
75,Chinatown,40.715618,-73.994279,Little Canal,40.714317,-73.990361,Coffee Shop
105,Chinatown,40.715618,-73.994279,Cafe Grumpy,40.715069,-73.989952,Coffee Shop
131,Washington Heights,40.851903,-73.9369,Forever Coffee Bar,40.850433,-73.936607,Coffee Shop


In [41]:
manhattan_venues.shape

(146, 7)

In [42]:
queens_venues = queens_venues[queens_venues['Venue Category'].isin(['Coffee Shop'])]
queens_venues.shape

(37, 7)

In [43]:
brooklyn_venues = brooklyn_venues[brooklyn_venues['Venue Category'].isin(['Coffee Shop'])]
brooklyn_venues.shape

(95, 7)

In [44]:
bronx_venues = bronx_venues[bronx_venues['Venue Category'].isin(['Coffee Shop'])]
bronx_venues.shape

(15, 7)

In [45]:
statenIsland_venues = statenIsland_venues[statenIsland_venues['Venue Category'].isin(['Coffee Shop'])]
statenIsland_venues.shape

(15, 7)

### Manhattan is the winner of the boroughs battle!

Now let's visualize the venues.

First, let's find the top five neighborhoods.

In [46]:
#manhattan_venues['Venue'].value_counts()
manhattan_venues['Neighborhood'].value_counts().head()

Financial District    12
Chelsea                8
Carnegie Hill          8
Civic Center           7
Yorkville              6
Name: Neighborhood, dtype: int64

Next, we will mark the clusters using lambda function. Clusters will have values 1 to 5, while 0 means that particular venue is not assigned to any cluster.

In [47]:
import pandas as pd
manhattan_grouped = pd.DataFrame()

def clusterNeigh(x):
    return {
        'Financial District': 1,
        'Carnegie Hill': 2,
        'Chelsea': 3,
        'Civic Center': 4,
        'Upper East Side': 5
    }.get(x, 0)

manhattan_grouped = manhattan_venues.copy()
manhattan_grouped['Cluster Labels'] = manhattan_grouped.apply(lambda x: clusterNeigh(x['Neighborhood']), axis=1)

#dfTemp.head(25)
manhattan_grouped['Cluster Labels'].value_counts()

0    105
1     12
3      8
2      8
4      7
5      6
Name: Cluster Labels, dtype: int64

Finally, we will clean the data by removing the venues that we don't need from the dataframe.

In [48]:
manhattan_grouped = manhattan_grouped[~manhattan_grouped['Cluster Labels'].isin(['0'])]

In [49]:
manhattan_grouped['Cluster Labels'].value_counts()

1    12
3     8
2     8
4     7
5     6
Name: Cluster Labels, dtype: int64

In [50]:
manhattan_grouped.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster Labels
458,Upper East Side,40.775639,-73.960508,Handcraft Coffee,40.773535,-73.95967,Coffee Shop,5
507,Upper East Side,40.775639,-73.960508,Joe the Art of Coffee,40.772044,-73.960805,Coffee Shop,5
508,Upper East Side,40.775639,-73.960508,Starbucks,40.773543,-73.959836,Coffee Shop,5
515,Upper East Side,40.775639,-73.960508,787 Coffee,40.774461,-73.955438,Coffee Shop,5
526,Upper East Side,40.775639,-73.960508,Starbucks Reserve,40.77985,-73.959584,Coffee Shop,5


#### Run _k_-means to cluster Manhattan into 5 clusters according to the above.

In [51]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

#manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)
manhattan_grouped_clustering = manhattan_grouped.drop(['Neighborhood','Venue Category','Venue'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 4, 4, 4, 4, 4, 3, 3, 3, 3])

In [52]:
kmeans.labels_ = manhattan_grouped['Cluster Labels']

In [53]:
#manhattan_grouped.head()
manhattan_merged = manhattan_grouped.copy()

In [54]:
kmeans.labels_[0:10]

458     5
507     5
508     5
515     5
526     5
537     5
1277    3
1291    3
1307    3
1314    3
Name: Cluster Labels, dtype: int64

In [55]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
#map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
map_clusters = folium.Map(location=[cityCoordinates['New York, NY'][0], cityCoordinates['New York, NY'][1]], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []

#fix the NaN values in order to build the graph!!!
manhattan_merged['Cluster Labels'] = manhattan_merged['Cluster Labels'].replace(np.nan, 0)

for lat, lon, poi, poiName, cluster in zip(manhattan_merged['Venue Latitude'], manhattan_merged['Venue Longitude'], manhattan_merged['Neighborhood'], manhattan_grouped['Venue'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poiName) + ', '+ str(poi) + ', Cluster ' + str(cluster), parse_html=True)
#    print("cluster = {}".format(cluster) + "; color code = {}".format(int(cluster-1)) + "; color result code = {}".format(rainbow[int(cluster-1)]))
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
#        color=rainbow[(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
      
map_clusters

Based on the analysis results, we were able to find the best location for a new business. The most attractive area for the new coffee beans roasting facility is Manhattan, and it’s five neighborhoods. Please see the details below:

1. The best city:
    We can see that New York has the highest density of coffee shops.
    Therefore, we pronounce New York the best city for coffee lovers!!!
2. Best borough: Manhattan (148 coffee shops)
3. Top five neighborhoods are:
    - Financial District (12)
    - Carnegie Hill (8)
    - Chelsea (8)
    - Civic Center (7)
    - Upper East Side (7)

This concludes the notebook.