# The Batte of Neighborhoods

## Introduction, Business Problem

### Background

Our boss, asks me, a data scientist, to help him because he wants to start a new business as he already own some restaurant in New York. He wants to expand his area of effect and to settle in other venues that are good for tourists. He noticed that tourists are loving having a coffee during their trip/holidays, so he thinks we should create a Café so people can enjoy their life while we could grow as business men and get bigger. As I am a french loving italians coffee he knew I would like to work with him on this. But he tells me it might not be exactly a café we need to open, but maybe a related venue similar to it as people also love eating. So he also tells he is open to any recommandation from me.

We want to open a café (or something alike) in New York City as people love to gather in a place to drink a hot beverage or a taste stunning beauty. We will need to know where to establish our café so we have the best chances to start our new business. As we may not choose a place if it is too different from what we want we will establish a top 5 of the best area.

## Data required

We will focus on the activity and the income of an area. We will need to find nearby good cafés having good ratings.

The aim of this project is to solve the problem so we will need something that works.

What data will we need :
* Boroughs
* Neighborhoods
* Locations
* Venues (locations, category)
* Maybe some demographics can be interesting
* Income (average)

We will find these data using foursquare API.

We will treat each data available in the New York City json file so we get the borough with the most café venues. Then in this Borough we want to find clusters so we can understand where and how to settle a new Café. We will make a top 5 so we know what's best to exploit.

## Imports and Packages

In [202]:
import numpy as np
import pandas as pd
import geocoder
import requests
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
import json
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import re
import bs4 as bs
import urllib
import urllib.request

## Foursquare Credentials

In [203]:
CLIENT_ID = 'FA0TFZTR0RWPD4OCSHAQGEFCSS4SXVPDK50VHT5KPFWLRZTU' # your Foursquare ID
CLIENT_SECRET = 'JUADW5XDGYN5HCJFYUQSDKLRLNGUVMFMYVCZW1VARRF4Y5IO' # your Foursquare Secret
VERSION = '20180604'

# Methodology

## Getting New York Data

Opening the new york json file and fetching the data. The source of the data : it's the one we used in the course labs.

In [204]:
with open('ny.json') as json_data:
    newyork_data = json.load(json_data)

Get the features out of the data

In [205]:
neighborhoods_data = newyork_data['features']

Creating a dataframe from it

In [206]:
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
neighborhoods = pd.DataFrame(columns=column_names)
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [207]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [208]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Getting the number of boroughs and neighborhoods

In [209]:
print('{} boroughs\n{} neighborhoods.'.format(len(neighborhoods['Borough'].unique()),neighborhoods.shape[0]))

5 boroughs
306 neighborhoods.


Getting New York City coordinates

In [210]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('New York City coordinates: {}, {}.'.format(latitude, longitude))

New York City coordinates: 40.7127281, -74.0060152.


Creating a map of New York for vizualisation

In [211]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

Getting all Borough so we can treat them separatly

In [212]:
neighborhoods['Borough'].unique()

array(['Bronx', 'Manhattan', 'Brooklyn', 'Queens', 'Staten Island'],
      dtype=object)

Getting longitude and latitude of all the New York Boroughs

In [213]:
geolocator = Nominatim(user_agent="ny_explorer")

bronx = neighborhoods[neighborhoods['Borough'] == 'Bronx'].reset_index(drop=True)
address = 'Bronx, NY'
location = geolocator.geocode(address)
bronx_lat = location.latitude
bronx_lon = location.longitude
print('Bronx: {}, {}.'.format(bronx_lat, bronx_lon))

manhattan = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
address = 'Manhattan, NY'
location = geolocator.geocode(address)
manhattan_lat = location.latitude
manhattan_lon = location.longitude
print('Manhattan: {}, {}.'.format(manhattan_lat, manhattan_lon))

brooklyn = neighborhoods[neighborhoods['Borough'] == 'Brooklyn'].reset_index(drop=True)
address = 'Brooklyn, NY'
location = geolocator.geocode(address)
brooklyn_lat = location.latitude
brooklyn_lon = location.longitude
print('Brooklyn: {}, {}.'.format(brooklyn_lat, brooklyn_lon))

queens = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)
address = 'Queens, NY'
location = geolocator.geocode(address)
queens_lat = location.latitude
queens_lon = location.longitude
print('Queens: {}, {}.'.format(queens_lat, queens_lon))

staten_island = neighborhoods[neighborhoods['Borough'] == 'Staten Island'].reset_index(drop=True)
address = 'Staten Island, NY'
location = geolocator.geocode(address)
staten_island_lat = location.latitude
staten_island_lon = location.longitude
print('Staten Island: {}, {}.'.format(staten_island_lat, staten_island_lon))

Bronx: 40.8466508, -73.8785937.
Manhattan: 40.7896239, -73.9598939.
Brooklyn: 40.6501038, -73.9495823.
Queens: 40.7498243, -73.7976337.
Staten Island: 40.5834557, -74.1496048.


Plotting a Map for each Brorough in New York so we have a representation of each

In [214]:
def plot_map(data, lat, lon):
    map1 = folium.Map(location=[lat, lon], zoom_start=11)

    for lat, lng, label in zip(data['Latitude'], data['Longitude'], data['Neighborhood']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map1)  
    return map1

In [215]:
bronx_map = plot_map(bronx, bronx_lat, bronx_lon)
bronx_map

In [216]:
manhattan_map = plot_map(manhattan, manhattan_lat, manhattan_lon)
manhattan_map

In [217]:
brooklyn_map = plot_map(brooklyn, brooklyn_lat, brooklyn_lon)
brooklyn_map

In [218]:
queens_map = plot_map(queens, queens_lat, queens_lon)
queens_map

In [219]:
staten_island_map = plot_map(staten_island, staten_island_lat, staten_island_lon)
staten_island_map

## Foursquare

Let's go through all neighborhoods and get the best borough to settle in.

In [220]:
borough = [bronx, manhattan, brooklyn, queens, staten_island]

print("Bronx: {} neigborhoods".format(bronx['Neighborhood'].shape[0]))
print("Manhattan: {} neigborhoods".format(manhattan['Neighborhood'].shape[0]))
print("Brooklyn: {} neigborhoods".format(brooklyn['Neighborhood'].shape[0]))
print("Queens: {} neigborhoods".format(queens['Neighborhood'].shape[0]))
print("Staten Island: {} neigborhoods".format(staten_island['Neighborhood'].shape[0]))

Bronx: 52 neigborhoods
Manhattan: 40 neigborhoods
Brooklyn: 70 neigborhoods
Queens: 81 neigborhoods
Staten Island: 63 neigborhoods


Remind that there are about 306 neighborhoods in NY and we can make 950 calls max per day with foursquare.
That would make 3 calls per neighborhood (950/306).
Here we choose to only focus on Queens for technical reasons. And then we can make 10 calls for each neigborhood in queens which makes 810 calls in total.

Now let's test the validity of one call :

In [20]:
coffee_shop = '4bf58dd8d48988d1e0931735'
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&query={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION,
        'coffee',
        queens.loc[0, 'Latitude'], 
        queens.loc[0, 'Longitude'],
        500, 
        1)  

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e4bf0b51d67cb001b2de7ac'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Astoria',
  'headerFullLocation': 'Astoria, Queens',
  'headerLocationGranularity': 'neighborhood',
  'query': 'coffee',
  'totalResults': 20,
  'suggestedBounds': {'ne': {'lat': 40.773008597854925,
    'lng': -73.90972309237958},
   'sw': {'lat': 40.76400858885492, 'lng': -73.9215843937051}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '53c91530498e10c40af10721',
       'name': 'Starbucks',
       'location': {'address': '30-18 Astoria Blvd',
        'crossStreet': 'at 31st St',
        'lat': 40.77003,
        'lng': -73.91851,
        'labeledLatLngs': [{'label': '

Now let's call some of the coffee shops from Queens in NY

In [21]:
LIMIT = 10
radius = 500

url = []
res = []

coffee_shop = 'coffee'
for index, row in queens.iterrows():
    
    curr_url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&query={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION,
        coffee_shop,
        row['Latitude'], 
        row['Longitude'],
        radius, 
        LIMIT)
    
    url.append(curr_url)
    
    res.append(requests.get(curr_url).json())


In [91]:
res[0]

{'meta': {'code': 200, 'requestId': '5e4bf156216785001cbddaee'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Astoria',
  'headerFullLocation': 'Astoria, Queens',
  'headerLocationGranularity': 'neighborhood',
  'query': 'coffee',
  'totalResults': 20,
  'suggestedBounds': {'ne': {'lat': 40.773008597854925,
    'lng': -73.90972309237958},
   'sw': {'lat': 40.76400858885492, 'lng': -73.9215843937051}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '53c91530498e10c40af10721',
       'name': 'Starbucks',
       'location': {'address': '30-18 Astoria Blvd',
        'crossStreet': 'at 31st St',
        'lat': 40.77003,
        'lng': -73.91851,
        'labeledLatLngs': [{'label': '

Let's clean the jsons

In [221]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [222]:
nearby_venues_list = []
filtered_columns = ['venue.location.city','venue.name','venue.categories','venue.location.lat','venue.location.lng']

i = 0
for re in res:
    venues = re['response']['groups'][0]['items']
    if not venues:
        i += 1
    else:
        nearby_venues = json_normalize(venues)
        nearby_venues = nearby_venues.loc[:, filtered_columns]

        nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
        nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
        nearby_venues_list.append(nearby_venues)
    
nb_venues = 0
for nv in nearby_venues_list:
    nb_venues += nv.shape[0]
    
print('{} venues were returned by Foursquare.'.format(nb_venues), " and {} json element had empty items and could not be treated which makes {} treated elements.".format(i, nb_venues-i))

nearby_venues_list[0].head()

240 venues were returned by Foursquare.  and 16 json element had empty items and could not be treated which makes 224 treated elements.


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,city,name,categories,lat,lng
0,Astoria,Starbucks,Coffee Shop,40.77003,-73.91851
1,Astoria,Café Via Espresso,Café,40.768676,-73.910851
2,Queens,Gossip Coffee And Cocktails,Coffee Shop,40.764522,-73.916462
3,Astoria,New York City Bagel & Coffee House,Bagel Shop,40.76597,-73.919495
4,Astoria,Café To Go Creperie,Café,40.766831,-73.920441


In [223]:
print(nearby_venues_list[0]['city'])

0    Astoria
1    Astoria
2     Queens
3    Astoria
4    Astoria
5    Astoria
6     Queens
7    Astoria
8    Astoria
9    Astoria
Name: city, dtype: object


Let's get number of coffee venues per neighborhoods

In [224]:
for re, ven in zip(res, nearby_venues_list):
    print("There are {} venues in {} !".format(ven.count()[0], ven['city'][0]))

There are 10 venues in Astoria !
There are 10 venues in Woodside !
There are 6 venues in Jackson Heights !
There are 10 venues in Jackson Heights !
There are 4 venues in Howard Beach !
There are 2 venues in Corona !
There are 2 venues in Forest Hills !
There are 5 venues in Kew Gardens !
There are 5 venues in Queens !
There are 10 venues in Flushing !
There are 10 venues in Long Island City !
There are 5 venues in Sunnyside !
There are 4 venues in East Elmhurst !
There are 3 venues in Maspeth !
There are 6 venues in Ridgewood !
There are 1 venues in Glendale !
There are 8 venues in Rego Park !
There are 1 venues in Woodhaven !
There are 4 venues in Jamaica !
There are 1 venues in South Ozone Park !
There are 2 venues in College Point !
There are 8 venues in Flushing !
There are 2 venues in Flushing !
There are 5 venues in Little Neck !
There are 2 venues in Little Neck !
There are 2 venues in New Hyde Park !
There are 2 venues in Bellerose !
There are 1 venues in Flushing !
There are 3

In [225]:
nearby_venues_list[0]['city']

0    Astoria
1    Astoria
2     Queens
3    Astoria
4    Astoria
5    Astoria
6     Queens
7    Astoria
8    Astoria
9    Astoria
Name: city, dtype: object

In [226]:
nearby_venues_list[0]

Unnamed: 0,city,name,categories,lat,lng
0,Astoria,Starbucks,Coffee Shop,40.77003,-73.91851
1,Astoria,Café Via Espresso,Café,40.768676,-73.910851
2,Queens,Gossip Coffee And Cocktails,Coffee Shop,40.764522,-73.916462
3,Astoria,New York City Bagel & Coffee House,Bagel Shop,40.76597,-73.919495
4,Astoria,Café To Go Creperie,Café,40.766831,-73.920441
5,Astoria,Higher Grounds Astoria,Coffee Shop,40.770629,-73.9205
6,Queens,Blvd Bagel Cafe,Bagel Shop,40.770279,-73.918466
7,Astoria,Dunkin',Donut Shop,40.770407,-73.918029
8,Astoria,Cafe Istanbul Mediterranean Restaurant and Hoo...,Café,40.766661,-73.91237
9,Astoria,Dunkin',Donut Shop,40.766784,-73.920645


In [227]:
nearby_venues = pd.concat(nearby_venues_list)
nearby_venues.head()

Unnamed: 0,city,name,categories,lat,lng
0,Astoria,Starbucks,Coffee Shop,40.77003,-73.91851
1,Astoria,Café Via Espresso,Café,40.768676,-73.910851
2,Queens,Gossip Coffee And Cocktails,Coffee Shop,40.764522,-73.916462
3,Astoria,New York City Bagel & Coffee House,Bagel Shop,40.76597,-73.919495
4,Astoria,Café To Go Creperie,Café,40.766831,-73.920441


Let's check how many coffee categories exists

In [228]:
print('There are {} uniques categories.'.format(len(nearby_venues['categories'].unique())))
print('These categories are {}'.format(nearby_venues['categories'].unique()))

There are 26 uniques categories.
These categories are ['Coffee Shop' 'Café' 'Bagel Shop' 'Donut Shop' 'Pub' 'Tea Room'
 'Latin American Restaurant' 'Bubble Tea Shop' 'Chinese Restaurant'
 'Italian Restaurant' 'Convenience Store' 'Bakery' 'Japanese Restaurant'
 'Bar' 'Dessert Shop' 'Greek Restaurant' 'Ice Cream Shop'
 'Vegetarian / Vegan Restaurant' 'Juice Bar' 'Shopping Mall' 'Art Gallery'
 'College Cafeteria' 'French Restaurant' 'Mediterranean Restaurant'
 'Bowling Alley' 'Smoothie Shop']


## Analyzing Neighborhoods

Binning categories

In [229]:
ny_onehot = pd.get_dummies(nearby_venues[['categories']], prefix="", prefix_sep="")
ny_onehot['city'] = nearby_venues['city']
fixed_columns = [ny_onehot.columns[-1]] + list(ny_onehot.columns[:-1])
ny_onehot = ny_onehot[fixed_columns]

ny_onehot.head()

Unnamed: 0,city,Art Gallery,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop,...,Italian Restaurant,Japanese Restaurant,Juice Bar,Latin American Restaurant,Mediterranean Restaurant,Pub,Shopping Mall,Smoothie Shop,Tea Room,Vegetarian / Vegan Restaurant
0,Astoria,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Astoria,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Queens,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,Astoria,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Astoria,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


Check the size

In [230]:
ny_onehot.shape

(240, 27)

Group rows per neighborhood

In [231]:
ny_grouped = ny_onehot.groupby('city').mean().reset_index()
ny_grouped

Unnamed: 0,city,Art Gallery,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop,...,Italian Restaurant,Japanese Restaurant,Juice Bar,Latin American Restaurant,Mediterranean Restaurant,Pub,Shopping Mall,Smoothie Shop,Tea Room,Vegetarian / Vegan Restaurant
0,Arverne,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0
1,Astoria,0.0,0.066667,0.0,0.0,0.0,0.0,0.4,0.0,0.266667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bayside,0.0,0.111111,0.111111,0.0,0.0,0.0,0.222222,0.0,0.111111,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Belle Harbor,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bellerose,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Broad Channel,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Brooklyn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,College Point,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Corona,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,East Elmhurst,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Getting the new size

In [232]:
ny_grouped.shape

(45, 27)

Let's get top 5 of venues

In [233]:
num_top_venues = 5

for hood in ny_grouped['city']:
    print("----"+hood+"----")
    temp = ny_grouped[ny_grouped['city'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Arverne----
              venue  freq
0     Smoothie Shop  0.33
1       Coffee Shop  0.33
2        Donut Shop  0.33
3       Art Gallery  0.00
4  Greek Restaurant  0.00


----Astoria----
         venue  freq
0         Café  0.40
1  Coffee Shop  0.27
2   Donut Shop  0.27
3   Bagel Shop  0.07
4  Art Gallery  0.00


----Bayside----
         venue  freq
0   Donut Shop  0.44
1         Café  0.22
2       Bakery  0.11
3  Coffee Shop  0.11
4   Bagel Shop  0.11


----Belle Harbor----
              venue  freq
0        Donut Shop   1.0
1       Art Gallery   0.0
2  Greek Restaurant   0.0
3          Tea Room   0.0
4     Smoothie Shop   0.0


----Bellerose----
              venue  freq
0              Café   0.5
1        Donut Shop   0.5
2       Art Gallery   0.0
3  Greek Restaurant   0.0
4          Tea Room   0.0


----Broad Channel----
              venue  freq
0              Café   1.0
1       Art Gallery   0.0
2  Greek Restaurant   0.0
3          Tea Room   0.0
4     Smoothie Shop   0.0


---

Function to put it in a dataframe

In [234]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [235]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

columns = ['city']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['city'] = ny_grouped['city']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,city,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Arverne,Donut Shop,Smoothie Shop,Coffee Shop,Dessert Shop,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café
1,Astoria,Café,Donut Shop,Coffee Shop,Bagel Shop,Tea Room,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
2,Bayside,Donut Shop,Café,Bagel Shop,Bakery,Coffee Shop,Tea Room,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
3,Belle Harbor,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop
4,Bellerose,Donut Shop,Café,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant,Coffee Shop


We can see that donut shops are very popular and we might choose to open a donut shop instead of a café.

## Clustering

In [236]:
k_clusters = 4
ny_grouped_clustering = ny_grouped.drop('city', 1)
kmeans = KMeans(n_clusters=k_clusters, random_state=0).fit(ny_grouped_clustering)

kmeans.labels_[0:10] 

array([3, 3, 3, 0, 3, 2, 1, 3, 3, 1])

Let's check the top 10 for each neighborhoods including clusters

In [237]:
neighborhoods_venues_sorted = neighborhoods_venues_sorted.rename(columns = {'city':'Neighborhood'})
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Arverne,Donut Shop,Smoothie Shop,Coffee Shop,Dessert Shop,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café
1,Astoria,Café,Donut Shop,Coffee Shop,Bagel Shop,Tea Room,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
2,Bayside,Donut Shop,Café,Bagel Shop,Bakery,Coffee Shop,Tea Room,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
3,Belle Harbor,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop
4,Bellerose,Donut Shop,Café,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant,Coffee Shop


In [238]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [239]:
neigh = neighborhoods.loc[neighborhoods['Borough'] == 'Queens']
neigh = neigh[['Neighborhood','Latitude','Longitude']]

neigh.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
129,Astoria,40.768509,-73.915654
130,Woodside,40.746349,-73.901842
131,Jackson Heights,40.751981,-73.882821
132,Elmhurst,40.744049,-73.881656
133,Howard Beach,40.654225,-73.838138


In [240]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [248]:
ny_merged = neigh
ny_merged = ny_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

ny_merged.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
129,Astoria,40.768509,-73.915654,3.0,Café,Donut Shop,Coffee Shop,Bagel Shop,Tea Room,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
130,Woodside,40.746349,-73.901842,3.0,Donut Shop,Café,Pub,Coffee Shop,Tea Room,Dessert Shop,Bagel Shop,Bakery,Bar,Bowling Alley
131,Jackson Heights,40.751981,-73.882821,3.0,Donut Shop,Café,Latin American Restaurant,Bubble Tea Shop,Coffee Shop,Dessert Shop,Bagel Shop,Bakery,Bar,Bowling Alley
132,Elmhurst,40.744049,-73.881656,3.0,Donut Shop,Bubble Tea Shop,Café,Chinese Restaurant,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Coffee Shop
133,Howard Beach,40.654225,-73.838138,3.0,Donut Shop,Café,Italian Restaurant,Convenience Store,Dessert Shop,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop


In [249]:
ny_merged.dropna(inplace=True)

In [250]:
ny_merged['Cluster Labels'] = ny_merged['Cluster Labels'].astype(int)

Let's visualize

In [252]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhood'], ny_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Clusters Examination

### Cluster 1

In [259]:
ny_merged.loc[ny_merged['Cluster Labels'] == 0, ny_merged.columns[[0] + list(range(4, ny_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
146,Woodhaven,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop
148,South Ozone Park,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop
150,Whitestone,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop
155,Glen Oaks,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop
162,Queens Village,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop
163,Hollis,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop
190,Belle Harbor,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant,Coffee Shop


### Cluster 2

In [260]:
ny_merged.loc[ny_merged['Cluster Labels'] == 1, ny_merged.columns[[0] + list(range(4, ny_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
136,Kew Gardens,Coffee Shop,Donut Shop,Café,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
139,Long Island City,Coffee Shop,Donut Shop,Café,Bar,Tea Room,Bagel Shop,Bakery,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
140,Sunnyside,Coffee Shop,Donut Shop,Italian Restaurant,Dessert Shop,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café
141,East Elmhurst,Coffee Shop,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant
142,Maspeth,Coffee Shop,Donut Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant
144,Glendale,Coffee Shop,Vegetarian / Vegan Restaurant,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant
170,Far Rockaway,Donut Shop,Coffee Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant
180,Murray Hill,Coffee Shop,Vegetarian / Vegan Restaurant,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Café,Chinese Restaurant


### Cluster 3

In [261]:
ny_merged.loc[ny_merged['Cluster Labels'] == 2, ny_merged.columns[[0] + list(range(4, ny_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
147,Ozone Park,Bakery,Café,Vegetarian / Vegan Restaurant,Tea Room,Bagel Shop,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant,Coffee Shop
167,Springfield Gardens,Café,Vegetarian / Vegan Restaurant,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant,Coffee Shop
171,Broad Channel,Café,Vegetarian / Vegan Restaurant,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant,Coffee Shop
290,Middle Village,Bakery,Café,Vegetarian / Vegan Restaurant,Tea Room,Bagel Shop,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant,Coffee Shop


### Cluster 4

In [262]:
ny_merged.loc[ny_merged['Cluster Labels'] == 3, ny_merged.columns[[0] + list(range(4, ny_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
129,Astoria,Café,Donut Shop,Coffee Shop,Bagel Shop,Tea Room,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
130,Woodside,Donut Shop,Café,Pub,Coffee Shop,Tea Room,Dessert Shop,Bagel Shop,Bakery,Bar,Bowling Alley
131,Jackson Heights,Donut Shop,Café,Latin American Restaurant,Bubble Tea Shop,Coffee Shop,Dessert Shop,Bagel Shop,Bakery,Bar,Bowling Alley
132,Elmhurst,Donut Shop,Bubble Tea Shop,Café,Chinese Restaurant,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Coffee Shop
133,Howard Beach,Donut Shop,Café,Italian Restaurant,Convenience Store,Dessert Shop,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop
134,Corona,Donut Shop,Café,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant,Coffee Shop
135,Forest Hills,Donut Shop,Café,Convenience Store,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
137,Richmond Hill,Donut Shop,Café,Coffee Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant
138,Flushing,Coffee Shop,Donut Shop,Café,Bakery,Shopping Mall,Bubble Tea Shop,Japanese Restaurant,Greek Restaurant,Tea Room,Convenience Store
143,Ridgewood,Donut Shop,Café,Dessert Shop,Tea Room,Bagel Shop,Bakery,Bar,Bowling Alley,Bubble Tea Shop,Chinese Restaurant


As you can see over this section, I tried to firstly find the best borough where we could install our business. I had to clean the data many times because the json returned by foursquare didn't always had all the informations. Then I used a k-Mean Clustering to get the similarity between the venues so we know which is the most popular venue category similar to a Café.

## Results and Discussion

In result we can see that the best place to start our new business is Queens because Coffee Shops are the most popular there and It turned out to be Donut Shops that was the most popular category but Coffee Shop are also really popular. So I would propose my boss to make an Hybrid Shop selling coffees and donuts.

## Conclusion

To conclude we can see that simple cafés are not that popular, people prefer Coffee Shops or Donut Shops so we might choose to create our business on these types of venues. We would Settle in Queens because it is more efficient. Then according to the clusters Donut Shops are very popular, so we can set the image of our business as a Donut Shop also serving the best Coffees ans we would settle in the north of Queens, because that is where most of these business works.