#  *The best locations where it pays to open restaurant in New York*

## Introduction/Business Problem

Restaurants are one of the most profitable sectors. However, according to one study, 60 percent close or change owners within the first year of operation, 80 percent fail within five years. Usually, restaurants fail with combination of problems that eventually lead to their closure. A bad location is one of the biggest reasons for restaurant failure. For example, a restaurant can sell the best "burger" in the world. If it is in a poor location (hidden, sparsely inhabited, blind and difficult to access) it will have to put in much more effort to fetch customers than to serve them.

In this context, how to define the best locations where it pays to open a restaurant?

Our objective is to recommend the best locations in New York city (well inhabited, close to subways, distant from existing restaurants) to open restaurant. We don’t distinguish the kind of restaurant.

The purpose of this whole exercise is for submission of the final capstone project for the "IBM Data Science" course on Coursera as well as to showcase my data science skills in the real-world application.



## Project Data Source

The data set required for this project provided from four different data sources: 

- Cordinates of the boundaries of Neighborhood Tabulation Areas (NTA) in New York from https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-nynta.page

- Population Numbers By New York City Neighborhood Tabulation Areas (NTA) from https://data.cityofnewyork.us. The link of the CSV file https://query.data.world/s/zdkpdxvomgauu4r3jvymhy57mwtolg

- Location data of New York city subway station from https://data.cityofnewyork.us. The link of the CSV file is https://query.data.world/s/rttrjnk7raatdri6ialljpsucvbv5b
It will help to determine the minimal distance from a NTA to a subway station and the number of subways located in a given radius.

- Location data of restaurants provided from Foursquare API. It will help to determine the minimal distance from a restaurant to a NTA and the number of restaurants located in a given radius.

These data required high pre-processing in order to convert it to a working set, capable of handling machine learning algorithms and visualization operations that were implemented on it.

So, we generate a dataframe with a number of rows corresponding to NTA and columns are:
* longitude and latitude
* population
* minimal distance from a neighborhood location to a subway station
* number of subways located in a given radius
* minimal distance from a restaurant to a neighborhood location
* number of restaurants located in a given radius


The best locations are those where there is no or few restaurants, close to subway stations and well inhabited. 
NTA boundaries and their associated names may not definitively represent neighborhoods. We consider the center of NTA as districts in this exercise.






In [662]:
import wget
import pandas as pd
import json
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering librar
from bs4 import BeautifulSoup # Library to scrape website

import geopandas as gpd

print('Libraries imported.')

Libraries imported.


# Data visualization and pre-processing

## Lets load Population Numbers By New York City Neighborhood Tabulation Areas (NTA) dataset in a CSV file
- NTA Name: The name of Neighborhood Tabulation Areas. 
- Population: Population number

We create "df_pop" dataframe

In [663]:
df_pop = pd.read_csv('https://query.data.world/s/zdkpdxvomgauu4r3jvymhy57mwtolg')
df_pop = df_pop[df_pop["Year"]==2010].reset_index(drop = True)
df_pop.rename(columns = {'NTA Code':'DistrictCode'}, inplace =True)
df_pop.head()

Unnamed: 0,Borough,Year,FIPS County Code,DistrictCode,NTA Name,Population
0,Bronx,2010,5,BX01,Claremont-Bathgate,31078
1,Bronx,2010,5,BX03,Eastchester-Edenwald-Baychester,34517
2,Bronx,2010,5,BX05,Bedford Park-Fordham North,54415
3,Bronx,2010,5,BX06,Belmont,27378
4,Bronx,2010,5,BX07,Bronxdale,35538


As we need the coordinates of NTA, we download them and saved in a GeoJson format

In [665]:
filename2 = wget.download("https://services5.arcgis.com/GfwWNkhOj9bNBqoJ/ArcGIS/rest/services/NYC_Neighborhood_Tabulation_Areas/FeatureServer/0/query?where=1=1&outFields=*&outSR=4326&f=pgeojson")

100% [..........................................................................] 6206903 / 6206903

In [666]:
with open(filename2) as json_data:
    newyork_puma = json.load(json_data)

## NTA are polygons or MultiPolygons. Our strategy is to determine the centeroid of these features which will be named "District"

- A function that determines the coordinates of the Point representing the center of polygon or multipolygon

In [667]:
def centeroid(arr):
    length = len(arr)
    sum_x = sum([r[0] for r in arr])
    sum_y =  sum([r[1] for r in arr])
    return sum_x/length, sum_y/length

Based on the GeoJson file, we generate "District" dataframe that contains:
- Borough
- DistrictName: The name of the NTA 
- DistrictCode: The code of the NTA
- Latitude and Longitude of the centers of NTA

In [668]:
districts_data = newyork_puma['features']
# define the dataframe columns
column_names = ['Borough', 'DistrictCode', 'DistrictName', 'Latitude', 'Longitude'] 

# instantiate the dataframe
districts = pd.DataFrame(columns=column_names)
n=1
for data in districts_data:
    borough = data['properties']['BoroName'] 
    DistrictCode = data['properties']['NTACode']       
    DistrictName = data['properties']['NTAName']
    d = data["geometry"]["coordinates"]
    if len(d)==1: 
        Longitude, Latitude = np.mean(np.array(d[0]), axis=0)
    if len(d)>1 and data["geometry"]["type"]=="Polygon":
        z = [centeroid(f) for f in d]
        Longitude, Latitude  = np.mean(np.array(z), axis=0)
    if len(d)>1 and data["geometry"]["type"]=="MultiPolygon":
        z = [centeroid(f) for f in d[0]]
        Longitude, Latitude  = np.mean(np.array(z), axis=0)
    
    districts = districts.append({'Borough': borough,
                                          'DistrictCode': DistrictCode,
                                          'DistrictName': DistrictName,
                                          'Latitude': Latitude,
                                          'Longitude': Longitude}, ignore_index=True)
districts.head(3)

Unnamed: 0,Borough,DistrictCode,DistrictName,Latitude,Longitude
0,Brooklyn,BK88,Borough Park,40.630667,-73.987897
1,Queens,QN51,Murray Hill,40.768102,-73.807672
2,Queens,QN27,East Elmhurst,40.763467,-73.866047


We merge the Districts dataframe with df_pop in a new dataframe "df"

In [669]:
df=pd.merge(districts, df_pop, on='DistrictCode')

In [670]:
df.head(5)

Unnamed: 0,Borough_x,DistrictCode,DistrictName,Latitude,Longitude,Borough_y,Year,FIPS County Code,NTA Name,Population
0,Brooklyn,BK88,Borough Park,40.630667,-73.987897,Brooklyn,2010,47,Borough Park,106357
1,Queens,QN51,Murray Hill,40.768102,-73.807672,Queens,2010,81,Murray Hill,51739
2,Queens,QN27,East Elmhurst,40.763467,-73.866047,Queens,2010,81,East Elmhurst,23150
3,Queens,QN07,Hollis,40.710505,-73.764068,Queens,2010,81,Hollis,20269
4,Brooklyn,BK25,Homecrest,40.598383,-73.964717,Brooklyn,2010,47,Homecrest,44316


We drop the non required columns

In [671]:
df.drop(['Borough_y','Year', 'NTA Name', 'FIPS County Code'],inplace=True,axis=1)


In [672]:
df.head(5)

Unnamed: 0,Borough_x,DistrictCode,DistrictName,Latitude,Longitude,Population
0,Brooklyn,BK88,Borough Park,40.630667,-73.987897,106357
1,Queens,QN51,Murray Hill,40.768102,-73.807672,51739
2,Queens,QN27,East Elmhurst,40.763467,-73.866047,23150
3,Queens,QN07,Hollis,40.710505,-73.764068,20269
4,Brooklyn,BK25,Homecrest,40.598383,-73.964717,44316


In [673]:
df.rename(columns = {'Borough_x':'Borough'}, inplace =True)

In [674]:
df.head(5)

Unnamed: 0,Borough,DistrictCode,DistrictName,Latitude,Longitude,Population
0,Brooklyn,BK88,Borough Park,40.630667,-73.987897,106357
1,Queens,QN51,Murray Hill,40.768102,-73.807672,51739
2,Queens,QN27,East Elmhurst,40.763467,-73.866047,23150
3,Queens,QN07,Hollis,40.710505,-73.764068,20269
4,Brooklyn,BK25,Homecrest,40.598383,-73.964717,44316


In [675]:
print('The dataframe has {} boroughs and {} districts.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 5 boroughs and 195 districts.


# Visualization of New York NTA through the coordinates of their calculated centers

In [676]:
address = 'New York City, NY'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [677]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, district in zip(df['Latitude'], df['Longitude'], df['Borough'], df['DistrictName']):
    label = '{}, {}'.format(district, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [None]:
# Code to save Folium map in png 
"""import os
import time
from selenium import webdriver

delay=5
fn='clusters.html'
tmpurl='file://{path}/{mapfile}'.format(path=os.getcwd(),mapfile=fn)
map_clusters.save(fn)

browser = webdriver.Chrome("C:/Users/midingoy/Downloads/chromedriver_win32/chromedriver.exe")
browser.get(tmpurl)
#Give the map tiles some time to load
time.sleep(delay)
browser.save_screenshot('clusters.png')
browser.quit()"""

# Pre-Processing of the Subway stations dataset

In [678]:
df_subw = pd.read_csv("https://query.data.world/s/rttrjnk7raatdri6ialljpsucvbv5b")

In [679]:
df_subw.head(5)

Unnamed: 0,URL,OBJECTID,NAME,the_geom,LINE,NOTES
0,http://web.mta.info/nyct/service/,1,Astor Pl,POINT (-73.99106999861966 40.73005400028978),4-6-6 Express,"4 nights, 6-all times, 6 Express-weekdays AM s..."
1,http://web.mta.info/nyct/service/,2,Canal St,POINT (-74.00019299927328 40.71880300107709),4-6-6 Express,"4 nights, 6-all times, 6 Express-weekdays AM s..."
2,http://web.mta.info/nyct/service/,3,50th St,POINT (-73.98384899986625 40.76172799961419),1-2,"1-all times, 2-nights"
3,http://web.mta.info/nyct/service/,4,Bergen St,POINT (-73.97499915116808 40.68086213682956),2-3-4,"4-nights, 3-all other times, 2-all times"
4,http://web.mta.info/nyct/service/,5,Pennsylvania Ave,POINT (-73.89488591154061 40.66471445143568),3-4,"4-nights, 3-all other times"


As you can see on the table above, it required to extract the coordianates of subway stations

- Function to extract coordinates

In [680]:
def coord(x): 
    z = x.replace("POINT", "").replace("(", "").replace(")", "").split(" ")
    return  float(z[1]), float(z[2])

- Create a DataFrame of subway stations with their name, longitude and latitude

In [681]:
column_names = ['Name', 'Latitude', 'Longitude'] 
# instantiate the dataframe
subw = pd.DataFrame(columns=column_names)
lonlat = list(map(coord, df_subw["the_geom"] ))
subw["Latitude"]= [l[1] for l in lonlat]
subw["Longitude"]= [l[0] for l in lonlat]
subw["Name"] = df_subw["NAME"]
subw.head(5)

Unnamed: 0,Name,Latitude,Longitude
0,Astor Pl,40.730054,-73.99107
1,Canal St,40.718803,-74.000193
2,50th St,40.761728,-73.983849
3,Bergen St,40.680862,-73.974999
4,Pennsylvania Ave,40.664714,-73.894886


- Algorithm used to calculate the distance between two points with their coordinates in decimal degrees

In [682]:
from math import *
def Distance(lat1, lon1, lat2, lon2):
    l1 = radians(lat1);
    l2 = radians(lat2);
    dlat = radians(lat2-lat1);
    dlon = radians(lon2-lon1);
    a = sin(dlat/2) * sin(dlat/2) + cos(l1) * cos(l2) * sin(dlon/2) * sin(dlon/2);
    c = 2 * atan2(sqrt(a), sqrt(1-a));
    d = 6371 * c # in km
    return d

* Here, we determine the distance from each district to the nearest subway, and the number of subways with a minimal distance of 1.5km

In [683]:
res, nb = [], []
for lat, lon in zip(df['Latitude'], df['Longitude']):
    z = []
    k = 0
    for lat1, lon1 in zip(subw['Latitude'], subw['Longitude']):
        dist = Distance(lat1, lon1, lat, lon)
        if dist<1.5: k = k+1
        z.append(dist)
    res.append(min(z))
    nb.append(k)

In [684]:
df["min_dist_to_subways_km"] = res

In [685]:
df["number_subways_1.5km"] = nb

In [686]:
df.head(5)

Unnamed: 0,Borough,DistrictCode,DistrictName,Latitude,Longitude,Population,min_dist_to_subways_km,number_subways_1.5km
0,Brooklyn,BK88,Borough Park,40.630667,-73.987897,106357,0.635259,10
1,Queens,QN51,Murray Hill,40.768102,-73.807672,51739,2.106955,0
2,Queens,QN27,East Elmhurst,40.763467,-73.866047,23150,1.53852,0
3,Queens,QN07,Hollis,40.710505,-73.764068,20269,1.681509,0
4,Brooklyn,BK25,Homecrest,40.598383,-73.964717,44316,0.758909,10


## Using the Foursquare API to retrieve information of the venues including the restaurants in  New York NTA (districts). The API will return a JSON file which will be further converted into a Python Dataframe.

In [687]:
CLIENT_ID = 'NPXEJNZYDELVG0KRFPVPZ3AAUJFKPJLIGVKAXN1MZ0VLKT40' # your Foursquare ID
CLIENT_SECRET = '5Q1BDASO43ZB5AQZVEJLE3UJDSK4SPDXXONMYC201MRDWIXV' # your Foursquare Secret
ACCESS_TOKEN = 'J3B4AQ5DAY2G4PENKGKEJM0XW1P3QWREURLU3IYWMWJLOUBO' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: NPXEJNZYDELVG0KRFPVPZ3AAUJFKPJLIGVKAXN1MZ0VLKT40
CLIENT_SECRET:5Q1BDASO43ZB5AQZVEJLE3UJDSK4SPDXXONMYC201MRDWIXV


* We select only Manhattan borough to restrict our analysis borough by borough. An other could be selected.

In [688]:
manhattan_data = df[df['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,DistrictCode,DistrictName,Latitude,Longitude,Population,min_dist_to_subways_km,number_subways_1.5km
0,Manhattan,MN15,Clinton,40.766378,-73.996548,45884,0.990834,13
1,Manhattan,MN25,Battery Park City-Lower Manhattan,40.694739,-74.001444,39699,0.767109,13
2,Manhattan,MN14,Lincoln Square,40.775236,-73.988364,61489,0.555412,9
3,Manhattan,MN17,Midtown-Midtown South,40.756759,-73.982858,28630,0.245619,32
4,Manhattan,MN40,Upper East Side-Carnegie Hill,40.775283,-73.960816,61207,0.201265,10


In [689]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


- create map of Manhattan using latitude and longitude values

In [690]:
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['DistrictName']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

In [691]:
manhattan_data.loc[0, 'DistrictName']

'Clinton'

In [692]:
district_latitude = manhattan_data.loc[0, 'Latitude'] # district latitude value
district_longitude = manhattan_data.loc[0, 'Longitude'] # district longitude value

district_name = manhattan_data.loc[0, 'DistrictName'] # district name

print('Latitude and longitude values of {} are {}, {}.'.format(district_name, 
                                                               district_latitude, 
                                                               district_longitude))

Latitude and longitude values of Clinton are 40.7663775596076, -73.99654837674895.


- Webscraping of a district data in Manhattan using Foursquare API

In [693]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    district_latitude, 
    district_longitude, 
    radius, 
    LIMIT)
url 


'https://api.foursquare.com/v2/venues/explore?&client_id=NPXEJNZYDELVG0KRFPVPZ3AAUJFKPJLIGVKAXN1MZ0VLKT40&client_secret=5Q1BDASO43ZB5AQZVEJLE3UJDSK4SPDXXONMYC201MRDWIXV&v=20180604&ll=40.7663775596076,-73.99654837674895&radius=500&limit=100'

In [694]:
results = requests.get(url).json()
#results

In [695]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

We can observe before scrapping the list of venue in the first district of Manhattan

In [696]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Ink 48 Hotel,Hotel,40.764505,-73.995987
1,The Press Lounge,Hotel Bar,40.764531,-73.996029
2,"Intrepid Sea, Air & Space Museum",History Museum,40.764514,-73.999385
3,Intrepid Museum Store,Gift Shop,40.764492,-73.999237
4,Print,American Restaurant,40.764658,-73.995808


In [697]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


* Generalizing precedent algorithm for all the districts of a borough

In [698]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['DistrictName', 
                  'District Latitude', 
                  'District Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [699]:
manhattan_venues = getNearbyVenues(names=manhattan_data['DistrictName'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

In [700]:
manhattan_venues.head(5)

Unnamed: 0,DistrictName,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Clinton,40.766378,-73.996548,Ink 48 Hotel,40.764505,-73.995987,Hotel
1,Clinton,40.766378,-73.996548,The Press Lounge,40.764531,-73.996029,Hotel Bar
2,Clinton,40.766378,-73.996548,"Intrepid Sea, Air & Space Museum",40.764514,-73.999385,History Museum
3,Clinton,40.766378,-73.996548,Intrepid Museum Store,40.764492,-73.999237,Gift Shop
4,Clinton,40.766378,-73.996548,Sullivan Street Bakery,40.763512,-73.994837,Bakery


In [701]:
manhattan_venues_cc=manhattan_venues['Venue Category'].value_counts().to_frame()

In [702]:
manhattan_venues_cc.rename(columns={'Venue Category':'Count'})
##


Unnamed: 0,Count
Coffee Shop,113
Park,97
Italian Restaurant,81
Café,80
Pizza Place,77
Mexican Restaurant,65
Bar,63
Bakery,63
American Restaurant,57
Art Gallery,45


- Convert venues into variables (as columns) and create a new dataframe manhattan_onehot

In [703]:
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['DistrictName'] = manhattan_venues['DistrictName'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.shape

(2805, 307)

- Group this dataframe by each district

In [704]:
manhattan_grouped = manhattan_onehot.groupby('DistrictName').sum().reset_index()
manhattan_grouped.head(8)

Unnamed: 0,DistrictName,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Australian Restaurant,Austrian Restaurant,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Beach,Beer Bar,Beer Garden,Beer Store,Big Box Store,Bike Rental / Bike Share,Bike Shop,Bike Trail,Bistro,Board Shop,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Boxing Gym,Brazilian Restaurant,Breakfast Spot,Brewery,Bridge,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Butcher,Café,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Circus,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Theater,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Cycle Studio,Dance Studio,Daycare,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Drugstore,Dumpling Restaurant,Duty-free Shop,Eastern European Restaurant,Electronics Store,Empanada Restaurant,Entertainment Service,Ethiopian Restaurant,Event Space,Exhibit,Factory,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Garden,Garden Center,Gastropub,Gay Bar,German Restaurant,Gift Shop,Golf Driving Range,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gymnastics Gym,Halal Restaurant,Harbor / Marina,Hardware Store,Hawaiian Restaurant,Health & Beauty Service,Health Food Store,Heliport,Historic Site,History Museum,Hobby Shop,Hot Dog Joint,Hotel,Hotel Bar,Hotpot Restaurant,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Intersection,Irish Pub,Island,Israeli Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Kids Store,Kitchen Supply Store,Korean Restaurant,Kosher Restaurant,Latin American Restaurant,Laundry Service,Lebanese Restaurant,Library,Lighthouse,Lingerie Store,Liquor Store,Lounge,Malay Restaurant,Market,Martial Arts School,Massage Studio,Mattress Store,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Motorcycle Shop,Movie Theater,Multiplex,Museum,Music Store,Music Venue,Nail Salon,New American Restaurant,Newsstand,Non-Profit,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Sculpture,Paper / Office Supplies Store,Park,Pedestrian Plaza,Performing Arts Venue,Peruvian Restaurant,Peruvian Roast Chicken Joint,Pet Café,Pet Service,Pet Store,Pharmacy,Pie Shop,Pier,Pilates Studio,Pizza Place,Playground,Plaza,Poke Place,Pool,Pub,Public Art,Puerto Rican Restaurant,Ramen Restaurant,Record Shop,Recreation Center,Rental Car Location,Residential Building (Apartment / Condo),Resort,Rest Area,Restaurant,Roller Rink,Roof Deck,Rugby Pitch,Russian Restaurant,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scandinavian Restaurant,Scenic Lookout,School,Science Museum,Sculpture Garden,Seafood Restaurant,Shanghai Restaurant,Shipping Store,Shoe Store,Skating Rink,Smoke Shop,Snack Place,Soba Restaurant,Soccer Field,Soup Place,South American Restaurant,South Indian Restaurant,Southern / Soul Food Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,State / Provincial Park,Steakhouse,Street Art,Supermarket,Supplement Shop,Sushi Restaurant,Swiss Restaurant,Szechuan Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Tea Room,Tennis Court,Tennis Stadium,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park Ride / Attraction,Theme Restaurant,Thrift / Vintage Store,Tiki Bar,Tour Provider,Toy / Game Store,Track,Track Stadium,Trail,Train Station,Turkish Restaurant,Udon Restaurant,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Volleyball Court,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yemeni Restaurant,Yoga Studio
0,Battery Park City-Lower Manhattan,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,1,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,4,0,0,0,0,0,0,2,0,0,0,0,1,0,0,3,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,1,0,3,2,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,2,0,0,1,0,0,0,0,4,0,2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,5,0,0,0,0,0,1,2,0,0,3,1,4,3,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,0,3,0,0,0,1,5
1,Central Harlem North-Polo Grounds,0,0,1,2,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,3,2,4,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3,0,0,2,0,0,4,0,0,0,0,0,0,1,0,4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,0,0,1,1,0,0,0,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,6,0,0,0,0,0,0,0,3,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,1,2,0,1,2,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,2,0,0,1,0,0,0,1,0,0,0,0,0,0,3,0,0,0,0,0,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
2,Central Harlem South,0,0,3,1,1,0,0,0,0,1,1,0,0,0,0,0,1,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,1,0,0,1,0,0,5,0,0,1,0,0,0,0,0,0,0,2,7,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,1,0,0,3,0,0,0,0,1,0,0,0,0,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,2,0,0,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,5,0,0,0,0,0,0,0,0,0,0,0,3,3,1,0,1,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,5,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,4,2,0,0,0,0,1
3,Chinatown,0,0,0,1,0,1,0,0,0,0,0,3,0,1,1,0,1,4,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,6,0,0,0,0,0,4,4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,1,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,1,0,2,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,5,0,1,0,0,0,0,0,0,0,0,0,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,3,4,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0
4,Clinton,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,1,0,0,0,1,0,2,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,1,0,3,0,0,0,0,3,2,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,2,0,0,0,2,0,1,0,0,0,0,0,1,1,0,0,0,0,0,3,1,4,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,1,1,0,3,0,0,2,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3,2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,4,0,0,0,0,0
5,East Harlem North,0,0,1,1,1,0,1,0,0,0,0,1,0,0,0,0,1,2,2,2,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3,0,0,5,0,0,3,0,0,0,0,0,0,0,3,2,0,0,1,0,0,1,2,1,1,0,0,0,0,0,1,0,0,0,0,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,1,1,0,1,0,0,5,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,6,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3,1,1,1,0,0,0,0,1,0,0,0,1,0,0,1,0,2,0,0,0,0,1,2,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1
6,East Harlem South,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,4,1,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,6,0,0,0,0,0,0,0,0,1,0,4,2,0,0,0,0,0,0,0,0,0,1,0,0,2,1,5,0,2,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,4,1,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,2,0,0,0,0,0,0,2,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,1,0,0,0,4,2,1,0,0,2,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,3,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,3,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
7,East Village,0,0,0,1,0,0,0,1,1,0,1,1,0,1,0,1,2,1,0,2,0,0,0,0,1,0,1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,1,0,1,2,0,0,1,1,1,0,0,0,0,0,1,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,3,0,0,0,0,0,0,0,4,0,3,0,1,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,2,0,0,0,0,2,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,2,1,0,0,0,0,0,1,0,0,0,0,0,1,1,2,0,0,1,0,2,0,0,0,4,2,0,0,0,0,0


- As we just need restaurants, we restrict the number of columns by retaining only venue category that is "Restaurant"

In [705]:
mann = pd.DataFrame(manhattan_grouped, columns = ["DistrictName"]+ list(manhattan_grouped.columns[manhattan_grouped.columns.str.contains("Restaurant")]))

In [706]:
mann["Number_of_restaurants"] = mann.sum(axis=1)

In [707]:
extr_mann = pd.DataFrame(mann, columns = ["DistrictName", "Number_of_restaurants"])
extr_mann.head(5)

Unnamed: 0,DistrictName,Number_of_restaurants
0,Battery Park City-Lower Manhattan,20
1,Central Harlem North-Polo Grounds,22
2,Central Harlem South,26
3,Chinatown,31
4,Clinton,28


In [708]:
df_manh=pd.merge(manhattan_data,extr_mann, on='DistrictName')

In [709]:
df_manh.head(5)

Unnamed: 0,Borough,DistrictCode,DistrictName,Latitude,Longitude,Population,min_dist_to_subways_km,number_subways_1.5km,Number_of_restaurants
0,Manhattan,MN15,Clinton,40.766378,-73.996548,45884,0.990834,13,28
1,Manhattan,MN25,Battery Park City-Lower Manhattan,40.694739,-74.001444,39699,0.767109,13,20
2,Manhattan,MN14,Lincoln Square,40.775236,-73.988364,61489,0.555412,9,15
3,Manhattan,MN17,Midtown-Midtown South,40.756759,-73.982858,28630,0.245619,32,16
4,Manhattan,MN40,Upper East Side-Carnegie Hill,40.775283,-73.960816,61207,0.201265,10,23


- Adding to the table the distance between each district and the nearest restaurant

In [710]:
res, nb = [], []
for lat, lon, distr in zip(df_manh['Latitude'], df_manh['Longitude'], df_manh['DistrictName']):
    z = []
    k = 0
    for lat1, lon1, cat, dis in zip(manhattan_venues['Venue Latitude'], manhattan_venues['Venue Longitude'], manhattan_venues["Venue Category"], manhattan_venues["DistrictName"] ):
        if "Restaurant" in cat and distr==dis :
            dist = Distance(lat1, lon1, lat, lon)
            if dist<1: k = k+1
            z.append(dist)
    res.append(min(z))
    nb.append(k)

In [711]:
df_manh["min_dist_to_restau_km"] = res

Now we generate our table which will be cluterised to determine the best places

In [712]:
df_manh.head(5)

Unnamed: 0,Borough,DistrictCode,DistrictName,Latitude,Longitude,Population,min_dist_to_subways_km,number_subways_1.5km,Number_of_restaurants,min_dist_to_restau_km
0,Manhattan,MN15,Clinton,40.766378,-73.996548,45884,0.990834,13,28,0.448717
1,Manhattan,MN25,Battery Park City-Lower Manhattan,40.694739,-74.001444,39699,0.767109,13,20,0.15659
2,Manhattan,MN14,Lincoln Square,40.775236,-73.988364,61489,0.555412,9,15,0.477536
3,Manhattan,MN17,Midtown-Midtown South,40.756759,-73.982858,28630,0.245619,32,16,0.154862
4,Manhattan,MN40,Upper East Side-Carnegie Hill,40.775283,-73.960816,61207,0.201265,10,23,0.168686


# First, We use K-means Algorithm

In [713]:
# set number of clusters
kclusters = 4

manhattan_clustering = df_manh.drop(['Borough',"DistrictCode","DistrictName"] , 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 2, 0, 2, 2, 3, 2, 2, 1])

In [714]:
manhattan_clustering.insert(0, 'Cluster Labels', kmeans.labels_)

In [715]:
manhattan_clustering["DistrictName"] = df_manh["DistrictName"]

In [716]:
manhattan_clustering.head(5)

Unnamed: 0,Cluster Labels,Latitude,Longitude,Population,min_dist_to_subways_km,number_subways_1.5km,Number_of_restaurants,min_dist_to_restau_km,DistrictName
0,3,40.766378,-73.996548,45884,0.990834,13,28,0.448717,Clinton
1,3,40.694739,-74.001444,39699,0.767109,13,20,0.15659,Battery Park City-Lower Manhattan
2,2,40.775236,-73.988364,61489,0.555412,9,15,0.477536,Lincoln Square
3,0,40.756759,-73.982858,28630,0.245619,32,16,0.154862,Midtown-Midtown South
4,2,40.775283,-73.960816,61207,0.201265,10,23,0.168686,Upper East Side-Carnegie Hill


# Results 

## Clusters Visualization with KMeans Algorithm

In [717]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_clustering['Latitude'], manhattan_clustering['Longitude'], manhattan_clustering['DistrictName'], manhattan_clustering['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

- Clusters 0

In [718]:
manhattan_clustering.loc[manhattan_clustering['Cluster Labels'] == 0, manhattan_clustering.columns[[manhattan_clustering.shape[1]-1] + list(range(3, manhattan_clustering.shape[1]-1))]]

Unnamed: 0,DistrictName,Population,min_dist_to_subways_km,number_subways_1.5km,Number_of_restaurants,min_dist_to_restau_km
3,Midtown-Midtown South,28630,0.245619,32,16,0.154862
12,park-cemetery-etc-Manhattan,1849,1.361403,1,13,0.584997
26,Manhattanville,22950,0.336913,8,31,0.124191
27,Gramercy,27988,0.440937,22,35,0.105912
28,Stuyvesant Town-Cooper Village,21049,0.620675,7,25,0.569645


- Clusters 1

In [719]:
manhattan_clustering.loc[manhattan_clustering['Cluster Labels'] == 1, manhattan_clustering.columns[[manhattan_clustering.shape[1]-1] + list(range(3, manhattan_clustering.shape[1]-1))]]

Unnamed: 0,DistrictName,Population,min_dist_to_subways_km,number_subways_1.5km,Number_of_restaurants,min_dist_to_restau_km
9,Upper West Side,132378,0.250604,9,30,0.248552


- Clusters 2

In [720]:
manhattan_clustering.loc[manhattan_clustering['Cluster Labels'] == 2, manhattan_clustering.columns[[manhattan_clustering.shape[1]-1] + list(range(3, manhattan_clustering.shape[1]-1))]]

Unnamed: 0,DistrictName,Population,min_dist_to_subways_km,number_subways_1.5km,Number_of_restaurants,min_dist_to_restau_km
2,Lincoln Square,61489,0.555412,9,15,0.477536
4,Upper East Side-Carnegie Hill,61207,0.201265,10,23,0.168686
5,Yorkville,77942,0.830499,5,24,0.59943
7,Hudson Yards-Chelsea-Flatiron-Union Square,70150,0.755112,13,12,0.215781
8,West Village,66880,0.306087,23,36,0.041516
16,Lower East Side,72957,0.7771,5,7,0.340188
19,Washington Heights North,67136,0.306602,7,23,0.264096
20,Washington Heights South,84438,0.361357,9,29,0.080978
22,Lenox Hill-Roosevelt Island,80771,0.727095,10,28,0.075408
25,Central Harlem North-Polo Grounds,75282,0.247548,15,22,0.247593


- Clusters 3

In [721]:
manhattan_clustering.loc[manhattan_clustering['Cluster Labels'] == 3, manhattan_clustering.columns[[manhattan_clustering.shape[1]-1] + list(range(3, manhattan_clustering.shape[1]-1))]]

Unnamed: 0,DistrictName,Population,min_dist_to_subways_km,number_subways_1.5km,Number_of_restaurants,min_dist_to_restau_km
0,Clinton,45884,0.990834,13,28,0.448717
1,Battery Park City-Lower Manhattan,39699,0.767109,13,20,0.15659
6,Marble Hill-Inwood,46746,0.580384,11,13,0.578123
10,Hamilton Heights,48520,0.161192,10,33,0.081053
11,East Harlem South,57902,0.506402,8,23,0.164852
13,Central Harlem South,43383,0.190487,15,26,0.10573
14,East Harlem North,58019,0.249872,9,28,0.472816
15,East Village,44136,0.498885,21,36,0.027682
17,Murray Hill-Kips Bay,50742,0.879675,13,31,0.193868
18,Morningside Heights,55929,0.081502,10,33,0.235167


# Discussion

As we can observe, the population of district have a high weight on the classification. The number of restaurants and subways are highly variable in each cluster. The cluster 1 has only one district and is the highly inhabited. Although it has 30 restaurants, the relationship between the population and the number of restaurant is good to allow installing an other restaurant. The cluster 2 contains the second category of well inhabited districts and some district such as Lower East Side has only 7 restaurants on a radius of 1 km. We cannot advise clustesr 0 and 3 which contain districts with few population and a lot of restaurants.

Our analysis did not take into account the repartition of population in a district. It based on the center of the Neighborhood Tabulation Areas. However, the results can be confirmed by the reality in Manhattan. As example, the cluster 3 which is not allowed, contains Manhattanville neighborhood which is few inhabited but has a lot of restaurants. It welcomes many tourists, likewise Gramercy.

### Conclusion and Future work

In terms of future work, I would be interested in applying the approach to other boroughs and also with other advanced machine learning techniques such as Density-based clustering (DBSCAN) to reinforce the clustering.
Thanks
