# Capstone Project - The Battle of the Neighborhoods (Week 2)

### Applied Data Science Capstone by IBM/Coursera

### Table of Contents
- Introduction: Business Problem
- Data

## Introduction: Business Problem

In this project, we will try to search for an ideal location to start a restaurant specifically in Brooklyn, one of the 5 boroughs in New York that has a high density of population. 

By an "ideal" location, we would like to detect a location that has high density of population but at the mean time to lessen the competition of the same sector. In this project, we would like to know which income group of people will be attracted most to it based on the location. 

Hence, with the power of data science, we would like to generate a few potential neighborhoods based on this criteria by using data analysis. 

## Data

#### Based on the problem, some dependent factors are listed as below:

- Existing restaurants in the neighbourfood
- Population and group of people with their income

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods

#### The following would be the data sources gathered to extract the information needed:

- This csv file contains the data for all the 5 boroughs in New York which also includes coordinates and their neighborhoods respectively. 

Link : https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json


- Using foursquare data to get inforamtion about restaurants in Selangor.

Link: https://foursquare.com/explore?mode=url&ne=44.418088%2C-78.362732&q=Restaurant&sw=42.742978%2C-80.554504

## 1. Explore Dataset

In [1]:
# Import necessary libraries
import json
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans

In [2]:
# Download data
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
with open('newyork_data.json') as json_data:
    neywork_data = json.load(json_data)

# Define a new variable that includes the data
neighborhoods_data = neywork_data["features"]

In [3]:
# Transfrom the data into pandas dataframe
# Creare dataframe columns
column_names = ["Borough","Neighborhood","Latitude", "Longitude"]
neighborhoods = pd.DataFrame(columns = column_names)

#Loop the data into the new created dataframe
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [4]:
# To examine the resulting dataframe
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [5]:
# Create new dataframe for neighborhoods in Brooklyn
brooklyn_data = neighborhoods[neighborhoods["Borough"] == "Brooklyn"].reset_index(drop=True)
brooklyn_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Brooklyn,Bay Ridge,40.625801,-74.030621
1,Brooklyn,Bensonhurst,40.611009,-73.99518
2,Brooklyn,Sunset Park,40.645103,-74.010316
3,Brooklyn,Greenpoint,40.730201,-73.954241
4,Brooklyn,Gravesend,40.59526,-73.973471


In [6]:
# To get the geographical coordinates of Brooklyn
address = 'Brooklyn, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Brooklyn are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Brooklyn are 40.6501038, -73.9495823.


In [7]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [8]:
map_brooklyn = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(brooklyn_data['Latitude'], brooklyn_data['Longitude'], brooklyn_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_brooklyn)  
    
map_brooklyn

## 2. Use FOURSQUARE API to explore the neighborhood

In [9]:
CLIENT_ID = 'TE5LBETEPPHSRPDK3SAKTK5LDFE4F15QQCNFDC1LTIWGXOU5' # your Foursquare ID
CLIENT_SECRET = 'XAJ525FWIICSFL2C431QDFHYCXVIK34NV01512IHWM2H4TS4' # your Foursquare Secret
ACCESS_TOKEN = 'R1D3W2EVQNLZVVNL0GK3NGNACVLIBBDXVEYO1XMT1F1VCI4L' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100

print('Successfully Logged-In')

Successfully Logged-In


#### The reason I chose Beaford Stuyvesant as my option is because this location is the most populated neighborhood with a population of 157,530 residents in Brooklyn.

In [10]:
# To get the neighborhood's latitude and longitude values.
# The reason I choose Bedford Stuyvesant 
brooklyn_data.loc[brooklyn_data["Neighborhood"] == "Bedford Stuyvesant"]
brooklyn_data.loc[17,"Neighborhood"]

'Bedford Stuyvesant'

In [11]:
neighborhood_latitude = brooklyn_data.loc[17, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = brooklyn_data.loc[17, 'Longitude'] # neighborhood longitude value

neighborhood_name = brooklyn_data.loc[17, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Bedford Stuyvesant are 40.687231607720456, -73.94178488690297.


#### We will now get the top 100 venues that are in Beaford Stuyvesant with a radius of 500 meters.

In [12]:
limit = 100
radius = 500

url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(CLIENT_ID, CLIENT_SECRET, VERSION, neighborhood_latitude, neighborhood_longitude,radius,limit)
url

'https://api.foursquare.com/v2/venues/explore?client_id=TE5LBETEPPHSRPDK3SAKTK5LDFE4F15QQCNFDC1LTIWGXOU5&client_secret=XAJ525FWIICSFL2C431QDFHYCXVIK34NV01512IHWM2H4TS4&v=20180605&ll=40.687231607720456,-73.94178488690297&radius=500&limit=100'

In [13]:
import requests
from pandas.io.json import json_normalize

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60373f85d226252cc3a62652'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Bedford-Stuyvesant',
  'headerFullLocation': 'Bedford-Stuyvesant, Brooklyn',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 30,
  'suggestedBounds': {'ne': {'lat': 40.69173161222046,
    'lng': -73.93586147520539},
   'sw': {'lat': 40.682731603220454, 'lng': -73.94770829860055}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5e4567fa2eafa100085e9ec3',
       'name': 'Bar Camillo',
       'location': {'address': '333 Tompkins Ave',
        'lat': 40.686523,
        'lng': -73.944379,
        'labeledLatLngs': [{'label': 'display',
          'lat': 40.68652

In [14]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [15]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  app.launch_new_instance()


Unnamed: 0,name,categories,lat,lng
0,Bar Camillo,Italian Restaurant,40.686523,-73.944379
1,Sincerely Tommy,Boutique,40.686066,-73.944294
2,The Bush Doctor,Juice Bar,40.687399,-73.94448
3,Bed-Vyne Brew,Bar,40.684751,-73.944319
4,Bed-Vyne Wine & Spirits,Wine Shop,40.684668,-73.944363


In [16]:
print("{} venues were returned by Foursquare.".format(nearby_venues.shape[0]))

30 venues were returned by Foursquare.


### Exploring neighborhoods in Brooklyn

In [17]:
#Create a function to repeat the same process to all neighborhoods in Brooklyn
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
# Create a new dataframe for Brooklyn
brooklyn_venues = getNearbyVenues(names=brooklyn_data["Neighborhood"],
                                   latitudes=brooklyn_data["Latitude"],
                                  longitudes=brooklyn_data["Longitude"])

Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker Heights
Gerritsen Beach
Marine Park
Clinton Hill
Sea Gate
Downtown
Boerum Hill
Prospect Lefferts Gardens
Ocean Hill
City Line
Bergen Beach
Midwood
Prospect Park South
Georgetown
East Williamsburg
North Side
South Side
Ocean Parkway
Fort Hamilton
Ditmas Park
Wingate
Rugby
Remsen Village
New Lots
Paerdegat Basin
Mill Basin
Fulton Ferry
Vinegar Hill
Weeksville
Broadway Junction
Dumbo
Homecrest
Highland Park
Madison
Erasmus


In [19]:
# To check the size of the resulting dataframe
print(brooklyn_venues.shape)
brooklyn_venues.head()

(2742, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bay Ridge,40.625801,-74.030621,Pilo Arts Day Spa and Salon,40.624748,-74.030591,Spa
1,Bay Ridge,40.625801,-74.030621,Bagel Boy,40.627896,-74.029335,Bagel Shop
2,Bay Ridge,40.625801,-74.030621,Leo's Casa Calamari,40.6242,-74.030931,Pizza Place
3,Bay Ridge,40.625801,-74.030621,Cocoa Grinder,40.623967,-74.030863,Juice Bar
4,Bay Ridge,40.625801,-74.030621,Pegasus Cafe,40.623168,-74.031186,Breakfast Spot


In [20]:
# To check how many venues were returned
brooklyn_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bath Beach,48,48,48,48,48,48
Bay Ridge,81,81,81,81,81,81
Bedford Stuyvesant,30,30,30,30,30,30
Bensonhurst,30,30,30,30,30,30
Bergen Beach,6,6,6,6,6,6
...,...,...,...,...,...,...
Vinegar Hill,31,31,31,31,31,31
Weeksville,17,17,17,17,17,17
Williamsburg,33,33,33,33,33,33
Windsor Terrace,30,30,30,30,30,30


In [21]:
print('There are {} uniques categories.'.format(len(brooklyn_venues['Venue Category'].unique())))

There are 291 uniques categories.


### Analyze each neighborhood

In [22]:
# one hot encoding
brooklyn_onehot = pd.get_dummies(brooklyn_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
brooklyn_onehot['Neighborhood'] = brooklyn_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [brooklyn_onehot.columns[-1]] + list(brooklyn_onehot.columns[:-1])
brooklyn_onehot = brooklyn_onehot[fixed_columns]

brooklyn_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#Examine the new dataframe size
brooklyn_onehot.shape

(2742, 291)

In [24]:
# Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
brooklyn_grouped = brooklyn_onehot.groupby('Neighborhood').mean().reset_index()
brooklyn_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Arts & Entertainment,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Bath Beach,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.020833,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0
1,Bay Ridge,0.000000,0.0,0.037037,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.012346,0.0,0.012346,0.0,0.000000,0.000000,0.000000,0.0,0.0
2,Bedford Stuyvesant,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.033333,0.033333,0.0,0.0
3,Bensonhurst,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0
4,Bergen Beach,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,Vinegar Hill,0.000000,0.0,0.032258,0.000000,0.0,0.0,0.064516,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.032258,0.032258,0.032258,0.0,0.0
66,Weeksville,0.000000,0.0,0.058824,0.000000,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0
67,Williamsburg,0.030303,0.0,0.000000,0.000000,0.0,0.0,0.030303,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.030303,0.000000,0.0,0.0
68,Windsor Terrace,0.000000,0.0,0.033333,0.033333,0.0,0.0,0.000000,0.033333,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.033333,0.0,0.0


In [25]:
#Print each neighborhood along with the top 5 most commmon venues.
num_top_venues = 5

for hood in brooklyn_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = brooklyn_grouped[brooklyn_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bath Beach----
                venue  freq
0     Bubble Tea Shop  0.06
1  Chinese Restaurant  0.06
2         Pizza Place  0.06
3            Pharmacy  0.04
4         Gas Station  0.04


----Bay Ridge----
                 venue  freq
0                  Spa  0.06
1           Bagel Shop  0.05
2          Pizza Place  0.05
3   Italian Restaurant  0.05
4  American Restaurant  0.04


----Bedford Stuyvesant----
           venue  freq
0    Coffee Shop  0.10
1    Pizza Place  0.07
2           Café  0.07
3  Deli / Bodega  0.07
4            Bar  0.07


----Bensonhurst----
                venue  freq
0                Park  0.10
1          Donut Shop  0.07
2         Pizza Place  0.07
3  Italian Restaurant  0.07
4  Chinese Restaurant  0.07


----Bergen Beach----
                venue  freq
0     Harbor / Marina  0.33
1          Playground  0.17
2      Baseball Field  0.17
3  Athletics & Sports  0.17
4                Park  0.17


----Boerum Hill----
            venue  freq
0     Coffee Shop  0.07
1

                 venue  freq
0          Coffee Shop  0.11
1          Pizza Place  0.05
2                  Bar  0.04
3               Bakery  0.04
4  American Restaurant  0.04


----Ocean Hill----
                             venue  freq
0                    Deli / Bodega  0.21
1              Fried Chicken Joint  0.07
2                             Food  0.07
3                    Grocery Store  0.07
4  Southern / Soul Food Restaurant  0.07


----Ocean Parkway----
                           venue  freq
0  Paper / Office Supplies Store  0.06
1                         Bakery  0.06
2                            Spa  0.06
3                      Nightclub  0.06
4                     Steakhouse  0.06


----Paerdegat Basin----
                  venue  freq
0          Home Service  0.33
1                  Food  0.33
2      Business Service  0.33
3       Organic Grocery  0.00
4  Pakistani Restaurant  0.00


----Park Slope----
          venue  freq
0   Coffee Shop  0.10
1  Burger Joint  0.06
2   Pizz

In [26]:
# Reconstruct it into a pd frame
# To sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [27]:
# create the new dataframe and display the top 10 venues for each neighborhood.

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = brooklyn_grouped['Neighborhood']

for ind in np.arange(brooklyn_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(brooklyn_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bath Beach,Bubble Tea Shop,Chinese Restaurant,Pizza Place,Pharmacy,Gas Station,Fast Food Restaurant,Italian Restaurant,Donut Shop,Peruvian Restaurant,Sushi Restaurant
1,Bay Ridge,Spa,Bagel Shop,Pizza Place,Italian Restaurant,American Restaurant,Greek Restaurant,Pharmacy,Chinese Restaurant,Bar,Grocery Store
2,Bedford Stuyvesant,Coffee Shop,Pizza Place,Café,Deli / Bodega,Bar,Boutique,Gourmet Shop,Thrift / Vintage Store,Gift Shop,Tiki Bar
3,Bensonhurst,Park,Donut Shop,Pizza Place,Italian Restaurant,Chinese Restaurant,Sushi Restaurant,Ice Cream Shop,Noodle House,Sporting Goods Shop,Cha Chaan Teng
4,Bergen Beach,Harbor / Marina,Playground,Baseball Field,Athletics & Sports,Park,Noodle House,North Indian Restaurant,Opera House,Optical Shop,Organic Grocery


## 3. Cluster Neighbors

In [28]:
#Run k-means to cluster the neighborhood into 5 clusters.
# set number of clusters
kclusters = 5

brooklyn_grouped_clustering = brooklyn_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(brooklyn_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 0, 0, 0, 3, 0, 2, 2, 2, 0], dtype=int32)

In [29]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

brooklyn_merged = brooklyn_data

# merge brooklyn_grouped with brooklyn_data to add latitude/longitude for each neighborhood
brooklyn_merged = brooklyn_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

brooklyn_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Brooklyn,Bay Ridge,40.625801,-74.030621,0,Spa,Bagel Shop,Pizza Place,Italian Restaurant,American Restaurant,Greek Restaurant,Pharmacy,Chinese Restaurant,Bar,Grocery Store
1,Brooklyn,Bensonhurst,40.611009,-73.99518,0,Park,Donut Shop,Pizza Place,Italian Restaurant,Chinese Restaurant,Sushi Restaurant,Ice Cream Shop,Noodle House,Sporting Goods Shop,Cha Chaan Teng
2,Brooklyn,Sunset Park,40.645103,-74.010316,2,Pizza Place,Mexican Restaurant,Latin American Restaurant,Bakery,Bank,Fried Chicken Joint,Gym,Mobile Phone Shop,Deli / Bodega,Sandwich Place
3,Brooklyn,Greenpoint,40.730201,-73.954241,0,Pizza Place,Coffee Shop,Bar,Cocktail Bar,Grocery Store,Yoga Studio,Record Shop,Deli / Bodega,Sandwich Place,Tea Room
4,Brooklyn,Gravesend,40.59526,-73.973471,2,Italian Restaurant,Lounge,Pizza Place,Chinese Restaurant,Bakery,Men's Store,Spa,Furniture / Home Store,Gym,Pharmacy


In [30]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(brooklyn_merged['Latitude'], brooklyn_merged['Longitude'], brooklyn_merged['Neighborhood'], brooklyn_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 4. Examine Clusters

#### Cluster 1

In [31]:
brooklyn_merged.loc[brooklyn_merged['Cluster Labels'] == 0, brooklyn_merged.columns[[1] + list(range(5, brooklyn_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bay Ridge,Spa,Bagel Shop,Pizza Place,Italian Restaurant,American Restaurant,Greek Restaurant,Pharmacy,Chinese Restaurant,Bar,Grocery Store
1,Bensonhurst,Park,Donut Shop,Pizza Place,Italian Restaurant,Chinese Restaurant,Sushi Restaurant,Ice Cream Shop,Noodle House,Sporting Goods Shop,Cha Chaan Teng
3,Greenpoint,Pizza Place,Coffee Shop,Bar,Cocktail Bar,Grocery Store,Yoga Studio,Record Shop,Deli / Bodega,Sandwich Place,Tea Room
6,Sheepshead Bay,Dessert Shop,Turkish Restaurant,Yoga Studio,Sandwich Place,Miscellaneous Shop,Fishing Spot,Harbor / Marina,Russian Restaurant,Café,Beer Garden
9,Crown Heights,Pizza Place,Café,Museum,Bagel Shop,Bakery,Coffee Shop,Pharmacy,Cosmetics Shop,Salon / Barbershop,Playground
11,Kensington,Grocery Store,Thai Restaurant,Pizza Place,Ice Cream Shop,Japanese Restaurant,Café,Mobile Phone Shop,Supermarket,Music Venue,Gas Station
12,Windsor Terrace,Deli / Bodega,Park,Grocery Store,Diner,Plaza,Bakery,Beer Store,French Restaurant,Food Truck,Middle Eastern Restaurant
13,Prospect Heights,Mexican Restaurant,Bar,Bakery,Thai Restaurant,Cocktail Bar,Coffee Shop,Wine Shop,Diner,Café,Wine Bar
15,Williamsburg,Pizza Place,Coffee Shop,Bagel Shop,Burger Joint,Clothing Store,Deli / Bodega,Middle Eastern Restaurant,Pet Store,Yoga Studio,Greek Restaurant
16,Bushwick,Bar,Mexican Restaurant,Coffee Shop,Discount Store,Deli / Bodega,Thrift / Vintage Store,Vegetarian / Vegan Restaurant,Bakery,Café,Pharmacy


#### Cluster 2

In [32]:
brooklyn_merged.loc[brooklyn_merged['Cluster Labels'] == 1, brooklyn_merged.columns[[1] + list(range(5, brooklyn_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,Mill Island,Pool,Yoga Studio,Optical Shop,Outlet Store,Outdoors & Recreation,Outdoor Gym,Other Repair Shop,Other Great Outdoors,Organic Grocery,Opera House


#### Cluster 3

In [33]:
brooklyn_merged.loc[brooklyn_merged['Cluster Labels'] == 2, brooklyn_merged.columns[[1] + list(range(5, brooklyn_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Sunset Park,Pizza Place,Mexican Restaurant,Latin American Restaurant,Bakery,Bank,Fried Chicken Joint,Gym,Mobile Phone Shop,Deli / Bodega,Sandwich Place
4,Gravesend,Italian Restaurant,Lounge,Pizza Place,Chinese Restaurant,Bakery,Men's Store,Spa,Furniture / Home Store,Gym,Pharmacy
5,Brighton Beach,Restaurant,Eastern European Restaurant,Russian Restaurant,Gourmet Shop,Bank,Mobile Phone Shop,Beach,Pharmacy,Sushi Restaurant,Convenience Store
7,Manhattan Terrace,Pizza Place,Ice Cream Shop,Grocery Store,Donut Shop,Bank,Jazz Club,Chinese Restaurant,Mobile Phone Shop,Coffee Shop,Convenience Store
8,Flatbush,Pharmacy,Mexican Restaurant,Coffee Shop,Deli / Bodega,Caribbean Restaurant,Bagel Shop,Middle Eastern Restaurant,Sandwich Place,Pizza Place,Lounge
10,East Flatbush,Chinese Restaurant,Park,Print Shop,Wine Shop,Fast Food Restaurant,Department Store,Supermarket,Liquor Store,Pharmacy,Caribbean Restaurant
14,Brownsville,Fried Chicken Joint,Moving Target,Chinese Restaurant,Park,Pizza Place,Spanish Restaurant,Farmers Market,Performing Arts Venue,Burger Joint,Playground
25,Cypress Hills,Pizza Place,Donut Shop,Fast Food Restaurant,Ice Cream Shop,Fried Chicken Joint,Metro Station,Baseball Field,Discount Store,Deli / Bodega,Dance Studio
26,East New York,Pharmacy,Spanish Restaurant,Fried Chicken Joint,Fast Food Restaurant,Deli / Bodega,Caribbean Restaurant,Plaza,Pizza Place,Salon / Barbershop,Home Service
27,Starrett City,Pharmacy,Convenience Store,American Restaurant,Bus Station,Intersection,Donut Shop,Bus Stop,Chinese Restaurant,River,Moving Target


#### Cluster 4

In [34]:
brooklyn_merged.loc[brooklyn_merged['Cluster Labels'] == 3, brooklyn_merged.columns[[1] + list(range(5, brooklyn_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
31,Manhattan Beach,Beach,Harbor / Marina,Bus Stop,Ice Cream Shop,Sandwich Place,Café,Playground,Pizza Place,Food,Other Repair Shop
36,Gerritsen Beach,Bar,Harbor / Marina,Seafood Restaurant,Ice Cream Shop,Baseball Field,Pizza Place,Bagel Shop,Deli / Bodega,Department Store,Boat or Ferry
45,Bergen Beach,Harbor / Marina,Playground,Baseball Field,Athletics & Sports,Park,Noodle House,North Indian Restaurant,Opera House,Optical Shop,Organic Grocery
46,Midwood,Pizza Place,Bakery,Moving Target,Ice Cream Shop,Pharmacy,Video Game Store,Candy Store,Convenience Store,Outlet Store,Outdoors & Recreation


#### Cluster 5

In [35]:
brooklyn_merged.loc[brooklyn_merged['Cluster Labels'] == 4, brooklyn_merged.columns[[1] + list(range(5, brooklyn_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
59,Paerdegat Basin,Home Service,Food,Business Service,Organic Grocery,Pakistani Restaurant,Outlet Store,Outdoors & Recreation,Outdoor Gym,Other Repair Shop,Other Great Outdoors
