
<h1 align=center><font size = 5>Final Capstone Project - Battle of the Coffee Merchants</font></h1>


## Table of Contents


1.  <a href="#item1">Introduction</a>

2.  <a href="#item2">Business Problem</a>

3.  <a href="#item2">Data Section</a>

4.  <a href="#item3">Data Cleaning</a>

5.  <a href="#item3">Analyze Each Neighborhood</a>

6.  <a href="#item4">Cluster Neighborhoods</a>

7.  <a href="#item5">Examine Clusters</a>  

8.  <a href="#item5">Conclusion</a>  


## 1. Introduction

We are exploring the opportunities for a mobile coffee merchant service in the neighborhoods of New York, the business concept is an independent unit that can provide a full range of barista offerings, and, based on demand and volume of pedestrian traffic, can relocate during the day to optimize sales/orders.
<br/>
The initial survey will be to determine the competition in the various neighborhoods and identify the optimal locations for placement.
<r/>
Once this phase has been completed, then the availability of licenses to trade may further refine the options, and, finally, the proximity to supply outlets for consumables will be factored in as the units have a limited capacity for storage of items such as milk and milk alternatives.


## 2. Business Problem
The data comparison can tend into all suppliers of coffee, however, we are concerned primarily with direct competitors selling take-away coffee, other providers like diners attract customers with their additional food offerings and this service is not competing with that, also, venues that don't provide a primary take away service are also offering a seated service which is not comparable to a mobile coffee kiosk.

## 3. Data Section



Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.52         |     pyhd8ed1ab_0          35 KB  conda-forge
    geopy-2.2.0                |     pyhd8ed1ab_0          67 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         102 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.52-pyhd8ed1ab_0
  geopy              conda-forge/noarch::geopy-2.2.0-pyhd8ed1ab_0



Downloading and Extracting Packages
geographiclib-1.52   | 35 KB     | ##################################### | 100% 
geopy-2.2.0          | 67 KB     | ########################

<a id='item1'></a>


## Download and Explore Dataset


Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 


The same newyork_data.json file provided in the labs is being used here.


In [2]:
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

Data downloaded!


#### Load and explore the data


Next, let's load the data.


In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

We know from looking at the data offline that most of the relevant fields are contained in the _features_ subset which is basically a list of the neighborhoods, with this in mind...
Next we define a new variable that includes this data.


In [4]:
neighborhoods_data = newyork_data['features']

By way of example let's take a look at the first item in this list.


In [5]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a _pandas_ dataframe


The next task is essentially transforming this data of nested Python dictionaries into a _pandas_ dataframe. So let's start by creating an empty dataframe.


In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.


In [7]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.


In [8]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.


In [9]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


And make sure that the dataset has all 5 boroughs and 306 neighborhoods.


In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### Use geopy library to get the latitude and longitude values of New York City.


In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.


In [11]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.


In [12]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.


However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Manhattan. So let's slice the original dataframe and create a new dataframe of the Manhattan data.


In [13]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Let's get the geographical coordinates of Manhattan.


In [14]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


Let's visualizat Manhattan the neighborhoods in it.


In [15]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.


#### Define Foursquare Credentials and Version

This hidden section provides the accounts connection tokens, the results will be evident in the data below...


In [16]:
CLIENT_ID = 'IFEMUJQT1JS5K4M2PSN2VPWR3HKJEJLHRQSUYSI3ZVYIT1O1' # your Foursquare ID
CLIENT_SECRET = 'YHBXGFHQWGBXD2IMTHT3IJ211GYBROVCA1DZDM0AYMHABOIH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: IFEMUJQT1JS5K4M2PSN2VPWR3HKJEJLHRQSUYSI3ZVYIT1O1
CLIENT_SECRET:YHBXGFHQWGBXD2IMTHT3IJ211GYBROVCA1DZDM0AYMHABOIH


#### Let's explore the first neighborhood in our dataframe.


Get the neighborhood's name.


In [17]:
manhattan_data.loc[0, 'Neighborhood']

'Marble Hill'

Get the neighborhood's latitude and longitude values.


In [18]:
neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


#### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.


First, let's create the GET request URL. Name your URL **url**.


In [19]:
# type your answer here

radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=IFEMUJQT1JS5K4M2PSN2VPWR3HKJEJLHRQSUYSI3ZVYIT1O1&client_secret=YHBXGFHQWGBXD2IMTHT3IJ211GYBROVCA1DZDM0AYMHABOIH&v=20180605&ll=40.87655077879964,-73.91065965862981&radius=500&limit=100'

Double-click **here** for the solution.

<!-- The correct answer is:
LIMIT = 100 # limit of number of venues returned by Foursquare API
-->

<!--
radius = 500 # define radius
-->

<!--
\\\\ # create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL
--> 


Send the GET request and examine the resutls


In [20]:
results = requests.get(url).json()
# results

From the Foursquare lab in the previous module, we know that all the information is in the _items_ key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.


In [33]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a _pandas_ dataframe.


In [34]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1) 

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Arturo's,Pizza Place,40.874412,-73.910271
1,Bikram Yoga,Yoga Studio,40.876844,-73.906204
2,Tibbett Diner,Diner,40.880404,-73.908937
3,Dunkin',Donut Shop,40.877136,-73.906666
4,Starbucks,Coffee Shop,40.877531,-73.905582


## 4. Data Cleaning
We only need the venues defined as "Coffee Shops" specifically, the rest can be excluded. Although this could have been done prior to bringing the data down via the FourSquare API, this does allow for a faster way to run similar queries across other site categories if the business model needs to be expanded and investigated for related or alternative competitors.

In [35]:
nearby_venues = nearby_venues.drop(nearby_venues[nearby_venues.categories != 'Coffee Shop'].index)
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
4,Starbucks,Coffee Shop,40.877531,-73.905582
17,Starbucks,Coffee Shop,40.873234,-73.90873


And how many venues were returned by Foursquare?


In [24]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

2 venues were returned by Foursquare.


<a id='item2'></a>


### 4.1. Explore Neighborhoods in Manhattan


#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan


In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        # if ['venue']['categories'][0]['name'].index ='Coffee Shop'
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called _manhattan_venues_.


In [69]:
# type your answer here
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'], latitudes=manhattan_data['Latitude'], longitudes=manhattan_data['Longitude'])
# manhattan_venues = getNearbyVenues(names=nearby_venues['name'], latitudes=manhattan_data['Latitude'], longitudes=manhattan_data['Longitude'])
# manhattan_venues = manhattan_venues.drop(nearby_venues[nearby_venues.categories != 'Coffee Shop'].index)

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


Double-click **here** for the solution.

<!-- The correct answer is:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )
--> 


#### Let's check the size of the resulting dataframe


In [71]:
print(manhattan_venues.shape)
manhattan_venues.head()

(3257, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop


In [73]:
manhattan_venues = manhattan_venues.drop(manhattan_venues[manhattan_venues['Venue Category'] != 'Coffee Shop'].index)
manhattan_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop
17,Marble Hill,40.876551,-73.91066,Starbucks,40.873234,-73.90873,Coffee Shop
82,Chinatown,40.715618,-73.994279,Little Canal,40.714317,-73.990361,Coffee Shop
129,Washington Heights,40.851903,-73.9369,Forever Coffee Bar,40.850433,-73.936607,Coffee Shop
152,Washington Heights,40.851903,-73.9369,Starbucks,40.850961,-73.93833,Coffee Shop


Let's check how many venues were returned for each neighborhood


In [105]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,5,5,5,5,5,5
Carnegie Hill,7,7,7,7,7,7
Chelsea,6,6,6,6,6,6
Chinatown,1,1,1,1,1,1
Civic Center,7,7,7,7,7,7
Clinton,4,4,4,4,4,4
East Village,2,2,2,2,2,2
Financial District,10,10,10,10,10,10
Flatiron,3,3,3,3,3,3
Gramercy,3,3,3,3,3,3


#### Let's find out how many unique categories can be curated from all the returned venues


In [75]:
print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))

There are 1 uniques categories.


<a id='item3'></a>


## 5. Analyze Each Neighborhood


In [76]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Coffee Shop
4,Marble Hill,1
17,Marble Hill,1
82,Chinatown,1
129,Washington Heights,1
152,Washington Heights,1


And let's examine the new dataframe size.


In [77]:
manhattan_onehot.shape

(135, 2)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [78]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Coffee Shop
0,Battery Park City,1
1,Carnegie Hill,1
2,Chelsea,1
3,Chinatown,1
4,Civic Center,1
5,Clinton,1
6,East Village,1
7,Financial District,1
8,Flatiron,1
9,Gramercy,1


#### Let's confirm the new size


In [79]:
manhattan_grouped.shape

(38, 2)

#### Let's print each neighborhood along with the top 5 most common venues


In [95]:
num_top_venues = 5

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Battery Park City----
         venue  freq
0  Coffee Shop   1.0


----Carnegie Hill----
         venue  freq
0  Coffee Shop   1.0


----Chelsea----
         venue  freq
0  Coffee Shop   1.0


----Chinatown----
         venue  freq
0  Coffee Shop   1.0


----Civic Center----
         venue  freq
0  Coffee Shop   1.0


----Clinton----
         venue  freq
0  Coffee Shop   1.0


----East Village----
         venue  freq
0  Coffee Shop   1.0


----Financial District----
         venue  freq
0  Coffee Shop   1.0


----Flatiron----
         venue  freq
0  Coffee Shop   1.0


----Gramercy----
         venue  freq
0  Coffee Shop   1.0


----Greenwich Village----
         venue  freq
0  Coffee Shop   1.0


----Hamilton Heights----
         venue  freq
0  Coffee Shop   1.0


----Hudson Yards----
         venue  freq
0  Coffee Shop   1.0


----Inwood----
         venue  freq
0  Coffee Shop   1.0


----Lenox Hill----
         venue  freq
0  Coffee Shop   1.0


----Lincoln Square----
         v

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [81]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [82]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
1,Carnegie Hill,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
2,Chelsea,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
3,Chinatown,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
4,Civic Center,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop


<a id='item4'></a>


## 6. Cluster Neighborhoods


Run _k_-means to cluster the neighborhood into 5 clusters.


In [83]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

  return_n_iter=True)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [84]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,0.0,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
1,Manhattan,Chinatown,40.715618,-73.994279,0.0,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
2,Manhattan,Washington Heights,40.851903,-73.9369,0.0,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
3,Manhattan,Inwood,40.867684,-73.92121,0.0,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
4,Manhattan,Hamilton Heights,40.823604,-73.949688,0.0,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop


<a id='item5'></a>


## 7. Examine Clusters


Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.


#### Cluster 1


In [86]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Marble Hill,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
1,Chinatown,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
2,Washington Heights,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
3,Inwood,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
4,Hamilton Heights,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
5,Manhattanville,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
8,Upper East Side,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
9,Yorkville,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
10,Lenox Hill,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop
11,Roosevelt Island,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop


## 8. Conclusion

The prevalence of fixed address coffee shops seems equally dispersed with similar densities throughout the neighborhoods, this is likely due to the mature market being refined over decades and provides enough data to warrant further investigation around work place densities and pedestrian traffic routes.

Neighbourhoods that did represent a higher numbe rof competitors included:
<li> Financial District, 
<li> Carnegie Hill, 
<li> Chelesea, 
<li> Civic Center, 
<li> Lenox Hill, 
<li>Yorkville, 
<br>
An initial opportunity would be to experiment with venues in close proximity (licenses permitting) to existing sites as invariably there will be queues during peak hours and overflow clients may be worthwhile pursuing, these in time may become direct clients.

Once traffic densities have been explored, then , one kiosk would be a worthwhile investment to use for an actual trial to determine best locations before committing to any further CAPEX.