<h1 align=center><font size = 5>Clustering New York and Toronto Neighbourhoods together</font></h1>


This Project clusters the neighbourhoods of two different cities, i.e., Toronto and New York together so that if a person wants to move from one city to another, he/she can choose the area to live in the new city from the same cluster as his/her current area so that they can get the similar venues and facilities nearby.

For this, we would need the Neighbourhood data for both the cities arranged in same way and then we will club the data together to apply clustering algorithm on the data.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import urllib.request

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

## 1. NEW YORK DATA.

We will download and arrange the data one by one for both the cities in same format.
First We will work to get New York Dataset.

This dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In the following cell, We will process the downloaded data for New York city and put it in a readable Dataframe which we can process further. Also it will filter the Manhattan area and map it. For our project, we have taken only Manhattan area.

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighbourhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)


for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighbourhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    

print('The dataframe has {} boroughs and {} neighbourhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)



manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()


address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))



# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

The dataframe has 5 boroughs and 306 neighbourhoods.
The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


## 2. TORONTO DATA.

Now we will work on getting and arranging the Toronto Dataset in the same format with top 10 most common venues.

This dataset exists for free on the web.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [4]:
html = ""

html = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read().decode('utf-8')

df = pd.DataFrame(columns = ['Postal Code','Borough','Neighbourhood'])
rows = ['init1','init2','init3']
counter = 0


while True:
    #loop to fill the three items in the list named "rows"
    for i in range (0,3):
        LinkStart = html.find("<td>")
        if(LinkStart == -1):
            break
        html = html[LinkStart+4:]
        LinkEnd = html.find("<")
        rows[i] = html[:LinkEnd].strip()
    
    #if end of the html then exit the loop
    LinkStart = html.find("<td>")
    if(LinkStart == -1):
        break
    
    #if Neighbourhood is 'Not assigned' then set Borough as the Neighbourhood
    if(rows[2] == 'Not assigned'):
        rows[2] = rows[1]
        
    
    #if more than one Neighbourhood in one postal code, then put all the Neighbourhoods in the same row of the DataFrame
    if(counter != 0):    
        if(df.iloc[-1][0] == rows[0]):
            df.iloc[-1][0] = df.iloc[-1][0] + ', ' + rows[2]
        
        elif(rows[1] != 'Not assigned'):
            temp = pd.DataFrame({'Postal Code':rows[0],'Borough':rows[1],'Neighbourhood':rows[2]}, index=[0])
            df = pd.concat([df, temp])
    
    #if borough is not equal to "Not assigned", only then copy the row in the DataFrame.
    if(rows[1] != 'Not assigned'):
            temp = pd.DataFrame({'Postal Code':rows[0],'Borough':rows[1],'Neighbourhood':rows[2]}, index=[0])
            df = pd.concat([df, temp])
    
df.reset_index(drop=True)

!wget -O Geospatial_Coordinates.csv https://cocl.us/Geospatial_data
    

latlong = pd.read_csv("Geospatial_Coordinates.csv")

df1 = df.merge(latlong, on='Postal Code')

address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df1['Latitude'], df1['Longitude'], df1['Borough'], df1['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

Toronto_Data = df1[df1['Borough'].str.contains('Toronto')].reset_index(drop=True)

# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(Toronto_Data['Latitude'], Toronto_Data['Longitude'], Toronto_Data['Borough'], Toronto_Data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

Toronto_Data_Filtered = Toronto_Data[['Borough','Neighbourhood','Latitude','Longitude']]

--2020-06-02 01:59:02--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.194, 158.85.108.86, 158.85.108.83
Connecting to cocl.us (cocl.us)|169.48.113.194|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-06-02 01:59:05--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-06-02 01:59:05--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr

Now we will use the Foursquare API to get the venues data for this neighbourhood of Manhattan, New York.

In [5]:
Final_Data = pd.concat([manhattan_data,Toronto_Data_Filtered])

In [6]:

CLIENT_ID = 'UCPZ1OFL3V4RMZHIIUZZYZHWG0URVQVS4PBCTBZQVPOL325W' # your Foursquare ID
CLIENT_SECRET = 'ZDSXDNIU1Y1KJMFVWE02OW0JAC5SU2KBQHUVU3WALNP0MM3L' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius


Your credentails:
CLIENT_ID: UCPZ1OFL3V4RMZHIIUZZYZHWG0URVQVS4PBCTBZQVPOL325W
CLIENT_SECRET:ZDSXDNIU1Y1KJMFVWE02OW0JAC5SU2KBQHUVU3WALNP0MM3L


In [7]:
Final_Data = Final_Data.reset_index(drop = True)

Creating following function to get the venues data for all the neighbourhoods. This is a generic function and will be used for both the cities.

In [8]:
Final_Data

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688
5,Manhattan,Manhattanville,40.816934,-73.957385
6,Manhattan,Central Harlem,40.815976,-73.943211
7,Manhattan,East Harlem,40.792249,-73.944182
8,Manhattan,Upper East Side,40.775639,-73.960508
9,Manhattan,Yorkville,40.77593,-73.947118


In [9]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Function created. Now calling the same for Manhatten data.

In [10]:
Final_venues = getNearbyVenues(names=Final_Data['Neighbourhood'],
                                   latitudes=Final_Data['Latitude'],
                                   longitudes=Final_Data['Longitude']
                                  )


Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards
Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District


In [24]:
Final_venues.groupby('Neighbourhood').count()

# one hot encoding
Final_onehot = pd.get_dummies(Final_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Final_onehot['Neighbourhood'] = Final_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [Final_onehot.columns[-1]] + list(Final_onehot.columns[:-1])
Final_onehot = Final_onehot[fixed_columns]

Final_onehot.head()

Final_grouped = Final_onehot.groupby('Neighbourhood').mean().reset_index()


In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [26]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = Final_grouped['Neighbourhood']

for ind in np.arange(Final_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Final_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Coffee Shop,Hotel,Gym,Memorial Site,Plaza,Wine Shop,Boat or Ferry,Gourmet Shop,Food Court
1,Berczy Park,Coffee Shop,Cocktail Bar,Café,Cheese Shop,Pub,Beer Bar,Restaurant,Bakery,Seafood Restaurant,Breakfast Spot
2,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Bakery,Restaurant,Grocery Store,Gym,Climbing Gym,Italian Restaurant,Performing Arts Venue
3,"Business reply mail Processing Centre, South C...",Yoga Studio,Gym / Fitness Center,Comic Shop,Recording Studio,Restaurant,Park,Skate Park,Burrito Place,Farmers Market,Fast Food Restaurant
4,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Plane,Boutique,Sculpture Garden,Coffee Shop,Harbor / Marina,Rental Car Location,Boat or Ferry


## 3. Cluster combined Neighborhoods data

Cluster the neighbourhoods data using k-means clustering

In [14]:
print('There are {} uniques categories.'.format(len(Final_venues['Venue Category'].unique())))

There are 376 uniques categories.


In [27]:
Final_merged

NameError: name 'Final_merged' is not defined

In [20]:
num_top_venues = 5

for hood in Final_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = Final_grouped[Final_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Battery Park City----
           venue  freq
0           Park  0.12
1    Coffee Shop  0.08
2          Hotel  0.06
3            Gym  0.05
4  Memorial Site  0.05


----Berczy Park----
                venue  freq
0         Coffee Shop  0.07
1        Cocktail Bar  0.05
2         Cheese Shop  0.03
3  Seafood Restaurant  0.03
4              Bakery  0.03


----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0            Café  0.13
1  Breakfast Spot  0.09
2     Coffee Shop  0.09
3             Bar  0.04
4   Grocery Store  0.04


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                  venue  freq
0           Yoga Studio  0.06
1            Restaurant  0.06
2         Burrito Place  0.06
3  Fast Food Restaurant  0.06
4                Garden  0.06


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
                 venue  freq
0      Airport Service  0.

In [28]:
#Final_merged = Final_merged.reset_index()

# set number of clusters
kclusters = 5

Final_grouped_clustering = Final_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Final_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Final_merged = Final_Data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Final_merged = Final_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

Final_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,3,Sandwich Place,Coffee Shop,Gym,Yoga Studio,Miscellaneous Shop,Steakhouse,Shopping Mall,Supplement Shop,Seafood Restaurant,Donut Shop
1,Manhattan,Chinatown,40.715618,-73.994279,3,Chinese Restaurant,Bakery,Cocktail Bar,Bubble Tea Shop,Vietnamese Restaurant,Salon / Barbershop,Coffee Shop,Ice Cream Shop,Optical Shop,American Restaurant
2,Manhattan,Washington Heights,40.851903,-73.9369,3,Café,Bakery,Grocery Store,Chinese Restaurant,Mobile Phone Shop,Gym,Mexican Restaurant,Supermarket,Sandwich Place,Bank
3,Manhattan,Inwood,40.867684,-73.92121,3,Mexican Restaurant,Lounge,Restaurant,Café,Bakery,Deli / Bodega,Frozen Yogurt Shop,Chinese Restaurant,Caribbean Restaurant,American Restaurant
4,Manhattan,Hamilton Heights,40.823604,-73.949688,3,Pizza Place,Café,Coffee Shop,Mexican Restaurant,Deli / Bodega,Yoga Studio,Caribbean Restaurant,Sushi Restaurant,Bakery,School


In [29]:
address = 'Rochester, New York'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
#print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))


# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=7)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Final_merged['Latitude'], Final_merged['Longitude'], Final_merged['Neighbourhood'], Final_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [30]:
Final_merged

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,3,Sandwich Place,Coffee Shop,Gym,Yoga Studio,Miscellaneous Shop,Steakhouse,Shopping Mall,Supplement Shop,Seafood Restaurant,Donut Shop
1,Manhattan,Chinatown,40.715618,-73.994279,3,Chinese Restaurant,Bakery,Cocktail Bar,Bubble Tea Shop,Vietnamese Restaurant,Salon / Barbershop,Coffee Shop,Ice Cream Shop,Optical Shop,American Restaurant
2,Manhattan,Washington Heights,40.851903,-73.9369,3,Café,Bakery,Grocery Store,Chinese Restaurant,Mobile Phone Shop,Gym,Mexican Restaurant,Supermarket,Sandwich Place,Bank
3,Manhattan,Inwood,40.867684,-73.92121,3,Mexican Restaurant,Lounge,Restaurant,Café,Bakery,Deli / Bodega,Frozen Yogurt Shop,Chinese Restaurant,Caribbean Restaurant,American Restaurant
4,Manhattan,Hamilton Heights,40.823604,-73.949688,3,Pizza Place,Café,Coffee Shop,Mexican Restaurant,Deli / Bodega,Yoga Studio,Caribbean Restaurant,Sushi Restaurant,Bakery,School
5,Manhattan,Manhattanville,40.816934,-73.957385,3,Coffee Shop,Seafood Restaurant,Bus Stop,Italian Restaurant,Mexican Restaurant,Chinese Restaurant,Park,Bus Station,BBQ Joint,Fried Chicken Joint
6,Manhattan,Central Harlem,40.815976,-73.943211,3,African Restaurant,American Restaurant,Cosmetics Shop,Chinese Restaurant,French Restaurant,Bar,Seafood Restaurant,Market,Juice Bar,BBQ Joint
7,Manhattan,East Harlem,40.792249,-73.944182,3,Mexican Restaurant,Bakery,Thai Restaurant,Deli / Bodega,Latin American Restaurant,Park,Spa,Sandwich Place,Liquor Store,Taco Place
8,Manhattan,Upper East Side,40.775639,-73.960508,3,Italian Restaurant,Bakery,Gym / Fitness Center,Coffee Shop,Juice Bar,Spa,French Restaurant,Exhibit,Yoga Studio,Wine Shop
9,Manhattan,Yorkville,40.77593,-73.947118,3,Coffee Shop,Italian Restaurant,Gym,Bar,Deli / Bodega,Sushi Restaurant,Wine Shop,Pizza Place,Japanese Restaurant,Mexican Restaurant


## 5. Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [31]:
Cluster1 = Final_merged.loc[Final_merged['Cluster Labels'] == 0, Final_merged.columns[[1] + list(range(5, Final_merged.shape[1]))]]

In [32]:
Cluster1

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Roosevelt Island,Park,Playground,Liquor Store,Japanese Restaurant,Residential Building (Apartment / Condo),Restaurant,Baseball Field,Sandwich Place,Scenic Lookout,School
28,Battery Park City,Park,Coffee Shop,Hotel,Gym,Memorial Site,Plaza,Wine Shop,Boat or Ferry,Gourmet Shop,Food Court
37,Stuyvesant Town,Park,Playground,Heliport,Baseball Field,Gas Station,Gym / Fitness Center,Cocktail Bar,Harbor / Marina,Pet Service,Fountain
55,"India Bazaar, The Beaches West",Park,Fast Food Restaurant,Fish & Chips Shop,Italian Restaurant,Pet Store,Pub,Restaurant,Sandwich Place,Movie Theater,Burrito Place
58,Lawrence Park,Bus Line,Park,Swim School,Yoga Studio,Drugstore,Dumpling Restaurant,Duty-free Shop,Eastern European Restaurant,Egyptian Restaurant,Electronics Store
60,Davisville North,Hotel,Breakfast Spot,Pizza Place,Food & Drink Shop,Sandwich Place,Gym / Fitness Center,Park,Department Store,Electronics Store,Donut Shop
78,"Business reply mail Processing Centre, South C...",Yoga Studio,Gym / Fitness Center,Comic Shop,Recording Studio,Restaurant,Park,Skate Park,Burrito Place,Farmers Market,Fast Food Restaurant


#### Cluster 2

In [33]:
Cluster2 = Final_merged.loc[Final_merged['Cluster Labels'] == 1, Final_merged.columns[[1] + list(range(5, Final_merged.shape[1]))]]

In [34]:
Cluster2

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
59,Roselawn,Home Service,Garden,Yoga Studio,English Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Duty-free Shop,Eastern European Restaurant,Egyptian Restaurant


#### Cluster 3

In [35]:
Cluster3 = Final_merged.loc[Final_merged['Cluster Labels'] == 2, Final_merged.columns[[1] + list(range(5, Final_merged.shape[1]))]]

In [36]:
Cluster3

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
44,The Beaches,Pub,Trail,Health Food Store,Neighborhood,Yoga Studio,Empanada Restaurant,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant


#### Cluster 4

In [37]:
Cluster4 = Final_merged.loc[Final_merged['Cluster Labels'] == 3, Final_merged.columns[[1] + list(range(5, Final_merged.shape[1]))]]

In [38]:
Cluster4

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Marble Hill,Sandwich Place,Coffee Shop,Gym,Yoga Studio,Miscellaneous Shop,Steakhouse,Shopping Mall,Supplement Shop,Seafood Restaurant,Donut Shop
1,Chinatown,Chinese Restaurant,Bakery,Cocktail Bar,Bubble Tea Shop,Vietnamese Restaurant,Salon / Barbershop,Coffee Shop,Ice Cream Shop,Optical Shop,American Restaurant
2,Washington Heights,Café,Bakery,Grocery Store,Chinese Restaurant,Mobile Phone Shop,Gym,Mexican Restaurant,Supermarket,Sandwich Place,Bank
3,Inwood,Mexican Restaurant,Lounge,Restaurant,Café,Bakery,Deli / Bodega,Frozen Yogurt Shop,Chinese Restaurant,Caribbean Restaurant,American Restaurant
4,Hamilton Heights,Pizza Place,Café,Coffee Shop,Mexican Restaurant,Deli / Bodega,Yoga Studio,Caribbean Restaurant,Sushi Restaurant,Bakery,School
5,Manhattanville,Coffee Shop,Seafood Restaurant,Bus Stop,Italian Restaurant,Mexican Restaurant,Chinese Restaurant,Park,Bus Station,BBQ Joint,Fried Chicken Joint
6,Central Harlem,African Restaurant,American Restaurant,Cosmetics Shop,Chinese Restaurant,French Restaurant,Bar,Seafood Restaurant,Market,Juice Bar,BBQ Joint
7,East Harlem,Mexican Restaurant,Bakery,Thai Restaurant,Deli / Bodega,Latin American Restaurant,Park,Spa,Sandwich Place,Liquor Store,Taco Place
8,Upper East Side,Italian Restaurant,Bakery,Gym / Fitness Center,Coffee Shop,Juice Bar,Spa,French Restaurant,Exhibit,Yoga Studio,Wine Shop
9,Yorkville,Coffee Shop,Italian Restaurant,Gym,Bar,Deli / Bodega,Sushi Restaurant,Wine Shop,Pizza Place,Japanese Restaurant,Mexican Restaurant


#### Cluster 5

In [39]:
Cluster5 = Final_merged.loc[Final_merged['Cluster Labels'] == 4, Final_merged.columns[[1] + list(range(5, Final_merged.shape[1]))]]

In [86]:
Cluster5

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
69,"Moore Park, Summerhill East",Park,Trail,Restaurant,Yoga Studio,Empanada Restaurant,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Duty-free Shop
73,Rosedale,Park,Playground,Trail,Yoga Studio,Empanada Restaurant,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Duty-free Shop


In [115]:
area = input("Enter your current area: ")

if Cluster5[Cluster5['Neighbourhood'].str.contains(area, case = False, regex = False)].shape > (0,11):
    print("Correct")
else:
    print("Not")

Enter your current area:  rose


Correct


In [116]:
area = input("Enter your current area: ")

if Cluster1[Cluster1['Neighbourhood'].str.contains(area, case = False, regex = False)].shape > (0,11):
    print("Cluster1")
    
elif Cluster2[Cluster2['Neighbourhood'].str.contains(area, case = False, regex = False)].shape > (0,11):
    print("Cluster2")

elif Cluster3[Cluster3['Neighbourhood'].str.contains(area, case = False, regex = False)].shape > (0,11):
    print("Cluster3")
    
elif Cluster4[Cluster4['Neighbourhood'].str.contains(area, case = False, regex = False)].shape > (0,11):
    print("Cluster4")

elif Cluster5[Cluster5['Neighbourhood'].str.contains(area, case = False, regex = False)].shape > (0,11):
    print("Cluster5")
    
else:
    print("Area not found in any cluster.")
    



Enter your current area:  lawrence


Cluster1
