# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

New York City and the city of Toronto are very diverse cities and both the financial capitals of their respective countries. Sometimes it is very useful to compare both cities to find out their structure. City segmentation will be useful mostly for tourists, using this they can plan their trip and save more time. 

## Data description <a name="data"></a>

For this problem, we will get the services of Foursquare API to explore the data of two cities, in terms of their neighborhoods. The data also include the information about the places around each neighborhood like restaurants, hotels, coffee shops, parks, theaters, art galleries, museums and many more. We selected all Borough from each city to analyze their neighborhoods in order to understand the size of clusters. We will use machine learning technique, “Clustering” to segment the neighborhoods with similar objects on the basis of each neighborhood data. 

## Methodology <a name="methodology"></a>

We have selected all cities Borough to explore their neighborhoods. The data exploration, analysis and visualization for both cities are done in the same way but separately.

In [10]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


C:\ProgramData\Miniconda3\python.exe: No module named pip


In [176]:
import requests
import lxml.html as lh
import pandas as pd

### 1. Get the dataset of Toronto Neiborhoods

In [177]:
#Scrape Table Cells
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #Create a handle, page, to handle the contents of the website
page = requests.get(url) #Store the contents of the website under doc
doc = lh.fromstring(page.content) #Parse data that are stored between <tr>..</tr> of HTML
tr_els = doc.xpath('//tr')

#Parse Table Header
tr_els = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_els[0]:
    i+=1
    name=t.text_content()
    #print (i,':','"',name,'"')
    col.append((name,[]))
    
#Creating Pandas DataFrame
#Each header is appended to a tuple along with an empty list

#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_els)):
    #T is our j'th row
    T=tr_els[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1
        
Dict={title:column for (title,column) in col}
df_Tor=pd.DataFrame(Dict)

#Delete "\n" in our dataframe
df_Tor = df_Tor.replace(r'\n','', regex=True) #Delete /n in all columns
df_Tor.columns = df_Tor.columns.str.replace(r'\n','') #Delete /n in header
df_Tor.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


Process this dataset

In [178]:
#Drop all raws with Borough = 'Not assigned'
df_Tor = df_Tor[df_Tor.Borough != 'Not assigned']
df_Tor = df_Tor.reset_index() #Reset index 
del df_Tor['index'] #delete additional index column

#If Neighbourhood = 'Not assigned' replace it with 'Borough' value
df_Tor.loc[df_Tor['Neighbourhood'] == 'Not assigned', 'Neighbourhood'] = df_Tor['Borough']

#Group the repeating Post code
df_Tor = df_Tor.groupby(['Postcode','Borough'], sort=False).agg( ','.join)
df_Tor = df_Tor.reset_index()

df_Tor.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


Append the geocoordinates to Toronto dataset

In [179]:
#Load the geoloc csv file and put it into dataframe df_geoloc
geoloc_file = 'http://cocl.us/Geospatial_data' #url to csv file with geolocation informatio
df_geoloc = pd.read_csv(geoloc_file)
df_geoloc.head(10)

#Parce the Latitude and Longitude values according to df postal code values and put the result into 2 lists
lat=[]
long=[]
flag = 0
for ind in df_Tor.index:
    for indd in df_geoloc.index:
        if df_Tor['Postcode'][ind] == df_geoloc['Postal Code'][indd]:
            lat.append(df_geoloc['Latitude'][indd])
            long.append(df_geoloc['Longitude'][indd])
            #print('For', df['Postcode'][ind], 'lat:', df_geoloc['Latitude'][indd], 'long:', df_geoloc['Longitude'][indd])
            flag = flag + 1
    if flag == 0:
        lat.append(0)
        long.append(0)
    flag = 0
        #else:
            #print('for', df['Postcode'][ind], 'there is no gelocation data')
            
#Append derived list as additional columns to df as Latitude and Longitude
df_Tor['Latitude'] = lat
df_Tor['Longitude'] = long

df_Tor=df_Tor.drop(['Postcode'], axis=1)


df_Tor.head(10)

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
0,North York,Parkwoods,43.753259,-79.329656
1,North York,Victoria Village,43.725882,-79.315572
2,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,Queen's Park,Queen's Park,43.667856,-79.532242
6,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,North York,Don Mills North,43.745906,-79.352188
8,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


In [180]:
df_Tor.tail(10)

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
93,Etobicoke,"Alderwood,Long Branch",43.602414,-79.543484
94,Etobicoke,Northwest,43.706748,-79.594054
95,Scarborough,Upper Rouge,43.836125,-79.205636
96,Downtown Toronto,"Cabbagetown,St. James Town",43.667967,-79.367675
97,Downtown Toronto,"First Canadian Place,Underground city",43.648429,-79.38228
98,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North",43.653654,-79.506944
99,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
100,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558
101,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout...",43.636258,-79.498509
102,Etobicoke,"Kingsway Park South West,Mimico NW,The Queensw...",43.628841,-79.520999


In [181]:
print('The Toronto dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_Tor['Borough'].unique()),
        df_Tor.shape[0]
    )
)

The Toronto dataframe has 11 boroughs and 103 neighborhoods.


### 2. Get the dataset of New York Neiborhoods

In [182]:
import urllib.request as request
import json

In [183]:
with request.urlopen('https://cocl.us/new_york_dataset') as response:
    source = response.read()
    newyork_data = json.loads(source)

In [184]:
neighborhoods_data = newyork_data['features']

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    df_NT = df_NY.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [185]:
df_NY.head(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Staten Island,Fox Hills,40.617311,-74.08174
1,Bronx,Wakefield,40.894705,-73.847201
2,Bronx,Co-op City,40.874294,-73.829939
3,Bronx,Eastchester,40.887556,-73.827806
4,Bronx,Fieldston,40.895437,-73.905643
5,Bronx,Riverdale,40.890834,-73.912585
6,Bronx,Kingsbridge,40.881687,-73.902818
7,Manhattan,Marble Hill,40.876551,-73.91066
8,Bronx,Woodlawn,40.898273,-73.867315
9,Bronx,Norwood,40.877224,-73.879391


In [186]:
df_NY.tail(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
297,Brooklyn,Madison,40.609378,-73.948415
298,Bronx,Bronxdale,40.852723,-73.861726
299,Bronx,Allerton,40.865788,-73.859319
300,Bronx,Kingsbridge Heights,40.870392,-73.901523
301,Brooklyn,Erasmus,40.646926,-73.948177
302,Manhattan,Hudson Yards,40.756658,-74.000111
303,Queens,Hammels,40.587338,-73.80553
304,Queens,Bayswater,40.611322,-73.765968
305,Queens,Queensbridge,40.756091,-73.945631
306,Staten Island,Fox Hills,40.617311,-74.08174


In [187]:
print('The New York dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_NY['Borough'].unique()),
        df_NY.shape[0]
    )
)

The New York dataframe has 5 boroughs and 307 neighborhoods.


### 3. Visualise the data

In [97]:
!conda install -c conda-forge geopy --yes 

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\ProgramData\Miniconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          58 KB

The following packages will be UPDATED:

  geopy                                         1.20.0-py_0 --> 1.21.0-py_0



Downloading and Extracting Packages

geopy-1.21.0         | 58 KB     |            |   0% 
geopy-1.21.0         | 58 KB     | ##7        |  27% 
geopy-1.21.0         | 58 KB     | ########## | 100% 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction

In [188]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [189]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude_Tor = location.latitude
longitude_Tor = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude_Tor, longitude_Tor))

address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude_NY = location.latitude
longitude_NY = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude_NY, longitude_NY))

The geograpical coordinate of Toronto are 43.653963, -79.387207.
The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [99]:
conda install folium -c conda-forge

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [190]:
import folium # map rendering library

### Toronto

In [191]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude_Tor, longitude_Tor], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_Tor['Latitude'], df_Tor['Longitude'], df_Tor['Borough'], df_Tor['Neighbourhood']):

    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### New York

In [192]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude_NY, longitude_NY], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_NY['Latitude'], df_NY['Longitude'], df_NY['Borough'], df_NY['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### 4. Get the venue data

Define Foursquare Credentials and Version

In [193]:
CLIENT_ID = 'PDK2DALILYZBAMX2OTBMI0BDTQDYH1MVNYMDY1KRVIK4U2FM' # your Foursquare ID
CLIENT_SECRET = 'GREMQSCEPPIJAKPU011XP0TL4INFOUT1ZN4CJXE4U5C2UGQT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PDK2DALILYZBAMX2OTBMI0BDTQDYH1MVNYMDY1KRVIK4U2FM
CLIENT_SECRET:GREMQSCEPPIJAKPU011XP0TL4INFOUT1ZN4CJXE4U5C2UGQT


In [194]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
#radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [195]:
toronto_venues = getNearbyVenues(names=df_Tor['Neighbourhood'],
                                   latitudes=df_Tor['Latitude'],
                                   longitudes=df_Tor['Longitude']
                                  )

Parkwoods
Victoria Village
Harbourfront
Lawrence Heights,Lawrence Manor
Queen's Park
Queen's Park
Rouge,Malvern
Don Mills North
Woodbine Gardens,Parkview Hill
Ryerson,Garden District
Glencairn
Cloverdale,Islington,Martin Grove,Princess Gardens,West Deane Park
Highland Creek,Rouge Hill,Port Union
Flemingdon Park,Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Bloordale Gardens,Eringate,Markland Wood,Old Burnhamthorpe
Guildwood,Morningside,West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor,Downsview North,Wilson Heights
Thorncliffe Park
Adelaide,King,Richmond
Dovercourt Village,Dufferin
Scarborough Village
Fairview,Henry Farm,Oriole
Northwood Park,York University
East Toronto
Harbourfront East,Toronto Islands,Union Station
Little Portugal,Trinity
East Birchmount Park,Ionview,Kennedy Park
Bayview Village
CFB Toronto,Downsview East
The Danforth West,Riverdale
Design Exchange,Toronto 

In [196]:
NY_venues = getNearbyVenues(names=df_NY['Neighborhood'],
                                   latitudes=df_NY['Latitude'],
                                   longitudes=df_NY['Longitude']
                                  )

Fox Hills
Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough 

In [197]:
print('There are {} uniques categories in Toronto.'.format(len(toronto_venues['Venue Category'].unique())))
print('There are {} uniques categories in New York.'.format(len(NY_venues['Venue Category'].unique())))

There are 266 uniques categories in Toronto.
There are 431 uniques categories in New York.


### 4. Analyze Each Neighborhood

Prepare the data for klustering algorithm

#### Toronto

In [198]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Transportation Service,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [199]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Transportation Service,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.02,0.0,0.000,0.0,0.0,0.01,0.0,0.01
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.000,0.0,0.0,0.00,0.0,0.00
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.000,0.0,0.0,0.00,0.0,0.00
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.000,0.0,0.0,0.00,0.0,0.00
4,"Alderwood,Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.000,0.0,0.0,0.00,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,Willowdale West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.000,0.0,0.0,0.00,0.0,0.00
95,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.000,0.0,0.0,0.00,0.0,0.00
96,"Woodbine Gardens,Parkview Hill",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.000,0.0,0.0,0.00,0.0,0.00
97,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.00,0.0,0.125,0.0,0.0,0.00,0.0,0.00


In [200]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
         venue  freq
0  Coffee Shop  0.07
1   Steakhouse  0.04
2         Café  0.04
3          Bar  0.04
4   Restaurant  0.03


----Agincourt----
                       venue  freq
0                     Lounge  0.25
1  Latin American Restaurant  0.25
2             Breakfast Spot  0.25
3               Skating Rink  0.25
4                     Market  0.00


----Agincourt North,L'Amoreaux East,Milliken,Steeles East----
                             venue  freq
0                       Playground   0.5
1                             Park   0.5
2                      Men's Store   0.0
3  Molecular Gastronomy Restaurant   0.0
4       Modern European Restaurant   0.0


----Albion Gardens,Beaumond Heights,Humbergate,Jamestown,Mount Olive,Silverstone,South Steeles,Thistletown----
            venue  freq
0     Pizza Place  0.17
1   Grocery Store  0.17
2        Pharmacy  0.08
3  Discount Store  0.08
4      Beer Store  0.08


----Alderwood,Long Branch----
          venu

Function to sort the venues in descending order

In [201]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [202]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Steakhouse,Bar,Café,Burger Joint,Breakfast Spot,Cosmetics Shop,Thai Restaurant,Restaurant,Bakery
1,Agincourt,Latin American Restaurant,Lounge,Skating Rink,Breakfast Spot,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Playground,Park,Women's Store,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Donut Shop
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Grocery Store,Pizza Place,Pharmacy,Coffee Shop,Fried Chicken Joint,Sandwich Place,Beer Store,Fast Food Restaurant,Discount Store,Japanese Restaurant
4,"Alderwood,Long Branch",Pizza Place,Gym,Skating Rink,Dance Studio,Coffee Shop,Pub,Sandwich Place,Pharmacy,Airport Service,Deli / Bodega


#### New York

In [203]:
# one hot encoding
NY_onehot = pd.get_dummies(NY_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
NY_onehot['Neighborhood'] = NY_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns1 = [NY_onehot.columns[-1]] + list(NY_onehot.columns[:-1])
NY_onehot = NY_onehot[fixed_columns1]

NY_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Volleyball Court,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [204]:
NY_grouped = NY_onehot.groupby('Neighborhood').mean().reset_index()
NY_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Antique Shop,Arcade,...,Volleyball Court,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Allerton,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0
1,Annadale,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0
2,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0
3,Arlington,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0
4,Arrochar,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,Woodhaven,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0
296,Woodlawn,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0
297,Woodrow,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0
298,Woodside,0.0,0.0,0.0,0.0,0.0,0.0,0.038961,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.00,0.0,0.0


In [205]:
num_top_venues = 5

for hood in NY_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = NY_grouped[NY_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Allerton----
                venue  freq
0         Pizza Place  0.12
1       Deli / Bodega  0.09
2  Chinese Restaurant  0.09
3         Supermarket  0.09
4    Department Store  0.06


----Annadale----
            venue  freq
0     Pizza Place  0.33
1        Pharmacy  0.08
2  Cosmetics Shop  0.08
3   Train Station  0.08
4           Diner  0.08


----Arden Heights----
          venue  freq
0      Bus Stop   0.2
1   Coffee Shop   0.2
2  Home Service   0.2
3   Pizza Place   0.2
4      Pharmacy   0.2


----Arlington----
                 venue  freq
0         Intersection  0.17
1  American Restaurant  0.17
2          Coffee Shop  0.17
3             Bus Stop  0.17
4        Grocery Store  0.17


----Arrochar----
                      venue  freq
0             Deli / Bodega  0.11
1        Italian Restaurant  0.11
2                  Bus Stop  0.11
3  Mediterranean Restaurant  0.05
4                Food Truck  0.05


----Arverne----
             venue  freq
0        Surf Spot  0.22
1    Metro 

In [206]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted1 = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted1['Neighborhood'] = NY_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted1.iloc[ind, 1:] = return_most_common_venues(NY_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted1.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allerton,Pizza Place,Chinese Restaurant,Deli / Bodega,Supermarket,Department Store,Fried Chicken Joint,Bike Trail,Bakery,Check Cashing Service,Fast Food Restaurant
1,Annadale,Pizza Place,Dance Studio,Train Station,Diner,Pharmacy,Restaurant,Sports Bar,Pub,Cosmetics Shop,Event Service
2,Arden Heights,Pharmacy,Bus Stop,Coffee Shop,Home Service,Pizza Place,Women's Store,Farmers Market,Ethiopian Restaurant,Event Service,Event Space
3,Arlington,Deli / Bodega,Bus Stop,Intersection,Grocery Store,Coffee Shop,American Restaurant,Farmers Market,Farm,Falafel Restaurant,Women's Store
4,Arrochar,Italian Restaurant,Deli / Bodega,Bus Stop,Supermarket,Food Truck,Middle Eastern Restaurant,Bagel Shop,Outdoors & Recreation,Sandwich Place,Taco Place


### 5 Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [207]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

#### Toronto

In [208]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 4, 0, 0, 0, 0, 0, 0, 0])

In [209]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_Tor

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

In [210]:
toronto_merged.head(20) # check the last columns!

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Food & Drink Shop,Women's Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Donut Shop
1,North York,Victoria Village,43.725882,-79.315572,0.0,Pizza Place,Coffee Shop,Portuguese Restaurant,Hockey Arena,Women's Store,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner
2,Downtown Toronto,Harbourfront,43.65426,-79.360636,0.0,Coffee Shop,Bakery,Café,Pub,Park,Breakfast Spot,Mexican Restaurant,Ice Cream Shop,Chocolate Shop,Beer Store
3,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763,0.0,Clothing Store,Furniture / Home Store,Women's Store,Boutique,Sporting Goods Shop,Miscellaneous Shop,Event Space,Vietnamese Restaurant,Coffee Shop,Accessories Store
4,Downtown Toronto,Queen's Park,43.662301,-79.389494,0.0,Coffee Shop,Gym,Park,Diner,Salad Place,Portuguese Restaurant,Mexican Restaurant,Liquor Store,Juice Bar,Italian Restaurant
5,Queen's Park,Queen's Park,43.667856,-79.532242,0.0,Coffee Shop,Gym,Park,Diner,Salad Place,Portuguese Restaurant,Mexican Restaurant,Liquor Store,Juice Bar,Italian Restaurant
6,Scarborough,"Rouge,Malvern",43.806686,-79.194353,1.0,Fast Food Restaurant,Women's Store,Deli / Bodega,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant
7,North York,Don Mills North,43.745906,-79.352188,0.0,Japanese Restaurant,Gym / Fitness Center,Café,Caribbean Restaurant,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Women's Store
8,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937,0.0,Fast Food Restaurant,Pizza Place,Pharmacy,Athletics & Sports,Bus Line,Intersection,Gastropub,Bank,Gym / Fitness Center,Pet Store
9,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Japanese Restaurant,Bookstore,Restaurant,Ramen Restaurant,Bakery,Electronics Store


Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Food & Drink Shop,Women's Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Donut Shop
1,North York,Victoria Village,43.725882,-79.315572,0.0,Pizza Place,Coffee Shop,Portuguese Restaurant,Hockey Arena,Women's Store,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner
2,Downtown Toronto,Harbourfront,43.65426,-79.360636,0.0,Coffee Shop,Bakery,Café,Pub,Park,Breakfast Spot,Mexican Restaurant,Ice Cream Shop,Chocolate Shop,Beer Store
3,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763,0.0,Clothing Store,Furniture / Home Store,Women's Store,Boutique,Sporting Goods Shop,Miscellaneous Shop,Event Space,Vietnamese Restaurant,Coffee Shop,Accessories Store
4,Downtown Toronto,Queen's Park,43.662301,-79.389494,0.0,Coffee Shop,Gym,Park,Diner,Salad Place,Portuguese Restaurant,Mexican Restaurant,Liquor Store,Juice Bar,Italian Restaurant
5,Queen's Park,Queen's Park,43.667856,-79.532242,0.0,Coffee Shop,Gym,Park,Diner,Salad Place,Portuguese Restaurant,Mexican Restaurant,Liquor Store,Juice Bar,Italian Restaurant
6,Scarborough,"Rouge,Malvern",43.806686,-79.194353,1.0,Fast Food Restaurant,Women's Store,Deli / Bodega,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant
7,North York,Don Mills North,43.745906,-79.352188,0.0,Japanese Restaurant,Gym / Fitness Center,Café,Caribbean Restaurant,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Women's Store
8,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937,0.0,Fast Food Restaurant,Pizza Place,Pharmacy,Athletics & Sports,Bus Line,Intersection,Gastropub,Bank,Gym / Fitness Center,Pet Store
9,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Japanese Restaurant,Bookstore,Restaurant,Ramen Restaurant,Bakery,Electronics Store


In [229]:
toronto_merged = toronto_merged.dropna(subset=['Cluster Labels'])   #Drop only if NaN in specific column (as asked in the question)

In [230]:
pd.set_option('display.max_rows', toronto_merged.shape[0]+1) # to display all rows of data

toronto_merged 

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,Parkwoods,43.753259,-79.329656,0.0,Park,Food & Drink Shop,Women's Store,Doner Restaurant,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Donut Shop
1,North York,Victoria Village,43.725882,-79.315572,0.0,Pizza Place,Coffee Shop,Portuguese Restaurant,Hockey Arena,Women's Store,Dog Run,Department Store,Dessert Shop,Dim Sum Restaurant,Diner
2,Downtown Toronto,Harbourfront,43.65426,-79.360636,0.0,Coffee Shop,Bakery,Café,Pub,Park,Breakfast Spot,Mexican Restaurant,Ice Cream Shop,Chocolate Shop,Beer Store
3,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763,0.0,Clothing Store,Furniture / Home Store,Women's Store,Boutique,Sporting Goods Shop,Miscellaneous Shop,Event Space,Vietnamese Restaurant,Coffee Shop,Accessories Store
4,Downtown Toronto,Queen's Park,43.662301,-79.389494,0.0,Coffee Shop,Gym,Park,Diner,Salad Place,Portuguese Restaurant,Mexican Restaurant,Liquor Store,Juice Bar,Italian Restaurant
5,Queen's Park,Queen's Park,43.667856,-79.532242,0.0,Coffee Shop,Gym,Park,Diner,Salad Place,Portuguese Restaurant,Mexican Restaurant,Liquor Store,Juice Bar,Italian Restaurant
6,Scarborough,"Rouge,Malvern",43.806686,-79.194353,1.0,Fast Food Restaurant,Women's Store,Deli / Bodega,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop,Doner Restaurant
7,North York,Don Mills North,43.745906,-79.352188,0.0,Japanese Restaurant,Gym / Fitness Center,Café,Caribbean Restaurant,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Women's Store
8,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937,0.0,Fast Food Restaurant,Pizza Place,Pharmacy,Athletics & Sports,Bus Line,Intersection,Gastropub,Bank,Gym / Fitness Center,Pet Store
9,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Japanese Restaurant,Bookstore,Restaurant,Ramen Restaurant,Bakery,Electronics Store


#### New York

In [211]:
NY_grouped_clustering = NY_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(NY_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 4, 4, 0, 0, 0, 0, 0, 2])

In [212]:
# add clustering labels
neighborhoods_venues_sorted1.insert(0, 'Cluster Labels', kmeans.labels_)

NY_merged = df_NY

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
NY_merged = NY_merged.join(neighborhoods_venues_sorted1.set_index('Neighborhood'), on='Neighborhood')

In [213]:
NY_merged.head(20) # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Staten Island,Fox Hills,40.617311,-74.08174,4.0,,,,,,,,,,
1,Bronx,Wakefield,40.894705,-73.847201,2.0,,,,,,,,,,
2,Bronx,Co-op City,40.874294,-73.829939,2.0,Bus Station,Liquor Store,Baseball Field,Bagel Shop,Gift Shop,Pharmacy,Mattress Store,Fast Food Restaurant,Pizza Place,Discount Store
3,Bronx,Eastchester,40.887556,-73.827806,2.0,Caribbean Restaurant,Bus Station,Diner,Deli / Bodega,Chinese Restaurant,Metro Station,Cosmetics Shop,Bus Stop,Convenience Store,Seafood Restaurant
4,Bronx,Fieldston,40.895437,-73.905643,3.0,Plaza,River,Bus Station,Field,English Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor
5,Bronx,Riverdale,40.890834,-73.912585,3.0,,,,,,,,,,
6,Bronx,Kingsbridge,40.881687,-73.902818,2.0,,,,,,,,,,
7,Manhattan,Marble Hill,40.876551,-73.91066,2.0,,,,,,,,,,
8,Bronx,Woodlawn,40.898273,-73.867315,0.0,,,,,,,,,,
9,Bronx,Norwood,40.877224,-73.879391,2.0,,,,,,,,,,


In [235]:
NY_merged = NY_merged.dropna(subset=['Cluster Labels'])   #Drop only if NaN in specific column (as asked in the question)

pd.set_option('display.max_rows', NY_merged.shape[0]+1) # to display all rows of data
NY_merged

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Staten Island,Fox Hills,40.617311,-74.08174,4.0,,,,,,,,,,
1,Bronx,Wakefield,40.894705,-73.847201,2.0,,,,,,,,,,
2,Bronx,Co-op City,40.874294,-73.829939,2.0,Bus Station,Liquor Store,Baseball Field,Bagel Shop,Gift Shop,Pharmacy,Mattress Store,Fast Food Restaurant,Pizza Place,Discount Store
3,Bronx,Eastchester,40.887556,-73.827806,2.0,Caribbean Restaurant,Bus Station,Diner,Deli / Bodega,Chinese Restaurant,Metro Station,Cosmetics Shop,Bus Stop,Convenience Store,Seafood Restaurant
4,Bronx,Fieldston,40.895437,-73.905643,3.0,Plaza,River,Bus Station,Field,English Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor
5,Bronx,Riverdale,40.890834,-73.912585,3.0,,,,,,,,,,
6,Bronx,Kingsbridge,40.881687,-73.902818,2.0,,,,,,,,,,
7,Manhattan,Marble Hill,40.876551,-73.91066,2.0,,,,,,,,,,
8,Bronx,Woodlawn,40.898273,-73.867315,0.0,,,,,,,,,,
9,Bronx,Norwood,40.877224,-73.879391,2.0,,,,,,,,,,


### 6 Visualise Clusters

In [214]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [231]:
# create map
map_clusters = folium.Map(location=[latitude_Tor, longitude_Tor], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [238]:
# create map
map_clusters = folium.Map(location=[latitude_NY, longitude_NY], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NY_merged['Latitude'], NY_merged['Longitude'], NY_merged['Neighborhood'], NY_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Analysis <a name="analysis"></a>

Calculate the amounts of clusters in %

#### Toronto

In [246]:
toronto_counts = toronto_merged['Cluster Labels'].value_counts()
percent_toronto = toronto_merged['Cluster Labels'].value_counts(normalize=True)
percent100_toronto = toronto_merged['Cluster Labels'].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
pd.DataFrame({'counts': toronto_counts, 'percentage': percent100_toronto})

Unnamed: 0,counts,percentage
0.0,90,90.0%
2.0,3,3.0%
4.0,3,3.0%
1.0,3,3.0%
3.0,1,1.0%


#### New York

In [263]:
NY_counts = NY_merged['Cluster Labels'].value_counts()
percent_NY = NY_merged['Cluster Labels'].value_counts(normalize=True)
percent100_NY = NY_merged['Cluster Labels'].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
pd.DataFrame({'counts': NY_counts, 'percentage': percent100_NY})

Unnamed: 0,counts,percentage
0.0,149,48.9%
2.0,134,43.9%
4.0,12,3.9%
3.0,6,2.0%
1.0,4,1.3%


## Results and Discussion <a name="results"></a>

It can be seen that Toronto has one big cluster (90.0% of the neighborhoods) and other are much smaller (about 3.0%).  For New York, there are two big clusters (48.9% and 43.9% of the neighborhoods) and other are much smaller (about 3.0% too). Segmentation of two cities are different. Toronto has more uniform neighborhood type.  New York has much more varieties. 

## Conclusion <a name="conclusion"></a>

As it was said in discussion there are two big clusters for New York and one large cluster for Toronto. If you are going to visit two cities and don’t know what to visit first it is better to first visit Toronto because it has more uniform neighborhood type as were determined in this report. 