<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto City</font></h1>

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in New York City</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')
print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Libraries imported.
Libraries imported.


## 1. Download and Explore Dataset

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto.Needed t0 Scrape the Wikipedia page and wrangle the data, cleane it, and then read it into a pandas dataframe so that it is in a structured format.

Wikipedia page link to the dataset: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, was used to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

####  The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [2]:
import pandas as pd # library for data analsysis

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]   
#df= pd.DataFrame(df.values[1:], columns=df.iloc[0])
df=df[df.Borough != 'Not assigned']
df['Neighbourhood'] = df.groupby(['Postcode','Borough'])['Neighbourhood'].transform(lambda x: ','.join(x))
df1=df[['Postcode','Borough','Neighbourhood']].drop_duplicates()

df1['Neighbourhood'].replace(to_replace ="Not assigned", 
                 value ="Queen's Park", inplace=True) 


df1.shape

(103, 3)

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

The link to a csv file has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data. The csv file to create the dataframe

In [3]:
df2 = pd.read_csv('http://cocl.us/Geospatial_data')

neighborhoods = pd.merge(df1, df2, left_on='Postcode', right_on='Postal Code')
neighborhoods.drop(['Postal Code'], axis = 1 , inplace=True)
print(neighborhoods.shape)
neighborhoods.head()

(103, 5)


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


In [4]:
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Toronto City, Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.653963, -79.387207.


In [5]:
Toronto_data = neighborhoods[neighborhoods['Borough'] =='Downtown Toronto'].reset_index(drop=True)

print(Toronto_data.shape)

Toronto_data.head()

(19, 5)


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [6]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
#import folium # map rendering library

#### Define Foursquare Credentials and Version

In [7]:
CLIENT_ID = 'HT4J1D0RWL52AV4QMEJ2DE0A0L1NWSUYRIVURPOTOXY04OAY' # your Foursquare ID
CLIENT_SECRET = 'J5QCUFEICHUV5NFJVKI1STS2M1WXSQOVJ3CBQR3KMI0CY0FM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: HT4J1D0RWL52AV4QMEJ2DE0A0L1NWSUYRIVURPOTOXY04OAY
CLIENT_SECRET:J5QCUFEICHUV5NFJVKI1STS2M1WXSQOVJ3CBQR3KMI0CY0FM


#### Let's explore the first neighborhood in our dataframe.

In [8]:
#Get the neighborhood's name.
Toronto_data.loc[0,'Neighbourhood']

'Harbourfront'

In [9]:
# Initialise

neighborhood_latitude = Toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = Toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = Toronto_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Harbourfront are 43.6542599, -79.3606359.


#### Now, let's get the top 100 venues that are  within a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [10]:
#First, let's create the GET request URL. Name your URL **url**.

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    'HT4J1D0RWL52AV4QMEJ2DE0A0L1NWSUYRIVURPOTOXY04OAY', 
    'J5QCUFEICHUV5NFJVKI1STS2M1WXSQOVJ3CBQR3KMI0CY0FM', 
    20200101, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    500, 
    100)
url # display URL



'https://api.foursquare.com/v2/venues/explore?&client_id=HT4J1D0RWL52AV4QMEJ2DE0A0L1NWSUYRIVURPOTOXY04OAY&client_secret=J5QCUFEICHUV5NFJVKI1STS2M1WXSQOVJ3CBQR3KMI0CY0FM&v=20200101&ll=43.6542599,-79.3606359&radius=500&limit=100'

In [11]:
results = requests.get(url).json()
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues)

Send the GET request and examine the resutls

### **get_category_type** function from the Foursquare lab.

In [12]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### clean the json and structure it into a *pandas* dataframe.

In [13]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.shape


(46, 4)

In [14]:
nearby_venues

Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Cooper Koo Family YMCA,Gym / Fitness Center,43.653191,-79.357947
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149
5,Impact Kitchen,Restaurant,43.656369,-79.35698
6,Corktown Common,Park,43.655618,-79.356211
7,Figs Breakfast & Lunch,Breakfast Spot,43.655675,-79.364503
8,Dominion Pub and Kitchen,Pub,43.656919,-79.358967
9,The Distillery Historic District,Historic Site,43.650244,-79.359323


In [15]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

46 venues were returned by Foursquare.


## 2. Explore Neighborhoods in Toronto

Create a function for extraction

In [16]:
def getNearbyVenues(names, latitudes, longitudes, radius=500,LIMIT=50):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            'HT4J1D0RWL52AV4QMEJ2DE0A0L1NWSUYRIVURPOTOXY04OAY', 
            'J5QCUFEICHUV5NFJVKI1STS2M1WXSQOVJ3CBQR3KMI0CY0FM', 
            20200101, 
            lat, 
            lng, 
            100, 
            50)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Extraction 

In [17]:
import requests

Toronto_data_venues = getNearbyVenues(names=Toronto_data['Neighbourhood'],
                                   latitudes=Toronto_data['Latitude'],
                                   longitudes=Toronto_data['Longitude']
                                  )                

Harbourfront
Queen's Park
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Christie
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown,St. James Town
First Canadian Place,Underground city
Church and Wellesley


In [18]:
#### Let's check the size of the resulting dataframe

In [19]:
print(Toronto_data_venues.shape)
Toronto_data_venues.head()

(85, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
1,Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
2,Harbourfront,43.65426,-79.360636,Massimo's Kitchen Studio,43.65477,-79.359698,Italian Restaurant
3,Harbourfront,43.65426,-79.360636,Sackville Playground,43.654656,-79.359871,Park
4,Queen's Park,43.662301,-79.389494,Ontario Police and Peace Officer's Memorial,43.662159,-79.389482,Sculpture Garden


Let's check how many venues were returned for each neighborhood

In [20]:
Toronto_data_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",11,11,11,11,11,11
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",1,1,1,1,1,1
"Cabbagetown,St. James Town",1,1,1,1,1,1
Central Bay Street,4,4,4,4,4,4
"Chinatown,Grange Park,Kensington Market",5,5,5,5,5,5
Christie,1,1,1,1,1,1
"Commerce Court,Victoria Hotel",15,15,15,15,15,15
"Design Exchange,Toronto Dominion Centre",7,7,7,7,7,7
"First Canadian Place,Underground city",12,12,12,12,12,12
"Harbord,University of Toronto",1,1,1,1,1,1


#### Let's find out how many unique categories can be curated from all the returned venues

In [21]:
print('There are {} uniques categories.'.format(len(Toronto_data_venues['Venue Category'].unique())))

There are 48 uniques categories.


## 3. Analyze Each Neighborhood

In [22]:
# one hot encoding
T_onehot = pd.get_dummies(Toronto_data_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
T_onehot['Neighborhood'] = Toronto_data_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [T_onehot.columns[-1]] + list(T_onehot.columns[:-1])
T_onehot = T_onehot[fixed_columns]

T_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Art Gallery,Bakery,Bar,Bookstore,Breakfast Spot,Building,Burger Joint,Burrito Place,Café,Cocktail Bar,Coffee Shop,College Gym,Concert Hall,Deli / Bodega,Diner,Farmers Market,Fast Food Restaurant,Food Court,Gastropub,Gift Shop,Gluten-free Restaurant,Greek Restaurant,Gym,Gym / Fitness Center,Hostel,Hotel,Italian Restaurant,Japanese Restaurant,Liquor Store,Nightclub,Park,Performing Arts Venue,Pharmacy,Pub,Restaurant,Salad Place,Sandwich Place,Sculpture Garden,Seafood Restaurant,Spa,Steakhouse,Sushi Restaurant,Taco Place,Tea Room,Thai Restaurant,Thrift / Vintage Store,Vegetarian / Vegan Restaurant
0,Harbourfront,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,Harbourfront,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Harbourfront,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Harbourfront,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Queen's Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [23]:
T_onehot.shape

(85, 49)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [24]:
T_grouped = T_onehot.groupby('Neighborhood').mean().reset_index()
T_grouped

Unnamed: 0,Neighborhood,American Restaurant,Art Gallery,Bakery,Bar,Bookstore,Breakfast Spot,Building,Burger Joint,Burrito Place,Café,Cocktail Bar,Coffee Shop,College Gym,Concert Hall,Deli / Bodega,Diner,Farmers Market,Fast Food Restaurant,Food Court,Gastropub,Gift Shop,Gluten-free Restaurant,Greek Restaurant,Gym,Gym / Fitness Center,Hostel,Hotel,Italian Restaurant,Japanese Restaurant,Liquor Store,Nightclub,Park,Performing Arts Venue,Pharmacy,Pub,Restaurant,Salad Place,Sandwich Place,Sculpture Garden,Seafood Restaurant,Spa,Steakhouse,Sushi Restaurant,Taco Place,Tea Room,Thai Restaurant,Thrift / Vintage Store,Vegetarian / Vegan Restaurant
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.090909,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.090909,0.090909,0.090909,0.0,0.0,0.090909
1,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Chinatown,Grange Park,Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Commerce Court,Victoria Hotel",0.066667,0.066667,0.066667,0.0,0.066667,0.0,0.0,0.0,0.066667,0.066667,0.0,0.066667,0.0,0.0,0.066667,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.066667,0.066667,0.0,0.066667,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0
7,"Design Exchange,Toronto Dominion Centre",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.428571,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"First Canadian Place,Underground city",0.0,0.0,0.083333,0.0,0.0,0.0,0.083333,0.083333,0.0,0.083333,0.0,0.083333,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.083333,0.083333,0.083333,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0
9,"Harbord,University of Toronto",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [25]:
T_grouped.shape

(17, 49)

#### Let's print each neighborhood along with the top 5 most common venues

In [26]:
num_top_venues = 5

for hood in T_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = T_grouped[T_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
                           venue  freq
0  Vegetarian / Vegan Restaurant  0.09
1               Sushi Restaurant  0.09
2               Greek Restaurant  0.09
3                     Food Court  0.09
4                    Coffee Shop  0.09


----CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara----
                   venue  freq
0  Performing Arts Venue   1.0
1    American Restaurant   0.0
2            Art Gallery   0.0
3                  Hotel   0.0
4     Italian Restaurant   0.0


----Cabbagetown,St. James Town----
                 venue  freq
0   Italian Restaurant   1.0
1  American Restaurant   0.0
2          Art Gallery   0.0
3                Hotel   0.0
4  Japanese Restaurant   0.0


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.50
1      Sandwich Place  0.25
2            Pharmacy  0.25
3         Salad Place  0.00
4  Italian Restaurant  0.00


----Chinatown,Grange Park,Ke

#### Let's put that into a *pandas* dataframe

In [27]:
# First, let's write a function to sort the venues in descending order.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]



In [28]:
#Now let's create the new dataframe and display the top 10 venues for each neighborhood.

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = T_grouped['Neighborhood']

for ind in np.arange(T_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(T_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Vegetarian / Vegan Restaurant,Coffee Shop,Greek Restaurant,Tea Room,Taco Place,Sushi Restaurant,Steakhouse,Food Court,Bar,Japanese Restaurant
1,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Performing Arts Venue,Vegetarian / Vegan Restaurant,Greek Restaurant,Gift Shop,Gastropub,Food Court,Fast Food Restaurant,Farmers Market,Diner,Deli / Bodega
2,"Cabbagetown,St. James Town",Italian Restaurant,Vegetarian / Vegan Restaurant,Greek Restaurant,Gift Shop,Gastropub,Food Court,Fast Food Restaurant,Farmers Market,Diner,Deli / Bodega
3,Central Bay Street,Coffee Shop,Sandwich Place,Pharmacy,Vegetarian / Vegan Restaurant,Gift Shop,Gastropub,Food Court,Fast Food Restaurant,Farmers Market,Diner
4,"Chinatown,Grange Park,Kensington Market",Thrift / Vintage Store,Liquor Store,Farmers Market,Café,Cocktail Bar,Concert Hall,Gluten-free Restaurant,Gift Shop,Gastropub,Food Court


## 4. Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [29]:
# set number of clusters
kclusters = 5

T_grouped_clustering = T_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(T_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 2, 1, 1, 1, 0, 1, 1, 1, 4], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [30]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

T_merged = Toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
T_merged = T_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

T_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,1.0,Italian Restaurant,Breakfast Spot,Spa,Park,Vegetarian / Vegan Restaurant,College Gym,Gastropub,Food Court,Fast Food Restaurant,Farmers Market
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,1.0,Sculpture Garden,Vegetarian / Vegan Restaurant,Coffee Shop,Gift Shop,Gastropub,Food Court,Fast Food Restaurant,Farmers Market,Diner,Deli / Bodega
2,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937,1.0,Coffee Shop,Art Gallery,College Gym,Gluten-free Restaurant,Gift Shop,Gastropub,Food Court,Fast Food Restaurant,Farmers Market,Diner
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1.0,Japanese Restaurant,Hostel,Italian Restaurant,Diner,Performing Arts Venue,Coffee Shop,Vegetarian / Vegan Restaurant,Concert Hall,Gastropub,Food Court
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,,,,,,,,,,,


In [31]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(T_merged['Latitude'], T_merged['Longitude'], T_merged['Neighbourhood'], T_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=1,
        #color=rainbow[cluster - 1],
        fill=True,
        fill_color=1,
        #fill_color=rainbow[cluster-1],        
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Thanks..