# Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Data Scrapping</a>

2. <a href="#item2">Data Preparation</a>

3. <a href="#item2">Explore Neighborhoods in Toronto</a>

4. <a href="#item3">Analyze Each Neighborhood</a>

5. <a href="#item4">Cluster Neighborhoods</a>

6. <a href="#item5">Examine Clusters</a>    
</font>
</div>

##  1. Data Scrapping

In [1]:
import pandas as pd
import requests

In [2]:
url ='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
tables = pd.read_html(url) # scrap all tables from html page
df = tables[0] # use the first table 
df.columns = ['PostalCode', 'Borough', 'Neighborhood'] # PostCode --> PostalCode
df.head() # check

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
df1 = df[df.Borough != 'Not assigned'] #copying only assigned Boroughs into df1
df1.head() #check

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [5]:
# merging neighborhoods sharing same postal codes into one cell, separating by a comma
cleaned_df=df1.groupby("PostalCode").agg(lambda x:', '.join(set(x))) # gosh, I was next to insanity looking for this solution
cleaned_df=cleaned_df.reset_index() #resetting the index
cleaned_df.head()



Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Port Union, Highland Creek, Rouge Hill"
2,M1E,Scarborough,"Morningside, West Hill, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
cleaned_df.loc[cleaned_df['Neighborhood']=="Not assigned",'Neighborhood']=cleaned_df.loc[cleaned_df['Neighborhood']=="Not assigned",'Borough'] # assigning Borough name where Neighborhood is Not assigned
cleaned_df.shape



(103, 3)

##  2. Data Preparation
<div class="alert alert-block alert-info" style="margin-top: 40px">

</div>

In [7]:
import pandas as pd
import matplotlib.pylab as plt #not sure if I need this one - just in case ;-)

In [8]:
file = 'https://cocl.us/Geospatial_data' 

In [9]:
df3 = pd.read_csv(file) #loading file

In [10]:
df3.columns = ['PostalCode', 'latitude', 'longitude'] #make sure that column names are consistent

In [11]:
neighborhoods = pd.merge(cleaned_df, df3, on='PostalCode') # merge two dataframes into one where PostalCode is the same

In [12]:
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,latitude,longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Port Union, Highland Creek, Rouge Hill",43.784535,-79.160497
2,M1E,Scarborough,"Morningside, West Hill, Guildwood",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## 3. Explore Neighborhoods in Toronto
<div class="alert alert-block alert-info" style="margin-top: 20px">

</div>

In [16]:
pip search geopy

tornado-geopy (0.1.0)         - tornado-geopy is an asynchronous version of the awesome geopy library.
geopy (1.20.0)                - Python Geocoding Toolbox
pivotal-geopy (1.0.0)         - Python Geocoding Toolbox
swisslandstats-geopy (0.7.1)  - Python for the land statistics datasets from the SFSO
Note: you may need to restart the kernel to use updated packages.


In [20]:
import numpy as np # library to handle data in a vectorized manner
#!pip install geopy
import geopy
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't intalled geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium
#!conda install -c conda-forge folium=0.5.0 --yes # this one takes forever to load, I hate it!
import folium # map rendering library

print('Libraries imported.') 

Collecting folium
  Downloading https://files.pythonhosted.org/packages/4f/86/1ab30184cb60bc2b95deffe2bd86b8ddbab65a4fac9f7313c278c6e8d049/folium-0.9.1-py2.py3-none-any.whl (91kB)
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.9.1
Libraries imported.


Let's quickly examine the dataset

In [21]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


Use geopy library to get the latitude and longitude values of Toronto.

In [22]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


### 3.1 Create a map to Toronto with neighborhoods superimposed on top

In [23]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lon, borough, neighborhood in zip(neighborhoods['latitude'], neighborhoods['longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

##### However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Central Toronto. So let's slice the original dataframe and create a new dataframe of the Central Toronto Data

In [24]:
central_data = neighborhoods[neighborhoods['Borough'] == 'Central Toronto'].reset_index(drop=True)
central_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,latitude,longitude
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
3,M4S,Central Toronto,Davisville,43.704324,-79.38879
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316


In [25]:


address = 'Central Toronto, Toronto'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Central Toronto are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Central Toronto are 43.653963, -79.387207.


### 3.2 Map of Central Toronto 

In [26]:
map_central = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(central_data['latitude'], central_data['longitude'], central_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_central)  
    
map_central

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Versio

In [27]:
CLIENT_ID = 'HCCZDOZC0UJLPOW23GXTUEFXUFF41IXP3QMFWFGT25EKUEHK' # your Foursquare ID
CLIENT_SECRET = 'KUWKZ5PJP0ODKJR1E44PG3WIKMM4UDNAPX0BTGFD4ZXHI2K2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: HCCZDOZC0UJLPOW23GXTUEFXUFF41IXP3QMFWFGT25EKUEHK
CLIENT_SECRET:KUWKZ5PJP0ODKJR1E44PG3WIKMM4UDNAPX0BTGFD4ZXHI2K2


#### Let's explore the first neighborhood in our dataframe.

In [28]:
central_data.loc[0, 'Neighborhood']

'Lawrence Park'

Get the neighborhood's latitude and longitude values.

In [29]:
neighborhood_latitude = central_data.loc[0, 'latitude'] # neighborhood latitude value
neighborhood_longitude = central_data.loc[0, 'longitude'] # neighborhood longitude value

neighborhood_name = central_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Lawrence Park are 43.7280205, -79.3887901.


#### Now, let's get the top 100 venues that are in Lawrence Park within a radius of 500 meters.

In [30]:
radius = 500 # define radius
limit = 100
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    limit)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=HCCZDOZC0UJLPOW23GXTUEFXUFF41IXP3QMFWFGT25EKUEHK&client_secret=KUWKZ5PJP0ODKJR1E44PG3WIKMM4UDNAPX0BTGFD4ZXHI2K2&v=20180605&ll=43.7280205,-79.3887901&radius=500&limit=100'

In [31]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d34ed35af35f30025f427b2'},
  'headerLocation': 'Toronto',
  'headerFullLocation': 'Toronto',
  'headerLocationGranularity': 'city',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 43.7325205045, 'lng': -79.3825744605273},
   'sw': {'lat': 43.7235204955, 'lng': -79.3950057394727}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '50e6da19e4b0d8a78a0e9794',
       'name': 'Lawrence Park Ravine',
       'location': {'address': '3055 Yonge Street',
        'crossStreet': 'Lawrence Avenue East',
        'lat': 43.72696303913755,
        'lng': -79.39438246708775,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.72696303913755,
          'lng': -79.39438246708775}],
        'distance': 465,
        'cc': 'CA',
  

In [32]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe

In [33]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Lawrence Park Ravine,Park,43.726963,-79.394382
1,Zodiac Swim School,Swim School,43.728532,-79.38286
2,TTC Bus #162 - Lawrence-Donway,Bus Line,43.728026,-79.382805


And how many venues were returned by Foursquare?

In [34]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


### Explore Neighborhoods in Central Toronto

##### Let's create a function to repeat the same process to all the neighborhoods in Central Toronto

In [35]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    radius = 500
    LIMIT = 100
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

##### Create a new dataframe called central_venues

In [36]:
central_venues = getNearbyVenues(names=central_data['Neighborhood'],
                                   latitudes=central_data['latitude'],
                                   longitudes=central_data['longitude']
                                  )


Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
South Hill, Deer Park, Summerhill West, Forest Hill SE, Rathnelly
Roselawn
Forest Hill West, Forest Hill North
North Midtown, Yorkville, The Annex


##### Let's check the size of the resulting dataframe

In [37]:
print(central_venues.shape)
central_venues.head()

(112, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.72802,-79.38879,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.72802,-79.38879,Zodiac Swim School,43.728532,-79.38286,Swim School
2,Lawrence Park,43.72802,-79.38879,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
3,Davisville North,43.712751,-79.390197,Sherwood Park,43.716551,-79.387776,Park
4,Davisville North,43.712751,-79.390197,Summerhill Market North,43.715499,-79.392881,Food & Drink Shop


##### Let's check how many venues were returned for each neighborhood

In [38]:
central_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Davisville,35,35,35,35,35,35
Davisville North,7,7,7,7,7,7
"Forest Hill West, Forest Hill North",4,4,4,4,4,4
Lawrence Park,3,3,3,3,3,3
"Moore Park, Summerhill East",4,4,4,4,4,4
"North Midtown, Yorkville, The Annex",24,24,24,24,24,24
North Toronto West,18,18,18,18,18,18
Roselawn,1,1,1,1,1,1
"South Hill, Deer Park, Summerhill West, Forest Hill SE, Rathnelly",16,16,16,16,16,16


#### Let's find out how many unique categories can be curated from all the returned venues

In [39]:
print('There are {} uniques categories.'.format(len(central_venues['Venue Category'].unique())))

There are 59 uniques categories.


## 4. Analyze Each Neighborhood

<div class="alert alert-block alert-info" style="margin-top: 20px">

</div>

In [40]:


# one hot encoding
cenral_onehot = pd.get_dummies(central_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
cenral_onehot['Neighborhood'] = central_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [cenral_onehot.columns[-1]] + list(cenral_onehot.columns[:-1])
cenral_onehot = cenral_onehot[fixed_columns]

cenral_onehot.head()



Unnamed: 0,Neighborhood,American Restaurant,BBQ Joint,Bagel Shop,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Cheese Shop,Chinese Restaurant,Clothing Store,Coffee Shop,Cosmetics Shop,Deli / Bodega,Dessert Shop,Diner,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Fried Chicken Joint,Garden,Gourmet Shop,Greek Restaurant,Gym,History Museum,Hotel,Ice Cream Shop,Indian Restaurant,Indoor Play Area,Italian Restaurant,Jewelry Store,Jewish Restaurant,Light Rail Station,Liquor Store,Mexican Restaurant,Park,Pharmacy,Pizza Place,Playground,Pub,Rental Car Location,Restaurant,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Spa,Sporting Goods Shop,Sports Bar,Summer Camp,Supermarket,Sushi Restaurant,Swim School,Tennis Court,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Lawrence Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Lawrence Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,Lawrence Park,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Davisville North,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Davisville North,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [41]:
cenral_onehot.shape

(112, 60)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [42]:
central_grouped = cenral_onehot.groupby('Neighborhood').mean().reset_index()
central_grouped

Unnamed: 0,Neighborhood,American Restaurant,BBQ Joint,Bagel Shop,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Cheese Shop,Chinese Restaurant,Clothing Store,Coffee Shop,Cosmetics Shop,Deli / Bodega,Dessert Shop,Diner,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Fried Chicken Joint,Garden,Gourmet Shop,Greek Restaurant,Gym,History Museum,Hotel,Ice Cream Shop,Indian Restaurant,Indoor Play Area,Italian Restaurant,Jewelry Store,Jewish Restaurant,Light Rail Station,Liquor Store,Mexican Restaurant,Park,Pharmacy,Pizza Place,Playground,Pub,Rental Car Location,Restaurant,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Spa,Sporting Goods Shop,Sports Bar,Summer Camp,Supermarket,Sushi Restaurant,Swim School,Tennis Court,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Davisville,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.057143,0.0,0.028571,0.0,0.085714,0.0,0.028571,0.085714,0.028571,0.028571,0.0,0.0,0.028571,0.0,0.028571,0.028571,0.028571,0.0,0.0,0.0,0.028571,0.028571,0.057143,0.0,0.0,0.0,0.0,0.0,0.028571,0.028571,0.057143,0.0,0.0,0.0,0.028571,0.0,0.085714,0.028571,0.0,0.0,0.0,0.0,0.0,0.057143,0.0,0.0,0.057143,0.028571,0.0,0.0,0.0,0.0
1,Davisville North,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.142857,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Forest Hill West, Forest Hill North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0
3,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Moore Park, Summerhill East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
5,"North Midtown, Yorkville, The Annex",0.041667,0.041667,0.0,0.0,0.0,0.041667,0.0,0.125,0.041667,0.0,0.0,0.125,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.041667,0.0,0.0,0.0,0.041667,0.0,0.041667,0.0,0.041667,0.041667,0.083333,0.0,0.041667,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0
6,North Toronto West,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.111111,0.111111,0.0,0.0,0.055556,0.055556,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.055556,0.055556,0.0,0.0,0.0,0.0,0.055556,0.0,0.055556,0.0,0.0,0.055556,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556
7,Roselawn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"South Hill, Deer Park, Summerhill West, Forest...",0.0625,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0625,0.0,0.125,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0


#### Let's confirm the new size

In [43]:
central_grouped.shape

(9, 60)

#### Let's print each neighborhood along with the top 5 most common venues

In [44]:
num_top_venues = 5

for hood in central_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = central_grouped[central_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Davisville----
                venue  freq
0      Sandwich Place  0.09
1        Dessert Shop  0.09
2         Coffee Shop  0.09
3  Italian Restaurant  0.06
4                Café  0.06


----Davisville North----
               venue  freq
0              Hotel  0.14
1     Clothing Store  0.14
2     Sandwich Place  0.14
3                Gym  0.14
4  Food & Drink Shop  0.14


----Forest Hill West, Forest Hill North----
                 venue  freq
0                Trail  0.25
1   Mexican Restaurant  0.25
2     Sushi Restaurant  0.25
3        Jewelry Store  0.25
4  American Restaurant  0.00


----Lawrence Park----
                 venue  freq
0                 Park  0.33
1             Bus Line  0.33
2          Swim School  0.33
3  American Restaurant  0.00
4   Seafood Restaurant  0.00


----Moore Park, Summerhill East----
                 venue  freq
0                 Park  0.25
1         Tennis Court  0.25
2           Playground  0.25
3          Summer Camp  0.25
4  American Restaurant 

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [45]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [46]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = central_grouped['Neighborhood']

for ind in np.arange(central_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(central_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Davisville,Coffee Shop,Sandwich Place,Dessert Shop,Italian Restaurant,Pizza Place,Café,Sushi Restaurant,Thai Restaurant,Farmers Market,Fried Chicken Joint
1,Davisville North,Sandwich Place,Hotel,Breakfast Spot,Gym,Park,Clothing Store,Food & Drink Shop,Dessert Shop,History Museum,Greek Restaurant
2,"Forest Hill West, Forest Hill North",Trail,Jewelry Store,Sushi Restaurant,Mexican Restaurant,Yoga Studio,Dessert Shop,History Museum,Gym,Greek Restaurant,Gourmet Shop
3,Lawrence Park,Swim School,Bus Line,Park,Yoga Studio,Dessert Shop,Hotel,History Museum,Gym,Greek Restaurant,Gourmet Shop
4,"Moore Park, Summerhill East",Tennis Court,Park,Summer Camp,Playground,Yoga Studio,Dessert Shop,History Museum,Gym,Greek Restaurant,Gourmet Shop


## 5. Cluster Neighborhoods
<div class="alert alert-block alert-info" style="margin-top: 20px">

</div>

Run k-means to cluster the neighborhood into 5 clusters.

In [47]:
# set number of clusters
kclusters = 5

central_grouped_clustering = central_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(central_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 3, 2, 4, 0, 0, 1, 0])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [48]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

central_merged = central_data

# merge toronto_grouped with Toronto_data to add latitude/longitude for each neighborhood
central_merged = central_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

central_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Swim School,Bus Line,Park,Yoga Studio,Dessert Shop,Hotel,History Museum,Gym,Greek Restaurant,Gourmet Shop
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Sandwich Place,Hotel,Breakfast Spot,Gym,Park,Clothing Store,Food & Drink Shop,Dessert Shop,History Museum,Greek Restaurant
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,0,Clothing Store,Coffee Shop,Yoga Studio,Salon / Barbershop,Bagel Shop,Chinese Restaurant,Dessert Shop,Diner,Fast Food Restaurant,Ice Cream Shop
3,M4S,Central Toronto,Davisville,43.704324,-79.38879,0,Coffee Shop,Sandwich Place,Dessert Shop,Italian Restaurant,Pizza Place,Café,Sushi Restaurant,Thai Restaurant,Farmers Market,Fried Chicken Joint
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,4,Tennis Court,Park,Summer Camp,Playground,Yoga Studio,Dessert Shop,History Museum,Gym,Greek Restaurant,Gourmet Shop


Finally, let's visualize the resulting clusters

In [49]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(central_merged['latitude'], central_merged['longitude'], central_merged['Neighborhood'], central_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 6. Examine Clusters
<div class="alert alert-block alert-info" style="margin-top: 20px">

</div>

Let's examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster.

#### Cluster 1

In [50]:
central_merged.loc[central_merged['Cluster Labels'] == 0, central_merged.columns[[1] + list(range(5, central_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Central Toronto,0,Sandwich Place,Hotel,Breakfast Spot,Gym,Park,Clothing Store,Food & Drink Shop,Dessert Shop,History Museum,Greek Restaurant
2,Central Toronto,0,Clothing Store,Coffee Shop,Yoga Studio,Salon / Barbershop,Bagel Shop,Chinese Restaurant,Dessert Shop,Diner,Fast Food Restaurant,Ice Cream Shop
3,Central Toronto,0,Coffee Shop,Sandwich Place,Dessert Shop,Italian Restaurant,Pizza Place,Café,Sushi Restaurant,Thai Restaurant,Farmers Market,Fried Chicken Joint
5,Central Toronto,0,Coffee Shop,Pub,American Restaurant,Sports Bar,Vietnamese Restaurant,Fried Chicken Joint,Light Rail Station,Liquor Store,Pizza Place,Restaurant
8,Central Toronto,0,Sandwich Place,Coffee Shop,Café,Pizza Place,American Restaurant,Park,Pub,Liquor Store,History Museum,Cosmetics Shop


#### Cluster 2

In [51]:
central_merged.loc[central_merged['Cluster Labels'] == 1, central_merged.columns[[1] + list(range(5, central_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Central Toronto,1,Garden,Yoga Studio,Dessert Shop,Ice Cream Shop,Hotel,History Museum,Gym,Greek Restaurant,Gourmet Shop,Fried Chicken Joint


#### Cluster 3

In [52]:
central_merged.loc[central_merged['Cluster Labels'] == 2, central_merged.columns[[1] + list(range(5, central_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,2,Swim School,Bus Line,Park,Yoga Studio,Dessert Shop,Hotel,History Museum,Gym,Greek Restaurant,Gourmet Shop


#### Cluster 4

In [53]:
central_merged.loc[central_merged['Cluster Labels'] == 3, central_merged.columns[[1] + list(range(5, central_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,Central Toronto,3,Trail,Jewelry Store,Sushi Restaurant,Mexican Restaurant,Yoga Studio,Dessert Shop,History Museum,Gym,Greek Restaurant,Gourmet Shop


#### Cluster 5

In [54]:
central_merged.loc[central_merged['Cluster Labels'] == 4, central_merged.columns[[1] + list(range(5, central_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,4,Tennis Court,Park,Summer Camp,Playground,Yoga Studio,Dessert Shop,History Museum,Gym,Greek Restaurant,Gourmet Shop
