## NOTE - FOR GRADING CONVENIENCE, I HAVE INCLUDED ALL PROMPTS WITHIN THE SAME NOTEBOOK

# Week 3 Submission - Segmenting and Clustering Neighborhoods in Toronto

First, we will need to import pandas, numpy, and requests in order to fully utilize the python code

# QUESTION 1

In [47]:
import pandas as pd
import numpy as np
import requests

### Attempt 1 (Question 1)

Next, after the libraries are installed, we will need to find the data.  Originally, I attempted to use "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M"

The purpose of using postal codes beginning with 'M' is that 'M' is a designation for Toronto, in Canada.

In [48]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M"
wiki_url = requests.get(url)

if wiki_url:
    print("[Response: 200] - Successfully connected to the Webpage")
else:
    print("[Response: 404] - Webpage not found.  Please check the URL of 'url' and try again.")

[Response: 200] - Successfully connected to the Webpage


The code above received a resposne of 200, so we know that it is connected and ready for use.

In [49]:
wiki_data = pd.read_html(wiki_url.text)
wiki_data

[                                                    0  \
 0                                     M1ANot assigned   
 1                     M1BScarborough(Malvern / Rouge)   
 2   M1CScarborough(Rouge Hill / Port Union / Highl...   
 3   M1EScarborough(Guildwood / Morningside / West ...   
 4                              M1GScarborough(Woburn)   
 5                           M1HScarborough(Cedarbrae)   
 6                 M1JScarborough(Scarborough Village)   
 7   M1KScarborough(Kennedy Park / Ionview / East B...   
 8   M1LScarborough(Golden Mile / Clairlea / Oakridge)   
 9   M1MScarborough(Cliffside / Cliffcrest / Scarbo...   
 10       M1NScarborough(Birch Cliff / Cliffside West)   
 11  M1PScarborough(Dorset Park / Wexford Heights /...   
 12                 M1RScarborough(Wexford / Maryvale)   
 13                          M1SScarborough(Agincourt)   
 14  M1TScarborough(Clarks Corners / Tam O'Shanter ...   
 15  M1VScarborough(Milliken / Agincourt North / St...   
 16     M1WSca

The data does not appear to be properly structured.  Is this a table?  When going to the website, you can see that the table is oddly formatted as somewhat "Unstructured Data."

We can group these parameters into a separate list, but I would like to approach this a different way...

---------------------------------------------------------------------------------------------------------------------------------------------------------

### Attempt 2 (Question 1)

Let's take the same approach, except this time, use 'url' : "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1012118802"

This is a past version of the Wikipedia page where a proper table is formed with headings.  Let us try it out.  :)

In [50]:
url2 = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1012118802"
wiki_url2 = requests.get(url2)
if wiki_url2:
    print("[Response: 200] - Successfully connected to the Webpage")
else:
    print("[Response: 404] - Webpage not found.  Please check the URL of 'url' and try again.")

[Response: 200] - Successfully connected to the Webpage


In [51]:
wiki_data2 = pd.read_html(wiki_url2.text)
wiki_data2

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

Ahhh, that is so much more refreshing.  Let us check out the 'len' and 'type' now to determine our next steps.

In [52]:
print("'wiki_data2' has a 'len' of",len(wiki_data2))
print("'wiki_data2' is", type(wiki_data2))

'wiki_data2' has a 'len' of 3
'wiki_data2' is <class 'list'>


Okay, so now it looks like in order to do some data manipulation, we will need the first table by itself.  We can either drop the other tables, but I think it is easier to just save over itself and exclude the others within the code.

In [53]:
wiki_data2 = wiki_data2[0]
wiki_data2

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


I am an American, so I do not like the 'u' in 'Neighbourhood'.  I am renaming the columns so I do not accidentally type a wrong value in the future.  And I'm stingy.  :P

In [54]:
wiki_data2.rename({'':'', 'Postal Code':'Postal Code', 'Borough':'Borough', 'Neighbourhood':'Neighborhood'}, axis='columns')

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In the prompt, it says "Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned."  Below, we are dropping "Not assigned" "Boroughs"

In [55]:
df = wiki_data2[wiki_data2["Borough"] != "Not assigned"]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Next, we will be grouping the data together based on the Postal Code.  

In [56]:
df = df.groupby(['Postal Code']).head()
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


The assignment says, "If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough."  First, allow us to see the number of records that contain a 'Not assigned' neighborhood after making the Borough changes from before.

In [57]:
df.Neighbourhood.str.count("Not assigned").sum

<bound method Series.sum of 2      0
3      0
4      0
5      0
6      0
      ..
160    0
165    0
168    0
169    0
178    0
Name: Neighbourhood, Length: 103, dtype: int64>

It doesn't look like we have any instances of this, so let's move on...

In [58]:
df = df.reset_index()
df

Unnamed: 0,index,Postal Code,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...,...
98,160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,165,M4Y,Downtown Toronto,Church and Wellesley
100,168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


"In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe."

In [59]:
df.drop(['index'], axis = 'columns', inplace = True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [60]:
df.shape

(103, 3)

# QUESTION 2

First, we will need to install Geocoder in order to import Geocorder functions to this notebook

In [61]:
pip install geocoder

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Note: you may need to restart the kernel to use updated packages.


In [62]:
import geocoder

The assignment provided a link 'http://colc.us/Geospatial_data'.  This link provides a csv file with "Postal Code" , "Latitude", and "Longitude".

In [63]:
url3 = "https://cocl.us/Geospatial_data"
df_coordinates = pd.read_csv(url3)

print("Connection Successful")
print(df_coordinates)

Connection Successful
    Postal Code   Latitude  Longitude
0           M1B  43.806686 -79.194353
1           M1C  43.784535 -79.160497
2           M1E  43.763573 -79.188711
3           M1G  43.770992 -79.216917
4           M1H  43.773136 -79.239476
..          ...        ...        ...
98          M9N  43.706876 -79.518188
99          M9P  43.696319 -79.532242
100         M9R  43.688905 -79.554724
101         M9V  43.739416 -79.588437
102         M9W  43.706748 -79.594054

[103 rows x 3 columns]


103 Rows Successfully Loaded, now to check the types of data that we have, both with our previous 'df' and newly created 'df_coordinates'.

In [64]:
print("df Types:")
print("")
print(df.dtypes)
print("")
print("")
print("df_coordinates Types:")
print("")
print(df_coordinates.dtypes)

df Types:

Postal Code      object
Borough          object
Neighbourhood    object
dtype: object


df_coordinates Types:

Postal Code     object
Latitude       float64
Longitude      float64
dtype: object


Before joining the data togehter, let's check on the number of records in each file.

In [65]:
print("df.shape is",df.shape)
print("df_coordinates.shape is",df_coordinates.shape)

df.shape is (103, 3)
df_coordinates.shape is (103, 3)


Let's join both tables together on the Primary Key "Postal Code"

In [66]:
df = df.join(df_coordinates.set_index('Postal Code'), on='Postal Code', how='inner')
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


In [67]:
df.shape

(103, 5)

# QUESTION 3

In [68]:
import geocoder
from geopy.geocoders import Nominatim

In [69]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {},{}'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817,-79.3839347


Let's install folium to our Notebook so that we can use a map later on in the process

In [70]:
!pip install folium
import folium

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [71]:
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for latitude, longitude, borough, neighbourhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}. {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
    ).add_to(map_Toronto)

map_Toronto

Nice!  Looks like maps are working just fine.  Hopefully we can use it later for kmeans clustering to visualize our results!

Let's enter our FourSquare account credentials so we can pull information from the API

In [72]:
Client_ID = '4S2BNX0NLMGN43TZUPGZRIT5NTCP3AITVD11P2VPE3JOXRLD'
Client_Secret = 'BMNO2DDUIZENHZZNEMTXVGTKOD1YZTZFTLTJC1M1ON3D400T'
Version = '20180605'

print('Your credentials:')
print('Client_ID: ' + Client_ID)
print('Client_Secret: ' + Client_Secret)

Your credentials:
Client_ID: 4S2BNX0NLMGN43TZUPGZRIT5NTCP3AITVD11P2VPE3JOXRLD
Client_Secret: BMNO2DDUIZENHZZNEMTXVGTKOD1YZTZFTLTJC1M1ON3D400T


In [73]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'.format(
            Client_ID, 
            Client_Secret, 
            Version, 
            lat, 
            lng, 
            radius
            )
 
        results = requests.get(url).json()["response"]['groups'][0]['items']

        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

After defining the links/accounts/APIs that need to be used, let's create a venue list and append some data to it.  Below, we pull the information together.

In [74]:
toronto_venues = getNearbyVenues(df['Neighbourhood'], df['Latitude'], df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

Now we have the venues per neighborhood.  awesome sauce

In [75]:
toronto_venues.shape

(1320, 5)

In [76]:
toronto_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,Coffee Shop


Just want to play around with the data for a little bit.  :)

In [77]:
toronto_venues.groupby('Neighbourhood').head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,Coffee Shop
...,...,...,...,...,...
1305,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,South St. Burger,Burger Joint
1306,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Wingporium,Wings Joint
1307,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Dollarama,Discount Store
1308,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Healthy Planet,Supplement Shop


In [78]:
toronto_venues.groupby('Venue Category').max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Accessories Store,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Ardene Shoes Outlet
Adult Boutique,Church and Wellesley,43.665860,-79.383160,Seduction
Airport,Downsview,43.737473,-79.394420,Toronto Downsview Airport (YZD)
Airport Food Court,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Billy Bishop Café
Airport Gate,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Gate 8
...,...,...,...,...
Warehouse Store,Thorncliffe Park,43.705369,-79.349372,Costco
Wine Bar,"Little Portugal, Trinity",43.653206,-79.400049,Paris Paris Bar
Wings Joint,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Wingporium
Women's Store,Caledonia-Fairbanks,43.689026,-79.453512,Maximum Woman


In [79]:
#How many venues for each neighborhood

toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Agincourt,5,5,5,5
"Alderwood, Long Branch",7,7,7,7
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21
Bayview Village,4,4,4,4
"Bedford Park, Lawrence Manor East",25,25,25,25
...,...,...,...,...
"Willowdale, Willowdale East",30,30,30,30
"Willowdale, Willowdale West",5,5,5,5
Woburn,5,5,5,5
Woodbine Heights,7,7,7,7


In [80]:
toronto_venue_category = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_venue_category

Unnamed: 0,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1315,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1316,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1317,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1318,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [81]:
toronto_venue_category['Neighbourhood'] = toronto_venues['Neighbourhood'] 

fixed_columns = [toronto_venue_category.columns[-1]] + list(toronto_venue_category.columns[:-1])
toronto_venue_category = toronto_venue_category[fixed_columns]

toronto_venue_category.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's group the neighborhoods together by calculating the average amount of categories in each.  

In [82]:
toronto_grouped = toronto_venue_category.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [83]:
def top_venues(row, top_number):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:top_number]

There were a lot of venue categories, as shown above.  The following code with limit the clustering to 'top_number' or the Top 15 venues to cluster.

In [84]:
top_number = 15

indicators = ['st', 'nd', 'rd']

columns = ['Neighbourhood']

for ind in np.arange(top_number):
    try:
        columns.append('{}{} Commonality Rating'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Commonality Rating'.format(ind+1))
    
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = top_venues(toronto_grouped.iloc[ind,:], top_number)
    
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Commonality Rating,2nd Commonality Rating,3rd Commonality Rating,4th Commonality Rating,5th Commonality Rating,6th Commonality Rating,7th Commonality Rating,8th Commonality Rating,9th Commonality Rating,10th Commonality Rating,11th Commonality Rating,12th Commonality Rating,13th Commonality Rating,14th Commonality Rating,15th Commonality Rating
0,Agincourt,Latin American Restaurant,Clothing Store,Lounge,Skating Rink,Breakfast Spot,Dance Studio,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store,Deli / Bodega
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Gym,Sandwich Place,Skating Rink,Pub,Department Store,Curling Ice,Dance Studio,Deli / Bodega,Dim Sum Restaurant,Dessert Shop,Creperie,Diner,Discount Store
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Supermarket,Shopping Mall,Sandwich Place,Bridal Shop,Restaurant,Pizza Place,Pharmacy,Park,Mobile Phone Shop,Middle Eastern Restaurant,Deli / Bodega,Diner,Ice Cream Shop
3,Bayview Village,Chinese Restaurant,Bank,Japanese Restaurant,Café,Yoga Studio,Dance Studio,Drugstore,Donut Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Restaurant,Coffee Shop,Pizza Place,Pharmacy,Indian Restaurant,Fast Food Restaurant,Sushi Restaurant,Butcher,Hobby Shop,Café,Thai Restaurant,Liquor Store,Pub


In [85]:
from sklearn.cluster import KMeans

In [86]:
k_num_clusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood',1)

kmeans = KMeans(n_clusters=k_num_clusters, random_state=0).fit(toronto_grouped_clustering)

kmeans

KMeans(n_clusters=5, random_state=0)

In [87]:
kmeans.labels_[0:100]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0,
       0, 0, 4, 0, 0, 2, 4, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0, 0, 0, 4,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 4], dtype=int32)

In [88]:
neighborhoods_venues_sorted.insert(0,'Cluster Labels', kmeans.labels_)

Combining our previous df together so the latitude and longitude can come into play for the folium mapping and clustering later!

In [89]:
merged = df

merged = merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Commonality Rating,2nd Commonality Rating,3rd Commonality Rating,4th Commonality Rating,...,6th Commonality Rating,7th Commonality Rating,8th Commonality Rating,9th Commonality Rating,10th Commonality Rating,11th Commonality Rating,12th Commonality Rating,13th Commonality Rating,14th Commonality Rating,15th Commonality Rating
0,M3A,North York,Parkwoods,43.753259,-79.329656,4.0,Park,Food & Drink Shop,Yoga Studio,Cuban Restaurant,...,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store,Deli / Bodega,Dance Studio,Curling Ice
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Coffee Shop,Financial or Legal Service,Hockey Arena,Portuguese Restaurant,...,Distribution Center,Discount Store,Diner,Dim Sum Restaurant,Creperie,Dessert Shop,Department Store,Deli / Bodega,Dog Run,Dance Studio
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,Coffee Shop,Park,Bakery,Breakfast Spot,...,Yoga Studio,Restaurant,Pub,Chocolate Shop,Mexican Restaurant,Dessert Shop,Distribution Center,Farmers Market,Historic Site,French Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0.0,Furniture / Home Store,Clothing Store,Accessories Store,Boutique,...,Coffee Shop,Vietnamese Restaurant,Airport Gate,College Rec Center,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run,Distribution Center,Discount Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.0,Coffee Shop,Sushi Restaurant,Yoga Studio,Bar,...,Smoothie Shop,Sandwich Place,Burrito Place,Café,Portuguese Restaurant,College Auditorium,Music Venue,Mexican Restaurant,Creperie,Diner


In [90]:
merged_clean = merged.dropna(subset=['Cluster Labels'])

In [91]:
import matplotlib.cm as cm
import matplotlib.colors as colors

And now to plot the clusters on the map!  Yay!  Awesome.

In [93]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(k_num_clusters)
ys = [i + x + (1*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []

for lat, lon, poi, cluster in zip(merged_clean['Latitude'], merged_clean['Longitude'], merged_clean['Neighbourhood'], merged_clean['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster) +1) + '\n' + str(poi) , parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)]
        ).add_to(map_clusters)
        
map_clusters

# NOTE: MAP WORKS IN JUPYTER NOTEBOOK IN IBM STUDIO

For some reason, when it goes to GitHub, the Map doesn't appear to show.